|
| 1 | +.. _slep_015: |
| 2 | + |
| 3 | +================================== |
| 4 | +SLEP015: Feature Names Propagation |
| 5 | +================================== |
| 6 | + |
| 7 | +:Author: Thomas J Fan |
| 8 | +:Status: Rejected |
| 9 | +:Type: Standards Track |
| 10 | +:Created: 2020-10-03 |
| 11 | + |
| 12 | +Abstract |
| 13 | +######## |
| 14 | + |
| 15 | +This SLEP proposes adding the ``get_feature_names_out`` method to all |
| 16 | +transformers and the ``feature_names_in_`` attribute for all estimators. |
| 17 | +The ``feature_names_in_`` attribute is set during ``fit`` if the input, ``X``, |
| 18 | +contains the feature names. |
| 19 | + |
| 20 | +Motivation |
| 21 | +########## |
| 22 | + |
| 23 | +``scikit-learn`` is commonly used as a part of a larger data processing |
| 24 | +pipeline. When this pipeline is used to transform data, the result is a |
| 25 | +NumPy array, discarding column names. The current workflow for |
| 26 | +extracting the feature names requires calling ``get_feature_names`` on the |
| 27 | +transformer that created the feature. This interface can be cumbersome when used |
| 28 | +together with a pipeline with multiple column names:: |
| 29 | + |
| 30 | + X = pd.DataFrame({'letter': ['a', 'b', 'c'], |
| 31 | + 'pet': ['dog', 'snake', 'dog'], |
| 32 | + 'distance': [1, 2, 3]}) |
| 33 | + y = [0, 0, 1] |
| 34 | + orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num'] |
| 35 | + |
| 36 | + ct = ColumnTransformer( |
| 37 | + [('cat', OneHotEncoder(), orig_cat_cols), |
| 38 | + ('num', StandardScaler(), orig_num_cols)]) |
| 39 | + pipe = make_pipeline(ct, LogisticRegression()).fit(X, y) |
| 40 | + |
| 41 | + cat_names = (pipe['columntransformer'] |
| 42 | + .named_transformers_['onehotencoder'] |
| 43 | + .get_feature_names(orig_cat_cols)) |
| 44 | + |
| 45 | + feature_names = np.r_[cat_names, orig_num_cols] |
| 46 | + |
| 47 | +The ``feature_names`` extracted above corresponds to the features directly |
| 48 | +passed into ``LogisticRegression``. As demonstrated above, the process of |
| 49 | +extracting ``feature_names`` requires knowing the order of the selected |
| 50 | +categories in the ``ColumnTransformer``. Furthermore, if there is feature |
| 51 | +selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method |
| 52 | +would need to be used to infer the column names that were selected. |
| 53 | + |
| 54 | +Solution |
| 55 | +######## |
| 56 | + |
| 57 | +This SLEP proposes adding the ``feature_names_in_`` attribute to all estimators |
| 58 | +that will extract the feature names of ``X`` during ``fit``. This will also |
| 59 | +be used for validation during non-``fit`` methods such as ``transform`` or |
| 60 | +``predict``. If the ``X`` is not a recognized container with columns, then |
| 61 | +``feature_names_in_`` can be undefined. If ``feature_names_in_`` is undefined, |
| 62 | +then it will not be validated. |
| 63 | + |
| 64 | +Secondly, this SLEP proposes adding ``get_feature_names_out(input_names=None)`` |
| 65 | +to all transformers. By default, the input features will be determined by the |
| 66 | +``feature_names_in_`` attribute. The feature names of a pipeline can then be |
| 67 | +easily extracted as follows:: |
| 68 | + |
| 69 | + pipe[:-1].get_feature_names_out() |
| 70 | + # ['cat__letter_a', 'cat__letter_b', 'cat__letter_c', |
| 71 | + 'cat__pet_dog', 'cat__pet_snake', 'num__distance'] |
| 72 | + |
| 73 | +Note that ``get_feature_names_out`` does not require ``input_names`` |
| 74 | +because the feature names was stored in the pipeline itself. These |
| 75 | +features will be passed to each step's ``get_feature_names_out`` method to |
| 76 | +obtain the output feature names of the ``Pipeline`` itself. |
| 77 | + |
| 78 | +Enabling Functionality |
| 79 | +###################### |
| 80 | + |
| 81 | +The following enhancements are **not** a part of this SLEP. These features are |
| 82 | +made possible if this SLEP gets accepted. |
| 83 | + |
| 84 | +1. This SLEP enables us to implement an ``array_out`` keyword argument to |
| 85 | + all ``transform`` methods to specify the array container outputted by |
| 86 | + ``transform``. An implementation of ``array_out`` requires |
| 87 | + ``feature_names_in_`` to validate that the names in ``fit`` and |
| 88 | + ``transform`` are consistent. An implementation of ``array_out`` needs |
| 89 | + a way to map from the input feature names to output feature names, which is |
| 90 | + provided by ``get_feature_names_out``. |
| 91 | + |
| 92 | +2. An alternative to ``array_out``: Transformers in a pipeline may wish to have |
| 93 | + feature names passed in as ``X``. This can be enabled by adding a |
| 94 | + ``array_input`` parameter to ``Pipeline``:: |
| 95 | + |
| 96 | + pipe = make_pipeline(ct, MyTransformer(), LogisticRegression(), |
| 97 | + array_input='pandas') |
| 98 | + |
| 99 | + In this case, the pipeline will construct a pandas DataFrame to be inputted |
| 100 | + into ``MyTransformer`` and ``LogisticRegression``. The feature names |
| 101 | + will be constructed by calling ``get_feature_names_out`` as data is passed |
| 102 | + through the ``Pipeline``. This feature implies that ``Pipeline`` is |
| 103 | + doing the construction of the DataFrame. |
| 104 | + |
| 105 | +Considerations and Limitations |
| 106 | +############################## |
| 107 | + |
| 108 | +1. The ``get_feature_names_out`` will be constructed using the name generation |
| 109 | + specification from :ref:`slep_007`. |
| 110 | + |
| 111 | +2. For a ``Pipeline`` with only one estimator, slicing will not work and one |
| 112 | + would need to access the feature names directly:: |
| 113 | + |
| 114 | + pipe1 = make_pipeline(StandardScaler(), LogisticRegression()) |
| 115 | + pipe[:-1].feature_names_in_ # Works |
| 116 | + |
| 117 | + pipe2 = make_pipeline(LogisticRegression()) |
| 118 | + pipe[:-1].feature_names_in_ # Does not work |
| 119 | + |
| 120 | + This is because `pipe2[:-1]` raises an error because it will result in |
| 121 | + a pipeline with no steps. We can work around this by allowing pipelines |
| 122 | + with no steps. |
| 123 | + |
| 124 | +3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or |
| 125 | + an ndarray. |
| 126 | + |
| 127 | +4. Meta-estimators will delegate the setting and validation of |
| 128 | + ``feature_names_in_`` to its inner estimators. The meta-estimator will |
| 129 | + define ``feature_names_in_`` by referencing its inner estimators. For |
| 130 | + example, the ``Pipeline`` can use ``steps[0].feature_names_in_`` as |
| 131 | + the input feature names. If the inner estimators do not define |
| 132 | + ``feature_names_in_`` then the meta-estimator will not defined |
| 133 | + ``feature_names_in_`` as well. |
| 134 | + |
| 135 | +Backward compatibility |
| 136 | +###################### |
| 137 | + |
| 138 | +1. This SLEP is fully backward compatible with previous versions. With the |
| 139 | + introduction of ``get_feature_names_out``, ``get_feature_names`` will |
| 140 | + be deprecated. Note that ``get_feature_names_out``'s signature will |
| 141 | + always contain ``input_features`` which can be used or ignored. This |
| 142 | + helps standardize the interface for the get feature names method. |
| 143 | + |
| 144 | +2. The inclusion of a ``get_feature_names_out`` method will not introduce any |
| 145 | + overhead to estimators. |
| 146 | + |
| 147 | +3. The inclusion of a ``feature_names_in_`` attribute will increase the size of |
| 148 | + estimators because they would store the feature names. Users can remove |
| 149 | + the attribute by calling ``del est.feature_names_in_`` if they want to |
| 150 | + remove the feature and disable validation. |
| 151 | + |
| 152 | +Alternatives |
| 153 | +############ |
| 154 | + |
| 155 | +There have been many attempts to address this issue: |
| 156 | + |
| 157 | +1. ``array_out`` in keyword parameter in ``transform`` : This approach requires |
| 158 | + third party estimators to unwrap and wrap array containers in transform, |
| 159 | + which introduces more burden for third party estimator maintainers. |
| 160 | + Furthermore, ``array_out`` with sparse data will introduce an overhead when |
| 161 | + being passed along in a ``Pipeline``. This overhead comes from the |
| 162 | + construction of the sparse data container that has the feature names. |
| 163 | + |
| 164 | +2. :ref:`slep_007` : ``SLEP007`` introduces a ``feature_names_out_`` attribute |
| 165 | + while this SLEP proposes a ``get_feature_names_out`` method to accomplish |
| 166 | + the same task. The benefit of the ``get_feature_names_out`` method is that |
| 167 | + it can be used even if the feature names were not passed in ``fit`` with a |
| 168 | + dataframe. For example, in a ``Pipeline`` the feature names are not passed |
| 169 | + through to each step and a ``get_feature_names_out`` method can be used to |
| 170 | + get the names of each step with slicing. |
| 171 | + |
| 172 | +3. :ref:`slep_012` : The ``InputArray`` was developed to work around the |
| 173 | + overhead of using a pandas ``DataFrame`` or an xarray ``DataArray``. The |
| 174 | + introduction of another data structure into the Python Data Ecosystem, would |
| 175 | + lead to more burden for third party estimator maintainers. |
| 176 | + |
| 177 | + |
| 178 | +References and Footnotes |
| 179 | +######################## |
| 180 | + |
| 181 | +.. [1] Each SLEP must either be explicitly labeled as placed in the public |
| 182 | + domain (see this SLEP as an example) or licensed under the `Open |
| 183 | + Publication License`_. |
| 184 | +
|
| 185 | +.. _Open Publication License: https://www.opencontent.org/openpub/ |
| 186 | + |
| 187 | + |
| 188 | +Copyright |
| 189 | +######### |
| 190 | + |
| 191 | +This document has been placed in the public domain. [1]_ |
0 commit comments