SLEP015: Feature Names Propagation (#48)

thomasjpfan · web-flow · commit 221362bf8ba4 · 2022-11-30T09:48:09.000+01:00
diff --git a/index.rst b/index.rst
@@ -40,6 +40,7 @@
     :caption: Rejected
 
     slep014/proposal
+    slep015/proposal
 
 .. toctree::
     :maxdepth: 1
diff --git a/slep015/proposal.rst b/slep015/proposal.rst
@@ -0,0 +1,191 @@
+.. _slep_015:
+
+==================================
+SLEP015: Feature Names Propagation
+==================================
+
+:Author: Thomas J Fan
+:Status: Rejected
+:Type: Standards Track
+:Created: 2020-10-03
+
+Abstract
+########
+
+This SLEP proposes adding the ``get_feature_names_out`` method to all
+transformers and the ``feature_names_in_`` attribute for all estimators.
+The ``feature_names_in_`` attribute is set during ``fit`` if the input, ``X``,
+contains the feature names.
+
+Motivation
+##########
+
+``scikit-learn`` is commonly used as a part of a larger data processing
+pipeline. When this pipeline is used to transform data, the result is a
+NumPy array, discarding column names. The current workflow for
+extracting the feature names requires calling ``get_feature_names`` on the
+transformer that created the feature. This interface can be cumbersome when used
+together with a pipeline with multiple column names::
+
+    X = pd.DataFrame({'letter': ['a', 'b', 'c'],
+                      'pet': ['dog', 'snake', 'dog'],
+                      'distance': [1, 2, 3]})
+    y = [0, 0, 1]
+    orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num']
+
+    ct = ColumnTransformer(
+        [('cat', OneHotEncoder(), orig_cat_cols),
+         ('num', StandardScaler(), orig_num_cols)])
+    pipe = make_pipeline(ct, LogisticRegression()).fit(X, y)
+
+    cat_names = (pipe['columntransformer']
+                 .named_transformers_['onehotencoder']
+                 .get_feature_names(orig_cat_cols))
+
+    feature_names = np.r_[cat_names, orig_num_cols]
+
+The ``feature_names`` extracted above corresponds to the features directly
+passed into ``LogisticRegression``. As demonstrated above, the process of
+extracting ``feature_names`` requires knowing the order of the selected
+categories in the ``ColumnTransformer``. Furthermore, if there is feature
+selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method
+would need to be used to infer the column names that were selected.
+
+Solution
+########
+
+This SLEP proposes adding the ``feature_names_in_`` attribute to all estimators
+that will extract the feature names of ``X`` during ``fit``. This will also
+be used for validation during non-``fit`` methods such as ``transform`` or
+``predict``. If the ``X`` is not a recognized container with columns, then
+``feature_names_in_`` can be undefined. If ``feature_names_in_`` is undefined,
+then it will not be validated.
+
+Secondly, this SLEP proposes adding ``get_feature_names_out(input_names=None)``
+to all transformers. By default, the input features will be determined by the
+``feature_names_in_`` attribute. The feature names of a pipeline can then be
+easily extracted as follows::
+
+    pipe[:-1].get_feature_names_out()
+    # ['cat__letter_a', 'cat__letter_b', 'cat__letter_c',
+       'cat__pet_dog', 'cat__pet_snake', 'num__distance']
+
+Note that ``get_feature_names_out`` does not require ``input_names``
+because the feature names was stored in the pipeline itself. These
+features will be passed to each step's ``get_feature_names_out`` method to
+obtain the output feature names of the ``Pipeline`` itself.
+
+Enabling Functionality
+######################
+
+The following enhancements are **not** a part of this SLEP. These features are
+made possible if this SLEP gets accepted.
+
+1. This SLEP enables us to implement an ``array_out`` keyword argument to
+   all ``transform`` methods to specify the array container outputted by
+   ``transform``. An implementation of ``array_out`` requires
+   ``feature_names_in_`` to validate that the names in ``fit`` and
+   ``transform`` are consistent. An implementation of ``array_out`` needs
+   a way to map from the input feature names to output feature names, which is
+   provided by ``get_feature_names_out``.
+
+2. An alternative to ``array_out``: Transformers in a pipeline may wish to have
+   feature names passed in as ``X``. This can be enabled by adding a
+   ``array_input`` parameter to ``Pipeline``::
+
+        pipe = make_pipeline(ct, MyTransformer(), LogisticRegression(),
+                             array_input='pandas')
+
+   In this case, the pipeline will construct a pandas DataFrame to be inputted
+   into ``MyTransformer`` and ``LogisticRegression``. The feature names
+   will be constructed by calling ``get_feature_names_out`` as data is passed
+   through the ``Pipeline``. This feature implies that ``Pipeline`` is
+   doing the construction of the DataFrame.
+
+Considerations and Limitations
+##############################
+
+1. The ``get_feature_names_out`` will be constructed using the name generation
+   specification from :ref:`slep_007`.
+
+2. For a ``Pipeline`` with only one estimator, slicing will not work and one
+   would need to access the feature names directly::
+
+      pipe1 = make_pipeline(StandardScaler(), LogisticRegression())
+      pipe[:-1].feature_names_in_  # Works
+
+      pipe2 = make_pipeline(LogisticRegression())
+      pipe[:-1].feature_names_in_  # Does not work
+
+   This is because `pipe2[:-1]` raises an error because it will result in
+   a pipeline with no steps. We can work around this by allowing pipelines
+   with no steps.
+
+3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or
+   an ndarray.
+
+4. Meta-estimators will delegate the setting and validation of
+   ``feature_names_in_`` to its inner estimators. The meta-estimator will
+   define ``feature_names_in_`` by referencing its inner estimators. For
+   example, the ``Pipeline`` can use ``steps[0].feature_names_in_`` as
+   the input feature names. If the inner estimators do not define
+   ``feature_names_in_`` then the meta-estimator will not defined
+   ``feature_names_in_`` as well.
+
+Backward compatibility
+######################
+
+1. This SLEP is fully backward compatible with previous versions. With the
+   introduction of ``get_feature_names_out``, ``get_feature_names`` will
+   be deprecated. Note that ``get_feature_names_out``'s signature will
+   always contain ``input_features`` which can be used or ignored. This
+   helps standardize the interface for the get feature names method.
+
+2. The inclusion of a ``get_feature_names_out`` method will not introduce any
+   overhead to estimators.
+
+3. The inclusion of a ``feature_names_in_`` attribute will increase the size of
+   estimators because they would store the feature names. Users can remove
+   the attribute by calling ``del est.feature_names_in_`` if they want to
+   remove the feature and disable validation.
+
+Alternatives
+############
+
+There have been many attempts to address this issue:
+
+1. ``array_out`` in keyword parameter in ``transform`` : This approach requires
+   third party estimators to unwrap and wrap array containers in transform,
+   which introduces more burden for third party estimator maintainers.
+   Furthermore, ``array_out`` with sparse data will introduce an overhead when
+   being passed along in a ``Pipeline``. This overhead comes from the
+   construction of the sparse data container that has the feature names.
+
+2. :ref:`slep_007` : ``SLEP007`` introduces a ``feature_names_out_`` attribute
+   while this SLEP proposes a ``get_feature_names_out`` method to accomplish
+   the same task. The benefit of the ``get_feature_names_out`` method is that
+   it can be used even if the feature names were not passed in ``fit`` with a
+   dataframe. For example, in a ``Pipeline`` the feature names are not passed
+   through to each step and a ``get_feature_names_out`` method can be used to
+   get the names of each step with slicing.
+
+3. :ref:`slep_012` : The ``InputArray`` was developed to work around the
+   overhead of using a pandas ``DataFrame`` or an xarray ``DataArray``. The
+   introduction of another data structure into the Python Data Ecosystem, would
+   lead to more burden for third party estimator maintainers.
+
+
+References and Footnotes
+########################
+
+.. [1] Each SLEP must either be explicitly labeled as placed in the public
+   domain (see this SLEP as an example) or licensed under the `Open
+   Publication License`_.
+
+.. _Open Publication License: https://www.opencontent.org/openpub/
+
+
+Copyright
+#########
+
+This document has been placed in the public domain. [1]_