Skip to content

Commit 221362b

Browse files
authored
SLEP015: Feature Names Propagation (#48)
1 parent 25edba4 commit 221362b

File tree

2 files changed

+192
-0
lines changed

2 files changed

+192
-0
lines changed

index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@
4040
:caption: Rejected
4141

4242
slep014/proposal
43+
slep015/proposal
4344

4445
.. toctree::
4546
:maxdepth: 1

slep015/proposal.rst

Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
.. _slep_015:
2+
3+
==================================
4+
SLEP015: Feature Names Propagation
5+
==================================
6+
7+
:Author: Thomas J Fan
8+
:Status: Rejected
9+
:Type: Standards Track
10+
:Created: 2020-10-03
11+
12+
Abstract
13+
########
14+
15+
This SLEP proposes adding the ``get_feature_names_out`` method to all
16+
transformers and the ``feature_names_in_`` attribute for all estimators.
17+
The ``feature_names_in_`` attribute is set during ``fit`` if the input, ``X``,
18+
contains the feature names.
19+
20+
Motivation
21+
##########
22+
23+
``scikit-learn`` is commonly used as a part of a larger data processing
24+
pipeline. When this pipeline is used to transform data, the result is a
25+
NumPy array, discarding column names. The current workflow for
26+
extracting the feature names requires calling ``get_feature_names`` on the
27+
transformer that created the feature. This interface can be cumbersome when used
28+
together with a pipeline with multiple column names::
29+
30+
X = pd.DataFrame({'letter': ['a', 'b', 'c'],
31+
'pet': ['dog', 'snake', 'dog'],
32+
'distance': [1, 2, 3]})
33+
y = [0, 0, 1]
34+
orig_cat_cols, orig_num_cols = ['letter', 'pet'], ['num']
35+
36+
ct = ColumnTransformer(
37+
[('cat', OneHotEncoder(), orig_cat_cols),
38+
('num', StandardScaler(), orig_num_cols)])
39+
pipe = make_pipeline(ct, LogisticRegression()).fit(X, y)
40+
41+
cat_names = (pipe['columntransformer']
42+
.named_transformers_['onehotencoder']
43+
.get_feature_names(orig_cat_cols))
44+
45+
feature_names = np.r_[cat_names, orig_num_cols]
46+
47+
The ``feature_names`` extracted above corresponds to the features directly
48+
passed into ``LogisticRegression``. As demonstrated above, the process of
49+
extracting ``feature_names`` requires knowing the order of the selected
50+
categories in the ``ColumnTransformer``. Furthermore, if there is feature
51+
selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method
52+
would need to be used to infer the column names that were selected.
53+
54+
Solution
55+
########
56+
57+
This SLEP proposes adding the ``feature_names_in_`` attribute to all estimators
58+
that will extract the feature names of ``X`` during ``fit``. This will also
59+
be used for validation during non-``fit`` methods such as ``transform`` or
60+
``predict``. If the ``X`` is not a recognized container with columns, then
61+
``feature_names_in_`` can be undefined. If ``feature_names_in_`` is undefined,
62+
then it will not be validated.
63+
64+
Secondly, this SLEP proposes adding ``get_feature_names_out(input_names=None)``
65+
to all transformers. By default, the input features will be determined by the
66+
``feature_names_in_`` attribute. The feature names of a pipeline can then be
67+
easily extracted as follows::
68+
69+
pipe[:-1].get_feature_names_out()
70+
# ['cat__letter_a', 'cat__letter_b', 'cat__letter_c',
71+
'cat__pet_dog', 'cat__pet_snake', 'num__distance']
72+
73+
Note that ``get_feature_names_out`` does not require ``input_names``
74+
because the feature names was stored in the pipeline itself. These
75+
features will be passed to each step's ``get_feature_names_out`` method to
76+
obtain the output feature names of the ``Pipeline`` itself.
77+
78+
Enabling Functionality
79+
######################
80+
81+
The following enhancements are **not** a part of this SLEP. These features are
82+
made possible if this SLEP gets accepted.
83+
84+
1. This SLEP enables us to implement an ``array_out`` keyword argument to
85+
all ``transform`` methods to specify the array container outputted by
86+
``transform``. An implementation of ``array_out`` requires
87+
``feature_names_in_`` to validate that the names in ``fit`` and
88+
``transform`` are consistent. An implementation of ``array_out`` needs
89+
a way to map from the input feature names to output feature names, which is
90+
provided by ``get_feature_names_out``.
91+
92+
2. An alternative to ``array_out``: Transformers in a pipeline may wish to have
93+
feature names passed in as ``X``. This can be enabled by adding a
94+
``array_input`` parameter to ``Pipeline``::
95+
96+
pipe = make_pipeline(ct, MyTransformer(), LogisticRegression(),
97+
array_input='pandas')
98+
99+
In this case, the pipeline will construct a pandas DataFrame to be inputted
100+
into ``MyTransformer`` and ``LogisticRegression``. The feature names
101+
will be constructed by calling ``get_feature_names_out`` as data is passed
102+
through the ``Pipeline``. This feature implies that ``Pipeline`` is
103+
doing the construction of the DataFrame.
104+
105+
Considerations and Limitations
106+
##############################
107+
108+
1. The ``get_feature_names_out`` will be constructed using the name generation
109+
specification from :ref:`slep_007`.
110+
111+
2. For a ``Pipeline`` with only one estimator, slicing will not work and one
112+
would need to access the feature names directly::
113+
114+
pipe1 = make_pipeline(StandardScaler(), LogisticRegression())
115+
pipe[:-1].feature_names_in_ # Works
116+
117+
pipe2 = make_pipeline(LogisticRegression())
118+
pipe[:-1].feature_names_in_ # Does not work
119+
120+
This is because `pipe2[:-1]` raises an error because it will result in
121+
a pipeline with no steps. We can work around this by allowing pipelines
122+
with no steps.
123+
124+
3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or
125+
an ndarray.
126+
127+
4. Meta-estimators will delegate the setting and validation of
128+
``feature_names_in_`` to its inner estimators. The meta-estimator will
129+
define ``feature_names_in_`` by referencing its inner estimators. For
130+
example, the ``Pipeline`` can use ``steps[0].feature_names_in_`` as
131+
the input feature names. If the inner estimators do not define
132+
``feature_names_in_`` then the meta-estimator will not defined
133+
``feature_names_in_`` as well.
134+
135+
Backward compatibility
136+
######################
137+
138+
1. This SLEP is fully backward compatible with previous versions. With the
139+
introduction of ``get_feature_names_out``, ``get_feature_names`` will
140+
be deprecated. Note that ``get_feature_names_out``'s signature will
141+
always contain ``input_features`` which can be used or ignored. This
142+
helps standardize the interface for the get feature names method.
143+
144+
2. The inclusion of a ``get_feature_names_out`` method will not introduce any
145+
overhead to estimators.
146+
147+
3. The inclusion of a ``feature_names_in_`` attribute will increase the size of
148+
estimators because they would store the feature names. Users can remove
149+
the attribute by calling ``del est.feature_names_in_`` if they want to
150+
remove the feature and disable validation.
151+
152+
Alternatives
153+
############
154+
155+
There have been many attempts to address this issue:
156+
157+
1. ``array_out`` in keyword parameter in ``transform`` : This approach requires
158+
third party estimators to unwrap and wrap array containers in transform,
159+
which introduces more burden for third party estimator maintainers.
160+
Furthermore, ``array_out`` with sparse data will introduce an overhead when
161+
being passed along in a ``Pipeline``. This overhead comes from the
162+
construction of the sparse data container that has the feature names.
163+
164+
2. :ref:`slep_007` : ``SLEP007`` introduces a ``feature_names_out_`` attribute
165+
while this SLEP proposes a ``get_feature_names_out`` method to accomplish
166+
the same task. The benefit of the ``get_feature_names_out`` method is that
167+
it can be used even if the feature names were not passed in ``fit`` with a
168+
dataframe. For example, in a ``Pipeline`` the feature names are not passed
169+
through to each step and a ``get_feature_names_out`` method can be used to
170+
get the names of each step with slicing.
171+
172+
3. :ref:`slep_012` : The ``InputArray`` was developed to work around the
173+
overhead of using a pandas ``DataFrame`` or an xarray ``DataArray``. The
174+
introduction of another data structure into the Python Data Ecosystem, would
175+
lead to more burden for third party estimator maintainers.
176+
177+
178+
References and Footnotes
179+
########################
180+
181+
.. [1] Each SLEP must either be explicitly labeled as placed in the public
182+
domain (see this SLEP as an example) or licensed under the `Open
183+
Publication License`_.
184+
185+
.. _Open Publication License: https://www.opencontent.org/openpub/
186+
187+
188+
Copyright
189+
#########
190+
191+
This document has been placed in the public domain. [1]_

0 commit comments

Comments
 (0)