SLEP015: Feature Names Propagation #48

thomasjpfan · 2020-10-03T21:09:23Z

This SLEP details how a feature_names_in_ attribute and a get_feature_names_out method can be used together to obtain feature name propagation. I see there are two main goals for having this feature:

Inspect the feature names inputted into the last step of a pipeline. This is address as the motivating problem in the SLEP.
Be able to use panda's dataframes or xarray's dataarray as input for estimators in a pipeline. This allows for estimators to do something special based on the name of a feature. This is addressed in the "Enabling Functionality" section with a new parameter in Pipeline.

Related to scikit-learn/scikit-learn#18444 - get_feature_names_out PR
Related to scikit-learn/scikit-learn#16772 - array_out PR
Related to scikit-learn/scikit-learn#18010 - feature_names_out_ PR

CC @scikit-learn/core-devs

adrinjalali

I'm a bit confused as what your proposal really is. This sounds to me much closer to Andy's approach, but that doesn't pass the feature names during fit down the pipeline. Maybe you could expand on those bits a bit?

slep015/proposal.rst

adrinjalali · 2020-10-04T14:29:24Z

slep015/proposal.rst

+attribute. The feature names of a pipeline can then be easily extracted as
+follows::
+
+    pipe[:-1].get_feature_names_out()


and maybe mention?

pipe[-1].feature_names_in_

adrinjalali · 2020-10-04T20:02:59Z

slep015/proposal.rst

+In this case, the pipeline will construct a pandas DataFrame to be inputted
+into ``MyTransformer`` and ``LogisticRegression``. The feature names
+will be constructed by calling ``get_feature_names_out`` as data is passed
+through the ``Pipeline``.


This implies it's the pipeline doing it. Or do you mean a pandas DF is passed around as your PR?

I am implying that Pipeline is doing this. I update the doc to make this point explicit. I also added more details on how this SLEP relates to array_out.

slep015/proposal.rst

adrinjalali · 2020-10-04T20:05:28Z

slep015/proposal.rst

+This SLEP requires all estimators to store ``feature_names_in_` for all
+estimators, which will increase the size of the estimators. By default, a
+``Pipeline`` will only store ``feature_names_in_`` in the first step and
+the rest can be computed by slicing the pipeline at different steps. In other
+words, the additional space used will be at a minimal because only the
+input feature names from the first step are stored.


Then are we storing feature names in each step or not? This paragraph is a bit confusing to me.

Also, slicing from the middle of the pipeline, would lose the feature names.

Or we need to implement a __getitem__ to recreate feature_names_in_ when start of the slice is not 0?

Also, slicing from the middle of the pipeline, would lose the feature names.

We can make this work by constructing the feature_names_in_ and attaching it to the new pipeline.

thomasjpfan · 2020-10-05T13:57:13Z

@adrinjalali I'm a bit confused as what your proposal really is. This sounds to me much closer to Andy's approach, but that doesn't pass the feature names during fit down the pipeline. Maybe you could expand on those bits a bit?

One of the reasons I wrote this SLEP was scikit-learn/scikit-learn#16772 (comment). It highlighted the performance issues for sparse data which is related to the InputArray idea.

One can consider this SLEP a stepping stone toward passing feature names down the pipeline. Furthermore this SLEP is a prerequisite for the array_out PR:

feature_names_in_ is required to make sure fit and transform has the same feature names.
get_feature_names_out needs to be implemented everywhere in the array_out PR, but privately.

This SLEP extends Andy's idea by including the interaction between feature_names_in_ and get_feature_names_out. Since every estimator has the ability to store feature_names_in_ users would not need to pass in the input names in get_feature_names_out in a pipeline or with a single estimator to get the output names.

Yes every step in a pipeline would not have the names, but these names can be obtained through slicing and if the estimator needs the name in fit, we can add a parameter in pipeline to tell it to construct the dataframe which it will pass to one of its steps. (This is an idea proposed in this SLEP but not a part of this SLEP)

TLDR: I think this SLEP is simpler and resolves some of the pain points we have with feature names.

thomasjpfan · 2020-10-06T15:19:10Z

Updated SLEP is state that the Pipeline will have feature_names_in_.

slep015/proposal.rst

lorentzenchr

@thomasjpfan Thanks for working on this! I've mainly nitpicks.

slep015/proposal.rst

lorentzenchr · 2020-10-08T15:51:10Z

slep015/proposal.rst

+be deprecated.
+
+The inclusion of ``get_feature_names_out`` and ``feature_names_in_`` will
+not introduce any overhead to ``Pipeline``.


Is it possible to have a slight overhead for very wide data with a lot of columns?

slep015/proposal.rst

adrinjalali

I'm happy to have this merged. And looks good to me for a vote from my side :)

jnothman

I think this PR is good overall, but it needs to:

specify the data type of feature_names_in_ and the return type of get_feature_names_out: are they necessarily lists? any Sequence? are arrays permitted/required?
address concerns about memory

If we are passing around a wide sparse matrix, and generating feature_names_in_ at each step in a pipeline, this could be consuming a lot of memory unnecessarily. Should we be introducing something like the following to avoid unnecessary memory usage with self.feature_names_in_ = DefaultFeatureNames(X.shape[1])?

class DefaultFeatureNames(collections.abc.Sequence):
    def __init__(self, n_features):
        self.n_features = n_features

    def __len__(self):
        return self.n_features

    def __getitem__(self, sl):
        if isinstance(sl, slice):
            return [f"x{i}" for i in range(len(self))]
        elif isinstance(sl, tuple):
            raise NotImplementedError
        return f"x{sl}"

    def __iter__(self):
        return (f"x{i}" for i in range(len(self)))

jnothman · 2020-10-12T12:35:10Z

slep015/proposal.rst

+extracting ``feature_names`` requires knowing the order of the selected
+categories in the ``ColumnTransformer``. Furthermore, if there is feature
+selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method
+would need to be used to select column names that were selected.


Suggested change

would need to be used to select column names that were selected.

would need to be used to infer the column names that were selected.

jnothman · 2020-10-12T12:36:56Z

slep015/proposal.rst

+made possible if this SLEP gets accepted.
+
+1. As an alternative to slicing, we can add a
+   ``Pipeline.get_feature_names_in_at`` method to get the names at a specific


I find this name unpleasant, and don't see what's so much better than Pipeline[-1].feature_names_in_

This will not work be default because the final step would not have the feature names if there is more than one step:

pipe1 = make_pipeline(StandardScaler(), LogisticRegression()) # pipe1[-1].feature_names_in_ does not exist pipe2 = make_pipeline(LogisticRegression()) # pipe2[-1].feature_names_in_ does exist

This proposal does not actually pass through the names at each step. Only the pipeline and the first step will have access to the input names.

I'll remove this point to make this SLEP shorter.

jnothman · 2020-10-12T12:38:03Z

slep015/proposal.rst

+   all ``transform`` methods to specify the array container outputted by
+   ``transform``. An implementation of ``array_out`` requires
+   ``feature_names_in_`` to validate that the names in ``fit`` and
+   ``transform`` are consistent. With the implementation of ``array_out`` needs


Suggested change

``transform`` are consistent. With the implementation of ``array_out`` needs

``transform`` are consistent. An implementation of ``array_out`` needs

jnothman · 2020-10-12T12:39:25Z

slep015/proposal.rst

+This SLEP proposes adding the ``feature_names_in_`` attribute to all estimators
+that will extract the feature names of ``X`` during ``fit``. This will also
+be used for validation during non-``fit`` methods such as ``transform`` or
+``predict``. If the ``X`` is not a recognized container, then


Suggested change

``predict``. If the ``X`` is not a recognized container, then

``predict``. If the ``X`` is not a recognized container with columns, then

jnothman · 2020-10-12T12:40:12Z

slep015/proposal.rst

+1. The ``get_feature_names_out`` will be constructed using the name generation
+   specification from :ref:`slep_007`.
+
+2. For a ``Pipeline`` with only one estimator, slicing will not work and one


I find this confusing. You're saying slicing will not work, but then showing an example with slicing? Or are you distinguishing slicing from indexing. What does slicing have to do with anything anyway??

I wanted to distinguish between the following two pipelines:

pipe1 = make_pipeline(StandardScaler(), LogisticRegression()) pipe1[:-1].get_feature_names_out() # this works pipe2 = make_pipeline(LogisticRegression()) pipe2[:-1].get_feature_names_out() # does not work

pipe2[:-1] fails because the slicing will produce a pipeline with no steps. Although, we can allow pipelines with no steps to get pipe2[:-1].get_feature_names_out() to work.

Okay, I agree this is a strange corner case, since it is the only way to construct a fitted empty Pipeline...

jnothman

This proposal does not actually pass through the names at each step. Only the pipeline and the first step will have access to the input names.

Oh! I don't think I understood this at all from the SLEP. Isn't that contradicted by "This SLEP proposes adding the feature_names_in_ attribute to all estimators
that will extract the feature names of X during fit."?

thomasjpfan · 2020-10-13T19:44:46Z

This SLEP proposes adding the feature_names_in_ attribute to all estimators
that will extract the feature names of X during fit.

I was thinking of the "outer" most layer of the API. For non-meta estimators, feature_names_in_ would be defined if the feature names can be extract from X.

For metaestimators, such as pipeline, they would have pipe.feature_names_in_ defined but we would not guarantee that inner estimators would have feature_names_in_ defined.

jnothman · 2020-10-14T13:15:28Z

I don't really get how you define "outer" or would ensure that from an implementation perspective, unless either feature_names_in_ was being set not by an estimator in its fit method, but by the caller of that estimator's fit method; or, if feature_names_in_ was only set when X was in a format that had names attached. Neither of these limitations are discussed in the SLEP afaict.

thomasjpfan · 2020-10-14T13:54:21Z

Neither of these limitations are discussed in the SLEP afaict.

Thank you for your thoughts, I'll update the SLEP accordingly.

I don't really get how you define "outer"

I was using "outer" and "inner" in context of a metaestimator. The metaestimator would be the "outer" estimator, while all estimators inside the meta estimators are "inner" estimators.

if feature_names_in_ was only set when X was in a format that had names attached.

This is the form I was considering.

unless either feature_names_in_ was being set not by an estimator in its fit method, but by the caller of that estimator's fit method

An metaestimator could have all sorts of estimators that it uses internally. I think having the metaestimator be responsible for setting feature_names_in_ for all its estimators would be too much of a requirement for metaestimator developers.

This SLEP is trying to proposal the bare minimum: If X has names and is passed to fit, then the estimator stores them as feature_names_in_.

Metaestimators is a case where I would want it to delegate this responsibly to its inner estimators. This way the metaestimator can construct metaestimator.feature_names_in_ for itself. In the case of Pipeline, it would be the feature_names_in_ of the first step.

thomasjpfan · 2020-10-14T16:13:55Z

specify the data type of feature_names_in_ and the return type of get_feature_names_out: are they necessarily lists? any Sequence? are arrays permitted/required?

I was considering any Sequence: ffd0954 (#48).

@jnothman I have updated the SLEP to addresss your concerns:

Regarding meta-estimators: 7537b15 (#48)
Abstract now explicitly states that feature_names_in_ are set in fit.
An example of slicing can fail with a pipeline with one step was added.

jnothman · 2020-10-14T20:54:47Z

slep015/proposal.rst

@@ -121,7 +121,10 @@ Considerations and Limitations
   a pipeline with no steps. We can work around this by allowing pipelines
   with no steps.

-3. Meta-estimators will delegate the setting and validation of
+3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or


But ndarray is not a sequence: numpy/numpy#2776

Maybe "Iterable that returns a string" would be enough.

In our discussions, I think we want to make sure the feature names are strings.

Hmm. We'd better accept Sequences and 1d array-likes whose elements are strings: pd.Index is not a Sequence.

Suggested change

3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or

3. ``feature_names_in_`` can be any 1d array-like of strings, such as an list or

jnothman · 2020-10-14T22:42:33Z

So now my understanding (from conversation, not reviewing the SLEP) is that feature_names_in_ should be set directly by every estimator that cannot delegate it.

Let's clarify:

When input is a list of strings, feature_names_in_ = None.
When input is a pd.DataFrame, feature_names_in_ = X.columns. Or is it X.columns.copy()?
Do we provide ducktyped support for anything with .columns?
When input is a 2d array or sparse matrix, feature_names_in_ is what?
In the absence of a cross-library API for frame-like implementations, what happens when we don't support extracting column names from some data type in one version, but then add the capability in the next release. How does that affect other behaviours?
How does a third party library know which data types are supported for storing column names? Do we give them a helper function to store/generate names in all supported cases?

thomasjpfan · 2020-10-15T02:15:00Z

When input is a list of strings, feature_names_in_ = None.

If we want to be consistent with n_feature_names this would be the case. I think I would prefer it to not be defined, which means there is no validation in non-fit methods.

When input is a pd.DataFrame, feature_names_in_ = X.columns. Or is it X.columns.copy()?

It would be safer to copy, but I would prefer not to.

When input is a 2d array or sparse matrix, feature_names_in_ is what?

Not defined and will not validate.

Do we provide ducktyped support for anything with .columns?

In the absence of a cross-library API for frame-like implementations, what happens when we don't support extracting column names from some data type in one version, but then add the capability in the next release. How does that affect other behaviours?

I think the safest thing to do is to restrict this SLEP to pandas until we get a frame-like protocol. When this frame-like protocol is defined, then we can say we support frame-like objects.

How does a third party library know which data types are supported for storing column names? Do we give them a helper function to store/generate names in all supported cases?

In scikit-learn/scikit-learn#18010, I am pushing for using _validate_data to handle this. Internally, a third party estimator can also use _check_feature_names.

I think we should adjust the SLEP to make feature_names_in_ optional. We can increase its adoption by demonstrating how useful it is with pipelines and other meta-estimators.

jnothman · 2020-10-17T10:56:13Z

But _validate_data isn't pubic. Since meta estimators need to handle the case that the attribute is absent or None in any case, making this optional seems reasonable.

jnothman

Regarding the discussion above, we need to make it clear that support for this is optional outside of the core library... Or we need a new discussion about how to make _validate_data's capabilities publicly available (either by making it a public function in sklearn.utils, or by defining a "protected" estimator class API).

jnothman · 2020-12-29T14:16:30Z

slep015/proposal.rst

+NumPy array, discarding column names. The current workflow for
+extracting the feature names requires calling ``get_feature_names`` on the
+transformer that created the feature. This interface can be cumbersome when used
+together with a pipeline with multiple column names::


Suggested change

together with a pipeline with multiple column names::

together with a Pipeline with multiple column names::

jnothman · 2021-01-18T00:34:53Z

slep015/proposal.rst

@@ -121,7 +121,10 @@ Considerations and Limitations
   a pipeline with no steps. We can work around this by allowing pipelines
   with no steps.

-3. Meta-estimators will delegate the setting and validation of
+3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or


Suggested change

3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or

3. ``feature_names_in_`` can be any 1d array-like of strings, such as an list or

jnothman · 2021-01-18T00:35:44Z

slep015/proposal.rst

+   with no steps.
+
+3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or
+   an ndarray.


It might be worth noting that this allowance can avoid unnecessary memory consumption/copies, with reduced implementation complexity, although it may reduce usability a bit.

lorentzenchr · 2021-12-06T14:24:03Z

Is this PR superseded by acceptance of SLEP007 in #59?

adrinjalali · 2021-12-06T14:29:08Z

Is this PR superseded by acceptance of SLEP007 in #59?

No, that one is about how the feature names are generated, this one is about how they're propagated in a pipeline.

lorentzenchr · 2021-12-06T15:01:41Z

Does this supersede #18, meaning it's an alternative?

adrinjalali · 2021-12-06T15:07:02Z

I'd say this is yet another effort to do the same thing as #18 wanted to do, but a lot more updated, after having gone down a few paths and not getting anywhere. I would need to read both again to say if this one is an alternative or just supersedes the other one.

adrinjalali · 2021-12-06T15:07:43Z

Also, I'd be happy to merge both of them and continue discussing on a separate issue.

thomasjpfan · 2021-12-06T16:44:52Z

A bunch of SLEP015 is already include in SLEP007 which has already been accepted. What SLEP015 adds is an API for actually outputting pandas DataFrames.

I think #18 (SLEP008) has more or less been superseded by SLEP007.

adrinjalali · 2021-12-06T17:22:11Z

back when we were writing sleps 7 and 8, I wrote 7 only to just talk about how we create the feature names, and 8 was more about the options we have to propagate them. Since then, I worked on a sklearn dataframe kinda object which we decided not to do, then worked on xarray, then Thomas worked on pandas, and the API also kinda evolved over time. I think we could at least focus on this SLEP for now, and then figure out what to do with other containers. We've gone back and forth with which container to use, or to use one at all, instead of propagating feature names alongside the data instead of with the data in a container.

amueller · 2022-11-28T17:45:43Z

This has been superseded by SLEP 18 #72 right?

…ames_prop

thomasjpfan · 2022-11-29T18:34:03Z

I updated this PR so that this SLEP is now rejected.

adrinjalali · 2022-11-30T08:48:03Z

Thanks @thomasjpfan . Then I think we can merge.

DOC Adds slep015

3b6631d

thomasjpfan mentioned this pull request Oct 3, 2020

[WIP] Feature names with pandas or xarray data structures scikit-learn/scikit-learn#16772

Closed

adrinjalali reviewed Oct 4, 2020

View reviewed changes

thomasjpfan mentioned this pull request Oct 5, 2020

API Implements get_feature_names_out for transformers that support get_feature_names scikit-learn/scikit-learn#18444

Merged

thomasjpfan added 3 commits October 5, 2020 18:23

DOC Adjust wording to allow Pipeline to store

55b62f6

DOC Adds more details

2f1d2f8

DOC Adds more details about array_out

22b9b00

CLN Address comments

4392546

ogrisel reviewed Oct 8, 2020

View reviewed changes

slep015/proposal.rst Show resolved Hide resolved

lorentzenchr reviewed Oct 8, 2020

View reviewed changes

ogrisel self-requested a review October 9, 2020 14:47

thomasjpfan added 3 commits October 10, 2020 12:46

DOC Address comments

0608c37

DOC Adds details

902f792

DOC Adjust styling

1fff514

adrinjalali approved these changes Oct 11, 2020

View reviewed changes

jnothman reviewed Oct 12, 2020

View reviewed changes

jnothman reviewed Oct 13, 2020

View reviewed changes

ogrisel mentioned this pull request Oct 14, 2020

add separator argument to OneHotEncoder get_feature_names method scikit-learn/scikit-learn#18552

Closed

thomasjpfan added 5 commits October 14, 2020 11:43

DOC Adds more details

388eda8

DOC Comment about meta-estimators

7537b15

DOC Comment about get_feature_names

8916cb1

CLN Address comments

e9b275c

DOC Specify data type

ffd0954

jnothman reviewed Oct 14, 2020

View reviewed changes

kylegilde mentioned this pull request Oct 20, 2020

Allowing Feature Selection inside or before Column Transformer kylegilde/Kaggle-Notebooks#2

Open

NicolasHug mentioned this pull request Nov 3, 2020

Feature request: OneHotEncoder copying Feature names scikit-learn/scikit-learn#18753

Closed

glemaitre mentioned this pull request Dec 14, 2020

ColumnTransformer get_feature_names on more transformers scikit-learn/scikit-learn#18993

Closed

glemaitre mentioned this pull request Jan 15, 2021

ColumnTransformer unable to get feature names from pipeline scikit-learn/scikit-learn#19157

Closed

jnothman reviewed Jan 18, 2021

View reviewed changes

ogrisel mentioned this pull request Feb 23, 2021

PolynomialFeatures: allow user defined combinations of features scikit-learn/scikit-learn#19533

Open

Merge remote-tracking branch 'upstream/master' into slep015_feature_n…

3f6426f

…ames_prop

adrinjalali merged commit 221362b into scikit-learn:master Nov 30, 2022

	would need to be used to select column names that were selected.
	would need to be used to infer the column names that were selected.

	``transform`` are consistent. With the implementation of ``array_out`` needs
	``transform`` are consistent. An implementation of ``array_out`` needs

	``predict``. If the ``X`` is not a recognized container, then
	``predict``. If the ``X`` is not a recognized container with columns, then

	3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or
	3. ``feature_names_in_`` can be any 1d array-like of strings, such as an list or

	together with a pipeline with multiple column names::
	together with a Pipeline with multiple column names::

SLEP015: Feature Names Propagation #48

SLEP015: Feature Names Propagation #48

Conversation

thomasjpfan commented Oct 3, 2020

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan commented Oct 5, 2020

thomasjpfan commented Oct 6, 2020

lorentzenchr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

thomasjpfan commented Oct 13, 2020

jnothman commented Oct 14, 2020 via email

thomasjpfan commented Oct 14, 2020 • edited Loading

thomasjpfan commented Oct 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman Oct 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Oct 14, 2020 • edited Loading

thomasjpfan commented Oct 15, 2020

jnothman commented Oct 17, 2020 via email

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lorentzenchr commented Dec 6, 2021

adrinjalali commented Dec 6, 2021

lorentzenchr commented Dec 6, 2021

adrinjalali commented Dec 6, 2021

adrinjalali commented Dec 6, 2021

thomasjpfan commented Dec 6, 2021 • edited Loading

adrinjalali commented Dec 6, 2021

amueller commented Nov 28, 2022

thomasjpfan commented Nov 29, 2022

adrinjalali commented Nov 30, 2022

thomasjpfan commented Oct 14, 2020 •

edited

Loading

jnothman Oct 14, 2020 •

edited

Loading

jnothman commented Oct 14, 2020 •

edited

Loading

thomasjpfan commented Dec 6, 2021 •

edited

Loading