-
-
Notifications
You must be signed in to change notification settings - Fork 34
SLEP015: Feature Names Propagation #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLEP015: Feature Names Propagation #48
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused as what your proposal really is. This sounds to me much closer to Andy's approach, but that doesn't pass the feature names during fit
down the pipeline. Maybe you could expand on those bits a bit?
attribute. The feature names of a pipeline can then be easily extracted as | ||
follows:: | ||
|
||
pipe[:-1].get_feature_names_out() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and maybe mention?
pipe[-1].feature_names_in_
slep015/proposal.rst
Outdated
In this case, the pipeline will construct a pandas DataFrame to be inputted | ||
into ``MyTransformer`` and ``LogisticRegression``. The feature names | ||
will be constructed by calling ``get_feature_names_out`` as data is passed | ||
through the ``Pipeline``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies it's the pipeline doing it. Or do you mean a pandas DF is passed around as your PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am implying that Pipeline is doing this. I update the doc to make this point explicit. I also added more details on how this SLEP relates to array_out
.
slep015/proposal.rst
Outdated
This SLEP requires all estimators to store ``feature_names_in_` for all | ||
estimators, which will increase the size of the estimators. By default, a | ||
``Pipeline`` will only store ``feature_names_in_`` in the first step and | ||
the rest can be computed by slicing the pipeline at different steps. In other | ||
words, the additional space used will be at a minimal because only the | ||
input feature names from the first step are stored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then are we storing feature names in each step or not? This paragraph is a bit confusing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, slicing from the middle of the pipeline, would lose the feature names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we need to implement a __getitem__
to recreate feature_names_in_
when start
of the slice is not 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, slicing from the middle of the pipeline, would lose the feature names.
We can make this work by constructing the feature_names_in_
and attaching it to the new pipeline.
@adrinjalali I'm a bit confused as what your proposal really is. This sounds to me much closer to Andy's approach, but that doesn't pass the feature names during fit down the pipeline. Maybe you could expand on those bits a bit? One of the reasons I wrote this SLEP was scikit-learn/scikit-learn#16772 (comment). It highlighted the performance issues for sparse data which is related to the One can consider this SLEP a stepping stone toward passing feature names down the pipeline. Furthermore this SLEP is a prerequisite for the
This SLEP extends Andy's idea by including the interaction between Yes every step in a pipeline would not have the names, but these names can be obtained through slicing and if the estimator needs the name in TLDR: I think this SLEP is simpler and resolves some of the pain points we have with feature names. |
Updated SLEP is state that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan Thanks for working on this! I've mainly nitpicks.
slep015/proposal.rst
Outdated
be deprecated. | ||
|
||
The inclusion of ``get_feature_names_out`` and ``feature_names_in_`` will | ||
not introduce any overhead to ``Pipeline``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have a slight overhead for very wide data with a lot of columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to have this merged. And looks good to me for a vote from my side :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is good overall, but it needs to:
- specify the data type of
feature_names_in_
and the return type ofget_feature_names_out
: are they necessarily lists? any Sequence? are arrays permitted/required? - address concerns about memory
If we are passing around a wide sparse matrix, and generating feature_names_in_ at each step in a pipeline, this could be consuming a lot of memory unnecessarily. Should we be introducing something like the following to avoid unnecessary memory usage with self.feature_names_in_ = DefaultFeatureNames(X.shape[1])
?
class DefaultFeatureNames(collections.abc.Sequence):
def __init__(self, n_features):
self.n_features = n_features
def __len__(self):
return self.n_features
def __getitem__(self, sl):
if isinstance(sl, slice):
return [f"x{i}" for i in range(len(self))]
elif isinstance(sl, tuple):
raise NotImplementedError
return f"x{sl}"
def __iter__(self):
return (f"x{i}" for i in range(len(self)))
slep015/proposal.rst
Outdated
extracting ``feature_names`` requires knowing the order of the selected | ||
categories in the ``ColumnTransformer``. Furthermore, if there is feature | ||
selection in the pipeline, such as ``SelectKBest``, the ``get_support`` method | ||
would need to be used to select column names that were selected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would need to be used to select column names that were selected. | |
would need to be used to infer the column names that were selected. |
slep015/proposal.rst
Outdated
made possible if this SLEP gets accepted. | ||
|
||
1. As an alternative to slicing, we can add a | ||
``Pipeline.get_feature_names_in_at`` method to get the names at a specific |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this name unpleasant, and don't see what's so much better than Pipeline[-1].feature_names_in_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work be default because the final step would not have the feature names if there is more than one step:
pipe1 = make_pipeline(StandardScaler(), LogisticRegression())
# pipe1[-1].feature_names_in_ does not exist
pipe2 = make_pipeline(LogisticRegression())
# pipe2[-1].feature_names_in_ does exist
This proposal does not actually pass through the names at each step. Only the pipeline and the first step will have access to the input names.
I'll remove this point to make this SLEP shorter.
slep015/proposal.rst
Outdated
all ``transform`` methods to specify the array container outputted by | ||
``transform``. An implementation of ``array_out`` requires | ||
``feature_names_in_`` to validate that the names in ``fit`` and | ||
``transform`` are consistent. With the implementation of ``array_out`` needs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``transform`` are consistent. With the implementation of ``array_out`` needs | |
``transform`` are consistent. An implementation of ``array_out`` needs |
slep015/proposal.rst
Outdated
This SLEP proposes adding the ``feature_names_in_`` attribute to all estimators | ||
that will extract the feature names of ``X`` during ``fit``. This will also | ||
be used for validation during non-``fit`` methods such as ``transform`` or | ||
``predict``. If the ``X`` is not a recognized container, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``predict``. If the ``X`` is not a recognized container, then | |
``predict``. If the ``X`` is not a recognized container with columns, then |
1. The ``get_feature_names_out`` will be constructed using the name generation | ||
specification from :ref:`slep_007`. | ||
|
||
2. For a ``Pipeline`` with only one estimator, slicing will not work and one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this confusing. You're saying slicing will not work, but then showing an example with slicing? Or are you distinguishing slicing from indexing. What does slicing have to do with anything anyway??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to distinguish between the following two pipelines:
pipe1 = make_pipeline(StandardScaler(), LogisticRegression())
pipe1[:-1].get_feature_names_out() # this works
pipe2 = make_pipeline(LogisticRegression())
pipe2[:-1].get_feature_names_out() # does not work
pipe2[:-1]
fails because the slicing will produce a pipeline with no steps. Although, we can allow pipelines with no steps to get pipe2[:-1].get_feature_names_out()
to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I agree this is a strange corner case, since it is the only way to construct a fitted empty Pipeline...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal does not actually pass through the names at each step. Only the pipeline and the first step will have access to the input names.
Oh! I don't think I understood this at all from the SLEP. Isn't that contradicted by "This SLEP proposes adding the feature_names_in_
attribute to all estimators
that will extract the feature names of X
during fit
."?
I was thinking of the "outer" most layer of the API. For non-meta estimators, For metaestimators, such as pipeline, they would have |
I don't really get how you define "outer" or would ensure that from an
implementation perspective, unless either feature_names_in_ was being set
not by an estimator in its fit method, but by the caller of that
estimator's fit method; or, if feature_names_in_ was only set when X was in
a format that had names attached. Neither of these limitations are
discussed in the SLEP afaict.
|
Thank you for your thoughts, I'll update the SLEP accordingly.
I was using "outer" and "inner" in context of a metaestimator. The metaestimator would be the "outer" estimator, while all estimators inside the meta estimators are "inner" estimators.
This is the form I was considering.
An metaestimator could have all sorts of estimators that it uses internally. I think having the metaestimator be responsible for setting This SLEP is trying to proposal the bare minimum: If Metaestimators is a case where I would want it to delegate this responsibly to its inner estimators. This way the metaestimator can construct |
I was considering any @jnothman I have updated the SLEP to addresss your concerns:
|
@@ -121,7 +121,10 @@ Considerations and Limitations | |||
a pipeline with no steps. We can work around this by allowing pipelines | |||
with no steps. | |||
|
|||
3. Meta-estimators will delegate the setting and validation of | |||
3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But ndarray is not a sequence: numpy/numpy#2776
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "Iterable that returns a string" would be enough.
In our discussions, I think we want to make sure the feature names are strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. We'd better accept Sequences and 1d array-likes whose elements are strings: pd.Index is not a Sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or | |
3. ``feature_names_in_`` can be any 1d array-like of strings, such as an list or |
So now my understanding (from conversation, not reviewing the SLEP) is that Let's clarify:
|
If we want to be consistent with
It would be safer to copy, but I would prefer not to.
Not defined and will not validate.
I think the safest thing to do is to restrict this SLEP to pandas until we get a frame-like protocol. When this frame-like protocol is defined, then we can say we support frame-like objects.
In scikit-learn/scikit-learn#18010, I am pushing for using I think we should adjust the SLEP to make |
But _validate_data isn't pubic. Since meta estimators need to handle the
case that the attribute is absent or None in any case, making this optional
seems reasonable.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the discussion above, we need to make it clear that support for this is optional outside of the core library... Or we need a new discussion about how to make _validate_data
's capabilities publicly available (either by making it a public function in sklearn.utils, or by defining a "protected" estimator class API).
NumPy array, discarding column names. The current workflow for | ||
extracting the feature names requires calling ``get_feature_names`` on the | ||
transformer that created the feature. This interface can be cumbersome when used | ||
together with a pipeline with multiple column names:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
together with a pipeline with multiple column names:: | |
together with a Pipeline with multiple column names:: |
@@ -121,7 +121,10 @@ Considerations and Limitations | |||
a pipeline with no steps. We can work around this by allowing pipelines | |||
with no steps. | |||
|
|||
3. Meta-estimators will delegate the setting and validation of | |||
3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or | |
3. ``feature_names_in_`` can be any 1d array-like of strings, such as an list or |
with no steps. | ||
|
||
3. ``feature_names_in_`` can be any 1-D ``Sequence``, such as an list or | ||
an ndarray. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth noting that this allowance can avoid unnecessary memory consumption/copies, with reduced implementation complexity, although it may reduce usability a bit.
Is this PR superseded by acceptance of SLEP007 in #59? |
No, that one is about how the feature names are generated, this one is about how they're propagated in a pipeline. |
Does this supersede #18, meaning it's an alternative? |
I'd say this is yet another effort to do the same thing as #18 wanted to do, but a lot more updated, after having gone down a few paths and not getting anywhere. I would need to read both again to say if this one is an alternative or just supersedes the other one. |
Also, I'd be happy to merge both of them and continue discussing on a separate issue. |
A bunch of SLEP015 is already include in SLEP007 which has already been accepted. What SLEP015 adds is an API for actually outputting pandas DataFrames. I think #18 (SLEP008) has more or less been superseded by SLEP007. |
back when we were writing sleps 7 and 8, I wrote 7 only to just talk about how we create the feature names, and 8 was more about the options we have to propagate them. Since then, I worked on a sklearn dataframe kinda object which we decided not to do, then worked on xarray, then Thomas worked on pandas, and the API also kinda evolved over time. I think we could at least focus on this SLEP for now, and then figure out what to do with other containers. We've gone back and forth with which container to use, or to use one at all, instead of propagating feature names alongside the data instead of with the data in a container. |
This has been superseded by SLEP 18 #72 right? |
I updated this PR so that this SLEP is now rejected. |
Thanks @thomasjpfan . Then I think we can merge. |
This SLEP details how a
feature_names_in_
attribute and aget_feature_names_out
method can be used together to obtain feature name propagation. I see there are two main goals for having this feature:Pipeline
.Related to scikit-learn/scikit-learn#18444 -
get_feature_names_out
PRRelated to scikit-learn/scikit-learn#16772 -
array_out
PRRelated to scikit-learn/scikit-learn#18010 -
feature_names_out_
PRCC @scikit-learn/core-devs