-
-
Notifications
You must be signed in to change notification settings - Fork 34
SLEP 014 Pandas in Pandas out #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think this should be slep 14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider the new pandas nan
in the consideration section? I think if transformers return pandas dataframes, it becomes relevant.
We can also talk about what kind of column (and/or row) index we accept, and be limited on what we accept.
Another alternative (which I personally prefer) is xarray
. We probably should have another SLEP covering that. @TomAugspurger may have a better insight comparing the two.
One of the reasons I think we should consider xarray, is that we can attach arbitrary feature/sample/data props to the data, which may be useful. It's also very lightweight compared to pandas, and we may be able to convince them to remove a hard pandas dependency (correct me if I'm wrong @TomAugspurger)
Thanks @thomasjpfan
slep014/proposal.rst
Outdated
============================== | ||
|
||
:Author: Thomas J Fan | ||
:Status: Under Review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go for SLEP000, this would be a Draft
3. Allow for extracting the feature names of estimators in meta-estimators:: | ||
|
||
from sklearn import set_config | ||
set_config(store_feature_names_in=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably should have the default values of these configs somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default of pandas_inout
is now stated twice in the Solution and Backward compatibility sections. The default value of store_feature_names_in
is started in its section. I purposefully did not go into too many details in the Enabling Functionality section, since it serves as "what are the possibilities if this SLEP gets accepted".
slep014/proposal.rst
Outdated
Backward compatibility | ||
###################### | ||
|
||
The ``set_config(pandas_inout=True)`` global configuration flag will be set to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe more like pandas_output
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had something like that initially, but it feels like "We will always output pandas, even if the input is numpy arrays". (Naming is hard)
Another point is to talk abut |
pandas 1.0 added a
In theory, I think it's possible to have a |
The
It also seems like DataFrame.attrs is experimental as well. In both cases, I think we should wait for the features to stabilize before designing with them in mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the scope of this "pandas in - pandas out". Only transformers that still return (samples x features) arrays? Or also in other places (output of predict, coefs, ..)
slep014/proposal.rst
Outdated
API for extracting feature names would be:: | ||
|
||
from sklearn import set_config | ||
set_config(pandas_inout=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit, but I think pandas_in_out
might be more readable
slep014/proposal.rst
Outdated
X_trans.columns.tolist() | ||
['letter_a', 'letter_b', 'letter_c', 'pet_dog', 'pet_snake', 'num'] | ||
|
||
This introduces a soft dependency on pandas, which is opt-in with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scikit-learn already has a soft-dependency on pandas for some features, so it wouldn't be new
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe word it more like
This feature is only available through a soft dependency on
pandas
or something.
slep014/proposal.rst
Outdated
Index alignment | ||
--------------- | ||
|
||
Operations are index aligned when working with DataFrames. Interally, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operations are index aligned when working with DataFrames. Interally, | |
Operations are index aligned when working with DataFrames. Internally, |
piped into ``LogisticRegression.fit`` which calls ``check_array`` on the | ||
DataFrame, which may lead to a memory copy in a future version of | ||
``pandas``. This leads to unnecessary overhead from piping the data from one | ||
estimator to another. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think that for some transformers (like StandardScaler) could rather easily work column-wise to avoid such copying overhead.
(of course, that will give a stronger pandas dependence also for the implementation of such transformers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and it could even have the option of being "in-place" :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can try to support this. check_array
would need to not run asarray
on the dataframe, and the transformer would need to operate on the dataframe itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes check_array
could get another option! exciting! ;) [I agree with @jorisvandenbossche though that this would be interesting for the future].
I personally think it's interesting that scikit-learn would consider supporting it, but IMO it is somewhat orthogonal to this SLEP. Also in the current situation of converting DataFrames to arrays on input to estimators, the question of supporting pd.NA already comes up. |
I think the scope is only |
I think transforms is the most important to start with, so that seems the best scope. But yes, so I mainly wanted to indicate that this should be mentioned more clearly in the SLEP then ;) |
slep014/proposal.rst
Outdated
- Use xarray's Dataset, ``xr.Dataset``: The pandas DataFrame is more widely used | ||
in Python's Data ecosystem, which means more libraries are built with pandas | ||
in mind. With xarray support, users will need to convert their DataFrame into | ||
a ``xr.Dataset``. This converstion process will be lossy when working with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a ``xr.Dataset``. This converstion process will be lossy when working with | |
a ``xr.Dataset``. This conversion process will be lossy when working with |
Do we want to have an in-depth discussion of xarray vs pandas in this slep? In this issue? In #35? My brief opinion: pandas pros
pandas cons
xarray pros
xarray cons
I assume @adrinjalali's pro and con list is different from mine since his preference is xarray ;) |
Isn't it possible to have it with an xarray |
That's true. I thought we were considering |
Sorry @amueller I don't completely follow: we can run the benchmarks for both dense and sparse data? I don't see yet how my comment describes a "very specific worst case". What I'm trying to say is, I'd be happy to vote yes if the benchmarks show that the worst case scenario isn't that bad after all. (as long as we can simulate the actual worst case). For now, IMHO, there is no tangible element on which to base a decision (hypothetical future behavior of pandas + hypothetical solution for that hypothetical behavior). I'm basically following your own words:
But like I said, if others want to vote, let's vote. |
We can certainly run a benchmark for what's there currently, but we also know the outcome: for dense data there is no overhead, and for sparse data there is unacceptably large overhead. What you also proposed is a hypothetical future scenario in which there is a single copy in the case of dense data (and an additional copy in the case of sparse data?). What I tried to say is that at least theoretically the future could be even worse than that, and so this benchmark also makes assumptions about the future implementation. |
I don't know what the current proposal is for sparse data so I would like to clarify that first, given the issues that @rth pointed out above. |
Memory copies are also a big problem in pandas, and actually one of the reasons to go towards a column store in pandas (to avoid things like "consolidation" of same-dtype columns etc). So while I personally am pushing for a column-store in pandas, I still think there is value in having a kind of homogeneously typed DataFrame backed by a single 2D array (we talked about this briefly in the call last week). So it would be a worthwile discussion to think about where in the PyData ecosystem that fits. Should people just use xarray for that use case? Or can we have such a single dtype DataFrame class in pandas that can share some infrastructure with the column-store DataFrame? That are interesting discussions, but let's have that discussion somewhere else ;) But, as @amueller / @jnothman already argued, this is not really a concern right now, and as @TomAugspurger showed, it seems even possible to hack together a zero-copy roundtrip even if DataFrame would become a column-store.
Yes, that is possible (even for floats with NaNs that is possible, if you keep them as NaNs and don't convert to missing vlaues).
@rth I just commented in pandas-dev/pandas#32196 that this can be sped up rather easily. But of course, it will always stay a lot slower since it is creating many 1D sparse arrays from the 2D array. |
Cool, thanks @jorisvandenbossche ! That would indeed resolve most of the concerns with sparse DataFrame. |
I am not sure that resolves most of the concerns regarding sparse. First, it's only possibly in a future release. And moreover, even if we can give that conversion a factor 10 speed-up in pandas, it is still much slower than a CSR->CSC conversion. For sparse output of transformers, it seems still best to output scipy matrices, but not sure how to deal with features names then. |
Well as far as I know, there is no ideal good way currently to have sparse arrays with labels in general. The alternative with xarray is to use So given that this is going to be an opt-in feature and only available in ~7 months in 0.24, as far as I understand, saying that you need latest pandas for best performance with sparse by that time could be reasonable IMO. I think we might want to compare to the cost on a typical workflow rather than to CSR->CSC conversion. Currently the performance blocker is only with a very large number of features (e.g. 100k). As mentioned in previous benchmark for |
This discussion shows that we currently don't have benchmarks for a few typical pipelines, and cannot easily evaluate the performance impact of proposed changes (in general, not specific to this PR). It would be good to merge ASV benchmarks created by @jeremiedbb into scikit-learn (scikit-learn/scikit-learn#16723), use them as a reference and extend them as needed. I really like how this is done and documented in pandas. Not a blocker for this SLEP but would be nice to have. |
I think we should ask ourselves whether we're ok with having a solution that doesn't address sparse. It would certainly mean not covering some important use-cases. |
Is it not covering some usecases or is it having some non-trivial overhead when the data is sparse? I think right now there's some overhead which we're not comfortable with, but we could see if pandas could at least improve on the conversions maybe the same way that it's done in In the mean time, we could develop the solution using pandas, expecting to have these overheads for sparse, and by the time we're ready to release this feature, it could also be improved in pandas. |
@jorisvandenbossche and @rth have been doing some work on pandas to get converting between a scipy.sparse matrix and pandas' format faster (hopefully they can report results). But even with this improvements we're still going to bump against the fact that pandas' SparseArray is 1-D, so when you have a sparse results with ~10,000s of columns, you're creating ~10,000s of Python objects, which is slow, even if creating each individual object is cheap. pandas' extension arrays being 1-D is a fundamental limitation that isn't likely to change soon. So there does seem to be a choice between "pandas in / pandas out" and "fast sparse transformed results" which may warrant its own configuration. |
I am currently prototyping this feature out, so we can measure what the overhead looks like for the various configurations combinations of xarray/pandas/sparse. I am planning to get this out this week. |
@TomAugspurger I'm a bit hesitant to have two separate configuration options to have some form of feature names. Or are you saying if we enable pandas output we would also go to pandas for sparse, and then have another option to move to a different format? |
I just worry that because pandas' support for very wide sparse datasets is so poor, people will be surprised / annoyed at |
Let's way for benchmark results from scikit-learn/scikit-learn#16772 to decide. The question is not so much the performance difference between pandas sparse and scipy sparse for constructors, but how much this would affect real word pipelines. <10% overhead to get feature names might be OK, x10 in some use cases is definitely not.
I think returning some custom object or a pydata/sparse wrapped into an xarray would results in as much suprise / annoyance for users not familiar with either. |
One thing that I didn't realize before [maybe it was discussed above and I just forgot?] is that deciding output type based on input type is not possible for CountVectorizer, which will be one of the main use-cases. @thomasjpfan implemented the 'we always output ' option, which produces array/dataframe/xarray depending on the configuration setting, independent of the input. |
What are the next steps here? |
Given the developments in scikit-learn/scikit-learn#16772, this would amount to a new SLEP which describes |
I understand that this SLEP is aimed at transformers, but why not consider predictors as well? Getting a Pandas DataFrame out is convenient, it clarifies the meaning of the output columns, and it preserves the index (row names). For classifiers, we already have the For regressors, the output feature names could be the same as the input target names. This would require storing the target names when calling Even when a method has a single output feature, the benefits are still 1. convenience, 2. clarity, and 3. preserving the index:
|
This should be moved to rejected as we accepted SLEP 18, right? |
In |
Let's merge this then. |
This SLEP proposes pandas in pandas out as an alternative to SLEP 012 InputArray.