Skip to content

Commit 23aced5

Browse files
thomasjpfanjjerphanlesteve
authored
VOTE SLEP018 - Pandas Output for Transformers (#72)
Co-authored-by: Julien Jerphanion <[email protected]> Co-authored-by: Loïc Estève <[email protected]>
1 parent 9884504 commit 23aced5

File tree

2 files changed

+20
-13
lines changed

2 files changed

+20
-13
lines changed

index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,14 +14,14 @@
1414
slep007/proposal
1515
slep009/proposal
1616
slep010/proposal
17+
slep018/proposal
1718

1819
.. toctree::
1920
:maxdepth: 1
2021
:caption: Under review
2122

2223
slep012/proposal
2324
slep013/proposal
24-
slep018/proposal
2525

2626
.. toctree::
2727
:maxdepth: 1

slep018/proposal.rst

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ SLEP018: Pandas Output for Transformers with set_output
55
=======================================================
66

77
:Author: Thomas J. Fan
8-
:Status: Draft
8+
:Status: Accepted
99
:Type: Standards Track
1010
:Created: 2022-06-22
1111

@@ -22,7 +22,7 @@ Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse
2222
matrices. This SLEP proposes adding a ``set_output`` method to configure a
2323
transformer to output pandas DataFrames::
2424

25-
scalar = StandardScalar().set_output(transform="pandas")
25+
scalar = StandardScaler().set_output(transform="pandas")
2626
scalar.fit(X_df)
2727

2828
# X_trans_df is a pandas DataFrame
@@ -37,20 +37,26 @@ sparse data, e.g. ``OneHotEncoder(sparse=True)``, then ``transform`` will raise
3737
``ValueError`` if ``set_output(transform="pandas")``. Dealing with sparse output
3838
might be the scope of another future SLEP.
3939

40-
For a pipeline, calling ``set_output`` on the pipeline will configure all steps
41-
in the pipeline::
40+
For a pipeline, calling ``set_output`` will configure all inner transformers and
41+
does not configure non-transformers. This enables the following workflow::
4242

43-
num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA())
44-
num_preprocessor.set_output(transform="pandas")
43+
log_reg = make_pipeline(SimpleImputer(), StandardScaler(), LogisticRegression())
44+
log_reg.set_output(transform="pandas")
45+
46+
# All transformers return DataFrames during fit
47+
log_reg.fit(X_df, y)
4548

4649
# X_trans_df is a pandas DataFrame
47-
X_trans_df = num_preprocessor.fit_transform(X_df)
50+
X_trans_df = log_reg[:-1].transform(X_df)
4851

4952
# X_trans_df is again a pandas DataFrame
50-
X_trans_df = num_preprocessor[0].transform(X_df)
53+
X_trans_df = log_reg[0].transform(X_df)
54+
55+
# The classifier contains the feature names in
56+
log_reg[-1].feature_names_in_
5157

5258
Meta-estimators that support ``set_output`` are required to configure all inner
53-
transformer by calling ``set_output``. Specifically all fitted and non-fitted
59+
transformers by calling ``set_output``. Specifically all fitted and non-fitted
5460
inner transformers must be configured with ``set_output``. This enables
5561
``transform``'s output to be a DataFrame before and after the meta-estimator is
5662
fitted. If an inner transformer does not define ``set_output``, then an error is
@@ -74,7 +80,7 @@ manager::
7480

7581
from sklearn import config_context
7682
with config_context(transform_output="pandas"):
77-
num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA())
83+
num_prep = make_pipeline(SimpleImputer(), StandardScaler(), PCA())
7884
num_preprocessor.fit_transform(X_df)
7985

8086
The following specifies the precedence levels for the three ways to configure
@@ -117,8 +123,9 @@ A list of issues discussing Pandas output are: `#14315
117123
<https://github.com/scikit-learn/scikit-learn/pull/20100>`__, and `#23001
118124
<https://github.com/scikit-learn/scikit-learn/issueas/23001>`__. This SLEP
119125
proposes configuring the output to be pandas because it is the DataFrame library
120-
that is most widely used and requested by users. The ``set_output`` can be
121-
extended to support support additional DataFrame libraries in the future.
126+
that is most widely used and requested by users. The ``set_output`` API can be
127+
extended to support additional DataFrame libraries and sparse data formats in
128+
the future.
122129

123130
References and Footnotes
124131
------------------------

0 commit comments

Comments
 (0)