@@ -5,7 +5,7 @@ SLEP018: Pandas Output for Transformers with set_output
5
5
=======================================================
6
6
7
7
:Author: Thomas J. Fan
8
- :Status: Draft
8
+ :Status: Accepted
9
9
:Type: Standards Track
10
10
:Created: 2022-06-22
11
11
@@ -22,7 +22,7 @@ Currently, scikit-learn transformers return NumPy ndarrays or SciPy sparse
22
22
matrices. This SLEP proposes adding a ``set_output `` method to configure a
23
23
transformer to output pandas DataFrames::
24
24
25
- scalar = StandardScalar ().set_output(transform="pandas")
25
+ scalar = StandardScaler ().set_output(transform="pandas")
26
26
scalar.fit(X_df)
27
27
28
28
# X_trans_df is a pandas DataFrame
@@ -37,20 +37,26 @@ sparse data, e.g. ``OneHotEncoder(sparse=True)``, then ``transform`` will raise
37
37
``ValueError `` if ``set_output(transform="pandas") ``. Dealing with sparse output
38
38
might be the scope of another future SLEP.
39
39
40
- For a pipeline, calling ``set_output `` on the pipeline will configure all steps
41
- in the pipeline ::
40
+ For a pipeline, calling ``set_output `` will configure all inner transformers and
41
+ does not configure non-transformers. This enables the following workflow ::
42
42
43
- num_prep = make_pipeline(SimpleImputer(), StandardScalar(), PCA())
44
- num_preprocessor.set_output(transform="pandas")
43
+ log_reg = make_pipeline(SimpleImputer(), StandardScaler(), LogisticRegression())
44
+ log_reg.set_output(transform="pandas")
45
+
46
+ # All transformers return DataFrames during fit
47
+ log_reg.fit(X_df, y)
45
48
46
49
# X_trans_df is a pandas DataFrame
47
- X_trans_df = num_preprocessor.fit_transform (X_df)
50
+ X_trans_df = log_reg[:-1].transform (X_df)
48
51
49
52
# X_trans_df is again a pandas DataFrame
50
- X_trans_df = num_preprocessor[0].transform(X_df)
53
+ X_trans_df = log_reg[0].transform(X_df)
54
+
55
+ # The classifier contains the feature names in
56
+ log_reg[-1].feature_names_in_
51
57
52
58
Meta-estimators that support ``set_output `` are required to configure all inner
53
- transformer by calling ``set_output ``. Specifically all fitted and non-fitted
59
+ transformers by calling ``set_output ``. Specifically all fitted and non-fitted
54
60
inner transformers must be configured with ``set_output ``. This enables
55
61
``transform ``'s output to be a DataFrame before and after the meta-estimator is
56
62
fitted. If an inner transformer does not define ``set_output ``, then an error is
@@ -74,7 +80,7 @@ manager::
74
80
75
81
from sklearn import config_context
76
82
with config_context(transform_output="pandas"):
77
- num_prep = make_pipeline(SimpleImputer(), StandardScalar (), PCA())
83
+ num_prep = make_pipeline(SimpleImputer(), StandardScaler (), PCA())
78
84
num_preprocessor.fit_transform(X_df)
79
85
80
86
The following specifies the precedence levels for the three ways to configure
@@ -117,8 +123,9 @@ A list of issues discussing Pandas output are: `#14315
117
123
<https://github.com/scikit-learn/scikit-learn/pull/20100> `__, and `#23001
118
124
<https://github.com/scikit-learn/scikit-learn/issueas/23001> `__. This SLEP
119
125
proposes configuring the output to be pandas because it is the DataFrame library
120
- that is most widely used and requested by users. The ``set_output `` can be
121
- extended to support support additional DataFrame libraries in the future.
126
+ that is most widely used and requested by users. The ``set_output `` API can be
127
+ extended to support additional DataFrame libraries and sparse data formats in
128
+ the future.
122
129
123
130
References and Footnotes
124
131
------------------------
0 commit comments