Calling methods on invalid types #13

datapythonista · 2020-06-06T10:50:10Z

This is a follow up of the discussions in:

Separate object for a dataframe colum? (is Series needed?) #6 (comment)
Reductions #11 (question: pandas has parameters (bool_only, numeric_only) to let only apply the operation over columns of certain types only. Do we want it?)

See this example:

>>> df[['name', 'population']].mean()
population    2.729748e+07
dtype: float64

Even if the name column is selected, it is being ignored, since the mean of a string columns does not make sense. As opposed to raising an exception.

Many reductions implement a parameter to let control this behavior:

df[['name', 'population']].mean(numeric_only=False)
TypeError: could not convert string to float:

If we consider more methods to be applied directly over a dataframe, for example:

>>> df[['first_'name', 'last_name']].str.lower()

We may end up with a huge amount of string_only, bool_only, numeric_only parameters. All meaning something similar, but IMO adding a decent amount of complexity, and being difficult to keep the behavior consistent.

My preference would be to always raise, but being a software engineer I'm biased, and I guess many users may want this "magic".

So, I guess implementing an option, for example: pandas.options.mode.invalid_dtype {raise or skip} could make more sense.

The main problem with this approach is probably that it's not as easy to define the behavior for each operation:

(df.mean(numeric_only=True)
   .mean(numeric_only=False))

Personally, I don't see this as an issue. IMO, the behavior depends more on the user than on the operation. I'd say for production code, having to be explicit, and selecting the columns to operate with, makes more sense. While in a notebook, avoiding exceptions with this sort of "magic" seems to be more useful.

I guess for Series/1-column DataFrame (see #6) it always makes sense to raise an exception.

Thoughts?

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-06-08T13:17:01Z

So, I guess implementing an option, for example: pandas.options.mode.invalid_dtype {raise or skip} could make more sense.

I don't think anything in this API can rely on global options. One goal is to allow writing code that works against multiple backends, and global options I think defeats that.

It's worth noting that in pandas, the is most painful when it comes to object-dtype columns. That's the case where the reduction / method actually needs to be executed on the values to determine the output columns / dtypes. For the remaining dtypes we know ahead of time what the result metadata will be.

amueller · 2020-06-08T16:01:52Z

In sklearn we basically decided not to do something like that, and it makes it somewhat nicer for the devs but certainly somewhat annoying for users. For my use-cases I often want to distinguish categorical and continuous data, and how those are determined in a pandas context are often less than clear in practice. However, you could argue that it's up to the user to use the type system to ensure columns have the correct type.

jbrockmendel · 2022-08-26T22:48:17Z

FWIW in pandas we have deprecated all of the foo_only=None cases, with the future defaults all being False for these.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calling methods on invalid types #13

Calling methods on invalid types #13

datapythonista commented Jun 6, 2020

TomAugspurger commented Jun 8, 2020

amueller commented Jun 8, 2020

jbrockmendel commented Aug 26, 2022

Calling methods on invalid types #13

Calling methods on invalid types #13

Comments

datapythonista commented Jun 6, 2020

TomAugspurger commented Jun 8, 2020

amueller commented Jun 8, 2020

jbrockmendel commented Aug 26, 2022