Skip to content

BUG: String upgraded to complex128 (and cascading to other columns) in df.mean() and df.agg('mean') #36703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
jtkiley opened this issue Sep 28, 2020 · 6 comments · Fixed by #52281
Closed
2 of 3 tasks
Assignees
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data

Comments

@jtkiley
Copy link
Contributor

jtkiley commented Sep 28, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas (1.1.1; latest from conda).

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd

df = pd.DataFrame([{'db': 'J', 'numeric': 123}])
df2 = pd.DataFrame([{'db': 'J'}])
df3 = pd.DataFrame([{'db': 'j'}])
df4 = pd.DataFrame([{'db': 'J', 'numeric': 123},
                    {'db': 'J', 'numeric': 456}])
df5 = pd.DataFrame([{'db': 'J'}], dtype='string')

# Initial columns are _correct_ types
df.dtypes

# complex128 and across columns
df.mean()

# agg('mean') is the same.
df.agg('mean')

# Happens even when the columns at issue is alone.
df2.mean()

# Case of the 'J' doesn't matter.
df3.mean()

# Numeric only works as expected.
df.mean(numeric_only=True)

# Two rows works as expected, too.
df4.mean()

# The new StringDtype doesn't appear to matter.
np.mean(df['db'].astype('string').array)
type(np.mean(df['db'].astype('string').array))
df['db'].astype('string').dtype
df5.mean()

# Happens in PandasArray.
np.mean(df['db'].array)
type(np.mean(df['db'].array))

# Not in a numpy array.
np.mean(df['db'].to_numpy())

# Also happens in a Series.
df['db'].mean()
type(df['db'])

# The conversion happens in nanops.nanmean().
pd.core.nanops.nanmean(df['db'])

# Doesn't look like _get_values() is to blame.
pd.core.nanops._get_values(df['db'], True)[0].mean()

# The sum doesn't appear to be it, either.
values = pd.core.nanops._get_values(df['db'], True)[0]
dtype_sum = pd.core.nanops._get_values(df['db'], True)[2]
type(values.sum(None, dtype=dtype_sum))

# The conversion happens in pd.core.nanops._ensure_numeric().
pd.core.nanops._ensure_numeric(values.sum(None, dtype=dtype_sum))

# Interestingly enough, it looks like a sum somewhere is concatenating
# the two 'J's together.
pd.Series(['J', 'J']).astype('complex128').mean()
pd.Series(['J', 'J']).mean()
pd.Series(['J']).mean()

# Here's an example. Then, that sum can't be cast.
pd.core.nanops._get_values(df4['db'], True)[0].sum()

With output:

>>> import numpy as np
>>> import pandas as pd
>>>
>>> df = pd.DataFrame([{'db': 'J', 'numeric': 123}])
>>> df2 = pd.DataFrame([{'db': 'J'}])
>>> df3 = pd.DataFrame([{'db': 'j'}])
>>> df4 = pd.DataFrame([{'db': 'J', 'numeric': 123},
...                     {'db': 'J', 'numeric': 456}])
>>> df5 = pd.DataFrame([{'db': 'J'}], dtype='string')
>>>
>>> # Initial columns are _correct_ types
>>> df.dtypes
db         object
numeric     int64
dtype: object
>>>
>>> # complex128 and across columns
>>> df.mean()
db           0.000000+1.000000j
numeric    123.000000+0.000000j
dtype: complex128
>>>
>>> # agg('mean') is the same.
>>> df.agg('mean')
db           0.000000+1.000000j
numeric    123.000000+0.000000j
dtype: complex128
>>>
>>> # Happens even when the columns at issue is alone.
>>> df2.mean()
db    0.000000+1.000000j
dtype: complex128
>>>
>>> # Case of the 'J' doesn't matter.
>>> df3.mean()
db    0.000000+1.000000j
dtype: complex128
>>>
>>> # Numeric only works as expected.
>>> df.mean(numeric_only=True)
numeric    123.0
dtype: float64
>>>
>>> # Two rows works as expected, too.
>>> df4.mean()
numeric    289.5
dtype: float64
>>> # The new StringDtype doesn't appear to matter.
>>> np.mean(df['db'].astype('string').array)
1j
>>> type(np.mean(df['db'].astype('string').array))
<class 'complex'>
>>> df['db'].astype('string').dtype
StringDtype
>>> df5.mean()
Series([], dtype: float64)
>>>
>>> # Happens in PandasArray.
>>> np.mean(df['db'].array)
1j
>>> type(np.mean(df['db'].array))
<class 'complex'>
>>>
>>> # Not in a numpy array.
>>> np.mean(df['db'].to_numpy())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 5, in mean
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 3372, in mean
    return _methods._mean(a, axis=axis, dtype=dtype,
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/_methods.py", line 172, in _mean
    ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'str' and 'int'
>>>
>>> # Also happens in a Series.
>>> df['db'].mean()
1j
>>> type(df['db'])
<class 'pandas.core.series.Series'>
>>>
>>> # The conversion happens in nanops.nanmean().
>>> pd.core.nanops.nanmean(df['db'])
1j
>>>
>>> # Doesn't look like _get_values() is to blame.
>>> pd.core.nanops._get_values(df['db'], True)[0].mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/_methods.py", line 172, in _mean
    ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'str' and 'int'
>>>
>>> # The sum doesn't appear to be it, either.
>>> values = pd.core.nanops._get_values(df['db'], True)[0]
>>> dtype_sum = pd.core.nanops._get_values(df['db'], True)[2]
>>> type(values.sum(None, dtype=dtype_sum))
<class 'str'>
>>>
>>> # The conversion happens in pd.core.nanops._ensure_numeric().
>>> pd.core.nanops._ensure_numeric(values.sum(None, dtype=dtype_sum))
1j
>>>
>>> # Interestingly enough, it looks like a sum somewhere is concatenating
>>> # the two 'J's together.
>>> pd.Series(['J', 'J']).astype('complex128').mean()
1j
>>> pd.Series(['J', 'J']).mean()
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1427, in _ensure_numeric
    x = float(x)
ValueError: could not convert string to float: 'JJ'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1431, in _ensure_numeric
    x = complex(x)
ValueError: complex() arg is a malformed string

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/generic.py", line 11459, in stat_func
    return self._reduce(
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/series.py", line 4236, in _reduce
    return op(delegate, skipna=skipna, **kwds)
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 71, in _f
    return f(*args, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 129, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 563, in nanmean
    the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1434, in _ensure_numeric
    raise TypeError(f"Could not convert {x} to numeric") from err
TypeError: Could not convert JJ to numeric
>>> pd.Series(['J']).mean()
1j
>>>
>>> # Here's an example. Then, that sum can't be cast.
>>> pd.core.nanops._get_values(df4['db'], True)[0].sum()
'JJ'

Problem description

It seems like df.mean() is being too aggressive about attempting to work with string columns in a one-row dataframe. In my use case, if I get a query result with one row, I get this bug. I have lots of these queries, and even one of them being affected will flow through to the concatenated dataframe downstream.

I first noticed this by getting an error that I couldn't write a parquet file with a complex128 column, and tracked it back from there.

Expected Output

Expected output is what we get with numeric_only=True set or more than one row. See above.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f2ca0a2
python : 3.8.5.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.1
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : None
pytest : 6.0.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.2
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.19
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@jtkiley jtkiley added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 28, 2020
@rhshadrach rhshadrach added Numeric Operations Arithmetic, Comparison, and Logical operations Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 28, 2020
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Sep 28, 2020
@rhshadrach
Copy link
Member

Thanks for the report, PRs to fix are welcome!

@jorisvandenbossche jorisvandenbossche added Reduction Operations sum, mean, min, max, etc. and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Sep 29, 2020
@jtkiley
Copy link
Contributor Author

jtkiley commented Sep 29, 2020

I just updated the code examples with more exploration of where this actually happens. It's in pd.core.nanops._ensure_numeric().

pandas/pandas/core/nanops.py

Lines 1409 to 1435 in 2a7d332

def _ensure_numeric(x):
if isinstance(x, np.ndarray):
if is_integer_dtype(x) or is_bool_dtype(x):
x = x.astype(np.float64)
elif is_object_dtype(x):
try:
x = x.astype(np.complex128)
except (TypeError, ValueError):
try:
x = x.astype(np.float64)
except ValueError as err:
# GH#29941 we get here with object arrays containing strs
raise TypeError(f"Could not convert {x} to numeric") from err
else:
if not np.any(np.imag(x)):
x = x.real
elif not (is_float(x) or is_integer(x) or is_complex(x)):
try:
x = float(x)
except ValueError:
# e.g. "1+1j" or "foo"
try:
x = complex(x)
except ValueError as err:
# e.g. "foo"
raise TypeError(f"Could not convert {x} to numeric") from err
return x

If something passed in is an object dtype, it tries to convert it to complex128. That works with 'J', but not with 'JJ' (i.e. what that sum method returns from the two row example (not sure if that's expected, either).

It's not exactly clear to me where the best place to fix it would be. The logic for some of the other functions works differently (often making sure that something can be cast to a float). The standard library complex() function does the same conversion to complex with a string 'J', though that's an explicit request.

I think the problem is that this behavior is implicit, and I wonder if using complex numbers is a common enough use case to get implicit casting of strings without a warning (cf. the one for datetime means). The use here along with sum means that it's often going to get a concatenated string that won't convert, so it'll be inconsistent and less likely to cast as rows increase.

Thoughts on where/how to fix it?

@biddwan09
Copy link
Contributor

take

@jtkiley
Copy link
Contributor Author

jtkiley commented Oct 19, 2020

@biddwan09 I poked around a bit more, and here are a couple of thoughts:

  1. I think the best fix may be to fix the issue where sum is concatenating strings. That seems like unexpected behavior. Instead, I'd expect a reduction on a dataframe to not return a column for that string column. I couldn't quite track down how the sum method is implemented to experiment more, but the behavior I describe above doesn't happen with other functions, so this may be the most limited change available.
  2. An alternative would be to condition that complex cast to something that is less ambiguous. For example, you might require that string to have a digit and a j, or, more restrictively, something like \d+(?:\.\d+)?[+-]\d+(?:\.\d+)?[jJ]. That moves it out of alignment with the standard library, but this is both implicit and pretty deep in the weeds, so I think there's a fair rationale for doing so.

I hope that helps!

@biddwan09
Copy link
Contributor

@jtkiley thanks for the approaches. I will try to understand how the sum method works and put a fix from there

@mroeschke mroeschke added the Strings String extension data type and string data label Aug 13, 2021
@jreback jreback modified the milestones: Contributions Welcome, 1.4 Nov 13, 2021
@jreback jreback modified the milestones: 1.4, Contributions Welcome Dec 23, 2021
@mgabs
Copy link

mgabs commented Jun 30, 2022

I had a similar issue running M1 on Macos 12.4

File ".venv/lib/python3.9/site-packages/pandas/core/nanops.py", line 1629, in _ensure_numeric
    raise TypeError(f"Could not convert {x} to numeric") from err

The issue was resolved when I installed snowflake-connector[pandas] that pulls pyarrow 6.0.
I don't claim to be sure of the workaround, or why it was solved.

I'd like feedback if someone seeing the same issue managed to workaround it the same way.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Reduction Operations sum, mean, min, max, etc. Strings String extension data type and string data
Projects
None yet
7 participants