BUG: String upgraded to complex128 (and cascading to other columns) in df.mean() and df.agg('mean') #36703

jtkiley · 2020-09-28T15:02:14Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas (1.1.1; latest from conda).
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd

df = pd.DataFrame([{'db': 'J', 'numeric': 123}])
df2 = pd.DataFrame([{'db': 'J'}])
df3 = pd.DataFrame([{'db': 'j'}])
df4 = pd.DataFrame([{'db': 'J', 'numeric': 123},
                    {'db': 'J', 'numeric': 456}])
df5 = pd.DataFrame([{'db': 'J'}], dtype='string')

# Initial columns are _correct_ types
df.dtypes

# complex128 and across columns
df.mean()

# agg('mean') is the same.
df.agg('mean')

# Happens even when the columns at issue is alone.
df2.mean()

# Case of the 'J' doesn't matter.
df3.mean()

# Numeric only works as expected.
df.mean(numeric_only=True)

# Two rows works as expected, too.
df4.mean()

# The new StringDtype doesn't appear to matter.
np.mean(df['db'].astype('string').array)
type(np.mean(df['db'].astype('string').array))
df['db'].astype('string').dtype
df5.mean()

# Happens in PandasArray.
np.mean(df['db'].array)
type(np.mean(df['db'].array))

# Not in a numpy array.
np.mean(df['db'].to_numpy())

# Also happens in a Series.
df['db'].mean()
type(df['db'])

# The conversion happens in nanops.nanmean().
pd.core.nanops.nanmean(df['db'])

# Doesn't look like _get_values() is to blame.
pd.core.nanops._get_values(df['db'], True)[0].mean()

# The sum doesn't appear to be it, either.
values = pd.core.nanops._get_values(df['db'], True)[0]
dtype_sum = pd.core.nanops._get_values(df['db'], True)[2]
type(values.sum(None, dtype=dtype_sum))

# The conversion happens in pd.core.nanops._ensure_numeric().
pd.core.nanops._ensure_numeric(values.sum(None, dtype=dtype_sum))

# Interestingly enough, it looks like a sum somewhere is concatenating
# the two 'J's together.
pd.Series(['J', 'J']).astype('complex128').mean()
pd.Series(['J', 'J']).mean()
pd.Series(['J']).mean()

# Here's an example. Then, that sum can't be cast.
pd.core.nanops._get_values(df4['db'], True)[0].sum()

With output:

>>> import numpy as np
>>> import pandas as pd
>>>
>>> df = pd.DataFrame([{'db': 'J', 'numeric': 123}])
>>> df2 = pd.DataFrame([{'db': 'J'}])
>>> df3 = pd.DataFrame([{'db': 'j'}])
>>> df4 = pd.DataFrame([{'db': 'J', 'numeric': 123},
...                     {'db': 'J', 'numeric': 456}])
>>> df5 = pd.DataFrame([{'db': 'J'}], dtype='string')
>>>
>>> # Initial columns are _correct_ types
>>> df.dtypes
db         object
numeric     int64
dtype: object
>>>
>>> # complex128 and across columns
>>> df.mean()
db           0.000000+1.000000j
numeric    123.000000+0.000000j
dtype: complex128
>>>
>>> # agg('mean') is the same.
>>> df.agg('mean')
db           0.000000+1.000000j
numeric    123.000000+0.000000j
dtype: complex128
>>>
>>> # Happens even when the columns at issue is alone.
>>> df2.mean()
db    0.000000+1.000000j
dtype: complex128
>>>
>>> # Case of the 'J' doesn't matter.
>>> df3.mean()
db    0.000000+1.000000j
dtype: complex128
>>>
>>> # Numeric only works as expected.
>>> df.mean(numeric_only=True)
numeric    123.0
dtype: float64
>>>
>>> # Two rows works as expected, too.
>>> df4.mean()
numeric    289.5
dtype: float64
>>> # The new StringDtype doesn't appear to matter.
>>> np.mean(df['db'].astype('string').array)
1j
>>> type(np.mean(df['db'].astype('string').array))
<class 'complex'>
>>> df['db'].astype('string').dtype
StringDtype
>>> df5.mean()
Series([], dtype: float64)
>>>
>>> # Happens in PandasArray.
>>> np.mean(df['db'].array)
1j
>>> type(np.mean(df['db'].array))
<class 'complex'>
>>>
>>> # Not in a numpy array.
>>> np.mean(df['db'].to_numpy())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 5, in mean
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 3372, in mean
    return _methods._mean(a, axis=axis, dtype=dtype,
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/_methods.py", line 172, in _mean
    ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'str' and 'int'
>>>
>>> # Also happens in a Series.
>>> df['db'].mean()
1j
>>> type(df['db'])
<class 'pandas.core.series.Series'>
>>>
>>> # The conversion happens in nanops.nanmean().
>>> pd.core.nanops.nanmean(df['db'])
1j
>>>
>>> # Doesn't look like _get_values() is to blame.
>>> pd.core.nanops._get_values(df['db'], True)[0].mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/numpy/core/_methods.py", line 172, in _mean
    ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'str' and 'int'
>>>
>>> # The sum doesn't appear to be it, either.
>>> values = pd.core.nanops._get_values(df['db'], True)[0]
>>> dtype_sum = pd.core.nanops._get_values(df['db'], True)[2]
>>> type(values.sum(None, dtype=dtype_sum))
<class 'str'>
>>>
>>> # The conversion happens in pd.core.nanops._ensure_numeric().
>>> pd.core.nanops._ensure_numeric(values.sum(None, dtype=dtype_sum))
1j
>>>
>>> # Interestingly enough, it looks like a sum somewhere is concatenating
>>> # the two 'J's together.
>>> pd.Series(['J', 'J']).astype('complex128').mean()
1j
>>> pd.Series(['J', 'J']).mean()
Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1427, in _ensure_numeric
    x = float(x)
ValueError: could not convert string to float: 'JJ'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1431, in _ensure_numeric
    x = complex(x)
ValueError: complex() arg is a malformed string

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/generic.py", line 11459, in stat_func
    return self._reduce(
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/series.py", line 4236, in _reduce
    return op(delegate, skipna=skipna, **kwds)
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 71, in _f
    return f(*args, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 129, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 563, in nanmean
    the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))
  File "/usr/local/Caskroom/miniconda/base/lib/python3.8/site-packages/pandas/core/nanops.py", line 1434, in _ensure_numeric
    raise TypeError(f"Could not convert {x} to numeric") from err
TypeError: Could not convert JJ to numeric
>>> pd.Series(['J']).mean()
1j
>>>
>>> # Here's an example. Then, that sum can't be cast.
>>> pd.core.nanops._get_values(df4['db'], True)[0].sum()
'JJ'

Problem description

It seems like df.mean() is being too aggressive about attempting to work with string columns in a one-row dataframe. In my use case, if I get a query result with one row, I get this bug. I have lots of these queries, and even one of them being affected will flow through to the concatenated dataframe downstream.

I first noticed this by getting an error that I couldn't write a parquet file with a complex128 column, and tracked it back from there.

Expected Output

Expected output is what we get with numeric_only=True set or more than one row. See above.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : f2ca0a2
python : 3.8.5.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.1
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : None
pytest : 6.0.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.2
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.3.19
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2020-09-28T23:40:56Z

Thanks for the report, PRs to fix are welcome!

jtkiley · 2020-09-29T22:28:22Z

I just updated the code examples with more exploration of where this actually happens. It's in pd.core.nanops._ensure_numeric().

pandas/pandas/core/nanops.py

Lines 1409 to 1435 in 2a7d332

    
           def _ensure_numeric(x): 
        
               if isinstance(x, np.ndarray): 
        
                   if is_integer_dtype(x) or is_bool_dtype(x): 
        
                       x = x.astype(np.float64) 
        
                   elif is_object_dtype(x): 
        
                       try: 
        
                           x = x.astype(np.complex128) 
        
                       except (TypeError, ValueError): 
        
                           try: 
        
                               x = x.astype(np.float64) 
        
                           except ValueError as err: 
        
                               # GH#29941 we get here with object arrays containing strs 
        
                               raise TypeError(f"Could not convert {x} to numeric") from err 
        
                       else: 
        
                           if not np.any(np.imag(x)): 
        
                               x = x.real 
        
               elif not (is_float(x) or is_integer(x) or is_complex(x)): 
        
                   try: 
        
                       x = float(x) 
        
                   except ValueError: 
        
                       # e.g. "1+1j" or "foo" 
        
                       try: 
        
                           x = complex(x) 
        
                       except ValueError as err: 
        
                           # e.g. "foo" 
        
                           raise TypeError(f"Could not convert {x} to numeric") from err 
        
               return x

If something passed in is an object dtype, it tries to convert it to complex128. That works with 'J', but not with 'JJ' (i.e. what that sum method returns from the two row example (not sure if that's expected, either).

It's not exactly clear to me where the best place to fix it would be. The logic for some of the other functions works differently (often making sure that something can be cast to a float). The standard library complex() function does the same conversion to complex with a string 'J', though that's an explicit request.

I think the problem is that this behavior is implicit, and I wonder if using complex numbers is a common enough use case to get implicit casting of strings without a warning (cf. the one for datetime means). The use here along with sum means that it's often going to get a concatenated string that won't convert, so it'll be inconsistent and less likely to cast as rows increase.

Thoughts on where/how to fix it?

biddwan09 · 2020-10-14T19:56:28Z

take

jtkiley · 2020-10-19T17:45:18Z

@biddwan09 I poked around a bit more, and here are a couple of thoughts:

I think the best fix may be to fix the issue where sum is concatenating strings. That seems like unexpected behavior. Instead, I'd expect a reduction on a dataframe to not return a column for that string column. I couldn't quite track down how the sum method is implemented to experiment more, but the behavior I describe above doesn't happen with other functions, so this may be the most limited change available.
An alternative would be to condition that complex cast to something that is less ambiguous. For example, you might require that string to have a digit and a j, or, more restrictively, something like \d+(?:\.\d+)?[+-]\d+(?:\.\d+)?[jJ]. That moves it out of alignment with the standard library, but this is both implicit and pretty deep in the weeds, so I think there's a fair rationale for doing so.

I hope that helps!

biddwan09 · 2020-10-19T19:07:17Z

@jtkiley thanks for the approaches. I will try to understand how the sum method works and put a fix from there

mgabs · 2022-06-30T09:20:18Z

I had a similar issue running M1 on Macos 12.4

File ".venv/lib/python3.9/site-packages/pandas/core/nanops.py", line 1629, in _ensure_numeric
    raise TypeError(f"Could not convert {x} to numeric") from err

The issue was resolved when I installed snowflake-connector[pandas] that pulls pyarrow 6.0.
I don't claim to be sure of the workaround, or why it was solved.

I'd like feedback if someone seeing the same issue managed to workaround it the same way.

jtkiley added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 28, 2020

rhshadrach added Numeric Operations Arithmetic, Comparison, and Logical operations Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 28, 2020

rhshadrach added this to the Contributions Welcome milestone Sep 28, 2020

jorisvandenbossche added Reduction Operations sum, mean, min, max, etc. and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Sep 29, 2020

github-actions bot assigned biddwan09 Oct 14, 2020

mroeschke added the Strings String extension data type and string data label Aug 13, 2021

kinshukdua mentioned this issue Oct 21, 2021

BUG: make mean() raise an exception for strings #44131

Closed

9 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Nov 13, 2021

jreback modified the milestones: 1.4, Contributions Welcome Dec 23, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel mentioned this issue Mar 29, 2023

BUG: mean/median with strings #52281

Merged

7 tasks

mroeschke closed this as completed in #52281 Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: String upgraded to complex128 (and cascading to other columns) in df.mean() and df.agg('mean') #36703

BUG: String upgraded to complex128 (and cascading to other columns) in df.mean() and df.agg('mean') #36703

jtkiley commented Sep 28, 2020 •

edited

Loading

INSTALLED VERSIONS

rhshadrach commented Sep 28, 2020

jtkiley commented Sep 29, 2020

biddwan09 commented Oct 14, 2020

jtkiley commented Oct 19, 2020

biddwan09 commented Oct 19, 2020

mgabs commented Jun 30, 2022

BUG: String upgraded to complex128 (and cascading to other columns) in df.mean() and df.agg('mean') #36703

BUG: String upgraded to complex128 (and cascading to other columns) in df.mean() and df.agg('mean') #36703

Comments

jtkiley commented Sep 28, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

rhshadrach commented Sep 28, 2020

jtkiley commented Sep 29, 2020

biddwan09 commented Oct 14, 2020

jtkiley commented Oct 19, 2020

biddwan09 commented Oct 19, 2020

mgabs commented Jun 30, 2022

jtkiley commented Sep 28, 2020 •

edited

Loading

Output of `pd.show_versions()`