BUG: DataFrame.at setter of categorical DF overwrites entire row #37763

treszkai · 2020-11-11T19:42:40Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

On a DataFrame with categorical dtype, df.at[x,y] = v sets all non-initialized values in row x.

Code Sample, a copy-pastable example

$ python
Python 3.8.0 (default, Oct 28 2019, 16:14:01) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import pandas as pd
>>> pd.__version__
'1.2.0.dev0+1137.g50b34a4a8'
>>> df = pd.DataFrame(index=range(3), columns=range(3), dtype=pd.CategoricalDtype(['foo', 'bar']))
>>> df.at[1,2] = 'foo'
>>> df
     0    1    2
0  NaN  NaN  NaN
1  foo  foo  foo
2  NaN  NaN  NaN

It doesn't overwrite values that have been set with df.loc:

>>> df = pd.DataFrame(index=range(3), columns=range(3), dtype=pd.CategoricalDtype(['foo', 'bar']))
>>> df.loc[1,1] = 'bar'  # not necessary, just for demo
>>> df.at[1,2] = 'foo'
>>> df
     0    1    2
0  NaN  NaN  NaN
1  foo  bar  foo
2  NaN  NaN  NaN

Problem description

df.at[x, y] = v on a categorical dtype should behave as with other dtypes, and the same as df.loc[x, y] = v.

Expected Output

The same as what happens with a DF initialized with Nones:

>>> df = pd.DataFrame([[None] * 3] * 3, index=range(3), columns=range(3), dtype=pd.CategoricalDtype(['foo', 'bar']))
>>> df.loc[1,1] = 'bar'
>>> df.at[1,2] = 'foo'
>>> df
     0    1    2
0  NaN  NaN  NaN
1  NaN  bar  foo
2  NaN  NaN  NaN

Or as with dtype=float:

>>> df = pd.DataFrame(index=range(3), columns=range(3), dtype=float)
>>> df.loc[1,1] = 1
>>> df.at[1,2] = 27
>>> df
    0    1     2
0 NaN  NaN   NaN
1 NaN  1.0  27.0
2 NaN  NaN   NaN

Output of `pd.show_versions()`

INSTALLED VERSIONS

------------------
commit : 50b34a4
python : 3.8.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-52-generic
Version : fillna bug #57~18.04.1-Ubuntu SMP Thu Oct 15 14:04:49 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.0.dev0+1137.g50b34a4a8
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.2.4
setuptools : 50.3.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

GYHHAHA · 2020-11-15T01:34:52Z

What's the expected output with dtype=float? I think your example returns the right result. @treszkai

treszkai · 2020-11-15T14:24:03Z

Yes, the problem is only with categorical dtype. The Expected Output section compares it to two scenarios that work well, i.e. pre-intialized or non-dtype. Also could've said that "Expected Output: DataFrame.at setter sets a single item, not the entire row."

treszkai · 2020-11-15T22:38:35Z

Setting with df.at will call DataFrame._set_value, which on the series = self._get_item_cache(col) line Series with the exact same _values regardless if col is 0 or 2:

>>> df = pd.DataFrame(index=range(3), columns=range(3), dtype=pd.CategoricalDtype(['foo', 'bar']))
>>> df.at[0,0] = 'foo'
>>> df[0]._values is df[2]._values
True

When setting with df.loc, in the _iLocIndexer._setitem_with_indexer method take_split_path will be true (with categorical dtype), which copies the entire column on every setitem, that's why loc isn't affected.

# set the item, possibly having a dtype change
ser = ser.copy()
ser._mgr = ser._mgr.setitem(indexer=pi, value=v)
ser._maybe_update_cacher(clear=True)

ma3da · 2020-11-16T09:29:32Z

Yes, I'd say this follows from two things:

when your DF is instantiated, only one "nan array" is created, which will be associated with all its columns, as they were provided no data (here),
in the case of categorical columns, the block manager keeps one block per column, which will result in underlying values being the same object (the unique "nan array") (here).

I'm tempted to say the way to correct this would be to create one "nan array" per empty column, but there may be a memory-efficient solution.

treszkai · 2020-11-16T12:03:51Z

to create one "nan array" per empty column, but there may be a

memory-efficient solution. In what situation is it necessary to spare memory temporarily, until the elements of the DataFrame are set? When the user asked for a categorical dtype for the whole DF, it's reasonable to assume that all columns will be used as such, no? Luis Pinto <[email protected]> schrieb am Mo., 16. Nov. 2020, 10:29:

…

Yes, I'd say this follows from two things: - when your DF is instantiated, only one "nan array" is created, which will be associated with all its columns, as they were provided no data ( here <https://github.com/pandas-dev/pandas/blob/c77fc357efbe9443ad8e32b9ce34599e61bcd3f5/pandas/core/internals/construction.py#L271-L272> ), - in the case of categorical columns, the block manager keeps one block per column, which will result in underlying values being the same object (the unique "nan array") (here <https://github.com/pandas-dev/pandas/blob/c77fc357efbe9443ad8e32b9ce34599e61bcd3f5/pandas/core/internals/managers.py#L1771-L1776> ). I'm tempted to say the way to correct this would be to create one "nan array" per empty column, but there may be a memory-efficient solution. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37763 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHJJPFQQQH3P62WX5KU4VALSQDWIZANCNFSM4TSMNBLQ> .

ma3da · 2020-11-16T12:19:04Z

@treszkai I do agree with you. But I'm not versed enough in the library's internals to feel confident there are no subtleties I'm missing :)

ma3da · 2020-11-26T08:12:31Z

This issue seems to have been solved by #37355.
On master:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.2.0.dev0+1320.g28634289c'

In [3]: df = pd.DataFrame(index=range(3), columns=range(3), dtype=pd.CategoricalDtype(['foo', 'bar']))

In [4]: df.at[1,2] = 'foo'

In [5]: df
Out[5]: 
     0    1    2
0  NaN  NaN  NaN
1  NaN  NaN  foo
2  NaN  NaN  NaN

jreback · 2020-11-26T13:01:17Z

would take a validation test to close this issue

(pandas-dev#37763)

treszkai added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 11, 2020

ma3da added a commit to ma3da/pandas that referenced this issue Nov 26, 2020

TST: DataFrame.at on categorical with missing

fd175db

(pandas-dev#37763)

ma3da mentioned this issue Nov 26, 2020

TST : Categorical DataFrame.at overwritting row #38085

Merged

5 tasks

jreback added this to the 1.2 milestone Nov 26, 2020

jreback added Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 26, 2020

jreback closed this as completed in #38085 Nov 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.at setter of categorical DF overwrites entire row #37763

BUG: DataFrame.at setter of categorical DF overwrites entire row #37763

treszkai commented Nov 11, 2020

GYHHAHA commented Nov 15, 2020

treszkai commented Nov 15, 2020

treszkai commented Nov 15, 2020

ma3da commented Nov 16, 2020

treszkai commented Nov 16, 2020 via email

ma3da commented Nov 16, 2020

ma3da commented Nov 26, 2020

jreback commented Nov 26, 2020

BUG: DataFrame.at setter of categorical DF overwrites entire row #37763

BUG: DataFrame.at setter of categorical DF overwrites entire row #37763

Comments

treszkai commented Nov 11, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

GYHHAHA commented Nov 15, 2020

treszkai commented Nov 15, 2020

treszkai commented Nov 15, 2020

ma3da commented Nov 16, 2020

treszkai commented Nov 16, 2020 via email

ma3da commented Nov 16, 2020

ma3da commented Nov 26, 2020

jreback commented Nov 26, 2020

Output of `pd.show_versions()`