Skip to content

BUG: DataFrame.at setter of categorical DF overwrites entire row #37763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
treszkai opened this issue Nov 11, 2020 · 8 comments · Fixed by #38085
Closed
3 tasks done

BUG: DataFrame.at setter of categorical DF overwrites entire row #37763

treszkai opened this issue Nov 11, 2020 · 8 comments · Fixed by #38085
Labels
Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@treszkai
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


On a DataFrame with categorical dtype, df.at[x,y] = v sets all non-initialized values in row x.

Code Sample, a copy-pastable example

$ python
Python 3.8.0 (default, Oct 28 2019, 16:14:01) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.__version__
'1.2.0.dev0+1137.g50b34a4a8'
>>> df = pd.DataFrame(index=range(3), columns=range(3), dtype=pd.CategoricalDtype(['foo', 'bar']))
>>> df.at[1,2] = 'foo'
>>> df
     0    1    2
0  NaN  NaN  NaN
1  foo  foo  foo
2  NaN  NaN  NaN

It doesn't overwrite values that have been set with df.loc:

>>> df = pd.DataFrame(index=range(3), columns=range(3), dtype=pd.CategoricalDtype(['foo', 'bar']))
>>> df.loc[1,1] = 'bar'  # not necessary, just for demo
>>> df.at[1,2] = 'foo'
>>> df
     0    1    2
0  NaN  NaN  NaN
1  foo  bar  foo
2  NaN  NaN  NaN

Problem description

df.at[x, y] = v on a categorical dtype should behave as with other dtypes, and the same as df.loc[x, y] = v.

Expected Output

The same as what happens with a DF initialized with Nones:

>>> df = pd.DataFrame([[None] * 3] * 3, index=range(3), columns=range(3), dtype=pd.CategoricalDtype(['foo', 'bar']))
>>> df.loc[1,1] = 'bar'
>>> df.at[1,2] = 'foo'
>>> df
     0    1    2
0  NaN  NaN  NaN
1  NaN  bar  foo
2  NaN  NaN  NaN

Or as with dtype=float:

>>> df = pd.DataFrame(index=range(3), columns=range(3), dtype=float)
>>> df.loc[1,1] = 1
>>> df.at[1,2] = 27
>>> df
    0    1     2
0 NaN  NaN   NaN
1 NaN  1.0  27.0
2 NaN  NaN   NaN

Output of pd.show_versions()

INSTALLED VERSIONS

  • ------------------
  • commit : 50b34a4
  • python : 3.8.0.final.0
  • python-bits : 64
  • OS : Linux
  • OS-release : 5.4.0-52-generic
  • Version : fillna bug #57~18.04.1-Ubuntu SMP Thu Oct 15 14:04:49 UTC 2020
  • machine : x86_64
  • processor : x86_64
  • byteorder : little
  • LC_ALL : None
  • LANG : en_US.UTF-8
  • LOCALE : en_US.UTF-8
  • pandas : 1.2.0.dev0+1137.g50b34a4a8
  • numpy : 1.19.4
  • pytz : 2020.4
  • dateutil : 2.8.1
  • pip : 20.2.4
  • setuptools : 50.3.2
  • Cython : None
  • pytest : None
  • hypothesis : None
  • sphinx : None
  • blosc : None
  • feather : None
  • xlsxwriter : None
  • lxml.etree : None
  • html5lib : None
  • pymysql : None
  • psycopg2 : None
  • jinja2 : None
  • IPython : None
  • pandas_datareader: None
  • bs4 : None
  • bottleneck : None
  • fsspec : None
  • fastparquet : None
  • gcsfs : None
  • matplotlib : None
  • numexpr : None
  • odfpy : None
  • openpyxl : None
  • pandas_gbq : None
  • pyarrow : None
  • pyxlsb : None
  • s3fs : None
  • scipy : None
  • sqlalchemy : None
  • tables : None
  • tabulate : None
  • xarray : None
  • xlrd : None
  • xlwt : None
  • numba : None
@treszkai treszkai added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 11, 2020
@GYHHAHA
Copy link
Contributor

GYHHAHA commented Nov 15, 2020

What's the expected output with dtype=float? I think your example returns the right result. @treszkai

@treszkai
Copy link
Author

Yes, the problem is only with categorical dtype. The Expected Output section compares it to two scenarios that work well, i.e. pre-intialized or non-dtype. Also could've said that "Expected Output: DataFrame.at setter sets a single item, not the entire row."

@treszkai
Copy link
Author

Setting with df.at will call DataFrame._set_value, which on the series = self._get_item_cache(col) line Series with the exact same _values regardless if col is 0 or 2:

>>> df = pd.DataFrame(index=range(3), columns=range(3), dtype=pd.CategoricalDtype(['foo', 'bar']))
>>> df.at[0,0] = 'foo'
>>> df[0]._values is df[2]._values
True

When setting with df.loc, in the _iLocIndexer._setitem_with_indexer method take_split_path will be true (with categorical dtype), which copies the entire column on every setitem, that's why loc isn't affected.

# set the item, possibly having a dtype change
ser = ser.copy()
ser._mgr = ser._mgr.setitem(indexer=pi, value=v)
ser._maybe_update_cacher(clear=True)

@ma3da
Copy link
Contributor

ma3da commented Nov 16, 2020

Yes, I'd say this follows from two things:

  • when your DF is instantiated, only one "nan array" is created, which will be associated with all its columns, as they were provided no data (here),
  • in the case of categorical columns, the block manager keeps one block per column, which will result in underlying values being the same object (the unique "nan array") (here).

I'm tempted to say the way to correct this would be to create one "nan array" per empty column, but there may be a memory-efficient solution.

@treszkai
Copy link
Author

treszkai commented Nov 16, 2020 via email

@ma3da
Copy link
Contributor

ma3da commented Nov 16, 2020

@treszkai I do agree with you. But I'm not versed enough in the library's internals to feel confident there are no subtleties I'm missing :)

@ma3da
Copy link
Contributor

ma3da commented Nov 26, 2020

This issue seems to have been solved by #37355.
On master:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.2.0.dev0+1320.g28634289c'

In [3]: df = pd.DataFrame(index=range(3), columns=range(3), dtype=pd.CategoricalDtype(['foo', 'bar']))

In [4]: df.at[1,2] = 'foo'

In [5]: df
Out[5]: 
     0    1    2
0  NaN  NaN  NaN
1  NaN  NaN  foo
2  NaN  NaN  NaN

@jreback
Copy link
Contributor

jreback commented Nov 26, 2020

would take a validation test to close this issue

ma3da added a commit to ma3da/pandas that referenced this issue Nov 26, 2020
@jreback jreback added this to the 1.2 milestone Nov 26, 2020
@jreback jreback added Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants