-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: DataFrame.at setter of categorical DF overwrites entire row #37763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What's the expected output with |
Yes, the problem is only with categorical dtype. The Expected Output section compares it to two scenarios that work well, i.e. pre-intialized or non-dtype. Also could've said that "Expected Output: DataFrame.at setter sets a single item, not the entire row." |
Setting with >>> df = pd.DataFrame(index=range(3), columns=range(3), dtype=pd.CategoricalDtype(['foo', 'bar']))
>>> df.at[0,0] = 'foo'
>>> df[0]._values is df[2]._values
True When setting with # set the item, possibly having a dtype change
ser = ser.copy()
ser._mgr = ser._mgr.setitem(indexer=pi, value=v)
ser._maybe_update_cacher(clear=True) |
Yes, I'd say this follows from two things:
I'm tempted to say the way to correct this would be to create one "nan array" per empty column, but there may be a memory-efficient solution. |
to create one "nan array" per empty column, but there may be a
memory-efficient solution.
In what situation is it necessary to spare memory temporarily, until the
elements of the DataFrame are set? When the user asked for a categorical
dtype for the whole DF, it's reasonable to assume that all columns will be
used as such, no?
Luis Pinto <[email protected]> schrieb am Mo., 16. Nov. 2020, 10:29:
… Yes, I'd say this follows from two things:
- when your DF is instantiated, only one "nan array" is created, which
will be associated with all its columns, as they were provided no data (
here
<https://github.com/pandas-dev/pandas/blob/c77fc357efbe9443ad8e32b9ce34599e61bcd3f5/pandas/core/internals/construction.py#L271-L272>
),
- in the case of categorical columns, the block manager keeps one
block per column, which will result in underlying values being the same
object (the unique "nan array") (here
<https://github.com/pandas-dev/pandas/blob/c77fc357efbe9443ad8e32b9ce34599e61bcd3f5/pandas/core/internals/managers.py#L1771-L1776>
).
I'm tempted to say the way to correct this would be to create one "nan
array" per empty column, but there may be a memory-efficient solution.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37763 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHJJPFQQQH3P62WX5KU4VALSQDWIZANCNFSM4TSMNBLQ>
.
|
@treszkai I do agree with you. But I'm not versed enough in the library's internals to feel confident there are no subtleties I'm missing :) |
This issue seems to have been solved by #37355.
|
would take a validation test to close this issue |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
On a DataFrame with categorical dtype,
df.at[x,y] = v
sets all non-initialized values in rowx
.Code Sample, a copy-pastable example
It doesn't overwrite values that have been set with
df.loc
:Problem description
df.at[x, y] = v
on a categorical dtype should behave as with other dtypes, and the same asdf.loc[x, y] = v
.Expected Output
The same as what happens with a DF initialized with
None
s:Or as with
dtype=float
:Output of
pd.show_versions()
INSTALLED VERSIONS
The text was updated successfully, but these errors were encountered: