Skip to content

Setting value of DataFrame[MultiIndex] via .loc partial indexing fails #22493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
holymonson opened this issue Aug 24, 2018 · 8 comments
Open
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@holymonson
Copy link
Contributor

Code Sample, a copy-pastable example if possible

a = pd.DataFrame(np.array(range(8)).reshape(4, 2), pd.MultiIndex.from_product([['a1', 'a2'], ['b1', 'b2']]), ['c1', 'c2'])
a
#        c1  c2
# a1 b1   0   1
#    b2   2   3
# a2 b1   4   5
#    b2   6   7
a.loc['a1']
#     c1  c2
# b1   0   1
# b2   2   3
b = pd.DataFrame(-1, ['b1', 'b2'], ['c1', 'c2'])
b
#     c1  c2
# b1  -1  -1
# b2  -1  -1

a.loc['a1'] = b
a
#         c1   c2
# a1 b1  NaN  NaN
#    b2  NaN  NaN
# a2 b1  4.0  5.0
#    b2  6.0  7.0

# However, setting with ndarray is fine.
a.loc['a1'] = b.values
a
#         c1   c2
# a1 b1 -1.0 -1.0
#    b2 -1.0 -1.0
# a2 b1  4.0  5.0
#    b2  6.0  7.0

Problem description

Setting value of DataFrame via .loc with DataFrame failed, even they have same columns and indexes.

Expected Output

Works the same like with .values.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: zh_CN.UTF-8
LOCALE: zh_CN.UTF-8

pandas: 0.23.4
pytest: 3.7.1
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: 0.9.0
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.0.5
lxml: 4.2.4
bs4: 4.6.3
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Aug 25, 2018
@gfyoung
Copy link
Member

gfyoung commented Aug 25, 2018

That indeed does look a little weird!

cc @toobaz

@phofl
Copy link
Member

phofl commented Nov 9, 2020

Not quite sure what is expected here, the align does not work properly with this input.

a = pd.DataFrame(np.array(range(8)).reshape(4, 2), pd.MultiIndex.from_product([['a1', 'a2'], ['b1', 'b2']]), ['c1', 'c2'])

print(a.loc['a1'])

b = pd.DataFrame(-1, pd.MultiIndex.from_product([['a1'], ['b1', 'b2']]), ['c1', 'c2'])
print(b)
a.loc['a1'] = b
print(a)

Setting the MultiIndex for b works

       c1  c2
a1 b1  -1  -1
   b2  -1  -1
a2 b1   4   5
   b2   6   7

@holymonson
Copy link
Contributor Author

b = pd.DataFrame(-1, ['b1', 'b2'], ['c1', 'c2'])

b = pd.DataFrame(-1, pd.MultiIndex.from_product([['a1'], ['b1', 'b2']]), ['c1', 'c2'])

@phaebz Here is the difference. While a.loc['a1'] is not a MultiIndex DataFrame, but should be viewed as a simple index DataFrame. So your a.loc['a1'] = b is basically assigning a MultiIndex DataFrame to a simple index DataFrame view.

Or from another point of view, a.loc['a1'] has chosen the 'a1' 1st level index, it should be redundant in b.

@phofl
Copy link
Member

phofl commented Nov 13, 2020

I get what the difference is. I simply could not find anything in the docs about this, maybe I am missing something. I also got what the problem is. But I am not sure if this is expected.

@holymonson
Copy link
Contributor Author

But I am not sure if this is expected.

No docs mentioned indeed. What should be considered is, whether a.loc['a1'] in this case could be viewed as a simple index DataFrame. If yes, then assigning a simple index DataFrame to an other is expected.

@toobaz
Copy link
Member

toobaz commented Nov 14, 2020

@gfyoung and everybody, sorry for replying after more than two years.

I think we can summarize the project stance on assigning by matching on a subset of a MultiIndex levels as: yes, it should ideally work, no, we are not terribly surprised that, at the moment, it does not work.

I'm also pretty sure there was already an open issue (and I was even involved in it), but I can't find it right now. I remember the MultiIndex in question was on the columns rather than on the rows, but the example provided was otherwise pretty similar.

Notice that a.loc[('a1',), 'c1'] = pd.Series({'b1' : -2, 'b2' : -3}) - which is simpler, because it inserts a 1D object - doesn't work either, and ideally should.

I suspect that it will be very hard to provide a clean fix for these before fixing #12827 , because in some way the code that indexes on subsets of MultiIndex levels should be related to the code that accesses them.

@toobaz toobaz changed the title Setting value of DataFrame[MultiIndex] via .loc with DataFrame failed Setting value of DataFrame[MultiIndex] via .loc partial indexing with DataFrame failed Nov 14, 2020
@toobaz toobaz changed the title Setting value of DataFrame[MultiIndex] via .loc partial indexing with DataFrame failed Setting value of DataFrame[MultiIndex] via .loc partial indexing fails Nov 14, 2020
@toobaz
Copy link
Member

toobaz commented Nov 14, 2020

The title change is justified by the existence of a relatively easy, although annoying, workaround which does not exploit partial indexing (well, it actually does, but only to determine the destination, not to match indexes):

a = pd.DataFrame(np.array(range(8)).reshape(4, 2), pd.MultiIndex.from_product([['a1', 'a2'], ['b1', 'b2']]), ['c1', 'c2'])
c = pd.DataFrame(-1, pd.MultiIndex.from_product([['a1'], ['b1', 'b2']]), ['c1', 'c2'])
a.loc[['a1']] = c

EDIT: that may seem a more complicated way to do what @phofl did above. But the thing is: that should not work! (oh what a mess) Because a.loc['a1'] is not multiindexed.

@toobaz
Copy link
Member

toobaz commented Dec 16, 2020

But the thing is: that should not work!

Another related thing that probably should not work emerged today in the mailing list:

In [2]: s = pd.Series(range(9), index=pd.MultiIndex.from_product([list('abc'), list('def')]))                                                                                                    

In [3]: mask = s % 2 == 0                                                                                                                                                                        

In [4]: s.loc['a', mask]                                                                                                                                                                         
Out[4]: 
a  d    0
   f    2
dtype: int64

... while s.loc['a', mask.loc['a']], which would be more acceptable, raises an error!

@mroeschke mroeschke added the Bug label Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

5 participants