-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: JSON serialization with orient split fails roundtrip with MultiIndex #50456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
take |
A test for this is being added in #50370. Once that PR is merged, you'll need to remove the xfail of the test here. |
Sweet. On it! |
Okay, run-down of what I have so far to make sure I am not missing anything: Program expectation:
Program behavior:
Unit focusfrom_array
to_json
read_json (_get_object_parser under the hood)
|
Not exactly. In my opinion we should change I think this will make everything consistent, and |
Hi, I have couple of questions before filing my pull request:
|
Ideally, we need #50370 merged before you can work on this. Once that PR is merged, the test will be added, and you can remove its You should just add a bullet point to the |
Sounds good. I will take another issue for now, and will keep on checking back until it merges. |
@labibdotc #50370 has been now merged. If you merge |
@datapythonista, How exactly "removing an xfail decorator" works on my part? Does it happen automatically by running the tests on my modified code? |
There is a decorator that allows a pytest test fail (what it's called an xfail). Since the roundtrip JSON serialization is broken for that case, now we have that decorator, so the failing test doesn't make our test suite fail and make the CI green. If you fix the problem, the test will pass, and pytest will complain that the test is xfailed but it's passing. If you remove the decorator with the xfail, things should then be fine. If you're not familiar, the pytest documentation for xfail is a good read. |
When I ran |
The file is still in main, shouldn't be deleted: https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/arrays.pyx Not sure what can be the problem. What I use is |
Was this PR accepted or can I take? |
You can assign it to you, this is still pending. |
take |
Is someone working on this, or can i take this? Thanks |
go ahead, thanks @tahamukhtar20 ! |
Thanks for the opportunity @MarcoGorelli |
Can I take this up @MarcoGorelli ? Seems to be stale for sometime now. |
yup go ahead |
take |
Hi @datapythonista , I made applied these changes - df.to_json(orient="split")
'{"columns":[["2022","2022"],["JAN","FEB"]],"index":[0,1],"data":[[1,2],[3,4]]}' But now after using read_json with orient as "split", we get this DataFrame (2022, 2022) (JAN, FEB)
0 1 2
1 3 4 So do we make change in read_json() now? |
any input @datapythonista ? |
@rsm-23 you still working on it or can I take it? |
@adnan2232 I was working on it and needed inputs from @datapythonista . You can go ahead if you have the full solution in mind. Or you can continue on my branch as well. |
@datapythonista any input possible here? |
I believe I have fixed the issue with the above PR. My fix was very similar to the previous PRs in this issue, however, read_json did also need to be changed to fix the issue @rsm-23 mentioned. Basically under the hood the columns were transformed into a list of tuples, but need to stay as a list of lists to make the proper dataframe that we expect, so adding a check for this and ensuring it is a list of lists if it is a multiindex seemed to do the trick. Let me know your thoughts @datapythonista |
When saving a DataFrame to JSON with
orient='split'
and then loading it again, the loaded dataframe is different from the original if columns are a multiindex.The problem seems to be that the JSON stores the format as
{"columns":[["2022","JAN"],["2022","FEB"]], ...}
, but when creating the loaded DataFrame thecolumns
value is passes as that, andDataFrame(data, columns=[["2022","JAN"],["2022","FEB"]])
produces the incorrect result.We can fix this by either changing how data is stored in the JSON, or how the dataframe is created. Personally, I think it makes more sense to store the data in the JSON in the way expected by the dataframe constructor.
CC: @MarcoGorelli
The text was updated successfully, but these errors were encountered: