Skip to content

BUG: JSON serialization with orient split fails roundtrip with MultiIndex #50456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
datapythonista opened this issue Dec 28, 2022 · 28 comments
Assignees
Labels
Bug IO JSON read_json, to_json, json_normalize

Comments

@datapythonista
Copy link
Member

When saving a DataFrame to JSON with orient='split' and then loading it again, the loaded dataframe is different from the original if columns are a multiindex.

>>> df = DataFrame([[1, 2], [3, 4]],
...                columns=pd.MultiIndex.from_arrays([["2022", "2022"], ['JAN', 'FEB']]))
>>> df
  2022    
   JAN FEB
0    1   2
1    3   4

>>> read_json(df.to_json(orient='split'), orient='split')
  2022 JAN
  2022 FEB
0    1   2
1    3   4

The problem seems to be that the JSON stores the format as {"columns":[["2022","JAN"],["2022","FEB"]], ...}, but when creating the loaded DataFrame the columns value is passes as that, and DataFrame(data, columns=[["2022","JAN"],["2022","FEB"]]) produces the incorrect result.

We can fix this by either changing how data is stored in the JSON, or how the dataframe is created. Personally, I think it makes more sense to store the data in the JSON in the way expected by the dataframe constructor.

CC: @MarcoGorelli

@labibdotc
Copy link

take

@datapythonista
Copy link
Member Author

A test for this is being added in #50370. Once that PR is merged, you'll need to remove the xfail of the test here.

@labibdotc
Copy link

Sweet. On it!

@labibdotc
Copy link

Okay, run-down of what I have so far to make sure I am not missing anything:

Program expectation:

>>> read_json(df.to_json(orient='split'), orient='split')
  2022    
   JAN FEB
0    1   2
1    3   4

Program behavior:

>>> read_json(df.to_json(orient='split'), orient='split')
  2022 JAN
  2022 FEB
0    1   2
1    3   4

Unit focus

from_array

# returns class instance of type multiIndex which stores the following:
        [('2022', 'JAN'),
            ('2022', 'FEB')],
# That's great!

to_json

# returns json object ready to be read as an orient "split". It contains the following:
{"columns":[["2022","JAN"],["2022","FEB"]],"index":[0,1],"data":[[1,2],[3,4]]}
# That's great!
# as it aligns with the read_json expectation in docs:
{{\
"columns":["col 1","col 2"],\
"index":["row 1","row 2"],\
"data":[["a","b"],["c","d"]]\
}}\
'
    >>> pd.read_json(_, orient='split')
          col 1 col 2
    row 1     a     b
    row 2     c     d

read_json (_get_object_parser under the hood)

# This is potentially a place where things go wrong as components leading up to here seems fine so far
# This is the step where the to_json object become tabular
# read_json() -> JsonReader.read() # when orient is "split" -> returns _get_object_parser(<to_json object from before>)
# obj produced by calling DataFrame constructor again using json objects we have, produces:
  2022 JAN
  2022 FEB
0    1   2
1    3   4

# this is the problem

@datapythonista
Copy link
Member Author

Not exactly. In my opinion we should change to_json, and instead of {"columns":[["2022","JAN"],["2022","FEB"]],"index":[0,1],"data":[[1,2],[3,4]]} save {"columns":[["2022","2022"],["JAN","FEB"]],"index":[0,1],"data":[[1,2],[3,4]]}.

I think this will make everything consistent, and read_json will work fine as is.

@labibdotc
Copy link

Hi, I have couple of questions before filing my pull request:

  • Where should I add my tests? pandas/tests/io/json/test_pandas.py: result = data.to_json(orient="split", index=False) is what I am looking at.
  • How do I show that I removed xfail of a test? Do I run pytest pandas and provide terminal-log in my pull request?
  • Also, should I create v2.0.1.rst or do I update the latest (i..e. v2.0.0.rst) in docs including my updates for this commit?

@datapythonista
Copy link
Member Author

Ideally, we need #50370 merged before you can work on this. Once that PR is merged, the test will be added, and you can remove its xfail decorator in your PR to show that the test is now passing.

You should just add a bullet point to the v2.0.0.rst file.

@labibdotc
Copy link

Sounds good. I will take another issue for now, and will keep on checking back until it merges.

@datapythonista
Copy link
Member Author

@labibdotc #50370 has been now merged. If you merge main into your branch (or start a new branch with an updated main), you can work in this issue without trouble now. Let me know if you've got any question.

@labibdotc
Copy link

labibdotc commented Jan 20, 2023

@datapythonista, How exactly "removing an xfail decorator" works on my part? Does it happen automatically by running the tests on my modified code?
Also to confirm, pandas/tests/io/json/test_pandas.py is the only test file I am running?

@datapythonista
Copy link
Member Author

There is a decorator that allows a pytest test fail (what it's called an xfail). Since the roundtrip JSON serialization is broken for that case, now we have that decorator, so the failing test doesn't make our test suite fail and make the CI green. If you fix the problem, the test will pass, and pytest will complain that the test is xfailed but it's passing. If you remove the decorator with the xfail, things should then be fine.

If you're not familiar, the pytest documentation for xfail is a good read.

@labibdotc
Copy link

labibdotc commented Jan 20, 2023

When I ran git pull --ff-only to get the up-to-date merge from upstream, I get deleted: pandas/_libs/arrays.pyx, etc. That makes my build fail, as apparently some of the code is still looking for pandas/_libs/arrays.pyx. Maybe I should have tried to merge instead of --ff-only? Do you recognize what I did wrong?

@datapythonista
Copy link
Member Author

The file is still in main, shouldn't be deleted: https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/arrays.pyx

Not sure what can be the problem. What I use is git fetch upstream and git merge upstream/main. A git pull in general will be merging from your own fork and the same branch, which may be quite behind in time, unless someone else pushes to your branch.

@lusolorz
Copy link
Contributor

lusolorz commented Apr 7, 2023

Was this PR accepted or can I take?

@datapythonista
Copy link
Member Author

You can assign it to you, this is still pending.

@lusolorz
Copy link
Contributor

take

@tahamukhtar20
Copy link

Is someone working on this, or can i take this? Thanks

@MarcoGorelli
Copy link
Member

go ahead, thanks @tahamukhtar20 !

@tahamukhtar20
Copy link

Thanks for the opportunity @MarcoGorelli

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 1, 2023

Can I take this up @MarcoGorelli ? Seems to be stale for sometime now.

@MarcoGorelli
Copy link
Member

yup go ahead

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 1, 2023

take

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 1, 2023

In my opinion we should change to_json, and instead of {"columns":[["2022","JAN"],["2022","FEB"]],"index":[0,1],"data":[[1,2],[3,4]]} save {"columns":[["2022","2022"],["JAN","FEB"]],"index":[0,1],"data":[[1,2],[3,4]]}.

I think this will make everything consistent, and read_json will work fine as is.

Hi @datapythonista , I made applied these changes -

 df.to_json(orient="split")
'{"columns":[["2022","2022"],["JAN","FEB"]],"index":[0,1],"data":[[1,2],[3,4]]}'

But now after using read_json with orient as "split", we get this DataFrame

   (2022, 2022)  (JAN, FEB)
0             1           2
1             3           4

So do we make change in read_json() now?

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 14, 2023

any input @datapythonista ?

@adnan2232
Copy link

@rsm-23 you still working on it or can I take it?

@rsm-23
Copy link
Contributor

rsm-23 commented Jul 21, 2023

@adnan2232 I was working on it and needed inputs from @datapythonista . You can go ahead if you have the full solution in mind. Or you can continue on my branch as well.

@rsm-23
Copy link
Contributor

rsm-23 commented Aug 19, 2023

@datapythonista any input possible here?

@mvernooy3687
Copy link

I believe I have fixed the issue with the above PR. My fix was very similar to the previous PRs in this issue, however, read_json did also need to be changed to fix the issue @rsm-23 mentioned. Basically under the hood the columns were transformed into a list of tuples, but need to stay as a list of lists to make the proper dataframe that we expect, so adding a check for this and ensuring it is a list of lists if it is a multiindex seemed to do the trick. Let me know your thoughts @datapythonista

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment