BUG: JSON serialization with orient split fails roundtrip with MultiIndex #50456

datapythonista · 2022-12-28T05:33:02Z

When saving a DataFrame to JSON with orient='split' and then loading it again, the loaded dataframe is different from the original if columns are a multiindex.

>>> df = DataFrame([[1, 2], [3, 4]],
...                columns=pd.MultiIndex.from_arrays([["2022", "2022"], ['JAN', 'FEB']]))
>>> df
  2022    
   JAN FEB
0    1   2
1    3   4

>>> read_json(df.to_json(orient='split'), orient='split')
  2022 JAN
  2022 FEB
0    1   2
1    3   4

The problem seems to be that the JSON stores the format as {"columns":[["2022","JAN"],["2022","FEB"]], ...}, but when creating the loaded DataFrame the columns value is passes as that, and DataFrame(data, columns=[["2022","JAN"],["2022","FEB"]]) produces the incorrect result.

We can fix this by either changing how data is stored in the JSON, or how the dataframe is created. Personally, I think it makes more sense to store the data in the JSON in the way expected by the dataframe constructor.

CC: @MarcoGorelli

The text was updated successfully, but these errors were encountered:

labibdotc · 2022-12-28T17:51:31Z

take

datapythonista · 2022-12-28T18:03:25Z

A test for this is being added in #50370. Once that PR is merged, you'll need to remove the xfail of the test here.

labibdotc · 2022-12-28T19:06:35Z

Sweet. On it!

labibdotc · 2022-12-28T23:49:21Z

Okay, run-down of what I have so far to make sure I am not missing anything:

Program expectation:

>>> read_json(df.to_json(orient='split'), orient='split')
  2022    
   JAN FEB
0    1   2
1    3   4

Program behavior:

>>> read_json(df.to_json(orient='split'), orient='split')
  2022 JAN
  2022 FEB
0    1   2
1    3   4

Unit focus

from_array

# returns class instance of type multiIndex which stores the following:
        [('2022', 'JAN'),
            ('2022', 'FEB')],
# That's great!

to_json

# returns json object ready to be read as an orient "split". It contains the following:
{"columns":[["2022","JAN"],["2022","FEB"]],"index":[0,1],"data":[[1,2],[3,4]]}
# That's great!
# as it aligns with the read_json expectation in docs:
{{\
"columns":["col 1","col 2"],\
"index":["row 1","row 2"],\
"data":[["a","b"],["c","d"]]\
}}\
'
    >>> pd.read_json(_, orient='split')
          col 1 col 2
    row 1     a     b
    row 2     c     d

read_json (_get_object_parser under the hood)

# This is potentially a place where things go wrong as components leading up to here seems fine so far
# This is the step where the to_json object become tabular
# read_json() -> JsonReader.read() # when orient is "split" -> returns _get_object_parser(<to_json object from before>)
# obj produced by calling DataFrame constructor again using json objects we have, produces:
  2022 JAN
  2022 FEB
0    1   2
1    3   4

# this is the problem

datapythonista · 2022-12-29T04:09:53Z

Not exactly. In my opinion we should change to_json, and instead of {"columns":[["2022","JAN"],["2022","FEB"]],"index":[0,1],"data":[[1,2],[3,4]]} save {"columns":[["2022","2022"],["JAN","FEB"]],"index":[0,1],"data":[[1,2],[3,4]]}.

I think this will make everything consistent, and read_json will work fine as is.

labibdotc · 2022-12-31T21:16:44Z

Hi, I have couple of questions before filing my pull request:

Where should I add my tests? pandas/tests/io/json/test_pandas.py: result = data.to_json(orient="split", index=False) is what I am looking at.
How do I show that I removed xfail of a test? Do I run pytest pandas and provide terminal-log in my pull request?
Also, should I create v2.0.1.rst or do I update the latest (i..e. v2.0.0.rst) in docs including my updates for this commit?

datapythonista · 2023-01-03T07:06:52Z

Ideally, we need #50370 merged before you can work on this. Once that PR is merged, the test will be added, and you can remove its xfail decorator in your PR to show that the test is now passing.

You should just add a bullet point to the v2.0.0.rst file.

labibdotc · 2023-01-04T07:58:50Z

Sounds good. I will take another issue for now, and will keep on checking back until it merges.

datapythonista · 2023-01-10T04:24:51Z

@labibdotc #50370 has been now merged. If you merge main into your branch (or start a new branch with an updated main), you can work in this issue without trouble now. Let me know if you've got any question.

labibdotc · 2023-01-20T00:49:35Z

@datapythonista, How exactly "removing an xfail decorator" works on my part? Does it happen automatically by running the tests on my modified code?
Also to confirm, pandas/tests/io/json/test_pandas.py is the only test file I am running?

datapythonista · 2023-01-20T03:28:38Z

There is a decorator that allows a pytest test fail (what it's called an xfail). Since the roundtrip JSON serialization is broken for that case, now we have that decorator, so the failing test doesn't make our test suite fail and make the CI green. If you fix the problem, the test will pass, and pytest will complain that the test is xfailed but it's passing. If you remove the decorator with the xfail, things should then be fine.

If you're not familiar, the pytest documentation for xfail is a good read.

labibdotc · 2023-01-20T19:10:21Z

When I ran git pull --ff-only to get the up-to-date merge from upstream, I get deleted: pandas/_libs/arrays.pyx, etc. That makes my build fail, as apparently some of the code is still looking for pandas/_libs/arrays.pyx. Maybe I should have tried to merge instead of --ff-only? Do you recognize what I did wrong?

datapythonista · 2023-01-20T19:15:15Z

The file is still in main, shouldn't be deleted: https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/arrays.pyx

Not sure what can be the problem. What I use is git fetch upstream and git merge upstream/main. A git pull in general will be merging from your own fork and the same branch, which may be quite behind in time, unless someone else pushes to your branch.

lusolorz · 2023-04-07T18:38:47Z

Was this PR accepted or can I take?

datapythonista · 2023-04-07T22:13:03Z

You can assign it to you, this is still pending.

lusolorz · 2023-04-10T21:13:05Z

take

tahamukhtar20 · 2023-04-25T04:50:56Z

Is someone working on this, or can i take this? Thanks

MarcoGorelli · 2023-04-27T08:19:50Z

go ahead, thanks @tahamukhtar20 !

tahamukhtar20 · 2023-04-27T11:35:03Z

Thanks for the opportunity @MarcoGorelli

rsm-23 · 2023-07-01T07:11:41Z

Can I take this up @MarcoGorelli ? Seems to be stale for sometime now.

MarcoGorelli · 2023-07-01T07:14:22Z

yup go ahead

rsm-23 · 2023-07-01T07:15:10Z

take

rsm-23 · 2023-07-01T11:07:27Z

In my opinion we should change to_json, and instead of {"columns":[["2022","JAN"],["2022","FEB"]],"index":[0,1],"data":[[1,2],[3,4]]} save {"columns":[["2022","2022"],["JAN","FEB"]],"index":[0,1],"data":[[1,2],[3,4]]}.

I think this will make everything consistent, and read_json will work fine as is.

Hi @datapythonista , I made applied these changes -

 df.to_json(orient="split")
'{"columns":[["2022","2022"],["JAN","FEB"]],"index":[0,1],"data":[[1,2],[3,4]]}'

But now after using read_json with orient as "split", we get this DataFrame

   (2022, 2022)  (JAN, FEB)
0             1           2
1             3           4

So do we make change in read_json() now?

rsm-23 · 2023-07-14T10:46:22Z

any input @datapythonista ?

adnan2232 · 2023-07-21T06:27:39Z

@rsm-23 you still working on it or can I take it?

rsm-23 · 2023-07-21T06:30:56Z

@adnan2232 I was working on it and needed inputs from @datapythonista . You can go ahead if you have the full solution in mind. Or you can continue on my branch as well.

rsm-23 · 2023-08-19T06:23:23Z

@datapythonista any input possible here?

mvernooy3687 · 2023-12-08T22:01:41Z

I believe I have fixed the issue with the above PR. My fix was very similar to the previous PRs in this issue, however, read_json did also need to be changed to fix the issue @rsm-23 mentioned. Basically under the hood the columns were transformed into a list of tuples, but need to stay as a list of lists to make the proper dataframe that we expect, so adding a check for this and ensuring it is a list of lists if it is a multiindex seemed to do the trick. Let me know your thoughts @datapythonista

datapythonista added Bug IO JSON read_json, to_json, json_normalize good first issue labels Dec 28, 2022

datapythonista mentioned this issue Dec 28, 2022

Rename duplicate column names in read_json(orient='split') #50370

Merged

1 task

github-actions bot assigned labibdotc Dec 28, 2022

labibdotc mentioned this issue Jan 20, 2023

BUG: JSON serialization with orient split fails roundtrip with MultiIndex #50904

Closed

MarcoGorelli assigned tahamukhtar20 and unassigned labibdotc Apr 27, 2023

github-actions bot assigned rsm-23 Jul 1, 2023

rsm-23 mentioned this issue Jul 1, 2023

BUG: JSON serialization with orient split fails roundtrip on MultiIndex #53963

Closed

4 tasks

mvernooy3687 mentioned this issue Dec 8, 2023

fixed issue, multiindex dataframe now created as expected #56415

Closed

4 tasks

mroeschke removed the good first issue label Dec 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: JSON serialization with orient split fails roundtrip with MultiIndex #50456

BUG: JSON serialization with orient split fails roundtrip with MultiIndex #50456

datapythonista commented Dec 28, 2022

labibdotc commented Dec 28, 2022

datapythonista commented Dec 28, 2022

labibdotc commented Dec 28, 2022

labibdotc commented Dec 28, 2022

datapythonista commented Dec 29, 2022

labibdotc commented Dec 31, 2022

datapythonista commented Jan 3, 2023

labibdotc commented Jan 4, 2023

datapythonista commented Jan 10, 2023

labibdotc commented Jan 20, 2023 •

edited

Loading

datapythonista commented Jan 20, 2023

labibdotc commented Jan 20, 2023 •

edited

Loading

datapythonista commented Jan 20, 2023

lusolorz commented Apr 7, 2023

datapythonista commented Apr 7, 2023

lusolorz commented Apr 10, 2023

tahamukhtar20 commented Apr 25, 2023

MarcoGorelli commented Apr 27, 2023

tahamukhtar20 commented Apr 27, 2023

rsm-23 commented Jul 1, 2023

MarcoGorelli commented Jul 1, 2023

rsm-23 commented Jul 1, 2023

rsm-23 commented Jul 1, 2023 •

edited

Loading

rsm-23 commented Jul 14, 2023

adnan2232 commented Jul 21, 2023

rsm-23 commented Jul 21, 2023

rsm-23 commented Aug 19, 2023

mvernooy3687 commented Dec 8, 2023

BUG: JSON serialization with orient split fails roundtrip with MultiIndex #50456

BUG: JSON serialization with orient split fails roundtrip with MultiIndex #50456

Comments

datapythonista commented Dec 28, 2022

labibdotc commented Dec 28, 2022

datapythonista commented Dec 28, 2022

labibdotc commented Dec 28, 2022

labibdotc commented Dec 28, 2022

Program expectation:

Program behavior:

Unit focus

from_array

to_json

read_json (_get_object_parser under the hood)

datapythonista commented Dec 29, 2022

labibdotc commented Dec 31, 2022

datapythonista commented Jan 3, 2023

labibdotc commented Jan 4, 2023

datapythonista commented Jan 10, 2023

labibdotc commented Jan 20, 2023 • edited Loading

datapythonista commented Jan 20, 2023

labibdotc commented Jan 20, 2023 • edited Loading

datapythonista commented Jan 20, 2023

lusolorz commented Apr 7, 2023

datapythonista commented Apr 7, 2023

lusolorz commented Apr 10, 2023

tahamukhtar20 commented Apr 25, 2023

MarcoGorelli commented Apr 27, 2023

tahamukhtar20 commented Apr 27, 2023

rsm-23 commented Jul 1, 2023

MarcoGorelli commented Jul 1, 2023

rsm-23 commented Jul 1, 2023

rsm-23 commented Jul 1, 2023 • edited Loading

rsm-23 commented Jul 14, 2023

adnan2232 commented Jul 21, 2023

rsm-23 commented Jul 21, 2023

rsm-23 commented Aug 19, 2023

mvernooy3687 commented Dec 8, 2023

labibdotc commented Jan 20, 2023 •

edited

Loading

labibdotc commented Jan 20, 2023 •

edited

Loading

rsm-23 commented Jul 1, 2023 •

edited

Loading