Skip to content

DataFrame.from_dict unexpectedly "flatten" tuples in the dictionary keys #16769

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tkf opened this issue Jun 25, 2017 · 8 comments · Fixed by #29497
Closed

DataFrame.from_dict unexpectedly "flatten" tuples in the dictionary keys #16769

tkf opened this issue Jun 25, 2017 · 8 comments · Fixed by #29497
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@tkf
Copy link
Contributor

tkf commented Jun 25, 2017

Code Sample, a copy-pastable example if possible

In [1]: import pandas

In [2]: pandas.__version__
Out[2]: '0.20.2'

In [3]: pandas.DataFrame.from_dict([{('a',): 1}, {('a',): 2}]).columns
Out[3]: Index(['a'], dtype='object')

In [4]: pandas.DataFrame.from_dict([{('a',): 1, ('b',): 2}]).columns
Out[4]: Index([('a',), ('b',)], dtype='object')

In [5]: pandas.DataFrame.from_dict([{('a', 'b'): 1}]).columns
Out[5]: Index([('a', 'b')], dtype='object')

Problem description

When (1) dictionaries with a single identical key is given to pandas.DataFrame.from_dict and (2) the key is a singleton tuple, then it returns a dataframe whose column is the content of the tuple, instead of the tuple itself.

Note that this problem does not happen when (1) is not the case (see In [4]) or (2) is not the case (see In [5]). It makes the case (1) & (2) inconsistent with those other cases.

Expected Output

In [3]: pandas.DataFrame.from_dict([{('a',): 1}, {('a',): 2}]).columns
Out[3]: Index([('a',)], dtype='object')

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Linux OS-release: 4.4.47-1-lts machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.13.0
scipy: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

It is also reproduced with the current master branch:

INSTALLED VERSIONS ------------------ commit: 1265c27 python: 3.6.0.final.0 python-bits: 64 OS: Linux OS-release: 4.4.47-1-lts machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.21.0.dev+179.g1265c27f4
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.13.0
scipy: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

@tkf
Copy link
Contributor Author

tkf commented Jun 25, 2017

Note that it works as excepted in pandas 0.19.2.

@tkf
Copy link
Contributor Author

tkf commented Jun 25, 2017

It seems that the problem is rather the DataFrame class itself. It happens even when there are multiple keys.:

In [1]: import pandas

In [2]: pandas.DataFrame({('a',): [1], ('b',): [2]}).columns
Out[2]: Index(['a', 'b'], dtype='object')

In [3]: pandas.DataFrame({('a',): [1], 'b': [2]}).columns
Out[3]: Index([('a',), 'b'], dtype='object')

I expect Out [2]: Index([('a',), ('b',)], dtype='object').

Reproduced in the current master (1265c27):

INSTALLED VERSIONS ------------------ commit: 1265c27 python: 3.6.0.final.0 python-bits: 64 OS: Linux OS-release: 4.4.47-1-lts machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.21.0.dev+179.g1265c27f4
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.13.0
scipy: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

@tkf
Copy link
Contributor Author

tkf commented Jun 25, 2017

Another related strange and inconsistent behavior is that setting tuple(s) to DataFrame.columns changes the result depending on whether there is only one column or multiple columns:

In [1]: import pandas

In [2]: df1 = pandas.DataFrame({'a': [1]})

In [3]: df1.columns = [('a',)]

In [4]: df1.columns
Out[4]: Index(['a'], dtype='object')

In [5]: df2 = pandas.DataFrame({'a': [1], 'b': [2]})

In [6]: df2.columns = [('a',), ('b',)]

In [7]: df2.columns
Out[7]: Index([('a',), ('b',)], dtype='object')

In [8]: df3 = pandas.DataFrame({('a',): [1], ('b',): [2]})

In [9]: df3.columns
Out[9]: Index(['a', 'b'], dtype='object')

In [10]: df3.columns = [('a',), ('b',)]

In [11]: df3.columns
Out[11]: Index([('a',), ('b',)], dtype='object')

I expect Out[4]: Index([('a',)], dtype='object') (and as I already mentioned, Out[9]: Index([('a',), ('b',)], dtype='object')).

@tkf
Copy link
Contributor Author

tkf commented Jun 25, 2017

Here is a workaround I've found. You can avoid pandas "de-tupling" a singleton tuple by adding a dummy column:

newcolumns = [c if isinstance(c, tuple) else (c,) for c in df.columns]
dummy = object()
df[dummy] = pandas.Categorical(0)
df.columns = newcolumns + [dummy]
del df[dummy]

@jreback
Copy link
Contributor

jreback commented Jun 28, 2017

@tkf a len-1 tuple as an Index entry is not allowed. This would imply

In [65]: pd.MultiIndex.from_tuples([('a',)])
Out[65]: Index(['a'], dtype='object')

would actually return a 1-level MultiIndex. which is by-definition an Index. Tuples indicate levels on index contruction. In a Series/DataFrame construction w/o explicit instructions these go thru _ensure_index which will try to figure out what you want. if you want to be explicit it would work.

You can bisect if you want and see when this changed as you said it worked in 0.19.2.

@tkf
Copy link
Contributor Author

tkf commented Jul 7, 2017

@jreback I'm not talking about MultiIndex here. I want to use tuples of arbitrary length as keys. This is actually noted as a valid use-case in the document:

It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis
--- http://pandas.pydata.org/pandas-docs/stable/advanced.html#creating-a-multiindex-hierarchical-index-object

At the "atomic" levels, I think any hashable has to be accepted as-is. This includes a singleton tuple.

@jreback
Copy link
Contributor

jreback commented Jul 7, 2017

@tkf well, you are fighting pandas here. I'll mark it, and if you can find a change that makes your test work and preserves other behavior then would accept.

@jreback jreback added Bug Difficulty Intermediate MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 7, 2017
@jreback jreback added this to the Next Major Release milestone Jul 7, 2017
@mroeschke
Copy link
Member

Looks like this is returning the expected result on master. I supposed this could use a test:

In [154]: In [3]: pandas.DataFrame.from_dict([{('a',): 1}, {('a',): 2}]).columns
     ...:
Out[154]: Index([('a',)], dtype='object')

In [155]: pd.__version__
Out[155]: '0.26.0.dev0+555.gf7d162b18'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Difficulty Intermediate MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 14, 2019
@gfyoung gfyoung modified the milestones: Contributions Welcome, 1.0 Nov 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants