DataFrame.from_dict unexpectedly "flatten" tuples in the dictionary keys #16769

tkf · 2017-06-25T05:01:37Z

Code Sample, a copy-pastable example if possible

In [1]: import pandas

In [2]: pandas.__version__
Out[2]: '0.20.2'

In [3]: pandas.DataFrame.from_dict([{('a',): 1}, {('a',): 2}]).columns
Out[3]: Index(['a'], dtype='object')

In [4]: pandas.DataFrame.from_dict([{('a',): 1, ('b',): 2}]).columns
Out[4]: Index([('a',), ('b',)], dtype='object')

In [5]: pandas.DataFrame.from_dict([{('a', 'b'): 1}]).columns
Out[5]: Index([('a', 'b')], dtype='object')

Problem description

When (1) dictionaries with a single identical key is given to pandas.DataFrame.from_dict and (2) the key is a singleton tuple, then it returns a dataframe whose column is the content of the tuple, instead of the tuple itself.

Note that this problem does not happen when (1) is not the case (see In [4]) or (2) is not the case (see In [5]). It makes the case (1) & (2) inconsistent with those other cases.

Expected Output

In [3]: pandas.DataFrame.from_dict([{('a',): 1}, {('a',): 2}]).columns
Out[3]: Index([('a',)], dtype='object')

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Linux OS-release: 4.4.47-1-lts machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.13.0
scipy: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

It is also reproduced with the current master branch:

INSTALLED VERSIONS ------------------ commit: 1265c27 python: 3.6.0.final.0 python-bits: 64 OS: Linux OS-release: 4.4.47-1-lts machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.21.0.dev+179.g1265c27f4
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.13.0
scipy: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

tkf · 2017-06-25T05:03:20Z

Note that it works as excepted in pandas 0.19.2.

tkf · 2017-06-25T05:41:14Z

It seems that the problem is rather the DataFrame class itself. It happens even when there are multiple keys.:

In [1]: import pandas

In [2]: pandas.DataFrame({('a',): [1], ('b',): [2]}).columns
Out[2]: Index(['a', 'b'], dtype='object')

In [3]: pandas.DataFrame({('a',): [1], 'b': [2]}).columns
Out[3]: Index([('a',), 'b'], dtype='object')

I expect Out [2]: Index([('a',), ('b',)], dtype='object').

Reproduced in the current master (1265c27):

INSTALLED VERSIONS ------------------ commit: 1265c27 python: 3.6.0.final.0 python-bits: 64 OS: Linux OS-release: 4.4.47-1-lts machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.21.0.dev+179.g1265c27f4
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.13.0
scipy: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

tkf · 2017-06-25T05:47:08Z

Another related strange and inconsistent behavior is that setting tuple(s) to DataFrame.columns changes the result depending on whether there is only one column or multiple columns:

In [1]: import pandas

In [2]: df1 = pandas.DataFrame({'a': [1]})

In [3]: df1.columns = [('a',)]

In [4]: df1.columns
Out[4]: Index(['a'], dtype='object')

In [5]: df2 = pandas.DataFrame({'a': [1], 'b': [2]})

In [6]: df2.columns = [('a',), ('b',)]

In [7]: df2.columns
Out[7]: Index([('a',), ('b',)], dtype='object')

In [8]: df3 = pandas.DataFrame({('a',): [1], ('b',): [2]})

In [9]: df3.columns
Out[9]: Index(['a', 'b'], dtype='object')

In [10]: df3.columns = [('a',), ('b',)]

In [11]: df3.columns
Out[11]: Index([('a',), ('b',)], dtype='object')

I expect Out[4]: Index([('a',)], dtype='object') (and as I already mentioned, Out[9]: Index([('a',), ('b',)], dtype='object')).

tkf · 2017-06-25T09:03:35Z

Here is a workaround I've found. You can avoid pandas "de-tupling" a singleton tuple by adding a dummy column:

newcolumns = [c if isinstance(c, tuple) else (c,) for c in df.columns]
dummy = object()
df[dummy] = pandas.Categorical(0)
df.columns = newcolumns + [dummy]
del df[dummy]

jreback · 2017-06-28T10:21:40Z

@tkf a len-1 tuple as an Index entry is not allowed. This would imply

In [65]: pd.MultiIndex.from_tuples([('a',)])
Out[65]: Index(['a'], dtype='object')

would actually return a 1-level MultiIndex. which is by-definition an Index. Tuples indicate levels on index contruction. In a Series/DataFrame construction w/o explicit instructions these go thru _ensure_index which will try to figure out what you want. if you want to be explicit it would work.

You can bisect if you want and see when this changed as you said it worked in 0.19.2.

tkf · 2017-07-07T03:32:46Z

@jreback I'm not talking about MultiIndex here. I want to use tuples of arbitrary length as keys. This is actually noted as a valid use-case in the document:

It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis
--- http://pandas.pydata.org/pandas-docs/stable/advanced.html#creating-a-multiindex-hierarchical-index-object

At the "atomic" levels, I think any hashable has to be accepted as-is. This includes a singleton tuple.

jreback · 2017-07-07T21:41:39Z

@tkf well, you are fighting pandas here. I'll mark it, and if you can find a change that makes your test work and preserves other behavior then would accept.

mroeschke · 2019-10-14T00:45:05Z

Looks like this is returning the expected result on master. I supposed this could use a test:

In [154]: In [3]: pandas.DataFrame.from_dict([{('a',): 1}, {('a',): 2}]).columns
     ...:
Out[154]: Index([('a',)], dtype='object')

In [155]: pd.__version__
Out[155]: '0.26.0.dev0+555.gf7d162b18'

jreback added Bug Difficulty Intermediate MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 7, 2017

jreback added this to the Next Major Release milestone Jul 7, 2017

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Difficulty Intermediate MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 14, 2019

ganevgv mentioned this issue Nov 9, 2019

TST: add test for df construction from dict with tuples #29497

Merged

5 tasks

gfyoung modified the milestones: Contributions Welcome, 1.0 Nov 9, 2019

mroeschke closed this as completed in #29497 Nov 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.from_dict unexpectedly "flatten" tuples in the dictionary keys #16769

DataFrame.from_dict unexpectedly "flatten" tuples in the dictionary keys #16769

tkf commented Jun 25, 2017

tkf commented Jun 25, 2017

tkf commented Jun 25, 2017

tkf commented Jun 25, 2017

tkf commented Jun 25, 2017

jreback commented Jun 28, 2017

tkf commented Jul 7, 2017

jreback commented Jul 7, 2017

mroeschke commented Oct 14, 2019

DataFrame.from_dict unexpectedly "flatten" tuples in the dictionary keys #16769

DataFrame.from_dict unexpectedly "flatten" tuples in the dictionary keys #16769

Comments

tkf commented Jun 25, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

tkf commented Jun 25, 2017

tkf commented Jun 25, 2017

tkf commented Jun 25, 2017

tkf commented Jun 25, 2017

jreback commented Jun 28, 2017

tkf commented Jul 7, 2017

jreback commented Jul 7, 2017

mroeschke commented Oct 14, 2019

Output of `pd.show_versions()`