Creating DataFrame throws: data type "bytes512" not understood #20734

stephenmartindale · 2018-04-18T15:39:28Z

Code Sample, a copy-pastable example if possible

index = pd.Series(name='id', dtype='S24')
df = pd.DataFrame(index=index)
df['a'] = pd.Series(name='a', index=index, dtype=np.uint32)
df['b'] = pd.Series(name='b', index=index, dtype='S64')
df['c'] = pd.Series(name='c', index=index, dtype='S64')
df['d'] = pd.Series(name='d', index=index, dtype=np.uint8)

Problem description

The code, above, which is attempting to create an empty pandas.DataFrame with an index and four typed columns yields the following error:

[... snip ...]\appdata\local\programs\python\python36\lib\site-packages\pandas\core\internals.py in _vstack(to_stack, dtype)
   4912 
   4913     # work around NumPy 1.6 bug
-> 4914     if dtype == _NS_DTYPE or dtype == _TD_DTYPE:
   4915         new_values = np.vstack([x.view('i8') for x in to_stack])
   4916         return new_values.view(dtype)

TypeError: data type "bytes512" not understood

Why?

Changing the order of the columns works just fine:

index = pd.Series(name='id', dtype='S24')
df = pd.DataFrame(index=index)
df['a'] = pd.Series(name='a', index=index, dtype=np.uint32)
df['d'] = pd.Series(name='d', index=index, dtype=np.uint8)
df['b'] = pd.Series(name='b', index=index, dtype='S64')
df['c'] = pd.Series(name='c', index=index, dtype='S64')

In fact, it seems that any Series added after the two S64 series throws an error: I tried with both np.float and np.bool.

Expected Output

I would expect that it isn't important which order the Series are added or, if it actually is important, perhaps a better error message.

I tried with an older version of Python 3.6, NumPy and Pandas and then updated, thinking this was just a bug. The latest version I tested was CPython 3.6.5, NumPy 1.14.2, Pandas 0.22.0.

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-04-18T21:14:43Z

I think this issue is back one layer in NumPy:

np.dtype('float') == np.dtype('float').name
True
np.dtype('S64') == np.dtype('S64').name
*** TypeError: data type "bytes512" not understood

Not an expert on dtypes so will see if others chime in, but I have a feeling this will need to be opened as an issue with that project instead of here

jschendel · 2018-04-18T23:09:22Z

See numpy/numpy#5329

jorisvandenbossche · 2018-04-19T15:03:18Z

The issue @jschendel links to is about that numpy does not compare to dtypes it does not know about (because np.dtype('unknown_string') will raise an error).
However, in this case I think you could argue that np.dtype('bytes512') should work since it is the name of their own dtype (although I don't know what guarantees numpy gives about the .name attribute of dtypes)

WillAyd · 2018-04-19T16:31:17Z

For sure it's a little bit of a gray area to the issue described, but reading through the comments it doesn't seem like NumPy wants to make any guarantees about str comps. This one is certainly more compelling of an argument to support than a comp to an arbitrary string so I suppose we could open the issue there and see if it gains more traction than the linked issue (happy to open that).

cc @jreback for any input

stephenmartindale · 2018-04-20T08:25:02Z

@jschendel I'd think that that NumPy bug is relevant but, in this case, not the same bug because S64 or bytes512 should be a valid NumPy type and, therefore, should be a fair target for comparison.

Of course, I'd also argue that the expample given in that issue (np.dtype('i8') == 'foo') should also work without throwing. I'd say that np.dtype('i8') == 'foo' should result in False. Why? Because the outcome of the comparison, as it is written, is clearly false. Is the comparison of i8 to foo likely a programming error or mistake? Almost certainly. A warning along the lines of 'foo' is not a data type would notify the user that they've likely typed an error but the comparison should yield False none the less.

Such a design change, had it been effected, would change the story of this issue. We would now be discussion why my DataFrame code was throwing warnings saying that bytes512 is not a data type when it clearly is. That would lead to much more sensible issues being logged upstream in NumPy: "comparison says that bytes512 is not a data type when it clearly is."

Finally: why is S64 being treated differently to S24 in my code? 64-characters is hardly an excessive string. In my use-case, it's just a hash of some data that my source uses to identify that data. (A third-party. Not my design choice.)

jreback · 2018-04-20T10:20:09Z

this is a numpy issue (if that) and not solvable in pandas

jreback · 2018-04-20T10:28:47Z

actually we should do this patch:

In [3]: df.dtypes
Out[3]: 
a    uint32
b      |S64
c      |S64
d     uint8
dtype: object

In [4]: quit()
g(pandas) bash-3.2$ git diff
diff --git a/pandas/core/internals.py b/pandas/core/internals.py
index 37d112964..5e7d37ef8 100644
--- a/pandas/core/internals.py
+++ b/pandas/core/internals.py
@@ -5115,7 +5115,7 @@ def _block_shape(values, ndim=1, shape=None):
 def _vstack(to_stack, dtype):
 
     # work around NumPy 1.6 bug
-    if dtype == _NS_DTYPE or dtype == _TD_DTYPE:
+    if is_dtype_equal(dtype, _NS_DTYPE) or is_dtype_equal(dtype, _TD_DTYPE):
         new_values = np.vstack([x.view('i8') for x in to_stack])
         return new_values.view(dtype)

though its still buggy. string types like these should be converted to object as they are unsupported.

mroeschke · 2020-04-10T04:22:30Z

Looks like we correctly coerce to object now. Guess this could use a test

In [52]: index = pd.Series(name='id', dtype='S24')
    ...: df = pd.DataFrame(index=index)
    ...: df['a'] = pd.Series(name='a', index=index, dtype=np.uint32)
    ...: df['b'] = pd.Series(name='b', index=index, dtype='S64')
    ...: df['c'] = pd.Series(name='c', index=index, dtype='S64')
    ...: df['d'] = pd.Series(name='d', index=index, dtype=np.uint8)

In [53]: df
Out[53]:
Empty DataFrame
Columns: [a, b, c, d]
Index: []

In [54]: df.dtypes
Out[54]:
a    uint32
b    object
c    object
d     uint8
dtype: object

In [55]: pd.__version__
Out[55]: '1.1.0.dev0+1216.gd4d58f960'

jreback closed this as completed Apr 20, 2018

jreback added this to the won't fix milestone Apr 20, 2018

jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Apr 20, 2018

jreback reopened this Apr 20, 2018

jreback removed this from the won't fix milestone Apr 20, 2018

jreback added Compat pandas objects compatability with Numpy or Python functions Difficulty Intermediate labels Apr 20, 2018

jreback added this to the Next Major Release milestone Apr 20, 2018

WillAyd mentioned this issue Apr 24, 2018

BUG: DataFrame.from_records with empty rec array #20806

Closed

jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Compat pandas objects compatability with Numpy or Python functions Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions labels Apr 10, 2020

mroeschke mentioned this issue May 28, 2021

TST: More old issues #41697

Merged

10 tasks

mroeschke modified the milestones: Contributions Welcome, 1.3 May 28, 2021

jreback closed this as completed in #41697 May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating DataFrame throws: data type "bytes512" not understood #20734

Creating DataFrame throws: data type "bytes512" not understood #20734

stephenmartindale commented Apr 18, 2018

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

WillAyd commented Apr 18, 2018

jschendel commented Apr 18, 2018

jorisvandenbossche commented Apr 19, 2018

WillAyd commented Apr 19, 2018

stephenmartindale commented Apr 20, 2018 •

edited

Loading

jreback commented Apr 20, 2018

jreback commented Apr 20, 2018

mroeschke commented Apr 10, 2020

Creating DataFrame throws: data type "bytes512" not understood #20734

Creating DataFrame throws: data type "bytes512" not understood #20734

Comments

stephenmartindale commented Apr 18, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

WillAyd commented Apr 18, 2018

jschendel commented Apr 18, 2018

jorisvandenbossche commented Apr 19, 2018

WillAyd commented Apr 19, 2018

stephenmartindale commented Apr 20, 2018 • edited Loading

jreback commented Apr 20, 2018

jreback commented Apr 20, 2018

mroeschke commented Apr 10, 2020

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

stephenmartindale commented Apr 20, 2018 •

edited

Loading