Skip to content

Creating DataFrame throws: data type "bytes512" not understood #20734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stephenmartindale opened this issue Apr 18, 2018 · 8 comments · Fixed by #41697
Closed

Creating DataFrame throws: data type "bytes512" not understood #20734

stephenmartindale opened this issue Apr 18, 2018 · 8 comments · Fixed by #41697
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@stephenmartindale
Copy link

Code Sample, a copy-pastable example if possible

index = pd.Series(name='id', dtype='S24')
df = pd.DataFrame(index=index)
df['a'] = pd.Series(name='a', index=index, dtype=np.uint32)
df['b'] = pd.Series(name='b', index=index, dtype='S64')
df['c'] = pd.Series(name='c', index=index, dtype='S64')
df['d'] = pd.Series(name='d', index=index, dtype=np.uint8)

Problem description

The code, above, which is attempting to create an empty pandas.DataFrame with an index and four typed columns yields the following error:

[... snip ...]\appdata\local\programs\python\python36\lib\site-packages\pandas\core\internals.py in _vstack(to_stack, dtype)
   4912 
   4913     # work around NumPy 1.6 bug
-> 4914     if dtype == _NS_DTYPE or dtype == _TD_DTYPE:
   4915         new_values = np.vstack([x.view('i8') for x in to_stack])
   4916         return new_values.view(dtype)

TypeError: data type "bytes512" not understood

Why?

Changing the order of the columns works just fine:

index = pd.Series(name='id', dtype='S24')
df = pd.DataFrame(index=index)
df['a'] = pd.Series(name='a', index=index, dtype=np.uint32)
df['d'] = pd.Series(name='d', index=index, dtype=np.uint8)
df['b'] = pd.Series(name='b', index=index, dtype='S64')
df['c'] = pd.Series(name='c', index=index, dtype='S64')

In fact, it seems that any Series added after the two S64 series throws an error: I tried with both np.float and np.bool.

Expected Output

I would expect that it isn't important which order the Series are added or, if it actually is important, perhaps a better error message.

I tried with an older version of Python 3.6, NumPy and Pandas and then updated, thinking this was just a bug. The latest version I tested was CPython 3.6.5, NumPy 1.14.2, Pandas 0.22.0.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Apr 18, 2018

I think this issue is back one layer in NumPy:

np.dtype('float') == np.dtype('float').name
True
np.dtype('S64') == np.dtype('S64').name
*** TypeError: data type "bytes512" not understood

Not an expert on dtypes so will see if others chime in, but I have a feeling this will need to be opened as an issue with that project instead of here

@jschendel
Copy link
Member

See numpy/numpy#5329

@jorisvandenbossche
Copy link
Member

The issue @jschendel links to is about that numpy does not compare to dtypes it does not know about (because np.dtype('unknown_string') will raise an error).
However, in this case I think you could argue that np.dtype('bytes512') should work since it is the name of their own dtype (although I don't know what guarantees numpy gives about the .name attribute of dtypes)

@WillAyd
Copy link
Member

WillAyd commented Apr 19, 2018

For sure it's a little bit of a gray area to the issue described, but reading through the comments it doesn't seem like NumPy wants to make any guarantees about str comps. This one is certainly more compelling of an argument to support than a comp to an arbitrary string so I suppose we could open the issue there and see if it gains more traction than the linked issue (happy to open that).

cc @jreback for any input

@stephenmartindale
Copy link
Author

stephenmartindale commented Apr 20, 2018

@jschendel I'd think that that NumPy bug is relevant but, in this case, not the same bug because S64 or bytes512 should be a valid NumPy type and, therefore, should be a fair target for comparison.

Of course, I'd also argue that the expample given in that issue (np.dtype('i8') == 'foo') should also work without throwing. I'd say that np.dtype('i8') == 'foo' should result in False. Why? Because the outcome of the comparison, as it is written, is clearly false. Is the comparison of i8 to foo likely a programming error or mistake? Almost certainly. A warning along the lines of 'foo' is not a data type would notify the user that they've likely typed an error but the comparison should yield False none the less.

Such a design change, had it been effected, would change the story of this issue. We would now be discussion why my DataFrame code was throwing warnings saying that bytes512 is not a data type when it clearly is. That would lead to much more sensible issues being logged upstream in NumPy: "comparison says that bytes512 is not a data type when it clearly is."

Finally: why is S64 being treated differently to S24 in my code? 64-characters is hardly an excessive string. In my use-case, it's just a hash of some data that my source uses to identify that data. (A third-party. Not my design choice.)

@jreback
Copy link
Contributor

jreback commented Apr 20, 2018

this is a numpy issue (if that) and not solvable in pandas

@jreback jreback closed this as completed Apr 20, 2018
@jreback jreback added this to the won't fix milestone Apr 20, 2018
@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Apr 20, 2018
@jreback jreback reopened this Apr 20, 2018
@jreback
Copy link
Contributor

jreback commented Apr 20, 2018

actually we should do this patch:

In [3]: df.dtypes
Out[3]: 
a    uint32
b      |S64
c      |S64
d     uint8
dtype: object

In [4]: quit()
g(pandas) bash-3.2$ git diff
diff --git a/pandas/core/internals.py b/pandas/core/internals.py
index 37d112964..5e7d37ef8 100644
--- a/pandas/core/internals.py
+++ b/pandas/core/internals.py
@@ -5115,7 +5115,7 @@ def _block_shape(values, ndim=1, shape=None):
 def _vstack(to_stack, dtype):
 
     # work around NumPy 1.6 bug
-    if dtype == _NS_DTYPE or dtype == _TD_DTYPE:
+    if is_dtype_equal(dtype, _NS_DTYPE) or is_dtype_equal(dtype, _TD_DTYPE):
         new_values = np.vstack([x.view('i8') for x in to_stack])
         return new_values.view(dtype)

though its still buggy. string types like these should be converted to object as they are unsupported.

@jreback jreback removed this from the won't fix milestone Apr 20, 2018
@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Difficulty Intermediate labels Apr 20, 2018
@jreback jreback added this to the Next Major Release milestone Apr 20, 2018
@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019
@mroeschke
Copy link
Member

Looks like we correctly coerce to object now. Guess this could use a test

In [52]: index = pd.Series(name='id', dtype='S24')
    ...: df = pd.DataFrame(index=index)
    ...: df['a'] = pd.Series(name='a', index=index, dtype=np.uint32)
    ...: df['b'] = pd.Series(name='b', index=index, dtype='S64')
    ...: df['c'] = pd.Series(name='c', index=index, dtype='S64')
    ...: df['d'] = pd.Series(name='d', index=index, dtype=np.uint8)

In [53]: df
Out[53]:
Empty DataFrame
Columns: [a, b, c, d]
Index: []

In [54]: df.dtypes
Out[54]:
a    uint32
b    object
c    object
d     uint8
dtype: object

In [55]: pd.__version__
Out[55]: '1.1.0.dev0+1216.gd4d58f960'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Compat pandas objects compatability with Numpy or Python functions Constructors Series/DataFrame/Index/pd.array Constructors Dtype Conversions Unexpected or buggy dtype conversions labels Apr 10, 2020
@mroeschke mroeschke mentioned this issue May 28, 2021
10 tasks
@mroeschke mroeschke modified the milestones: Contributions Welcome, 1.3 May 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants