Separate pd2.NaT for datetime vs timedelta #74

jbrockmendel · 2018-01-09T16:51:59Z

A lot of headaches are caused by the fact that pd.NaT is usually a datetime but occasionally a timedelta. In some cases this leads to unavoidable internal inconsistency (pandas-dev/pandas#19124). For pandas2 it is worth consider breaking these into two distinct constants with unambiguous types.

The text was updated successfully, but these errors were encountered:

chris-b1 · 2018-01-09T19:13:19Z

I believe the API in arrow/pandas2 is currently pushing in the opposite direction, using a unified NA scalar (e.g. below). However, it will probably be easier than it sounds because missing-ness is always tracked in a separate bitmap, rather than as special sentinel values.

In [132]: import pyarrow as pa

In [133]: pa.array([1, 2, None])
Out[133]: 
<pyarrow.lib.Int64Array object at 0x000000000BCBBB88>
[
  1,
  2,
  NA
]

In [134]: pa.array([1, 2, None])[-1]
Out[134]: NA

In [135]: import datetime

In [136]: pa.array([datetime.datetime(2016, 12, 31), None])
Out[136]: 
<pyarrow.lib.TimestampArray object at 0x000000000BD2CB38>
[
  Timestamp('2016-12-31 00:00:00'),
  NA
]

In [137]: pa.array([datetime.datetime(2016, 12, 31), None])[-1]
Out[137]: NA

In [138]: type(_)
Out[138]: pyarrow.lib.NAType

jbrockmendel · 2018-01-09T20:01:27Z

@chris-b1 thanks for filling me in. Is pyarrow the repo to keep an eye on to follow pd2 development?

Big if, but IIUC what you’re discussing is how a null is represented inside an array, where the array holds a dtype. I’m talking about a scalar, where NaT + TimedeltaIndex(...) is ambiguous because NaT currently quacks as both a datetime and a timedelta.

chris-b1 · 2018-01-09T20:11:57Z

Yeah, the vision has evolved over time, but my current (possibly incorrect) understanding is:

arrow - base, python agnostic, c++ layer, core memory layout & algos - https://github.com/apache/arrow/tree/master/cpp
pyarrow - python wrapper/access to arrow, https://github.com/apache/arrow/tree/master/python
pandas2 - TBD, wrapper around pyarrow (may be one and the same), more traditional pandas interface. (see also ibis - https://github.com/ibis-project/ibis)

arrow issues are on JIRA, here - https://issues.apache.org/jira/projects/ARROW/issues

In pyarrow, NA is also the scalar type. Not sure how this actually will work as numeric ops, etc are not implemented yet, but for instance, in theory could be:

In [144]: pa.array([1, 2, 3]) + pa.NA
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-144-09f1116a2f04> in <module>()
----> 1 pa.array([1, 2, 3]) + pa.NA

TypeError: unsupported operand type(s) for +: 'pyarrow.lib.Int64Array' and 'pyarrow.lib.NAType'

shoyer mentioned this issue Jan 28, 2019

Separate NaT values for Timedelta ("NaTD") and Period? pandas-dev/pandas#24983

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate pd2.NaT for datetime vs timedelta #74

Separate pd2.NaT for datetime vs timedelta #74

jbrockmendel commented Jan 9, 2018

chris-b1 commented Jan 9, 2018

jbrockmendel commented Jan 9, 2018

chris-b1 commented Jan 9, 2018

Separate pd2.NaT for datetime vs timedelta #74

Separate pd2.NaT for datetime vs timedelta #74

Comments

jbrockmendel commented Jan 9, 2018

chris-b1 commented Jan 9, 2018

jbrockmendel commented Jan 9, 2018

chris-b1 commented Jan 9, 2018