Skip to content
This repository was archived by the owner on Apr 10, 2024. It is now read-only.

Separate pd2.NaT for datetime vs timedelta #74

Open
jbrockmendel opened this issue Jan 9, 2018 · 3 comments
Open

Separate pd2.NaT for datetime vs timedelta #74

jbrockmendel opened this issue Jan 9, 2018 · 3 comments

Comments

@jbrockmendel
Copy link

A lot of headaches are caused by the fact that pd.NaT is usually a datetime but occasionally a timedelta. In some cases this leads to unavoidable internal inconsistency (pandas-dev/pandas#19124). For pandas2 it is worth consider breaking these into two distinct constants with unambiguous types.

@chris-b1
Copy link

chris-b1 commented Jan 9, 2018

I believe the API in arrow/pandas2 is currently pushing in the opposite direction, using a unified NA scalar (e.g. below). However, it will probably be easier than it sounds because missing-ness is always tracked in a separate bitmap, rather than as special sentinel values.

In [132]: import pyarrow as pa

In [133]: pa.array([1, 2, None])
Out[133]: 
<pyarrow.lib.Int64Array object at 0x000000000BCBBB88>
[
  1,
  2,
  NA
]

In [134]: pa.array([1, 2, None])[-1]
Out[134]: NA

In [135]: import datetime

In [136]: pa.array([datetime.datetime(2016, 12, 31), None])
Out[136]: 
<pyarrow.lib.TimestampArray object at 0x000000000BD2CB38>
[
  Timestamp('2016-12-31 00:00:00'),
  NA
]

In [137]: pa.array([datetime.datetime(2016, 12, 31), None])[-1]
Out[137]: NA

In [138]: type(_)
Out[138]: pyarrow.lib.NAType

@jbrockmendel
Copy link
Author

@chris-b1 thanks for filling me in. Is pyarrow the repo to keep an eye on to follow pd2 development?

Big if, but IIUC what you’re discussing is how a null is represented inside an array, where the array holds a dtype. I’m talking about a scalar, where NaT + TimedeltaIndex(...) is ambiguous because NaT currently quacks as both a datetime and a timedelta.

@chris-b1
Copy link

chris-b1 commented Jan 9, 2018

Yeah, the vision has evolved over time, but my current (possibly incorrect) understanding is:

arrow issues are on JIRA, here - https://issues.apache.org/jira/projects/ARROW/issues

In pyarrow, NA is also the scalar type. Not sure how this actually will work as numeric ops, etc are not implemented yet, but for instance, in theory could be:

In [144]: pa.array([1, 2, 3]) + pa.NA
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-144-09f1116a2f04> in <module>()
----> 1 pa.array([1, 2, 3]) + pa.NA

TypeError: unsupported operand type(s) for +: 'pyarrow.lib.Int64Array' and 'pyarrow.lib.NAType'

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants