-
Notifications
You must be signed in to change notification settings - Fork 39
Separate pd2.NaT for datetime vs timedelta #74
Comments
I believe the API in arrow/pandas2 is currently pushing in the opposite direction, using a unified NA scalar (e.g. below). However, it will probably be easier than it sounds because missing-ness is always tracked in a separate bitmap, rather than as special sentinel values.
|
@chris-b1 thanks for filling me in. Is pyarrow the repo to keep an eye on to follow pd2 development? Big if, but IIUC what you’re discussing is how a null is represented inside an array, where the array holds a dtype. I’m talking about a scalar, where NaT + TimedeltaIndex(...) is ambiguous because NaT currently quacks as both a datetime and a timedelta. |
Yeah, the vision has evolved over time, but my current (possibly incorrect) understanding is:
arrow issues are on JIRA, here - https://issues.apache.org/jira/projects/ARROW/issues In In [144]: pa.array([1, 2, 3]) + pa.NA
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-144-09f1116a2f04> in <module>()
----> 1 pa.array([1, 2, 3]) + pa.NA
TypeError: unsupported operand type(s) for +: 'pyarrow.lib.Int64Array' and 'pyarrow.lib.NAType' |
A lot of headaches are caused by the fact that
pd.NaT
is usually adatetime
but occasionally atimedelta
. In some cases this leads to unavoidable internal inconsistency (pandas-dev/pandas#19124). For pandas2 it is worth consider breaking these into two distinct constants with unambiguous types.The text was updated successfully, but these errors were encountered: