Skip to content

BUG refactor datetime parsing and fix 8 bugs #50242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -780,6 +780,7 @@ Performance improvements
- Performance improvement in :func:`merge` when not merging on the index - the new index will now be :class:`RangeIndex` instead of :class:`Int64Index` (:issue:`49478`)
- Performance improvement in :meth:`DataFrame.to_dict` and :meth:`Series.to_dict` when using any non-object dtypes (:issue:`46470`)
- Performance improvement in :func:`read_html` when there are multiple tables (:issue:`49929`)
- Performance improvement in :func:`to_datetime` when using ``'%Y%m%d'`` format (:issue:`17410`)

.. ---------------------------------------------------------------------------
.. _whatsnew_200.bug_fixes:
Expand Down Expand Up @@ -809,6 +810,13 @@ Datetimelike
- Bug in :meth:`Timestamp.round` when the ``freq`` argument has zero-duration (e.g. "0ns") returning incorrect results instead of raising (:issue:`49737`)
- Bug in :func:`to_datetime` was not raising ``ValueError`` when invalid format was passed and ``errors`` was ``'ignore'`` or ``'coerce'`` (:issue:`50266`)
- Bug in :class:`DateOffset` was throwing ``TypeError`` when constructing with milliseconds and another super-daily argument (:issue:`49897`)
- Bug in :func:`to_datetime` was not raising ``ValueError`` when parsing string with decimal date with format ``'%Y%m%d'`` (:issue:`50051`)
- Bug in :func:`to_datetime` was not converting ``None`` to ``NaT`` when parsing mixed-offset date strings with ISO8601 format (:issue:`50071`)
- Bug in :func:`to_datetime` was not returning input when parsing out-of-bounds date string with ``errors='ignore'`` and ``format='%Y%m%d'`` (:issue:`14487`)
- Bug in :func:`to_datetime` was converting timezone-naive ``datetime.datetime`` to timezone-aware when parsing with timezone-aware strings, ISO8601 format, and ``utc=False`` (:issue:`50254`)
- Bug in :func:`to_datetime` was throwing ``ValueError`` when parsing dates with ISO8601 format where some values were not zero-padded (:issue:`21422`)
- Bug in :func:`to_datetime` was giving incorrect results when using ``format='%Y%m%d'`` and ``errors='ignore'`` (:issue:`26493`)
- Bug in :func:`to_datetime` was failing to parse date strings ``'today'`` and ``'now'`` if ``format`` was not ISO8601 (:issue:`50359`)
-

Timedelta
Expand Down
3 changes: 0 additions & 3 deletions pandas/_libs/tslib.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,6 @@ def array_to_datetime(
dayfirst: bool = ...,
yearfirst: bool = ...,
utc: bool = ...,
require_iso8601: bool = ...,
format: str | None = ...,
exact: bool = ...,
) -> tuple[np.ndarray, tzinfo | None]: ...

# returned ndarray may be object dtype or datetime64[ns]
Expand Down
64 changes: 3 additions & 61 deletions pandas/_libs/tslib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ from pandas._libs.tslibs.np_datetime cimport (
pydatetime_to_dt64,
string_to_dts,
)
from pandas._libs.tslibs.strptime cimport parse_today_now
from pandas._libs.util cimport (
is_datetime64_object,
is_float_object,
Expand Down Expand Up @@ -443,9 +444,6 @@ cpdef array_to_datetime(
bint dayfirst=False,
bint yearfirst=False,
bint utc=False,
bint require_iso8601=False,
format: str | None=None,
bint exact=True,
):
"""
Converts a 1D array of date-like values to a numpy array of either:
Expand All @@ -472,8 +470,6 @@ cpdef array_to_datetime(
yearfirst parsing behavior when encountering datetime strings
utc : bool, default False
indicator whether the dates should be UTC
require_iso8601 : bool, default False
indicator whether the datetime string should be iso8601

Returns
-------
Expand Down Expand Up @@ -544,16 +540,6 @@ cpdef array_to_datetime(
iresult[i] = get_datetime64_nanos(val, NPY_FR_ns)

elif is_integer_object(val) or is_float_object(val):
if require_iso8601:
if is_coerce:
iresult[i] = NPY_NAT
continue
elif is_raise:
raise ValueError(
f"time data \"{val}\" doesn't "
f"match format \"{format}\", at position {i}"
)
return values, tz_out
# these must be ns unit by-definition
seen_integer = True

Expand Down Expand Up @@ -584,25 +570,13 @@ cpdef array_to_datetime(

string_to_dts_failed = string_to_dts(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error messaging here is a bit confusing to me - looks like string_to_dts is already labeled ?except -1. Is there a reason why Cython doesn't propogate an error before your check of if string_to_dts_failed?

Copy link
Member Author

@MarcoGorelli MarcoGorelli Dec 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because here want_exc is False:

parse_error:
if (want_exc) {
PyErr_Format(PyExc_ValueError,
"Error parsing datetime string \"%s\" at position %d", str,
(int)(substr - str));
}
return -1;

The only place where it's True is

object[::1] res_flat = result.ravel() # should NOT be a copy
cnp.flatiter it = cnp.PyArray_IterNew(values)
if na_rep is None:
na_rep = "NaT"
if tz is None:
# if we don't have a format nor tz, then choose
# a format based on precision
basic_format = format is None
if basic_format:
reso_obj = get_resolution(values, tz=tz, reso=reso)
show_ns = reso_obj == Resolution.RESO_NS
show_us = reso_obj == Resolution.RESO_US
show_ms = reso_obj == Resolution.RESO_MS
elif format == "%Y-%m-%d %H:%M:%S":
# Same format as default, but with hardcoded precision (s)

which is only a testing function. So, perhaps the ?except -1 can just be removed, and the testing function removed (I think it would be better to test to_datetime directly).

I'd keep that to a separate PR anyway, but thanks for catching this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks for review. Yea I'd be OK with your suggestion in a separate PR. Always good to clean this up - not sure we've handled consistently in the past

val, &dts, &out_bestunit, &out_local,
&out_tzoffset, False, format, exact
&out_tzoffset, False, None, False
)
if string_to_dts_failed:
# An error at this point is a _parsing_ error
# specifically _not_ OutOfBoundsDatetime
if _parse_today_now(val, &iresult[i], utc):
if parse_today_now(val, &iresult[i], utc):
continue
elif require_iso8601:
# if requiring iso8601 strings, skip trying
# other formats
if is_coerce:
iresult[i] = NPY_NAT
continue
elif is_raise:
raise ValueError(
f"time data \"{val}\" doesn't "
f"match format \"{format}\", at position {i}"
)
return values, tz_out

try:
py_dt = parse_datetime_string(val,
Expand Down Expand Up @@ -665,18 +639,6 @@ cpdef array_to_datetime(
if is_coerce:
iresult[i] = NPY_NAT
continue
elif require_iso8601 and isinstance(val, str):
# GH#19382 for just-barely-OutOfBounds falling back to
# dateutil parser will return incorrect result because
# it will ignore nanoseconds
if is_raise:

# Still raise OutOfBoundsDatetime,
# as error message is informative.
raise

assert is_ignore
return values, tz_out
raise

except OutOfBoundsDatetime:
Expand Down Expand Up @@ -835,26 +797,6 @@ cdef _array_to_datetime_object(
return oresult, None


cdef bint _parse_today_now(str val, int64_t* iresult, bint utc):
# We delay this check for as long as possible
# because it catches relatively rare cases

# Multiply by 1000 to convert to nanos, since these methods naturally have
# microsecond resolution
if val == "now":
if utc:
iresult[0] = Timestamp.utcnow().value * 1000
else:
# GH#18705 make sure to_datetime("now") matches Timestamp("now")
# Note using Timestamp.now() is faster than Timestamp("now")
iresult[0] = Timestamp.now().value * 1000
return True
elif val == "today":
iresult[0] = Timestamp.today().value * 1000
return True
return False


def array_to_datetime_with_tz(ndarray values, tzinfo tz):
"""
Vectorized analogue to pd.Timestamp(value, tz=tz)
Expand Down
1 change: 0 additions & 1 deletion pandas/_libs/tslibs/parsing.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,6 @@ def try_parse_datetime_components(
minutes: npt.NDArray[np.object_], # object[:]
seconds: npt.NDArray[np.object_], # object[:]
) -> npt.NDArray[np.object_]: ...
def format_is_iso(f: str) -> bool: ...
def guess_datetime_format(
dt_str,
dayfirst: bool | None = ...,
Expand Down
20 changes: 0 additions & 20 deletions pandas/_libs/tslibs/parsing.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -818,26 +818,6 @@ class _timelex:
_DATEUTIL_LEXER_SPLIT = _timelex.split


def format_is_iso(f: str) -> bint:
"""
Does format match the iso8601 set that can be handled by the C parser?
Generally of form YYYY-MM-DDTHH:MM:SS - date separator can be different
but must be consistent. Leading 0s in dates and times are optional.
"""
iso_template = "%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S{micro_or_tz}".format
excluded_formats = ["%Y%m%d", "%Y%m", "%Y"]

for date_sep in [" ", "/", "\\", "-", ".", ""]:
for time_sep in [" ", "T"]:
for micro_or_tz in ["", "%z", ".%f", ".%f%z"]:
if (iso_template(date_sep=date_sep,
time_sep=time_sep,
micro_or_tz=micro_or_tz,
).startswith(f) and f not in excluded_formats):
return True
return False


def guess_datetime_format(dt_str: str, bint dayfirst=False) -> str | None:
"""
Guess the datetime format of a given datetime string.
Expand Down
4 changes: 4 additions & 0 deletions pandas/_libs/tslibs/strptime.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from numpy cimport int64_t


cdef bint parse_today_now(str val, int64_t* iresult, bint utc)
74 changes: 74 additions & 0 deletions pandas/_libs/tslibs/strptime.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,14 @@ from pandas._libs.tslibs.nattype cimport (
c_nat_strings as nat_strings,
)
from pandas._libs.tslibs.np_datetime cimport (
NPY_DATETIMEUNIT,
NPY_FR_ns,
check_dts_bounds,
npy_datetimestruct,
npy_datetimestruct_to_datetime,
pydate_to_dt64,
pydatetime_to_dt64,
string_to_dts,
)
from pandas._libs.tslibs.np_datetime import OutOfBoundsDatetime
from pandas._libs.tslibs.timestamps cimport _Timestamp
Expand All @@ -48,9 +50,50 @@ from pandas._libs.util cimport (
is_float_object,
is_integer_object,
)
from pandas._libs.tslibs.timestamps import Timestamp

cnp.import_array()

cdef bint format_is_iso(f: str):
"""
Does format match the iso8601 set that can be handled by the C parser?
Generally of form YYYY-MM-DDTHH:MM:SS - date separator can be different
but must be consistent. Leading 0s in dates and times are optional.
"""
excluded_formats = ["%Y%m"]

for date_sep in [" ", "/", "\\", "-", ".", ""]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a loop can you express this as a regular expression? Seems like it would help the performance that way as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nevermind I see this is the way it is currently written - something to consider for another PR though. My guess is can only help

for time_sep in [" ", "T"]:
for micro_or_tz in ["", "%z", ".%f", ".%f%z"]:
iso_fmt = f"%Y{date_sep}%m{date_sep}%d{time_sep}%H:%M:%S{micro_or_tz}"
if iso_fmt.startswith(f) and f not in excluded_formats:
return True
return False


def _test_format_is_iso(f: str) -> bool:
"""Only used in testing."""
return format_is_iso(f)


cdef bint parse_today_now(str val, int64_t* iresult, bint utc):
# We delay this check for as long as possible
# because it catches relatively rare cases

# Multiply by 1000 to convert to nanos, since these methods naturally have
# microsecond resolution
if val == "now":
if utc:
iresult[0] = Timestamp.utcnow().value * 1000
else:
# GH#18705 make sure to_datetime("now") matches Timestamp("now")
# Note using Timestamp.now() is faster than Timestamp("now")
iresult[0] = Timestamp.now().value * 1000
return True
elif val == "today":
iresult[0] = Timestamp.today().value * 1000
return True
return False

cdef dict _parse_code_table = {"y": 0,
"Y": 1,
Expand Down Expand Up @@ -111,6 +154,9 @@ def array_strptime(
bint found_naive = False
bint found_tz = False
tzinfo tz_out = None
bint iso_format = fmt is not None and format_is_iso(fmt)
NPY_DATETIMEUNIT out_bestunit
int out_local = 0, out_tzoffset = 0

assert is_raise or is_ignore or is_coerce

Expand Down Expand Up @@ -232,6 +278,34 @@ def array_strptime(
else:
val = str(val)

if iso_format:
string_to_dts_failed = string_to_dts(
val, &dts, &out_bestunit, &out_local,
&out_tzoffset, False, fmt, exact
)
if not string_to_dts_failed:
# No error reported by string_to_dts, pick back up
# where we left off
value = npy_datetimestruct_to_datetime(NPY_FR_ns, &dts)
if out_local == 1:
# Store the out_tzoffset in seconds
# since we store the total_seconds of
# dateutil.tz.tzoffset objects
tz = timezone(timedelta(minutes=out_tzoffset))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the analogous block in tslib we then adjust value using tz_localize_to_utc. do we need to do that here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it happens a few levels up, here:

tz_results = np.empty(len(result), dtype=object)
for zone in unique(timezones):
mask = timezones == zone
dta = DatetimeArray(result[mask]).tz_localize(zone)
if utc:
if dta.tzinfo is None:
dta = dta.tz_localize("utc")
else:
dta = dta.tz_convert("utc")
tz_results[mask] = dta

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks. would it be viable to use the same pattern so we can share more code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would indeed be good, I'll see what I can do

result_timezone[i] = tz
out_local = 0
out_tzoffset = 0
iresult[i] = value
check_dts_bounds(&dts)
continue

if parse_today_now(val, &iresult[i], utc):
continue

# Some ISO formats can't be parsed by string_to_dts
# For example, 6-digit YYYYMD. So, if there's an error,
# try the string-matching code below.

# exact matching
if exact:
found = format_regex.match(val)
Expand Down
7 changes: 0 additions & 7 deletions pandas/core/arrays/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -2118,10 +2118,7 @@ def objects_to_datetime64ns(
yearfirst,
utc: bool = False,
errors: DateTimeErrorChoices = "raise",
require_iso8601: bool = False,
allow_object: bool = False,
format: str | None = None,
exact: bool = True,
):
"""
Convert data to array of timestamps.
Expand All @@ -2134,7 +2131,6 @@ def objects_to_datetime64ns(
utc : bool, default False
Whether to convert/localize timestamps to UTC.
errors : {'raise', 'ignore', 'coerce'}
require_iso8601 : bool, default False
allow_object : bool
Whether to return an object-dtype ndarray instead of raising if the
data contains more than one timezone.
Expand Down Expand Up @@ -2165,9 +2161,6 @@ def objects_to_datetime64ns(
utc=utc,
dayfirst=dayfirst,
yearfirst=yearfirst,
require_iso8601=require_iso8601,
format=format,
exact=exact,
)
result = result.reshape(data.shape, order=order)
except OverflowError as err:
Expand Down
Loading