Skip to content

PDEP-4: consistent parsing of datetimes #48621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 23, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions web/pandas/pdeps/0004-consistent-to-datetime-parsing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# PDEP-4: Consistent datetime parsing

- Created: 18 September 2022
- Status: Accepted
- Discussion: [#48621](https://github.com/pandas-dev/pandas/pull/48621)
- Author: [Marco Gorelli](https://github.com/MarcoGorelli)
- Revision: 1

## Abstract

The suggestion is that:
- ``to_datetime`` becomes strict and uses the same datetime format to parse all elements in its input.
Copy link
Member

@datapythonista datapythonista Sep 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a blank line before the list, that's markdown standard (GitHub comments allows it, but it's not allowed in the markdown spec)

The format will either be inferred from the first non-NaN element (if `format` is not provided by the user), or from
`format`;
- ``infer_datetime_format`` be deprecated (as a strict version of it will become the default);
- an easy workaround for non-strict parsing be clearly documented.

## Motivation and Scope

Pandas date parsing is very flexible, but arguably too much so - see
https://github.com/pandas-dev/pandas/issues/12585 and linked issues for how
much confusion this causes. Pandas can swap format midway, and though this
is documented, it regularly breaks users' expectations.

Simple example:
```ipython
In [1]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'])
Out[1]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)
```
The user was almost certainly intending the data to be read as "12th of January, 13th of January".
However, it's read as "1st of December, 13th of January". No warning or error is thrown.

Currently, the only way to ensure consistent parsing is by explicitly passing
``format=``. The argument ``infer_datetime_format``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor but related, but it would be good to mention that infer_datetime_format should be strict with respect to being mixed with format

In [1]: pd.to_datetime(["2022-01-01"], infer_datetime_format=True, format="%Y-%m-%d")
Out[1]: DatetimeIndex(['2022-01-01'], dtype='datetime64[ns]', freq=None)

# Format doesn't match the input
In [2]: pd.to_datetime(["2022-01-01"], infer_datetime_format=True, format="%m-%d-%Y")
Out[2]: DatetimeIndex(['2022-01-01'], dtype='datetime64[ns]', freq=None)

i.e. it's not great that format != None and infer_datetime_format=True

isn't strict, can be called together with ``format``, and can still break users' expectations:

```ipython
In [2]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'], infer_datetime_format=True)
Out[2]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)
```

## Detailed Description

Concretely, the suggestion is:
- if no ``format`` is specified, ``pandas`` will guess the format from the first non-NaN row
and parse the rest of the input according to that format. Errors will be handled
according to the ``errors`` argument - there will be no silent switching of format;
- ``infer_datetime_format`` will be deprecated;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you were parsing 11-12-10, would the default be November 12, 2010, or December 11, 2010 or December 10, 2011? I think we should be explicit in this proposal on how dayfirst and yearfirst interacts with this particular behavior, especially since the docs say that those 2 parameters are not strict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know what infer_datetime_format currently does but is it not very restrictive, and not always possible, to infer from just the first element?

If that were to be the case I support the comment of outlining how to operate with year first and day first. As a European my colleagues and I hate US mm/dd/yy and I think this might also need outlining some specifics.

The advantage of providing a consistent format input is that better inference could be made from multiple samples and this is very useful for structured data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both for taking a look!

@Dr-Irv this proposal wouldn't change how dayfirst and yearfirst operate. The format will try to be guessed in accordance with these parameters, just like it is on main - the difference is that with this proposal, the format guessed from the first non-NaN row will be used to parse the rest of the Series

@attack68 in the rare case that it's not possible to guess the format from the first element, then a UserWarning would be raised, check lines 49-55 of this PR

You're very right to bring up mm/dd/yy 👍 - indeed the vast majority of the world doesn't use that format. That's why the current behaviour is so dangerous. For example, suppose your data is in %d-%m-%Y %H:%M format:

On main, the first row's date would be parsed as mm-dd-yyyy, whilst the second one as dd-mm-yyyy. No error, no warning, this is very easy to miss (and I almost did once in a prod setting 😳 ):

In [1]: pd.to_datetime(['12-01-2000 00:00', '13-01-2000 00:00'])
Out[1]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)

With this PDEP, you could just check the format of your first row, and you'd know the rest of the Series was parsed in accordance to that. If it can't be, then with errors='raise' (the default), you'd get an error

ValueError: time data '13-01-2000 00:00' does not match format '%m-%d-%Y %H:%M' (match)

and you'd see that the guessed format wasn't right. You could get around that either by explicitly passing format, or with dayfirst=True:

In [2]: pd.to_datetime(['12-01-2000 00:00', '13-01-2000 00:00'], dayfirst=True)
Out[2]: DatetimeIndex(['2000-01-12', '2000-01-13'], dtype='datetime64[ns]', freq=None)

Totally agree on better documenting this, and that inference could be optimised by using multiple samples to guess - first, I just wanted to get agreement that we want to_datetime to parse consistently

- ``dayfirst`` and ``yearfirst`` will continue working as they currently do;
- if the format cannot be guessed from the first non-NaN row, a ``UserWarning`` will be thrown,
encouraging users to explicitly pass in a format.
Note that this should only happen for invalid inputs such as `'a'`
(which would later throw a ``ParserError`` anyway), or inputs such as ``'00:12:13'``,
which would currently get converted to ``''2022-09-18 00:12:13''``.

If a user has dates in a mixed format, they can still use flexible parsing and accept
the risks that poses, e.g.:
```ipython
In [3]: pd.Series(['12-01-2000 00:00:00', '13-01-2000 00:00:00']).apply(pd.to_datetime)
Out[3]:
0 2000-12-01
1 2000-01-13
dtype: datetime64[ns]
```

## Usage and Impact

My expectation is that the impact would be a net-positive:
- potentially severe bugs in people's code will be caught early;
- users who actually want mixed formats can still parse them, but now they'd be forced to be
very explicit about it;
- the codebase would be noticeably simplified.

As far as I can tell, there is no chance of _introducing_ bugs.

## Implementation

The whatsnew notes read

> In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we don't have style for blockquote in our website. I created #48758 to fix it.


I'd suggest making this change as part of the above, because:
- it would only help prevent bugs, not introduce any;
- given the severity of bugs that can result from the current behaviour, waiting another 2 years until pandas 3.0.0
would potentially cause a lot of damage.

Note that this wouldn't mean getting rid of ``dateutil.parser``, as that would still be used within ``guess_datetime_format``. With this proposal, however, subsequent rows would be parsed with the guessed format rather than repeatedly calling ``dateutil.parser`` and risk having it silently switch format

Finally, the function ``from pandas._libs.tslibs.parsing import guess_datetime_format`` would be made public, under ``pandas.tools``.

## Out of scope

We could make ``guess_datetime_format`` smarter by using a random sample of elements to infer the format.

### PDEP History

- 18 September 2022: Initial draft