diff --git a/web/pandas/pdeps/0004-consistent-to-datetime-parsing.md b/web/pandas/pdeps/0004-consistent-to-datetime-parsing.md new file mode 100644 index 0000000000000..10dc4486b90e9 --- /dev/null +++ b/web/pandas/pdeps/0004-consistent-to-datetime-parsing.md @@ -0,0 +1,97 @@ +# PDEP-4: Consistent datetime parsing + +- Created: 18 September 2022 +- Status: Accepted +- Discussion: [#48621](https://github.com/pandas-dev/pandas/pull/48621) +- Author: [Marco Gorelli](https://github.com/MarcoGorelli) +- Revision: 1 + +## Abstract + +The suggestion is that: +- ``to_datetime`` becomes strict and uses the same datetime format to parse all elements in its input. + The format will either be inferred from the first non-NaN element (if `format` is not provided by the user), or from + `format`; +- ``infer_datetime_format`` be deprecated (as a strict version of it will become the default); +- an easy workaround for non-strict parsing be clearly documented. + +## Motivation and Scope + +Pandas date parsing is very flexible, but arguably too much so - see +https://github.com/pandas-dev/pandas/issues/12585 and linked issues for how +much confusion this causes. Pandas can swap format midway, and though this +is documented, it regularly breaks users' expectations. + +Simple example: +```ipython +In [1]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00']) +Out[1]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None) +``` +The user was almost certainly intending the data to be read as "12th of January, 13th of January". +However, it's read as "1st of December, 13th of January". No warning or error is thrown. + +Currently, the only way to ensure consistent parsing is by explicitly passing +``format=``. The argument ``infer_datetime_format`` +isn't strict, can be called together with ``format``, and can still break users' expectations: + +```ipython +In [2]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'], infer_datetime_format=True) +Out[2]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None) +``` + +## Detailed Description + +Concretely, the suggestion is: +- if no ``format`` is specified, ``pandas`` will guess the format from the first non-NaN row + and parse the rest of the input according to that format. Errors will be handled + according to the ``errors`` argument - there will be no silent switching of format; +- ``infer_datetime_format`` will be deprecated; +- ``dayfirst`` and ``yearfirst`` will continue working as they currently do; +- if the format cannot be guessed from the first non-NaN row, a ``UserWarning`` will be thrown, + encouraging users to explicitly pass in a format. + Note that this should only happen for invalid inputs such as `'a'` + (which would later throw a ``ParserError`` anyway), or inputs such as ``'00:12:13'``, + which would currently get converted to ``''2022-09-18 00:12:13''``. + +If a user has dates in a mixed format, they can still use flexible parsing and accept +the risks that poses, e.g.: +```ipython +In [3]: pd.Series(['12-01-2000 00:00:00', '13-01-2000 00:00:00']).apply(pd.to_datetime) +Out[3]: +0 2000-12-01 +1 2000-01-13 +dtype: datetime64[ns] +``` + +## Usage and Impact + +My expectation is that the impact would be a net-positive: +- potentially severe bugs in people's code will be caught early; +- users who actually want mixed formats can still parse them, but now they'd be forced to be + very explicit about it; +- the codebase would be noticeably simplified. + +As far as I can tell, there is no chance of _introducing_ bugs. + +## Implementation + +The whatsnew notes read + +> In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation. + +I'd suggest making this change as part of the above, because: +- it would only help prevent bugs, not introduce any; +- given the severity of bugs that can result from the current behaviour, waiting another 2 years until pandas 3.0.0 + would potentially cause a lot of damage. + +Note that this wouldn't mean getting rid of ``dateutil.parser``, as that would still be used within ``guess_datetime_format``. With this proposal, however, subsequent rows would be parsed with the guessed format rather than repeatedly calling ``dateutil.parser`` and risk having it silently switch format + +Finally, the function ``from pandas._libs.tslibs.parsing import guess_datetime_format`` would be made public, under ``pandas.tools``. + +## Out of scope + +We could make ``guess_datetime_format`` smarter by using a random sample of elements to infer the format. + +### PDEP History + +- 18 September 2022: Initial draft