From e59b6ffd4881ecebb1d3739080d16e162660ae83 Mon Sep 17 00:00:00 2001
From: MarcoGorelli <>
Date: Sun, 18 Sep 2022 14:31:45 +0100
Subject: [PATCH 1/2] pdep-4: initial draft

---
 .../0004-consistent-to-datetime-parsing.md    | 88 +++++++++++++++++++
 1 file changed, 88 insertions(+)
 create mode 100644 web/pandas/pdeps/0004-consistent-to-datetime-parsing.md

diff --git a/web/pandas/pdeps/0004-consistent-to-datetime-parsing.md b/web/pandas/pdeps/0004-consistent-to-datetime-parsing.md
new file mode 100644
index 0000000000000..a5e218a209236
--- /dev/null
+++ b/web/pandas/pdeps/0004-consistent-to-datetime-parsing.md
@@ -0,0 +1,88 @@
+# PDEP-4: Consistent datetime parsing
+
+- Created: 18 September 2022
+- Status: Under discussion
+- Discussion: [#48621](https://github.com/pandas-dev/pandas/pull/48621)
+- Author: [Marco Gorelli](https://github.com/MarcoGorelli)
+- Revision: 1
+
+## Abstract
+
+The suggestion is that:
+- ``to_datetime`` becomes strict and uses the same datetime format to parse all elements in its input.
+  The format will either be inferred from the first non-NaN element (if `format` is not provided by the user), or from
+  `format`;
+- ``infer_datetime_format`` be deprecated (as a strict version of it will become the default);
+- an easy workaround for non-strict parsing be clearly documented.
+
+## Motivation and Scope
+
+Pandas date parsing is very flexibible, but arguably too much so - see
+https://github.com/pandas-dev/pandas/issues/12585 and linked issues for how
+much confusion this causes. Pandas can swap format midway, and though this
+is document, it regularly breaks users' expectations.
+
+Simple example:
+```ipython
+In [1]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'])
+Out[1]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)
+```
+The user was almost certainly intending the data to be read as "12th of January, 13th of January".
+However, it's read as "1st of December, 13th of January". No warning or error is thrown.
+
+Currently, the only way to ensure consistent parsing is by explicitly passing
+``format=``. The argument ``infer_datetime_format``
+isn't strict and can still break users' expectations.
+
+```ipython
+In [2]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'], infer_datetime_format=True)
+Out[2]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)
+```
+
+## Detailed Description
+
+Concretely, the suggestion is:
+- if no ``format`` is specified, ``pandas`` will guess the format from the first non-NaN row
+  and parse the rest of the input according to that format. Errors will be handled
+  according to the ``errors`` argument - there will be no silent switching of format;
+- ``infer_datetime_format`` will be deprecated;
+- if the format cannot be guessed from the first non-NaN row, a ``UserWarning`` will be thrown,
+  encouraging users to explicitly pass in a format.
+  Note that this should only happen for invalid inputs such as `'a'`
+  (which would later throw a ``ParserError`` anyway), or inputs such as ``'00:12:13'``,
+  which would currently get converted to ``''2022-09-18 00:12:13''``.
+
+If a user has dates in a mixed format, they can still use flexible parsing and accept
+the risks that poses, e.g.:
+```ipython
+In [3]: pd.Series(['12-01-2000 00:00:00', '13-01-2000 00:00:00']).apply(pd.to_datetime)
+Out[3]:
+0   2000-12-01
+1   2000-01-13
+dtype: datetime64[ns]
+```
+
+## Usage and Impact
+
+My expectation is that the impact would be a net-positive:
+- potentially severe bugs in people's code will be caught early;
+- users who actually want mixed formats can still parse them, but now they'd be forced to be
+  very explicit about it;
+- the codebase would be noticeably simplified.
+
+As far as I can tell, there is no chance of _introducing_ bugs.
+
+## Implementation
+
+The whatsnew notes read
+
+> In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation.
+
+I'd suggest making this change as part of the above, because:
+- it would only help prevent bugs, not introduce any;
+- given the severity of bugs that can result from the current behaviour, waiting another 2 years until pandas 3.0.0
+  would potentially cause a lot of damage.
+
+### PDEP History
+
+- 18 September 2022: Initial draft

From 851137db80d84a54cebb755e21c041d9e4ea22b7 Mon Sep 17 00:00:00 2001
From: MarcoGorelli <>
Date: Tue, 20 Sep 2022 10:49:03 +0100
Subject: [PATCH 2/2] note about making guess_datetime_format public, out of
 scope work, dayfirst/yearfirst

---
 .../0004-consistent-to-datetime-parsing.md      | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/web/pandas/pdeps/0004-consistent-to-datetime-parsing.md b/web/pandas/pdeps/0004-consistent-to-datetime-parsing.md
index a5e218a209236..10dc4486b90e9 100644
--- a/web/pandas/pdeps/0004-consistent-to-datetime-parsing.md
+++ b/web/pandas/pdeps/0004-consistent-to-datetime-parsing.md
@@ -1,7 +1,7 @@
 # PDEP-4: Consistent datetime parsing
 
 - Created: 18 September 2022
-- Status: Under discussion
+- Status: Accepted
 - Discussion: [#48621](https://github.com/pandas-dev/pandas/pull/48621)
 - Author: [Marco Gorelli](https://github.com/MarcoGorelli)
 - Revision: 1
@@ -17,10 +17,10 @@ The suggestion is that:
 
 ## Motivation and Scope
 
-Pandas date parsing is very flexibible, but arguably too much so - see
+Pandas date parsing is very flexible, but arguably too much so - see
 https://github.com/pandas-dev/pandas/issues/12585 and linked issues for how
 much confusion this causes. Pandas can swap format midway, and though this
-is document, it regularly breaks users' expectations.
+is documented, it regularly breaks users' expectations.
 
 Simple example:
 ```ipython
@@ -32,7 +32,7 @@ However, it's read as "1st of December, 13th of January". No warning or error is
 
 Currently, the only way to ensure consistent parsing is by explicitly passing
 ``format=``. The argument ``infer_datetime_format``
-isn't strict and can still break users' expectations.
+isn't strict, can be called together with ``format``, and can still break users' expectations:
 
 ```ipython
 In [2]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'], infer_datetime_format=True)
@@ -46,6 +46,7 @@ Concretely, the suggestion is:
   and parse the rest of the input according to that format. Errors will be handled
   according to the ``errors`` argument - there will be no silent switching of format;
 - ``infer_datetime_format`` will be deprecated;
+- ``dayfirst`` and ``yearfirst`` will continue working as they currently do;
 - if the format cannot be guessed from the first non-NaN row, a ``UserWarning`` will be thrown,
   encouraging users to explicitly pass in a format.
   Note that this should only happen for invalid inputs such as `'a'`
@@ -83,6 +84,14 @@ I'd suggest making this change as part of the above, because:
 - given the severity of bugs that can result from the current behaviour, waiting another 2 years until pandas 3.0.0
   would potentially cause a lot of damage.
 
+Note that this wouldn't mean getting rid of ``dateutil.parser``, as that would still be used within ``guess_datetime_format``. With this proposal, however, subsequent rows would be parsed with the guessed format rather than repeatedly calling ``dateutil.parser`` and risk having it silently switch format
+
+Finally, the function ``from pandas._libs.tslibs.parsing import guess_datetime_format`` would be made public, under ``pandas.tools``.
+
+## Out of scope
+
+We could make ``guess_datetime_format`` smarter by using a random sample of elements to infer the format.
+
 ### PDEP History
 
 - 18 September 2022: Initial draft