-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: incorrect assumption of full-name month format when "May" is initial month #58328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report! This is a duplicate of #57521 which got closed for "not a bug" |
Thanks for the ping @Aloqeely. That issue was closed by its author while I sought the opinion of @MarcoGorelli. I'd like to do that here as well! |
thanks for the ping - I wouldn't consider this a bug, this looks intended if this was auto-inferred as %b, then someone else would complain when it would raise on "April"
how, what do you suggest? |
I'd agree in general with @rhshadrach's comments in the linked issue (specifically around the edge cases where say most of the entries are May but one is April etc.). In the meantime, I think a doc update which contains language from the release notes (see below) would be helpful. Maybe add this specific caveat around inferring May which could lead to either %b or %B in the notes section.
That language is present in the current docs under, but the fact that this param is deprecated doesn't make it very clear as to what is the current behavior.
|
Just to say - I've read the previous issue - but I do think this is a bug in the guess of the initial date format, unless there is, for some reason, an absolute requirement that the guess is not revisited, and cannot be improved by later entries. I mean, I think it is reasonable to consider a risky guess as a bug. Here, when the first date has "May" as the month, the chances of full-month-format ( |
I don't know the code, and if that's an issue - I can take a look. I guess the problem is that there isn't currently a way of updating the guess once an error arises? If not - would that be practical to implement in restricted circumstances with high risk guesses and obvious alternatives, as here? |
tbh I wouldn't be opposed to a simple solution which considers more than just the first element - it needs to be simple though, this code's already quite complex |
A 2-pass approach might work. If the first pass errors out, then re-run with |
@MarcoGorelli - thanks - I do not know the Pandas code-base very well, but if no-one else is interested to look at this, I could give it a go. |
Using the first element that doesn't have May as month. Opened #58336 |
Another requirement I'd add is predictable - not only in behavior but performance. It should be easy to describe to a user what pandas is doing to infer, and the procedure should be in the docstring. |
Just thinking aloud here, but I wonder whether there is some way of maintaining an "alternative" format that would be triggered by an error. As in: 31-May-2023 -> guessed %B, alternative %b but 31-May-2023 -> guessed %B, alternative %b This could also mean automatic correct parsing of the old dd-mm-yyyy issue: 12-10-2023 -> guessed mm-dd-yyyy, alternative dd-mm-yyyy Where: 12-10-2023 -> guessed mm-dd-yyyy, alternative dd-mm-yyyy Of course this would have some parsing-time cost, but this would only be for a few elements, typically, before the alternative can be excluded. |
The general "alternative format" idea sounds great but there will be an endless amount of alternative formats, and I'm not sure if it is considered a "simple solution", maybe you can try tweaking with it and see if it's easily doable. |
I'll have a look at the code - thanks. Sketching here, I think it need only catch some common cases, and others could argue for further extensions as they run into them. Then the code might look something like: ...
found_fmt = guess_format(first_entry)
alternatives = _find_alternatives(found_fmt)
...
for current_entry in entries:
...
if alternatives:
alternatives = _prune_alternatives(current_entry, alternatives)
try:
out = parse_entry(current_entry, found_fmt)
except ValueError as e:
if alternatives:
return dp.to_datetime(series, format=alternatives)
raise e |
thanks - tbh I feel uneasy about special-casing 'may' like this Maybe just take the first 5 non-null elements, guess the format for each, and take a majority vote? I think this'd be pretty simple and easy to explain In the end, this "guess datetime format" is just provided for convenience, if you want reliable parsing you should pass a format explicitly |
I agree, Brett also mentioned (in my PR) that it wouldn't work for other locale's month names
I'm a big fan of your suggestion and Brett's suggestion and think we should implement both, I.E. take first 5 non-null elements, for each element see all possible guesses, and take the first format that works for all 5. Is this good? Now I don't know if we can accomplish this without complicating the code, @matthew-brett are you able to do it? If not, I can just implement the idea of checking the first 5 non-null elements and call it a day |
I'm not necessarily a fan of using first K because there's nothing fundamentally different from K=1 (which is today's behavior) and you would still run into issues if "April" pops up on the 6th value instead of within the first K. |
What do you suggest? |
After having this conversation, one thing is clear to me, not only do we have to describe the procedure pandas does to infer (as Mr. Shadrach suggested), we should also highly encourage the use of the |
@Aloqeely - thanks for looking at this - yes - I'm happy to take a go at this, and try and come up with something suitably simple, to make it as likely as possible that the guess will be correct. I have about 2 weeks of heavy work just now, but I will get to it soon afterwards. |
Not sure if it is right, but my idea for the fix would be as below
Something like below: if "May" in data[0]:
expected_format = get_format(data[0]) # "%B-%d-%Y"
formats_to_try = [expected_format, expected_format.replace('%B', '%b')]
for fmt in formats_to_try:
try:
return pd.to_datetime(*same_args, format=fmt)
except ValueError:
continue
raise ValueError("Mixed date formats") |
Haven't looked at the code here but if it is not favoured to special branch "May" in the code for a second pass which I can understand, is it easier to detect the cause of the raise and branch to better, more informed error messages? In this case just describing what the likely problem is around "May" and how a user can quickly fix it? A targeted error message might also allow a user to monkey patch any dynamic code. |
…whether to treat May as an abbreviation or a full month name. See: pandas-dev/pandas#58328
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When the first date entries have "May" as the month, the parser assumes the months are written out as "month as locale’s full name" (format code
%B
), and then, when it hits an abbreviated month (correct format code%b
- abbreviated name), it gives an error:Expected Behavior
It would be very good if the parser remained agnostic as to whether the month specifier was abbreviated until after it had seen a month other than "May".
Installed Versions
INSTALLED VERSIONS
commit : 2fbabb1
python : 3.10.12.final.0
python-bits : 64
OS : Darwin
OS-release : 23.4.0
Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:11:05 PDT 2024; root:xnu-10063.101.17~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 3.0.0.dev0+736.g2fbabb11db
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.2.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.23.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: