Skip to content

ENH: Allow callable for on_bad_lines in read_csv when engine="python" #45146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Jan 8, 2022

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke added Enhancement IO CSV read_csv, to_csv labels Dec 31, 2021
@@ -364,6 +365,12 @@

.. versionadded:: 1.3.0

- callable, function with signature ``(bad_line: list[str]) -> list[str]``
that will process a single bad line. ``bad_line`` is a list of strings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am I right in thinking the output list[str] must be a certain length? if the output were to be the same as the input, for example, then what would happen? Checked the tests but they seemed to only cover valid function cases where relevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_csv has a precedent of throwing a ParserWarning if a row has more elements that expected and continues parsing (seems to drop the extra elements), so I think if the callable does similar it should also throw a ParserWarning

Added a test to check this behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it can return a list of Hashables, this should not be an issue.

We should document, that the fallback behavior is a warning

@jreback jreback added this to the 1.4 milestone Jan 3, 2022
@jreback
Copy link
Contributor

jreback commented Jan 3, 2022

@phofl if you can review

1 similar comment
@jreback
Copy link
Contributor

jreback commented Jan 3, 2022

@phofl if you can review

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments/questions. I like this!

Could we make this work for the header columns? I think the bad_lines arguments are currently ignored there

@@ -364,6 +365,12 @@

.. versionadded:: 1.3.0

- callable, function with signature ``(bad_line: list[str]) -> list[str]``
that will process a single bad line. ``bad_line`` is a list of strings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it can return a list of Hashables, this should not be an issue.

We should document, that the fallback behavior is a warning

@mroeschke
Copy link
Member Author

mroeschke commented Jan 3, 2022

Could we make this work for the header columns? I think the bad_lines arguments are currently ignored there

Looks like that would require a refactor IIUC because the Python parser first "determines" what the correct header should be first and then on_bad_lines takes affect if there are any rows where len(row) > len(predetermined header)

@phofl
Copy link
Member

phofl commented Jan 3, 2022

Ok thought so. Then let's ignore this for now and keep in mind for the future. I think ignoring bad lines is not useful, if they are in the header, but modifying them with a function might be worthwile

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine

@jreback jreback merged commit a8f966b into pandas-dev:master Jan 8, 2022
@jreback
Copy link
Contributor

jreback commented Jan 8, 2022

thanks @mroeschke very nice (and @phofl for review!)

@jreback
Copy link
Contributor

jreback commented Jan 8, 2022

@meeseeksdev backport 1.4.x

@lumberbot-app
Copy link

lumberbot-app bot commented Jan 8, 2022

Something went wrong ... Please have a look at my logs.

jreback pushed a commit that referenced this pull request Jan 8, 2022
@mroeschke mroeschke deleted the enh/on_bad_lines_callable branch January 8, 2022 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to process bad lines for read_csv
4 participants