-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Allow callable for on_bad_lines in read_csv when engine="python" #45146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow callable for on_bad_lines in read_csv when engine="python" #45146
Conversation
mroeschke
commented
Dec 31, 2021
- closes Add ability to process bad lines for read_csv #5686
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
pandas/io/parsers/readers.py
Outdated
@@ -364,6 +365,12 @@ | |||
|
|||
.. versionadded:: 1.3.0 | |||
|
|||
- callable, function with signature ``(bad_line: list[str]) -> list[str]`` | |||
that will process a single bad line. ``bad_line`` is a list of strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am I right in thinking the output list[str]
must be a certain length? if the output were to be the same as the input, for example, then what would happen? Checked the tests but they seemed to only cover valid function cases where relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_csv
has a precedent of throwing a ParserWarning
if a row has more elements that expected and continues parsing (seems to drop the extra elements), so I think if the callable does similar it should also throw a ParserWarning
Added a test to check this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically it can return a list of Hashables, this should not be an issue.
We should document, that the fallback behavior is a warning
@phofl if you can review |
1 similar comment
@phofl if you can review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments/questions. I like this!
Could we make this work for the header columns? I think the bad_lines arguments are currently ignored there
pandas/io/parsers/readers.py
Outdated
@@ -364,6 +365,12 @@ | |||
|
|||
.. versionadded:: 1.3.0 | |||
|
|||
- callable, function with signature ``(bad_line: list[str]) -> list[str]`` | |||
that will process a single bad line. ``bad_line`` is a list of strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically it can return a list of Hashables, this should not be an issue.
We should document, that the fallback behavior is a warning
Looks like that would require a refactor IIUC because the Python parser first "determines" what the correct header should be first and then |
Ok thought so. Then let's ignore this for now and keep in mind for the future. I think ignoring bad lines is not useful, if they are in the header, but modifying them with a function might be worthwile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine
thanks @mroeschke very nice (and @phofl for review!) |
@meeseeksdev backport 1.4.x |
… read_csv when engine="python"
Something went wrong ... Please have a look at my logs. |
…when engine="python" (#45264) Co-authored-by: Matthew Roeschke <[email protected]>