-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG, DOC: Allow custom line terminator with delim_whitespace=True #12939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The tokenizers in |
no make smaller / simpler PRs that do 1 thing |
they can of course be built upon one another |
@@ -97,6 +97,11 @@ sep : str, defaults to ``','`` for :func:`read_csv`, ``\t`` for :func:`read_tabl | |||
Regex example: ``'\\r\\t'``. | |||
delimiter : str, default ``None`` | |||
Alternative argument name for sep. | |||
delim_whitespace : boolean, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say this is equivalent to \s+ regex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
why do you think adding all of this untested code is a good idea? you would basically have to duplicate all of the test suite with a custom terminator in order to validate this you need to use functions that are already tested |
What functions are you talking about? If you look at the other tokenizer functions, they all duplicate the same or have very similar case work. The function I added is in fact a near duplicate of |
well then refactor first - we can't add more code like this |
@@ -97,6 +97,11 @@ sep : str, defaults to ``','`` for :func:`read_csv`, ``\t`` for :func:`read_tabl | |||
Regex example: ``'\\r\\t'``. | |||
delimiter : str, default ``None`` | |||
Alternative argument name for sep. | |||
delim_whitespace : boolean, default False | |||
Specifies whether or not whitespace (e.g. ``' '`` or ```'\t'``) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One backtick too much before \t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good catch. Will fix.
20b27f1
to
21697bc
Compare
Managed to refactor code so that it only splits based on whether or not whitespace is used as a delimiter, so there are now only two tokenizing functions instead of three (or four after my second commit). Will see if it can be squashed into just one, though it seems less straightforward compared to the custom line terminator split. |
@gfyoung ok much better. pls also run asv's for csv to make sure no degredation. So macros are good because they make the code shorter / more understandable while hopefully not sacrificing perf. more combining code is better (again could be done later), but since you are already working on it.... |
@@ -209,6 +209,11 @@ | |||
warn_bad_lines : boolean, default True | |||
If error_bad_lines is False, and warn_bad_lines is True, a warning for each | |||
"bad line" will be output. (Only valid with C parser). | |||
delim_whitespace : boolean, default False | |||
Specifies whether or not whitespace (e.g. ``' '`` or ``'\t'``) will be used | |||
as the delimiter. Equivalent to ``'\+s'`` in regex. If this option is set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say (both her and in the docs), that delim_whitespace=True
equiv to sep='\s+'
(not just that its the same regex)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
21697bc
to
c00d0e0
Compare
@jreback : there seems to have been a recent update to |
@gfyoung ok! my pip install catches things then, great! (the osx build will pip install python-dateutil to catch things just like this :>), while the 3.5 build uses conda (which is with a stable build). IIRC this has some reverses from previous things, so will have to sort them and see. Maybe we should ban certain versions of dateutil. These can cause weird things to happen. |
@jreback : is it possible to put in a stopgap measure FTTB (e.g. change the |
see #12944 (I skipped the test for now on master), so rebase. |
Addresses DOC issue part of pandas-devgh-12912.
Addresses BUG issue part of pandas-devgh-12912. Closes pandas-devgh-12912.
c00d0e0
to
78cf922
Compare
Whoot! Squashed it down into one function! Hopefully Travis will give the green light this time around. I will also do some timing analysis for |
For PRio_bench.frame_to_csv.time_frame_to_csv 188.25ms io_bench.frame_to_csv2.time_frame_to_csv2 312.94ms ...formatting.time_frame_to_csv_date_formatting 13.36ms ...h.frame_to_csv_mixed.time_frame_to_csv_mixed 218.47ms ...m.time_read_csv_infer_datetime_format_custom 15.82ms ....time_read_csv_infer_datetime_format_iso8601 2.64ms ..._ymd.time_read_csv_infer_datetime_format_ymd 2.92ms ...nch.read_csv_skiprows.time_read_csv_skiprows 16.74ms ...nch.read_csv_standard.time_read_csv_standard 14.00ms ..._dates_iso8601.time_read_parse_dates_iso8601 1.93ms ...h.write_csv_standard.time_write_csv_standard 50.49ms For Masterio_bench.frame_to_csv.time_frame_to_csv 195.29ms io_bench.frame_to_csv2.time_frame_to_csv2 303.08ms ...formatting.time_frame_to_csv_date_formatting 12.99ms ...h.frame_to_csv_mixed.time_frame_to_csv_mixed 224.39ms ...m.time_read_csv_infer_datetime_format_custom 15.53ms ....time_read_csv_infer_datetime_format_iso8601 2.64ms ..._ymd.time_read_csv_infer_datetime_format_ymd 2.80ms ...nch.read_csv_skiprows.time_read_csv_skiprows 16.04ms ...nch.read_csv_standard.time_read_csv_standard 13.23ms ..._dates_iso8601.time_read_parse_dates_iso8601 1.78ms ...h.write_csv_standard.time_write_csv_standard 51.60ms |
@jreback : Travis is giving the green light, and I don't see any major timing discrepancies. Ready to merge if there is nothing else. |
thanks @gfyoung didn't realize how much duplicative code there was already in tokenizer.c sheesh 👍 |
as an aside, adding |
Title is self-explanatory. Closes #12912.