Skip to content

BUG, DOC: Allow custom line terminator with delim_whitespace=True #12939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Apr 20, 2016

Title is self-explanatory. Closes #12912.

@gfyoung
Copy link
Member Author

gfyoung commented Apr 20, 2016

The tokenizers in tokenizer.c, and the CParser tests in test_parser.py seem to have a lot of repetition, although reworking the latter seems easier to do than the former. Can / should this PR also include refactoring changes to those parts of the code?

@jreback
Copy link
Contributor

jreback commented Apr 20, 2016

no make smaller / simpler PRs that do 1 thing

@jreback
Copy link
Contributor

jreback commented Apr 20, 2016

they can of course be built upon one another

@@ -97,6 +97,11 @@ sep : str, defaults to ``','`` for :func:`read_csv`, ``\t`` for :func:`read_tabl
Regex example: ``'\\r\\t'``.
delimiter : str, default ``None``
Alternative argument name for sep.
delim_whitespace : boolean, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say this is equivalent to \s+ regex

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jreback
Copy link
Contributor

jreback commented Apr 20, 2016

why do you think adding all of this untested code is a good idea?

you would basically have to duplicate all of the test suite with a custom terminator in order to validate this

you need to use functions that are already tested
iow each of the cases needs to all a function

@gfyoung
Copy link
Member Author

gfyoung commented Apr 20, 2016

What functions are you talking about? If you look at the other tokenizer functions, they all duplicate the same or have very similar case work. The function I added is in fact a near duplicate of tokenize_whitespace save replacing checks for \n and \r with self->lineterminator. That's why I asked about the refactoring initially.

@jreback
Copy link
Contributor

jreback commented Apr 20, 2016

well then refactor first - we can't add more code like this

@@ -97,6 +97,11 @@ sep : str, defaults to ``','`` for :func:`read_csv`, ``\t`` for :func:`read_tabl
Regex example: ``'\\r\\t'``.
delimiter : str, default ``None``
Alternative argument name for sep.
delim_whitespace : boolean, default False
Specifies whether or not whitespace (e.g. ``' '`` or ```'\t'``)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One backtick too much before \t

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch. Will fix.

@jreback jreback added the IO CSV read_csv, to_csv label Apr 21, 2016
@gfyoung gfyoung force-pushed the delim-whitespace-fix branch from 20b27f1 to 21697bc Compare April 21, 2016 13:54
@gfyoung
Copy link
Member Author

gfyoung commented Apr 21, 2016

Managed to refactor code so that it only splits based on whether or not whitespace is used as a delimiter, so there are now only two tokenizing functions instead of three (or four after my second commit). Will see if it can be squashed into just one, though it seems less straightforward compared to the custom line terminator split.

@jreback
Copy link
Contributor

jreback commented Apr 21, 2016

@gfyoung ok much better. pls also run asv's for csv to make sure no degredation.

So macros are good because they make the code shorter / more understandable while hopefully not sacrificing perf. more combining code is better (again could be done later), but since you are already working on it....

@@ -209,6 +209,11 @@
warn_bad_lines : boolean, default True
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
"bad line" will be output. (Only valid with C parser).
delim_whitespace : boolean, default False
Specifies whether or not whitespace (e.g. ``' '`` or ``'\t'``) will be used
as the delimiter. Equivalent to ``'\+s'`` in regex. If this option is set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say (both her and in the docs), that delim_whitespace=True equiv to sep='\s+' (not just that its the same regex)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung gfyoung force-pushed the delim-whitespace-fix branch from 21697bc to c00d0e0 Compare April 21, 2016 14:48
@gfyoung
Copy link
Member Author

gfyoung commented Apr 21, 2016

@jreback : there seems to have been a recent update to python-dateutil (just today), causing a test failure in pandas.tslib on OSX. I can reproduce this failure on Linux FYI.

@jreback
Copy link
Contributor

jreback commented Apr 21, 2016

@gfyoung ok! my pip install catches things then, great! (the osx build will pip install python-dateutil to catch things just like this :>), while the 3.5 build uses conda (which is with a stable build). IIRC this has some reverses from previous things, so will have to sort them and see. Maybe we should ban certain versions of dateutil. These can cause weird things to happen.

@gfyoung
Copy link
Member Author

gfyoung commented Apr 21, 2016

@jreback : is it possible to put in a stopgap measure FTTB (e.g. change the compat condition in that test to being exactly 2.5.2 so that investigation can proceed in isolation from other PR's?)

@jreback
Copy link
Contributor

jreback commented Apr 21, 2016

see #12944 (I skipped the test for now on master), so rebase.

@gfyoung gfyoung force-pushed the delim-whitespace-fix branch from c00d0e0 to 78cf922 Compare April 21, 2016 16:36
@gfyoung
Copy link
Member Author

gfyoung commented Apr 21, 2016

Whoot! Squashed it down into one function! Hopefully Travis will give the green light this time around. I will also do some timing analysis for read_csv and post it in the conversation.

@gfyoung
Copy link
Member Author

gfyoung commented Apr 21, 2016

For PR

io_bench.frame_to_csv.time_frame_to_csv           188.25ms
io_bench.frame_to_csv2.time_frame_to_csv2         312.94ms
...formatting.time_frame_to_csv_date_formatting    13.36ms
...h.frame_to_csv_mixed.time_frame_to_csv_mixed   218.47ms
...m.time_read_csv_infer_datetime_format_custom    15.82ms
....time_read_csv_infer_datetime_format_iso8601     2.64ms
..._ymd.time_read_csv_infer_datetime_format_ymd     2.92ms
...nch.read_csv_skiprows.time_read_csv_skiprows    16.74ms
...nch.read_csv_standard.time_read_csv_standard    14.00ms
..._dates_iso8601.time_read_parse_dates_iso8601     1.93ms
...h.write_csv_standard.time_write_csv_standard    50.49ms

For Master

io_bench.frame_to_csv.time_frame_to_csv           195.29ms
io_bench.frame_to_csv2.time_frame_to_csv2         303.08ms
...formatting.time_frame_to_csv_date_formatting    12.99ms
...h.frame_to_csv_mixed.time_frame_to_csv_mixed   224.39ms
...m.time_read_csv_infer_datetime_format_custom    15.53ms
....time_read_csv_infer_datetime_format_iso8601     2.64ms
..._ymd.time_read_csv_infer_datetime_format_ymd     2.80ms
...nch.read_csv_skiprows.time_read_csv_skiprows    16.04ms
...nch.read_csv_standard.time_read_csv_standard    13.23ms
..._dates_iso8601.time_read_parse_dates_iso8601     1.78ms
...h.write_csv_standard.time_write_csv_standard    51.60ms

@gfyoung
Copy link
Member Author

gfyoung commented Apr 21, 2016

@jreback : Travis is giving the green light, and I don't see any major timing discrepancies. Ready to merge if there is nothing else.

@jreback jreback added this to the 0.18.1 milestone Apr 21, 2016
@jreback jreback closed this in b3b166a Apr 21, 2016
@jreback
Copy link
Contributor

jreback commented Apr 21, 2016

thanks @gfyoung didn't realize how much duplicative code there was already in tokenizer.c sheesh 👍

@jreback
Copy link
Contributor

jreback commented Apr 21, 2016

as an aside, adding delim_whitespace option to python parser is trivial.

@gfyoung gfyoung deleted the delim-whitespace-fix branch April 21, 2016 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants