BUG, DOC: Allow custom line terminator with delim_whitespace=True #12939

gfyoung · 2016-04-20T23:32:11Z

Title is self-explanatory. Closes #12912.

gfyoung · 2016-04-20T23:34:01Z

The tokenizers in tokenizer.c, and the CParser tests in test_parser.py seem to have a lot of repetition, although reworking the latter seems easier to do than the former. Can / should this PR also include refactoring changes to those parts of the code?

jreback · 2016-04-20T23:39:28Z

no make smaller / simpler PRs that do 1 thing

jreback · 2016-04-20T23:39:53Z

they can of course be built upon one another

jreback · 2016-04-20T23:40:37Z

doc/source/io.rst

@@ -97,6 +97,11 @@ sep : str, defaults to ``','`` for :func:`read_csv`, ``\t`` for :func:`read_tabl
  Regex example: ``'\\r\\t'``.
 delimiter : str, default ``None``
  Alternative argument name for sep.
+delim_whitespace : boolean, default False


say this is equivalent to \s+ regex

jreback · 2016-04-20T23:45:33Z

why do you think adding all of this untested code is a good idea?

you would basically have to duplicate all of the test suite with a custom terminator in order to validate this

you need to use functions that are already tested
iow each of the cases needs to all a function

gfyoung · 2016-04-20T23:50:21Z

What functions are you talking about? If you look at the other tokenizer functions, they all duplicate the same or have very similar case work. The function I added is in fact a near duplicate of tokenize_whitespace save replacing checks for \n and \r with self->lineterminator. That's why I asked about the refactoring initially.

jreback · 2016-04-20T23:53:37Z

well then refactor first - we can't add more code like this

jorisvandenbossche · 2016-04-21T10:30:07Z

doc/source/io.rst

@@ -97,6 +97,11 @@ sep : str, defaults to ``','`` for :func:`read_csv`, ``\t`` for :func:`read_tabl
  Regex example: ``'\\r\\t'``.
 delimiter : str, default ``None``
  Alternative argument name for sep.
+delim_whitespace : boolean, default False
+  Specifies whether or not whitespace (e.g. ``' '`` or ```'\t'``)


One backtick too much before \t

Oh, good catch. Will fix.

gfyoung · 2016-04-21T13:55:51Z

Managed to refactor code so that it only splits based on whether or not whitespace is used as a delimiter, so there are now only two tokenizing functions instead of three (or four after my second commit). Will see if it can be squashed into just one, though it seems less straightforward compared to the custom line terminator split.

jreback · 2016-04-21T13:58:39Z

@gfyoung ok much better. pls also run asv's for csv to make sure no degredation.

So macros are good because they make the code shorter / more understandable while hopefully not sacrificing perf. more combining code is better (again could be done later), but since you are already working on it....

jreback · 2016-04-21T13:59:44Z

pandas/io/parsers.py

@@ -209,6 +209,11 @@
 warn_bad_lines : boolean, default True
    If error_bad_lines is False, and warn_bad_lines is True, a warning for each
    "bad line" will be output. (Only valid with C parser).
+delim_whitespace : boolean, default False
+    Specifies whether or not whitespace (e.g. ``' '`` or ``'\t'``) will be used
+    as the delimiter. Equivalent to ``'\+s'`` in regex. If this option is set


say (both her and in the docs), that delim_whitespace=True equiv to sep='\s+' (not just that its the same regex)

gfyoung · 2016-04-21T14:59:30Z

@jreback : there seems to have been a recent update to python-dateutil (just today), causing a test failure in pandas.tslib on OSX. I can reproduce this failure on Linux FYI.

jreback · 2016-04-21T15:40:50Z

@gfyoung ok! my pip install catches things then, great! (the osx build will pip install python-dateutil to catch things just like this :>), while the 3.5 build uses conda (which is with a stable build). IIRC this has some reverses from previous things, so will have to sort them and see. Maybe we should ban certain versions of dateutil. These can cause weird things to happen.

gfyoung · 2016-04-21T15:42:58Z

@jreback : is it possible to put in a stopgap measure FTTB (e.g. change the compat condition in that test to being exactly 2.5.2 so that investigation can proceed in isolation from other PR's?)

jreback · 2016-04-21T15:47:37Z

see #12944 (I skipped the test for now on master), so rebase.

Addresses DOC issue part of pandas-devgh-12912.

Addresses BUG issue part of pandas-devgh-12912. Closes pandas-devgh-12912.

gfyoung · 2016-04-21T16:37:35Z

Whoot! Squashed it down into one function! Hopefully Travis will give the green light this time around. I will also do some timing analysis for read_csv and post it in the conversation.

gfyoung · 2016-04-21T18:53:57Z

For PR

io_bench.frame_to_csv.time_frame_to_csv           188.25ms
io_bench.frame_to_csv2.time_frame_to_csv2         312.94ms
...formatting.time_frame_to_csv_date_formatting    13.36ms
...h.frame_to_csv_mixed.time_frame_to_csv_mixed   218.47ms
...m.time_read_csv_infer_datetime_format_custom    15.82ms
....time_read_csv_infer_datetime_format_iso8601     2.64ms
..._ymd.time_read_csv_infer_datetime_format_ymd     2.92ms
...nch.read_csv_skiprows.time_read_csv_skiprows    16.74ms
...nch.read_csv_standard.time_read_csv_standard    14.00ms
..._dates_iso8601.time_read_parse_dates_iso8601     1.93ms
...h.write_csv_standard.time_write_csv_standard    50.49ms

For Master

io_bench.frame_to_csv.time_frame_to_csv           195.29ms
io_bench.frame_to_csv2.time_frame_to_csv2         303.08ms
...formatting.time_frame_to_csv_date_formatting    12.99ms
...h.frame_to_csv_mixed.time_frame_to_csv_mixed   224.39ms
...m.time_read_csv_infer_datetime_format_custom    15.53ms
....time_read_csv_infer_datetime_format_iso8601     2.64ms
..._ymd.time_read_csv_infer_datetime_format_ymd     2.80ms
...nch.read_csv_skiprows.time_read_csv_skiprows    16.04ms
...nch.read_csv_standard.time_read_csv_standard    13.23ms
..._dates_iso8601.time_read_parse_dates_iso8601     1.78ms
...h.write_csv_standard.time_write_csv_standard    51.60ms

gfyoung · 2016-04-21T18:56:14Z

@jreback : Travis is giving the green light, and I don't see any major timing discrepancies. Ready to merge if there is nothing else.

jreback · 2016-04-21T21:11:06Z

thanks @gfyoung didn't realize how much duplicative code there was already in tokenizer.c sheesh 👍

jreback · 2016-04-21T21:11:51Z

as an aside, adding delim_whitespace option to python parser is trivial.

jreback reviewed Apr 20, 2016
View reviewed changes

jorisvandenbossche reviewed Apr 21, 2016
View reviewed changes

jreback added the IO CSV read_csv, to_csv label Apr 21, 2016

gfyoung force-pushed the delim-whitespace-fix branch from 20b27f1 to 21697bc Compare April 21, 2016 13:54

jreback reviewed Apr 21, 2016
View reviewed changes

gfyoung force-pushed the delim-whitespace-fix branch from 21697bc to c00d0e0 Compare April 21, 2016 14:48

gfyoung added 3 commits April 21, 2016 17:09

DOC: Add documentation for delim_whitespace

fdbc768

Addresses DOC issue part of pandas-devgh-12912.

BUG: Parse custom terminator with whitespace delimiter

62d6260

Addresses BUG issue part of pandas-devgh-12912. Closes pandas-devgh-12912.

MAINT: Refactor C engine tokenizing

78cf922

gfyoung force-pushed the delim-whitespace-fix branch from c00d0e0 to 78cf922 Compare April 21, 2016 16:36

gfyoung mentioned this pull request Apr 21, 2016

Allow parsing in skipped row for C engine #12900

Closed

jreback added this to the 0.18.1 milestone Apr 21, 2016

jreback closed this in b3b166a Apr 21, 2016

gfyoung deleted the delim-whitespace-fix branch April 21, 2016 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG, DOC: Allow custom line terminator with delim_whitespace=True #12939

BUG, DOC: Allow custom line terminator with delim_whitespace=True #12939

gfyoung commented Apr 20, 2016

gfyoung commented Apr 20, 2016 •

edited

Loading

jreback commented Apr 20, 2016

jreback commented Apr 20, 2016

jreback Apr 20, 2016

gfyoung Apr 21, 2016

jreback commented Apr 20, 2016

gfyoung commented Apr 20, 2016 •

edited

Loading

jreback commented Apr 20, 2016

jorisvandenbossche Apr 21, 2016

gfyoung Apr 21, 2016

gfyoung commented Apr 21, 2016 •

edited

Loading

jreback commented Apr 21, 2016

jreback Apr 21, 2016

gfyoung Apr 21, 2016

gfyoung commented Apr 21, 2016 •

edited

Loading

jreback commented Apr 21, 2016

gfyoung commented Apr 21, 2016 •

edited

Loading

jreback commented Apr 21, 2016

gfyoung commented Apr 21, 2016

gfyoung commented Apr 21, 2016 •

edited

Loading

gfyoung commented Apr 21, 2016

jreback commented Apr 21, 2016

jreback commented Apr 21, 2016

BUG, DOC: Allow custom line terminator with delim_whitespace=True #12939

BUG, DOC: Allow custom line terminator with delim_whitespace=True #12939

Conversation

gfyoung commented Apr 20, 2016

gfyoung commented Apr 20, 2016 • edited Loading

jreback commented Apr 20, 2016

jreback commented Apr 20, 2016

jreback Apr 20, 2016

Choose a reason for hiding this comment

gfyoung Apr 21, 2016

Choose a reason for hiding this comment

jreback commented Apr 20, 2016

gfyoung commented Apr 20, 2016 • edited Loading

jreback commented Apr 20, 2016

jorisvandenbossche Apr 21, 2016

Choose a reason for hiding this comment

gfyoung Apr 21, 2016

Choose a reason for hiding this comment

gfyoung commented Apr 21, 2016 • edited Loading

jreback commented Apr 21, 2016

jreback Apr 21, 2016

Choose a reason for hiding this comment

gfyoung Apr 21, 2016

Choose a reason for hiding this comment

gfyoung commented Apr 21, 2016 • edited Loading

jreback commented Apr 21, 2016

gfyoung commented Apr 21, 2016 • edited Loading

jreback commented Apr 21, 2016

gfyoung commented Apr 21, 2016

gfyoung commented Apr 21, 2016 • edited Loading

For PR

For Master

gfyoung commented Apr 21, 2016

jreback commented Apr 21, 2016

jreback commented Apr 21, 2016

gfyoung commented Apr 20, 2016 •

edited

Loading

gfyoung commented Apr 20, 2016 •

edited

Loading

gfyoung commented Apr 21, 2016 •

edited

Loading

gfyoung commented Apr 21, 2016 •

edited

Loading

gfyoung commented Apr 21, 2016 •

edited

Loading

gfyoung commented Apr 21, 2016 •

edited

Loading