ENH: Faster merge_asof() performs a single pass when joining tables (#13902) #13903

chrisaycock · 2016-08-04T14:52:49Z

This version passes existing regression tests but is ultimately wrong
because it requires the "by" column to be a single object. A proper version
would handle int (and possily float) columns through type differentiation.

chrisaycock · 2016-08-04T19:32:01Z

lint failed with

'pandas.types.common._ensure_platform_int' imported but unused

I will fix this in a subsequent update.

codecov-io · 2016-08-04T21:04:51Z

Current coverage is 85.29% (diff: 97.72%)

Merging #13903 into master will decrease coverage by 0.01%

@@             master     #13903   diff @@
==========================================
  Files           139        139          
  Lines         50143      50154    +11   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42776      42780     +4   
- Misses         7367       7374     +7   
  Partials          0          0

Powered by Codecov. Last update 142c796...f0d0165

chrisaycock · 2016-08-04T21:06:57Z

First attempt at Tempita. More work needed still -- not ready for merge yet.

chrisaycock · 2016-08-05T18:25:15Z

While Travis is running, here are the latest benchmarks:

$ asv continuous master -b "join_merge.merge_asof_*"
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For pandas commit hash 2a01e11a:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.............................
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 16.67%] ··· Running join_merge.merge_asof_by_int.time_merge_asof_by_int                                       24.79ms
[ 33.33%] ··· Running join_merge.merge_asof_by_object.time_merge_asof_by_object                                 41.29ms
[ 50.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           12.63ms
[ 50.00%] · For pandas commit hash e5ee5d2e:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.............................
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 66.67%] ··· Running join_merge.merge_asof_by_int.time_merge_asof_by_int                                      447.26ms
[ 83.33%] ··· Running join_merge.merge_asof_by_object.time_merge_asof_by_object                                533.65ms
[100.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           78.02ms
   before     after       ratio
  [e5ee5d2e] [2a01e11a]
-   78.02ms    12.63ms      0.16  join_merge.merge_asof_noby.time_merge_asof_noby
-  533.65ms    41.29ms      0.08  join_merge.merge_asof_by_object.time_merge_asof_by_object
-  447.26ms    24.79ms      0.06  join_merge.merge_asof_by_int.time_merge_asof_by_int

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

This can handle on as {int64_t, double} and by as {int64_t, object}. But it only allows one column in the by parameter. So either I'll have to change how I'm doing things, or change the spec of the function such that only one column is permitted.

jreback · 2016-08-05T18:32:45Z

it's fine to limit this for now
we can always expand / fix that later

chrisaycock · 2016-08-05T18:55:38Z

@jreback Alright, I've made it explicit now, both in the documentation and in the validation.

jreback · 2016-08-05T19:31:23Z

pandas/tools/merge.py

+            # choose appropriate function by type
+            on_dtype = left_values.dtype
+            by_dtype = left_by_values.dtype
+            if is_integer_dtype(by_dtype):


need a more generic way of getting these

see core/algorithms.py, basically put tuple of the args in a dict then get it

e.g.

_asof_funcs = { ('int64', 'double') : ....., ('int64', 'int64'): ....}

jreback · 2016-08-06T21:45:20Z

just merged #13925

pls rebase and can merge this

…13902) Uses Tempita for different specializations of "by" and "on" types. These parameters can only take a single label now.

chrisaycock · 2016-08-08T03:41:13Z

@jreback Everything has been rebased and pushed. Thanks!

jreback · 2016-08-08T12:47:12Z

thanks @chrisaycock

very nice!

2 things for future:

can you raise an issue about limitations in using on (e.g. need to expand the dtypes accepted). and show an example
releasing GIL on this is very easy (though not sure it would help very much, but prob won't hurt).

jreback added Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Aug 4, 2016

jreback reviewed Aug 5, 2016
View reviewed changes

ENH: Faster merge_asof() performs a single pass when joining tables (#…

f0d0165

…13902) Uses Tempita for different specializations of "by" and "on" types. These parameters can only take a single label now.

jreback added this to the 0.19.0 milestone Aug 8, 2016

jreback closed this in cff1f55 Aug 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Faster merge_asof() performs a single pass when joining tables (#13902) #13903

ENH: Faster merge_asof() performs a single pass when joining tables (#13902) #13903

chrisaycock commented Aug 4, 2016 •

edited

Loading

chrisaycock commented Aug 4, 2016

codecov-io commented Aug 4, 2016 •

edited

Loading

chrisaycock commented Aug 4, 2016 •

edited

Loading

chrisaycock commented Aug 5, 2016

jreback commented Aug 5, 2016

chrisaycock commented Aug 5, 2016

jreback Aug 5, 2016

jreback Aug 5, 2016

jreback commented Aug 6, 2016

chrisaycock commented Aug 8, 2016

jreback commented Aug 8, 2016

ENH: Faster merge_asof() performs a single pass when joining tables (#13902) #13903

ENH: Faster merge_asof() performs a single pass when joining tables (#13902) #13903

Conversation

chrisaycock commented Aug 4, 2016 • edited Loading

chrisaycock commented Aug 4, 2016

codecov-io commented Aug 4, 2016 • edited Loading

Current coverage is 85.29% (diff: 97.72%)

chrisaycock commented Aug 4, 2016 • edited Loading

chrisaycock commented Aug 5, 2016

jreback commented Aug 5, 2016

chrisaycock commented Aug 5, 2016

jreback Aug 5, 2016

Choose a reason for hiding this comment

jreback Aug 5, 2016

Choose a reason for hiding this comment

jreback commented Aug 6, 2016

chrisaycock commented Aug 8, 2016

jreback commented Aug 8, 2016

chrisaycock commented Aug 4, 2016 •

edited

Loading

codecov-io commented Aug 4, 2016 •

edited

Loading

chrisaycock commented Aug 4, 2016 •

edited

Loading