Skip to content

ENH: Faster merge_asof() performs a single pass when joining tables (#13902) #13903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

chrisaycock
Copy link
Contributor

@chrisaycock chrisaycock commented Aug 4, 2016

This version passes existing regression tests but is ultimately wrong
because it requires the "by" column to be a single object. A proper version
would handle int (and possily float) columns through type differentiation.

@chrisaycock
Copy link
Contributor Author

lint failed with

'pandas.types.common._ensure_platform_int' imported but unused

I will fix this in a subsequent update.

@jreback jreback added Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Aug 4, 2016
@codecov-io
Copy link

codecov-io commented Aug 4, 2016

Current coverage is 85.29% (diff: 97.72%)

Merging #13903 into master will decrease coverage by 0.01%

@@             master     #13903   diff @@
==========================================
  Files           139        139          
  Lines         50143      50154    +11   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42776      42780     +4   
- Misses         7367       7374     +7   
  Partials          0          0          

Powered by Codecov. Last update 142c796...f0d0165

@chrisaycock
Copy link
Contributor Author

chrisaycock commented Aug 4, 2016

First attempt at Tempita. More work needed still -- not ready for merge yet.

@chrisaycock
Copy link
Contributor Author

While Travis is running, here are the latest benchmarks:

$ asv continuous master -b "join_merge.merge_asof_*"
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For pandas commit hash 2a01e11a:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.............................
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 16.67%] ··· Running join_merge.merge_asof_by_int.time_merge_asof_by_int                                       24.79ms
[ 33.33%] ··· Running join_merge.merge_asof_by_object.time_merge_asof_by_object                                 41.29ms
[ 50.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           12.63ms
[ 50.00%] · For pandas commit hash e5ee5d2e:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.............................
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 66.67%] ··· Running join_merge.merge_asof_by_int.time_merge_asof_by_int                                      447.26ms
[ 83.33%] ··· Running join_merge.merge_asof_by_object.time_merge_asof_by_object                                533.65ms
[100.00%] ··· Running join_merge.merge_asof_noby.time_merge_asof_noby                                           78.02ms
   before     after       ratio
  [e5ee5d2e] [2a01e11a]
-   78.02ms    12.63ms      0.16  join_merge.merge_asof_noby.time_merge_asof_noby
-  533.65ms    41.29ms      0.08  join_merge.merge_asof_by_object.time_merge_asof_by_object
-  447.26ms    24.79ms      0.06  join_merge.merge_asof_by_int.time_merge_asof_by_int

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

This can handle on as {int64_t, double} and by as {int64_t, object}. But it only allows one column in the by parameter. So either I'll have to change how I'm doing things, or change the spec of the function such that only one column is permitted.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2016

it's fine to limit this for now
we can always expand / fix that later

@chrisaycock
Copy link
Contributor Author

@jreback Alright, I've made it explicit now, both in the documentation and in the validation.

# choose appropriate function by type
on_dtype = left_values.dtype
by_dtype = left_by_values.dtype
if is_integer_dtype(by_dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a more generic way of getting these

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see core/algorithms.py, basically put tuple of the args in a dict then get it

e.g.

_asof_funcs = { ('int64', 'double') : .....,
                         ('int64', 'int64'): ....}

@jreback
Copy link
Contributor

jreback commented Aug 6, 2016

just merged #13925

pls rebase and can merge this

…13902)

Uses Tempita for different specializations of "by" and "on" types.
These parameters can only take a single label now.
@chrisaycock
Copy link
Contributor Author

@jreback Everything has been rebased and pushed. Thanks!

@jreback jreback added this to the 0.19.0 milestone Aug 8, 2016
@jreback jreback closed this in cff1f55 Aug 8, 2016
@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

thanks @chrisaycock

very nice!

2 things for future:

  1. can you raise an issue about limitations in using on (e.g. need to expand the dtypes accepted). and show an example
  2. releasing GIL on this is very easy (though not sure it would help very much, but prob won't hurt).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants