-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: add data hashing routines #14729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @mikegraham added |
With this broader goal in mind, it might be worth considering exactly what the applications are. This current implementation uses the There was a big discussion for the dask application about whether to handle the case where dtype was object but the objects weren't strings, and whether to handle strings in a way that was stable between runs. Part of what shaped that discussion was dask's refraining from introducing C extensions. |
@mikegraham yes a stable string hash would be nice. (esp between runs). I don't think performance is an issue ATM (this is pretty fast). But can certainly deal with that if it comes up. Do you have a 'better' way of hashing strings? |
should we be using |
I have seen this done. Its stable across runs & compat between python versions. |
we could also always stringify |
Current coverage is 85.28% (diff: 98.14%)@@ master #14729 diff @@
==========================================
Files 143 144 +1
Lines 50849 50903 +54
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43343 43414 +71
+ Misses 7506 7489 -17
Partials 0 0
|
note great about 1/2 time is spent in hashlib. but can work on that later. |
It will be slow has heck. I have a Cython siphash implementation I can contribute if you want. This should also be able to be GIL-releasing.
You mean |
@mikegraham ok, using your cython code we now get about a 6x speedup. so 120ms for 500k rows.
|
ok, fixed the consistency issue. key has to be 16 long! hahha |
and latest push makes this nogil! |
Examples | ||
-------- | ||
>>> pd.Index([1, 2, 3]).hash() | ||
array([6238072747940578789, 15839785061582574730, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: have to regen these (from old way of hashing)
23e4257
to
ca2a329
Compare
ok, added tests for mixed dtypes, empty objects. Timing is not bad
We can do even better by factorizing first & then hashing (so we wouldn't even need the object hashing code).
Is this a valid thing to do? Acutally its not very hard to:
|
are repeated hashes likely?
hmm, something funny going on........
|
nvm, was incorretly assigning the pointer.
|
That will make the hash only valid for one given series/df, I think, not between different dfs. I don't know what all the uses are for hashing, but for the dask uses, I think it wouldn't since the hash values for similar values wouldn't agree between partitions. |
now for dask I don't actually think this is an issue. as you always know the meta-data anyhow.
|
I meant to ask what the specific use cases for a whole-dataframe hash were. |
We use hashing for other things in dask as well. For example we use whole-dataframe hashing to determine keys if they are not given. In [1]: from dask.base import tokenize
In [2]: import pandas as pd
In [3]: tokenize(pd.DataFrame({'x': [1, 2, 3]}))
Out[3]: '873595b6236c19b206f2e28992546e10' https://github.com/dask/dask/blob/master/dask/base.py#L292-L315 |
Since this is a row-based hash, it may make sense to wrap the results back in a I'm not 100% clear on the use-case - I would have guessed the |
@chris-b1 addressed your comments. |
7605b68
to
6154cd5
Compare
No comments directly about the implementation, but regarding the API: is it needed to have this as a method on the DataFrame/Series/Index object itself? The reason I raise this points is because we already have so many methods on dataframe, we should be weigh the added value of having it as a method. Also not that fond of having it in 0.19.2, but I understand you want to have this quickly :-) |
I'm quite happy to call an internal method within Dask. It's all the same
to me.
…On Fri, Nov 25, 2016 at 4:58 PM, Joris Van den Bossche < ***@***.***> wrote:
No comments directly about the implementation, but regarding the API: is
it needed to have this as a method on the DataFrame/Series/Index object
itself?
I think this is not a functionality that will be used a lot directly by
users, where interactive tab completion and method chaining is important,
but rather used by library developers (eg dask)? Is that correct? In that
case, I don't think it is that much an inconvenience for dask to eg have to
use pd.tools.hash(df) instead of df.hash()?
The reason I raise this points is because we already have so many methods
on dataframe, we should be weigh the added value of having it as a method.
Also not that fond of having it in 0.19.2, but I understand you want to
have this quickly :-)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14729 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszCekgrXO2rWA94uE7P1qqe91HXVgks5rB1oGgaJpZM4K7wlL>
.
|
ok, removed the public API. should be just along for the ride now. We can think about when (or if) to expose this publicly. |
Thought about using this to override |
@MaximilianR I don't think that's appropriate -- the |
@MaximilianR in theory this could provide |
Even if you could detected stale cached values, I'm not sure it would be appropriate to define
|
@mikegraham we r leaving 'hash' out for now For Python structures this is non trivial and requires knowing objects IN a container change. this could be quite non performant for say a python list (where u may have to arbitrarily descend the object tree); a dataframe is more limited in how it can change and generally doesn't hold arbitrary other objects), so this is feasible. whether it is in practice reallly useful is another matter. |
any further comments? |
@jreback Thanks for relating, I agree 100%, I was just saying that it seems like |
strings, tuples, nulls xref pandas-dev#14729
strings, nulls xref pandas-dev#14729
strings, nulls xref pandas-dev#14729
xref #14729 Author: Jeff Reback <[email protected]> Closes #14767 from jreback/hashing_object and squashes the following commits: 9a5a5d4 [Jeff Reback] ERR: raise on python in object hashing, only supporting strings, nulls
xref dask/dask#1807 (cherry picked from commit 06f26b5)
…ting strings, nulls xref #14729 Author: Jeff Reback <[email protected]> Closes #14767 from jreback/hashing_object and squashes the following commits: 9a5a5d4 [Jeff Reback] ERR: raise on python in object hashing, only supporting strings, nulls (cherry picked from commit de1132d)
xref dask/dask#1807