ENH: add data hashing routines #14729

jreback · 2016-11-24T15:21:19Z

jreback · 2016-11-24T16:03:39Z

added index=True kw to by default hash the index.

mikegraham · 2016-11-24T16:15:42Z

With this broader goal in mind, it might be worth considering exactly what the applications are. This current implementation uses the hash builtin, so it will have hash randomization for strings for some Pythons. I don't know what the right move is there, but it's something to keep in mind.

There was a big discussion for the dask application about whether to handle the case where dtype was object but the objects weren't strings, and whether to handle strings in a way that was stable between runs. Part of what shaped that discussion was dask's refraining from introducing C extensions.

jreback · 2016-11-24T16:20:40Z

@mikegraham yes a stable string hash would be nice. (esp between runs). I don't think performance is an issue ATM (this is pretty fast). But can certainly deal with that if it comes up. Do you have a 'better' way of hashing strings?

jreback · 2016-11-24T16:21:37Z

should we be using hashlib? https://www.peterbe.com/plog/best-hashing-function-in-python

jreback · 2016-11-24T16:27:07Z

(pandas) [Thu Nov 24 11:26:03 ~/pandas]$ python -c "import hashlib; print(int(hashlib.md5('foo'.encode('utf8')).hexdigest(), 16) % (10 ** 8))"
985560

I have seen this done. Its stable across runs & compat between python versions.

jreback · 2016-11-24T16:28:42Z

we could also always stringify object dtype data.

codecov-io · 2016-11-24T16:39:30Z

Current coverage is 85.28% (diff: 98.14%)

Merging #14729 into master will increase coverage by 0.04%

@@             master     #14729   diff @@
==========================================
  Files           143        144     +1   
  Lines         50849      50903    +54   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43343      43414    +71   
+ Misses         7506       7489    -17   
  Partials          0          0

Powered by Codecov. Last update 837db72...f5e05a7

jreback · 2016-11-24T16:41:06Z

In [2]: df = tm.makeMixedDataFrame()

In [3]: df = pd.concat([df]*100000)

In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0 to 4
Data columns (total 4 columns):
A    500000 non-null float64
B    500000 non-null float64
C    500000 non-null object
D    500000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 19.1+ MB

In [5]: df.hash()
Out[5]: 
array([ 7765496462383768478,  9003743364327424712, 10642375279821279843,
       ..., 10642375279821279843, 15490030136262591728,
        1988273298325137642], dtype=uint64)

In [6]: %timeit df.hash()
1 loop, best of 3: 995 ms per loop

note great about 1/2 time is spent in hashlib. but can work on that later.

mikegraham · 2016-11-24T16:47:50Z

should we be using hashlib?

It will be slow has heck. I have a Cython siphash implementation I can contribute if you want. This should also be able to be GIL-releasing.

we could also always stringify object dtype data.

You mean str(x)? I don't think that's super valid -- it will often have unstable data (like the id) in it and will often not have real information about the object in it.

mikegraham · 2016-11-24T16:52:45Z

https://gist.github.com/mikegraham/feab49dd9a0ac98cc1fb26a3366e84e4

jreback · 2016-11-24T18:46:46Z

@mikegraham ok, using your cython code we now get about a 6x speedup. so 120ms for 500k rows.
couple of issues.

~~have to handle things like tuples (e.g. from a multiindex) / non-strings~~
~~the hash does not appear consistent. maybe I am doing something wrong. can you have a look.~~

jreback · 2016-11-24T18:51:02Z

ok, fixed the consistency issue. key has to be 16 long! hahha

jreback · 2016-11-24T19:07:11Z

and latest push makes this nogil!

jreback · 2016-11-24T19:07:38Z

pandas/core/base.py

+        Examples
+        --------
+        >>> pd.Index([1, 2, 3]).hash()
+        array([6238072747940578789, 15839785061582574730,


note: have to regen these (from old way of hashing)

jreback · 2016-11-25T01:24:33Z

ok, added tests for mixed dtypes, empty objects.
support for encoding and different hash_keys.

Timing is not bad

In [33]: len(s2)
Out[33]: 500000

In [34]: %timeit s2.hash()
10 loops, best of 3: 126 ms per loop

We can do even better by factorizing first & then hashing (so we wouldn't even need the object hashing code).

In [35]: %timeit Series(s2.factorize()[0]).hash()
10 loops, best of 3: 28 ms per loop

Is this a valid thing to do?

Acutally its not very hard to:

categorize
hash the categories
broadcast back to the original.

jreback · 2016-11-25T01:32:41Z

are repeated hashes likely?

In [21]: N=1000000; s = pd.Series(tm.makeStringIndex(10).take(np.random.randint(0, 10, size=N)))

In [22]: c = s.astype('category')

In [23]: c.cat.categories
Out[23]: Index(['56bjQF4AsK', '8ZAYAT8GPT', 'AR3qPSViwT', 'I7RtoX1MVN', 'IlKrENATh1', 'QWnGtQeaPG', 'YhkZCbpQHJ', 'a7N5Sb78rE', 'fE08yrQbJ7', 'jVm54MmpWD'], dtype='object')

In [24]: c.cat.categories.hash()
Out[24]: 
array([16874492830844911956, 17129770899522711342, 16874492830844911956,
       17129770899522711342, 16874492830844911956, 17129770899522711342,
       16874492830844911956, 17129770899522711342, 16874492830844911956,
       17129770899522711342], dtype=uint64)

hmm, something funny going on........

In [27]: Index(['foo', 'bar', 'baz']).hash()
Out[27]: array([ 477881037637427054, 1374399572096150070,  477881037637427054], dtype=uint64)

jreback · 2016-11-25T02:02:44Z

nvm, was incorretly assigning the pointer.

In [1]: Index(['foo', 'bar', 'baz']).hash()
Out[1]: array([3600424527151052760, 1374399572096150070,  477881037637427054], dtype=uint64)

mikegraham · 2016-11-25T02:12:21Z

We can do even better by factorizing first & then hashing (so we wouldn't even need the object hashing code).

That will make the hash only valid for one given series/df, I think, not between different dfs. I don't know what all the uses are for hashing, but for the dask uses, I think it wouldn't since the hash values for similar values wouldn't agree between partitions.

jreback · 2016-11-25T14:43:42Z

now for dask I don't actually think this is an issue. as you always know the meta-data anyhow.

In [10]: pd.DataFrame([[1, 'foo'], [2, 'bar'], [3, 'baz']]).hash()
Out[10]: array([11603696091789712533,  5345384428795788641,    46691607209239364], dtype=uint64)

In [11]: pd.DataFrame([[1, 'foo'], [2, 'bar'], [3, 'baz']], columns=list('AB')).hash()
Out[11]: array([11603696091789712533,  5345384428795788641,    46691607209239364], dtype=uint64)

mikegraham · 2016-11-25T14:50:18Z

I meant to ask what the specific use cases for a whole-dataframe hash were.

mrocklin · 2016-11-25T14:51:10Z

We use hashing for other things in dask as well. For example we use whole-dataframe hashing to determine keys if they are not given.

In [1]: from dask.base import tokenize

In [2]: import pandas as pd

In [3]: tokenize(pd.DataFrame({'x': [1, 2, 3]}))
Out[3]: '873595b6236c19b206f2e28992546e10'

https://github.com/dask/dask/blob/master/dask/base.py#L292-L315

chris-b1 · 2016-11-25T15:10:33Z

Since this is a row-based hash, it may make sense to wrap the results back in a Series with the original row index?

I'm not 100% clear on the use-case - I would have guessed the DataFrame one is column based, so I also think the docstring could use some stronger language to make it clear it is row-based.

jreback · 2016-11-25T16:15:13Z

@chris-b1 addressed your comments.

jorisvandenbossche · 2016-11-25T21:58:22Z

No comments directly about the implementation, but regarding the API: is it needed to have this as a method on the DataFrame/Series/Index object itself?
I think this is not a functionality that will be used a lot directly by users, where interactive tab completion and method chaining is important, but rather used by library developers (eg dask)? Is that correct? In that case, I don't think it is that much an inconvenience for dask to eg have to use pd.tools.hash(df) instead of df.hash()?

The reason I raise this points is because we already have so many methods on dataframe, we should be weigh the added value of having it as a method.

Also not that fond of having it in 0.19.2, but I understand you want to have this quickly :-)

mrocklin · 2016-11-25T22:02:40Z

I'm quite happy to call an internal method within Dask. It's all the same to me.

…

On Fri, Nov 25, 2016 at 4:58 PM, Joris Van den Bossche < ***@***.***> wrote: No comments directly about the implementation, but regarding the API: is it needed to have this as a method on the DataFrame/Series/Index object itself? I think this is not a functionality that will be used a lot directly by users, where interactive tab completion and method chaining is important, but rather used by library developers (eg dask)? Is that correct? In that case, I don't think it is that much an inconvenience for dask to eg have to use pd.tools.hash(df) instead of df.hash()? The reason I raise this points is because we already have so many methods on dataframe, we should be weigh the added value of having it as a method. Also not that fond of having it in 0.19.2, but I understand you want to have this quickly :-) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14729 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszCekgrXO2rWA94uE7P1qqe91HXVgks5rB1oGgaJpZM4K7wlL> .

jreback · 2016-11-25T23:18:33Z

ok, removed the public API. should be just along for the ride now. We can think about when (or if) to expose this publicly.

max-sixty · 2016-11-26T05:07:33Z

Thought about using this to override __hash__(), and so making the pandas objects Hashable?

mikegraham · 2016-11-26T14:42:55Z

@MaximilianR I don't think that's appropriate -- the __eq__ of a pandas object is subject to change.

xref dask/dask#1807

jreback · 2016-11-27T15:02:02Z

@MaximilianR in theory this could provide __hash__, but its not cheap to produce, so it couldn't be cached (unless we then track whether data actually is changed, IOW, to mark it dirty). Its mostly useful when a frame is serialized / deserialized anyhow, not the typical scenario for __hash__.

mikegraham · 2016-11-28T02:44:12Z

Even if you could detected stale cached values, I'm not sure it would be appropriate to define __hash__ for pandas objects. According to the Python Language Reference

If a class defines mutable objects and implements a __cmp__() or __eq__() method, it should not implement __hash__()

jreback · 2016-11-28T11:17:25Z

@mikegraham we r leaving 'hash' out for now
but this IS a valid use case to provide a data hash that does change when the data is changed . pandas objects are mutable but occasionally you do want to compare versions of them (there are explicit ways of doing this now - namely compare the elements)

For Python structures this is non trivial and requires knowing objects IN a container change. this could be quite non performant for say a python list (where u may have to arbitrarily descend the object tree); a dataframe is more limited in how it can change and generally doesn't hold arbitrary other objects), so this is feasible. whether it is in practice reallly useful is another matter.

jreback · 2016-11-28T15:29:36Z

any further comments?

mikegraham · 2016-11-28T16:30:27Z

@jreback Thanks for relating, I agree 100%, I was just saying that it seems like __hash__ isn't the right name for the class of use cases you seem to be describing.

strings, tuples, nulls xref pandas-dev#14729

strings, nulls xref pandas-dev#14729

xref #14729 Author: Jeff Reback <[email protected]> Closes #14767 from jreback/hashing_object and squashes the following commits: 9a5a5d4 [Jeff Reback] ERR: raise on python in object hashing, only supporting strings, nulls

xref pandas-dev#14729

xref dask/dask#1807 (cherry picked from commit 06f26b5)

…ting strings, nulls xref #14729 Author: Jeff Reback <[email protected]> Closes #14767 from jreback/hashing_object and squashes the following commits: 9a5a5d4 [Jeff Reback] ERR: raise on python in object hashing, only supporting strings, nulls (cherry picked from commit de1132d)

jreback added Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations labels Nov 24, 2016

jreback added this to the 0.19.2 milestone Nov 24, 2016

jreback mentioned this pull request Nov 24, 2016

Add nunique_approx dask/dask#1807

Closed

jreback force-pushed the hashing branch from f6a54f7 to bd35efc Compare November 24, 2016 16:02

jreback force-pushed the hashing branch from bd35efc to d9615d5 Compare November 24, 2016 16:05

jreback force-pushed the hashing branch from 9f55864 to 2a1db31 Compare November 24, 2016 18:45

jreback force-pushed the hashing branch from 2a1db31 to cfde2e8 Compare November 24, 2016 18:48

jreback force-pushed the hashing branch from 604e57d to 14d0f1e Compare November 24, 2016 19:06

jreback commented Nov 24, 2016

View reviewed changes

jreback force-pushed the hashing branch 2 times, most recently from 23e4257 to ca2a329 Compare November 25, 2016 01:22

jreback force-pushed the hashing branch from ca2a329 to 0f8cf5d Compare November 25, 2016 02:01

jreback force-pushed the hashing branch 2 times, most recently from 7605b68 to 6154cd5 Compare November 25, 2016 19:05

jreback force-pushed the hashing branch from 6154cd5 to 5784380 Compare November 25, 2016 23:15

ENH: add data hashing routines

f5e05a7

xref dask/dask#1807

jreback force-pushed the hashing branch from fa893c6 to f5e05a7 Compare November 27, 2016 14:59

jorisvandenbossche approved these changes Nov 28, 2016

View reviewed changes

jreback merged commit 06f26b5 into pandas-dev:master Nov 28, 2016

jreback added a commit to jreback/pandas that referenced this pull request Nov 29, 2016

ERR: raise on python in object hashing, only supporting

18c01bc

strings, tuples, nulls xref pandas-dev#14729

jreback added a commit to jreback/pandas that referenced this pull request Nov 29, 2016

ERR: raise on python in object hashing, only supporting

8fb4ed0

strings, nulls xref pandas-dev#14729

jreback mentioned this pull request Nov 29, 2016

ERR: raise on python in object hashing, only supporting strings, nulls #14767

Closed

jreback added a commit to jreback/pandas that referenced this pull request Nov 30, 2016

ERR: raise on python in object hashing, only supporting

9a5a5d4

strings, nulls xref pandas-dev#14729

jreback added a commit to jreback/pandas that referenced this pull request Dec 1, 2016

ENH: add reduction option to hash_pandas_objects

908161b

xref pandas-dev#14729

jorisvandenbossche pushed a commit that referenced this pull request Dec 15, 2016

ENH: add data hashing routines (#14729)

59f633f

xref dask/dask#1807 (cherry picked from commit 06f26b5)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add data hashing routines #14729

ENH: add data hashing routines #14729

jreback commented Nov 24, 2016

jreback commented Nov 24, 2016

mikegraham commented Nov 24, 2016

jreback commented Nov 24, 2016

jreback commented Nov 24, 2016

jreback commented Nov 24, 2016

jreback commented Nov 24, 2016

codecov-io commented Nov 24, 2016 •

edited

Loading

jreback commented Nov 24, 2016

mikegraham commented Nov 24, 2016

mikegraham commented Nov 24, 2016

jreback commented Nov 24, 2016 •

edited

Loading

jreback commented Nov 24, 2016

jreback commented Nov 24, 2016

jreback Nov 24, 2016

jreback commented Nov 25, 2016 •

edited

Loading

jreback commented Nov 25, 2016

jreback commented Nov 25, 2016

mikegraham commented Nov 25, 2016

jreback commented Nov 25, 2016

mikegraham commented Nov 25, 2016

mrocklin commented Nov 25, 2016

chris-b1 commented Nov 25, 2016

jreback commented Nov 25, 2016

jorisvandenbossche commented Nov 25, 2016

mrocklin commented Nov 25, 2016 via email

jreback commented Nov 25, 2016

max-sixty commented Nov 26, 2016

mikegraham commented Nov 26, 2016

jreback commented Nov 27, 2016

mikegraham commented Nov 28, 2016 •

edited

Loading

jreback commented Nov 28, 2016 •

edited

Loading

jreback commented Nov 28, 2016

mikegraham commented Nov 28, 2016 •

edited

Loading

ENH: add data hashing routines #14729

ENH: add data hashing routines #14729

Conversation

jreback commented Nov 24, 2016

jreback commented Nov 24, 2016

mikegraham commented Nov 24, 2016

jreback commented Nov 24, 2016

jreback commented Nov 24, 2016

jreback commented Nov 24, 2016

jreback commented Nov 24, 2016

codecov-io commented Nov 24, 2016 • edited Loading

Current coverage is 85.28% (diff: 98.14%)

jreback commented Nov 24, 2016

mikegraham commented Nov 24, 2016

mikegraham commented Nov 24, 2016

jreback commented Nov 24, 2016 • edited Loading

jreback commented Nov 24, 2016

jreback commented Nov 24, 2016

jreback Nov 24, 2016

Choose a reason for hiding this comment

jreback commented Nov 25, 2016 • edited Loading

jreback commented Nov 25, 2016

jreback commented Nov 25, 2016

mikegraham commented Nov 25, 2016

jreback commented Nov 25, 2016

mikegraham commented Nov 25, 2016

mrocklin commented Nov 25, 2016

chris-b1 commented Nov 25, 2016

jreback commented Nov 25, 2016

jorisvandenbossche commented Nov 25, 2016

mrocklin commented Nov 25, 2016 via email

jreback commented Nov 25, 2016

max-sixty commented Nov 26, 2016

mikegraham commented Nov 26, 2016

jreback commented Nov 27, 2016

mikegraham commented Nov 28, 2016 • edited Loading

jreback commented Nov 28, 2016 • edited Loading

jreback commented Nov 28, 2016

mikegraham commented Nov 28, 2016 • edited Loading

codecov-io commented Nov 24, 2016 •

edited

Loading

jreback commented Nov 24, 2016 •

edited

Loading

jreback commented Nov 25, 2016 •

edited

Loading

mikegraham commented Nov 28, 2016 •

edited

Loading

jreback commented Nov 28, 2016 •

edited

Loading

mikegraham commented Nov 28, 2016 •

edited

Loading