Skip to content

ENH: add data hashing routines #14729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 28, 2016
Merged

ENH: add data hashing routines #14729

merged 1 commit into from
Nov 28, 2016

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Nov 24, 2016

@jreback jreback added Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations labels Nov 24, 2016
@jreback jreback added this to the 0.19.2 milestone Nov 24, 2016
@jreback
Copy link
Contributor Author

jreback commented Nov 24, 2016

cc @mikegraham
cc @jcrist
cc @mrocklin

added index=True kw to by default hash the index.

@mikegraham
Copy link
Contributor

With this broader goal in mind, it might be worth considering exactly what the applications are. This current implementation uses the hash builtin, so it will have hash randomization for strings for some Pythons. I don't know what the right move is there, but it's something to keep in mind.

There was a big discussion for the dask application about whether to handle the case where dtype was object but the objects weren't strings, and whether to handle strings in a way that was stable between runs. Part of what shaped that discussion was dask's refraining from introducing C extensions.

@jreback
Copy link
Contributor Author

jreback commented Nov 24, 2016

@mikegraham yes a stable string hash would be nice. (esp between runs). I don't think performance is an issue ATM (this is pretty fast). But can certainly deal with that if it comes up. Do you have a 'better' way of hashing strings?

@jreback
Copy link
Contributor Author

jreback commented Nov 24, 2016

should we be using hashlib? https://www.peterbe.com/plog/best-hashing-function-in-python

@jreback
Copy link
Contributor Author

jreback commented Nov 24, 2016

(pandas) [Thu Nov 24 11:26:03 ~/pandas]$ python -c "import hashlib; print(int(hashlib.md5('foo'.encode('utf8')).hexdigest(), 16) % (10 ** 8))"
985560

I have seen this done. Its stable across runs & compat between python versions.

@jreback
Copy link
Contributor Author

jreback commented Nov 24, 2016

we could also always stringify object dtype data.

@codecov-io
Copy link

codecov-io commented Nov 24, 2016

Current coverage is 85.28% (diff: 98.14%)

Merging #14729 into master will increase coverage by 0.04%

@@             master     #14729   diff @@
==========================================
  Files           143        144     +1   
  Lines         50849      50903    +54   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43343      43414    +71   
+ Misses         7506       7489    -17   
  Partials          0          0          

Powered by Codecov. Last update 837db72...f5e05a7

@jreback
Copy link
Contributor Author

jreback commented Nov 24, 2016

In [2]: df = tm.makeMixedDataFrame()

In [3]: df = pd.concat([df]*100000)

In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500000 entries, 0 to 4
Data columns (total 4 columns):
A    500000 non-null float64
B    500000 non-null float64
C    500000 non-null object
D    500000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 19.1+ MB

In [5]: df.hash()
Out[5]: 
array([ 7765496462383768478,  9003743364327424712, 10642375279821279843,
       ..., 10642375279821279843, 15490030136262591728,
        1988273298325137642], dtype=uint64)

In [6]: %timeit df.hash()
1 loop, best of 3: 995 ms per loop

note great about 1/2 time is spent in hashlib. but can work on that later.

@mikegraham
Copy link
Contributor

should we be using hashlib?

It will be slow has heck. I have a Cython siphash implementation I can contribute if you want. This should also be able to be GIL-releasing.

we could also always stringify object dtype data.

You mean str(x)? I don't think that's super valid -- it will often have unstable data (like the id) in it and will often not have real information about the object in it.

@mikegraham
Copy link
Contributor

@jreback
Copy link
Contributor Author

jreback commented Nov 24, 2016

@mikegraham ok, using your cython code we now get about a 6x speedup. so 120ms for 500k rows.
couple of issues.

  • have to handle things like tuples (e.g. from a multiindex) / non-strings
  • the hash does not appear consistent. maybe I am doing something wrong. can you have a look.

@jreback
Copy link
Contributor Author

jreback commented Nov 24, 2016

ok, fixed the consistency issue. key has to be 16 long! hahha

@jreback
Copy link
Contributor Author

jreback commented Nov 24, 2016

and latest push makes this nogil!

Examples
--------
>>> pd.Index([1, 2, 3]).hash()
array([6238072747940578789, 15839785061582574730,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: have to regen these (from old way of hashing)

@jreback jreback force-pushed the hashing branch 2 times, most recently from 23e4257 to ca2a329 Compare November 25, 2016 01:22
@jreback
Copy link
Contributor Author

jreback commented Nov 25, 2016

ok, added tests for mixed dtypes, empty objects.
support for encoding and different hash_keys.

Timing is not bad

In [33]: len(s2)
Out[33]: 500000

In [34]: %timeit s2.hash()
10 loops, best of 3: 126 ms per loop

We can do even better by factorizing first & then hashing (so we wouldn't even need the object hashing code).

In [35]: %timeit Series(s2.factorize()[0]).hash()
10 loops, best of 3: 28 ms per loop

Is this a valid thing to do?

Acutally its not very hard to:

  • categorize
  • hash the categories
  • broadcast back to the original.

@jreback
Copy link
Contributor Author

jreback commented Nov 25, 2016

are repeated hashes likely?

In [21]: N=1000000; s = pd.Series(tm.makeStringIndex(10).take(np.random.randint(0, 10, size=N)))

In [22]: c = s.astype('category')

In [23]: c.cat.categories
Out[23]: Index(['56bjQF4AsK', '8ZAYAT8GPT', 'AR3qPSViwT', 'I7RtoX1MVN', 'IlKrENATh1', 'QWnGtQeaPG', 'YhkZCbpQHJ', 'a7N5Sb78rE', 'fE08yrQbJ7', 'jVm54MmpWD'], dtype='object')

In [24]: c.cat.categories.hash()
Out[24]: 
array([16874492830844911956, 17129770899522711342, 16874492830844911956,
       17129770899522711342, 16874492830844911956, 17129770899522711342,
       16874492830844911956, 17129770899522711342, 16874492830844911956,
       17129770899522711342], dtype=uint64)

hmm, something funny going on........

In [27]: Index(['foo', 'bar', 'baz']).hash()
Out[27]: array([ 477881037637427054, 1374399572096150070,  477881037637427054], dtype=uint64)

@jreback
Copy link
Contributor Author

jreback commented Nov 25, 2016

nvm, was incorretly assigning the pointer.

In [1]: Index(['foo', 'bar', 'baz']).hash()
Out[1]: array([3600424527151052760, 1374399572096150070,  477881037637427054], dtype=uint64)

@mikegraham
Copy link
Contributor

We can do even better by factorizing first & then hashing (so we wouldn't even need the object hashing code).

That will make the hash only valid for one given series/df, I think, not between different dfs. I don't know what all the uses are for hashing, but for the dask uses, I think it wouldn't since the hash values for similar values wouldn't agree between partitions.

@jreback
Copy link
Contributor Author

jreback commented Nov 25, 2016

now for dask I don't actually think this is an issue. as you always know the meta-data anyhow.

In [10]: pd.DataFrame([[1, 'foo'], [2, 'bar'], [3, 'baz']]).hash()
Out[10]: array([11603696091789712533,  5345384428795788641,    46691607209239364], dtype=uint64)

In [11]: pd.DataFrame([[1, 'foo'], [2, 'bar'], [3, 'baz']], columns=list('AB')).hash()
Out[11]: array([11603696091789712533,  5345384428795788641,    46691607209239364], dtype=uint64)

@mikegraham
Copy link
Contributor

I meant to ask what the specific use cases for a whole-dataframe hash were.

@mrocklin
Copy link
Contributor

We use hashing for other things in dask as well. For example we use whole-dataframe hashing to determine keys if they are not given.

In [1]: from dask.base import tokenize

In [2]: import pandas as pd

In [3]: tokenize(pd.DataFrame({'x': [1, 2, 3]}))
Out[3]: '873595b6236c19b206f2e28992546e10'

https://github.com/dask/dask/blob/master/dask/base.py#L292-L315

@chris-b1
Copy link
Contributor

Since this is a row-based hash, it may make sense to wrap the results back in a Series with the original row index?

I'm not 100% clear on the use-case - I would have guessed the DataFrame one is column based, so I also think the docstring could use some stronger language to make it clear it is row-based.

@jreback
Copy link
Contributor Author

jreback commented Nov 25, 2016

@chris-b1 addressed your comments.

@jreback jreback force-pushed the hashing branch 2 times, most recently from 7605b68 to 6154cd5 Compare November 25, 2016 19:05
@jorisvandenbossche
Copy link
Member

No comments directly about the implementation, but regarding the API: is it needed to have this as a method on the DataFrame/Series/Index object itself?
I think this is not a functionality that will be used a lot directly by users, where interactive tab completion and method chaining is important, but rather used by library developers (eg dask)? Is that correct? In that case, I don't think it is that much an inconvenience for dask to eg have to use pd.tools.hash(df) instead of df.hash()?

The reason I raise this points is because we already have so many methods on dataframe, we should be weigh the added value of having it as a method.

Also not that fond of having it in 0.19.2, but I understand you want to have this quickly :-)

@mrocklin
Copy link
Contributor

mrocklin commented Nov 25, 2016 via email

@jreback
Copy link
Contributor Author

jreback commented Nov 25, 2016

ok, removed the public API. should be just along for the ride now. We can think about when (or if) to expose this publicly.

@max-sixty
Copy link
Contributor

Thought about using this to override __hash__(), and so making the pandas objects Hashable?

@mikegraham
Copy link
Contributor

@MaximilianR I don't think that's appropriate -- the __eq__ of a pandas object is subject to change.

@jreback
Copy link
Contributor Author

jreback commented Nov 27, 2016

@MaximilianR in theory this could provide __hash__, but its not cheap to produce, so it couldn't be cached (unless we then track whether data actually is changed, IOW, to mark it dirty). Its mostly useful when a frame is serialized / deserialized anyhow, not the typical scenario for __hash__.

@mikegraham
Copy link
Contributor

mikegraham commented Nov 28, 2016

Even if you could detected stale cached values, I'm not sure it would be appropriate to define __hash__ for pandas objects. According to the Python Language Reference

If a class defines mutable objects and implements a __cmp__() or __eq__() method, it should not implement __hash__()

@jreback
Copy link
Contributor Author

jreback commented Nov 28, 2016

@mikegraham we r leaving 'hash' out for now
but this IS a valid use case to provide a data hash that does change when the data is changed . pandas objects are mutable but occasionally you do want to compare versions of them (there are explicit ways of doing this now - namely compare the elements)

For Python structures this is non trivial and requires knowing objects IN a container change. this could be quite non performant for say a python list (where u may have to arbitrarily descend the object tree); a dataframe is more limited in how it can change and generally doesn't hold arbitrary other objects), so this is feasible. whether it is in practice reallly useful is another matter.

@jreback
Copy link
Contributor Author

jreback commented Nov 28, 2016

any further comments?

@jreback jreback merged commit 06f26b5 into pandas-dev:master Nov 28, 2016
@mikegraham
Copy link
Contributor

mikegraham commented Nov 28, 2016

@jreback Thanks for relating, I agree 100%, I was just saying that it seems like __hash__ isn't the right name for the class of use cases you seem to be describing.

jreback added a commit to jreback/pandas that referenced this pull request Nov 29, 2016
jreback added a commit to jreback/pandas that referenced this pull request Nov 29, 2016
jreback added a commit to jreback/pandas that referenced this pull request Nov 30, 2016
jreback added a commit that referenced this pull request Nov 30, 2016
xref #14729

Author: Jeff Reback <[email protected]>

Closes #14767 from jreback/hashing_object and squashes the following commits:

9a5a5d4 [Jeff Reback] ERR: raise on python in object hashing, only supporting strings, nulls
jreback added a commit to jreback/pandas that referenced this pull request Dec 1, 2016
jorisvandenbossche pushed a commit that referenced this pull request Dec 15, 2016
jorisvandenbossche pushed a commit that referenced this pull request Dec 15, 2016
…ting strings, nulls

xref #14729

Author: Jeff Reback <[email protected]>

Closes #14767 from jreback/hashing_object and squashes the following commits:

9a5a5d4 [Jeff Reback] ERR: raise on python in object hashing, only supporting strings, nulls

(cherry picked from commit de1132d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants