Categoricals hash consistently #15143

jcrist · 2017-01-16T22:48:30Z

Previously categorical values were hashed using just their codes. This
meant that the hash value depended on the ordering of the categories,
rather than on the values the series represented. This caused problems
in dask, where different partitions might have different categorical
mappings.

This PR makes the hashing dependent on the values the categorical
represents, rather than on the codes. The categories are first hashed,
and then the codes are remapped to the hashed values. This is slightly
slower than before (still need to hash the categories, where we didn't
before), but allows for more consistent hashing.

Related to this work in dask: dask/dask#1877.

Previously categorical values were hashed using just their codes. This meant that the hash value depended on the ordering of the categories, rather than on the values the series represented. This caused problems in dask, where different partitions might have different categorical mappings. This PR makes the hashing dependent on the values the categorical represents, rather than on the codes. The categories are first hashed, and then the codes are remapped to the hashed values. This is slightly slower than before (still need to hash the categories, where we didn't before), but allows for more consistent hashing.

jreback · 2017-01-16T23:27:13Z

pandas/tools/hashing.py

    if is_categorical_dtype(vals.dtype):
-        vals = vals.codes
+        cat_hashed = hash_array(vals.categories.values, encoding, hash_key,


do this like below instead, rather than adding a keyword (where we do basically the same thing)

I'm not sure I follow - we need to recurse through in case the values aren't objects. So we need to be able to handle vals.categories.values of any dtype. However, we don't want to categorize the categories again, since we already know they're unique. This seemed to me the simplest and cleanest way of implementing this.

maybe show what the problem is
hash_array can hash values JUST like we do below

this code is too complicated

Can you explain where you mean "like we do below"? I'm assuming you mean lines 125-130, where we handle object dtype.

For categoricals, what we want to do is:

Get a hash for the categories. This should match what hash_pandas_object(series, index=False) would return for the un-categorized data. Meaning hash_pandas_object(object_series) == hash_pandas_object(object_series.astype('category')).

Remap the category hashes based on the codes.

vals.categories.values can be of any dtype, so we need to recurse through hash_array again to get the hashes of the categories. However, we also know that the categories are already unique, so we don't want to call factorize again. As such, we set categorize=False to skip that. We can't just call hash_object_array, as the categories may not be objects. And we need to do the remapping, so we can't just set vals = something and fall through like we did before.

I don't think adding this extra keyword overly complicates things, and do think this is the simplest way to do this. I may not be understanding what you're trying to suggest here - perhaps if you could explain a bit better I might get it.

I looked again. This is reasonable. Ideally don't want to repeat the factorize-hash-remap logic as right now this code looks very similar to what you have. maybe pull this out to a common helper function?

I get the problem is now hash_array_object will only de-duplicate object types, when of course we could have any dtypes. Though in practice it is generally not a big deal to simply hash non-objects even if very many duplicates (as simply finding the duplicates can be somewhat expensive). In my tests, only for object does this make a huge difference.

That said not averse to allow a caller to do this (as dask may have more information that pandas, e.g. the cardinaility of the set, or even the uniques already?)

I get the problem is now hash_array_object will only de-duplicate object types, when of course we could have any dtypes. Though in practice it is generally not a big deal to simply hash non-objects even if very many duplicates (as simply finding the duplicates can be somewhat expensive). In my tests, only for object does this make a huge difference.

Not quite. The categorize parameter only indicates whether to categorize object arrays before hashing. Other arrays are treated the same as before. The reason for this is that when we hash the categories attribute on a categorical (and then map the codes to the hashes of the categories), we already know that it is unique, so we don't need to re-categorize before hashing (which would happen for object categoricals, but not for others). This might be more clear now that the _hash_categorical helper is pulled out.

jreback · 2017-01-17T13:31:34Z

pandas/tools/hashing.py

@@ -80,53 +80,56 @@ def hash_array(vals, encoding='utf8', hash_key=None):
    encoding : string, default 'utf8'
        encoding for data & key when strings
    hash_key : string key to encode, default to _default_hash_key
+    categorize : bool, default True
+        Whether to first categorize object arrays before hashing. This is more
+        efficient when the array contains duplicate values.



add a versionadded tag here (0.20.0)

I think also add categorize to hash_pandas_object for consistency as well?

jreback · 2017-01-17T13:37:35Z

doc/source/whatsnew/v0.20.0.txt

@@ -309,6 +309,7 @@ Bug Fixes
 - Bug in ``pd.read_csv()`` in which the ``dialect`` parameter was not being verified before processing (:issue:`14898`)
 - Bug in ``pd.read_fwf`` where the skiprows parameter was not being respected during column width inference (:issue:`11256`)
 - Bug in ``pd.read_csv()`` in which missing data was being improperly handled with ``usecols`` (:issue:`6710`)
+- Bug in ``pandas.tools.hashing.hash_pandas_object`` in which hashing of categoricals depended on the ordering of categories, instead of just their values.


put a () as this is a function call, and pandas -> pd

add this PR number as the issue number

jreback · 2017-01-17T13:39:53Z

pandas/tools/tests/test_hashing.py

+    def test_categorical_consistency(self):
+        # Check that categoricals hash consistent with their values, not codes
+        # This should work for categoricals of any dtype
+        for data in [['a', 'b', 'c', 'd'], [1000, 2000, 3000, 4000]]:


can you test with datetimes as well

- Add `categorize` parameter to `hash_pandas_object` - Update test - Update whatsnew

jcrist · 2017-01-17T19:32:25Z

Thanks for the review. Addressed all comments.

jreback · 2017-01-17T19:33:50Z

lgtm.

ping on green. (note appveyor is currently recovering, so may take a while) :<

jcrist · 2017-01-18T03:25:42Z

Windows tests passed, travis failures look unrelated?

jreback · 2017-01-18T14:35:46Z

thanks @jcrist

Previously categorical values were hashed using just their codes. This meant that the hash value depended on the ordering of the categories, rather than on the values the series represented. This caused problems in dask, where different partitions might have different categorical mappings. This PR makes the hashing dependent on the values the categorical represents, rather than on the codes. The categories are first hashed, and then the codes are remapped to the hashed values. This is slightly slower than before (still need to hash the categories, where we didn't before), but allows for more consistent hashing. Related to this work in dask: dask/dask#1877. Author: Jim Crist <crist042@umn.edu> Closes pandas-dev#15143 from jcrist/categories_hash_consistently and squashes the following commits: f1aea13 [Jim Crist] Address comments 7878c55 [Jim Crist] Categoricals hash consistently

jreback requested changes Jan 16, 2017

View reviewed changes

jreback reviewed Jan 17, 2017

View reviewed changes

jreback added Categorical Categorical Data Type Compat pandas objects compatability with Numpy or Python functions labels Jan 17, 2017

jreback reviewed Jan 17, 2017

View reviewed changes

Address comments

f1aea13

- Add `categorize` parameter to `hash_pandas_object` - Update test - Update whatsnew

jreback added this to the 0.20.0 milestone Jan 17, 2017

jreback closed this in 3772cc7 Jan 18, 2017

jcrist deleted the categories_hash_consistently branch January 18, 2017 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categoricals hash consistently #15143

Categoricals hash consistently #15143

jcrist commented Jan 16, 2017

jreback Jan 16, 2017

jcrist Jan 16, 2017

jreback Jan 16, 2017

jcrist Jan 16, 2017

jreback Jan 17, 2017

jcrist Jan 17, 2017

jreback Jan 17, 2017

jreback Jan 17, 2017

jreback Jan 17, 2017

jreback Jan 17, 2017

jreback Jan 17, 2017

jcrist commented Jan 17, 2017

jreback commented Jan 17, 2017

jcrist commented Jan 18, 2017

jreback commented Jan 18, 2017

Categoricals hash consistently #15143

Categoricals hash consistently #15143

Conversation

jcrist commented Jan 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcrist commented Jan 17, 2017

jreback commented Jan 17, 2017

jcrist commented Jan 18, 2017

jreback commented Jan 18, 2017