Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categoricals hash consistently #15143

Closed
wants to merge 2 commits into from

Conversation

jcrist
Copy link
Contributor

@jcrist jcrist commented Jan 16, 2017

Previously categorical values were hashed using just their codes. This
meant that the hash value depended on the ordering of the categories,
rather than on the values the series represented. This caused problems
in dask, where different partitions might have different categorical
mappings.

This PR makes the hashing dependent on the values the categorical
represents, rather than on the codes. The categories are first hashed,
and then the codes are remapped to the hashed values. This is slightly
slower than before (still need to hash the categories, where we didn't
before), but allows for more consistent hashing.

Related to this work in dask: dask/dask#1877.

Previously categorical values were hashed using just their codes. This
meant that the hash value depended on the ordering of the categories,
rather than on the values the series represented. This caused problems
in dask, where different partitions might have different categorical
mappings.

This PR makes the hashing dependent on the values the categorical
represents, rather than on the codes. The categories are first hashed,
and then the codes are remapped to the hashed values. This is slightly
slower than before (still need to hash the categories, where we didn't
before), but allows for more consistent hashing.
if is_categorical_dtype(vals.dtype):
vals = vals.codes
cat_hashed = hash_array(vals.categories.values, encoding, hash_key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this like below instead, rather than adding a keyword (where we do basically the same thing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow - we need to recurse through in case the values aren't objects. So we need to be able to handle vals.categories.values of any dtype. However, we don't want to categorize the categories again, since we already know they're unique. This seemed to me the simplest and cleanest way of implementing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe show what the problem is
hash_array can hash values JUST like we do below

this code is too complicated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain where you mean "like we do below"? I'm assuming you mean lines 125-130, where we handle object dtype.

For categoricals, what we want to do is:

  • Get a hash for the categories. This should match what hash_pandas_object(series, index=False) would return for the un-categorized data. Meaning hash_pandas_object(object_series) == hash_pandas_object(object_series.astype('category')).
  • Remap the category hashes based on the codes.

vals.categories.values can be of any dtype, so we need to recurse through hash_array again to get the hashes of the categories. However, we also know that the categories are already unique, so we don't want to call factorize again. As such, we set categorize=False to skip that. We can't just call hash_object_array, as the categories may not be objects. And we need to do the remapping, so we can't just set vals = something and fall through like we did before.

I don't think adding this extra keyword overly complicates things, and do think this is the simplest way to do this. I may not be understanding what you're trying to suggest here - perhaps if you could explain a bit better I might get it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked again. This is reasonable. Ideally don't want to repeat the factorize-hash-remap logic as right now this code looks very similar to what you have. maybe pull this out to a common helper function?

I get the problem is now hash_array_object will only de-duplicate object types, when of course we could have any dtypes. Though in practice it is generally not a big deal to simply hash non-objects even if very many duplicates (as simply finding the duplicates can be somewhat expensive). In my tests, only for object does this make a huge difference.

That said not averse to allow a caller to do this (as dask may have more information that pandas, e.g. the cardinaility of the set, or even the uniques already?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the problem is now hash_array_object will only de-duplicate object types, when of course we could have any dtypes. Though in practice it is generally not a big deal to simply hash non-objects even if very many duplicates (as simply finding the duplicates can be somewhat expensive). In my tests, only for object does this make a huge difference.

Not quite. The categorize parameter only indicates whether to categorize object arrays before hashing. Other arrays are treated the same as before. The reason for this is that when we hash the categories attribute on a categorical (and then map the codes to the hashes of the categories), we already know that it is unique, so we don't need to re-categorize before hashing (which would happen for object categoricals, but not for others). This might be more clear now that the _hash_categorical helper is pulled out.

@@ -80,53 +80,56 @@ def hash_array(vals, encoding='utf8', hash_key=None):
encoding : string, default 'utf8'
encoding for data & key when strings
hash_key : string key to encode, default to _default_hash_key
categorize : bool, default True
Whether to first categorize object arrays before hashing. This is more
efficient when the array contains duplicate values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded tag here (0.20.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think also add categorize to hash_pandas_object for consistency as well?

@jreback jreback added Categorical Categorical Data Type Compat pandas objects compatability with Numpy or Python functions labels Jan 17, 2017
@@ -309,6 +309,7 @@ Bug Fixes
- Bug in ``pd.read_csv()`` in which the ``dialect`` parameter was not being verified before processing (:issue:`14898`)
- Bug in ``pd.read_fwf`` where the skiprows parameter was not being respected during column width inference (:issue:`11256`)
- Bug in ``pd.read_csv()`` in which missing data was being improperly handled with ``usecols`` (:issue:`6710`)
- Bug in ``pandas.tools.hashing.hash_pandas_object`` in which hashing of categoricals depended on the ordering of categories, instead of just their values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a () as this is a function call, and pandas -> pd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this PR number as the issue number

def test_categorical_consistency(self):
# Check that categoricals hash consistent with their values, not codes
# This should work for categoricals of any dtype
for data in [['a', 'b', 'c', 'd'], [1000, 2000, 3000, 4000]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test with datetimes as well

- Add `categorize` parameter to `hash_pandas_object`
- Update test
- Update whatsnew
@jcrist
Copy link
Contributor Author

jcrist commented Jan 17, 2017

Thanks for the review. Addressed all comments.

@jreback jreback added this to the 0.20.0 milestone Jan 17, 2017
@jreback
Copy link
Contributor

jreback commented Jan 17, 2017

lgtm.

ping on green. (note appveyor is currently recovering, so may take a while) :<

@jcrist
Copy link
Contributor Author

jcrist commented Jan 18, 2017

Windows tests passed, travis failures look unrelated?

@jreback jreback closed this in 3772cc7 Jan 18, 2017
@jreback
Copy link
Contributor

jreback commented Jan 18, 2017

thanks @jcrist

@jcrist jcrist deleted the categories_hash_consistently branch January 18, 2017 16:55
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
Previously categorical values were hashed using just their codes. This
meant that the hash value depended on the ordering of the categories,
rather than on the values the series represented. This caused problems
in dask, where different partitions might have different categorical
mappings.    This PR makes the hashing dependent on the values the
categorical  represents, rather than on the codes. The categories are
first hashed,  and then the codes are remapped to the hashed values.
This is slightly  slower than before (still need to hash the
categories, where we didn't  before), but allows for more consistent
hashing.    Related to this work in dask:
dask/dask#1877.

Author: Jim Crist <crist042@umn.edu>

Closes pandas-dev#15143 from jcrist/categories_hash_consistently and squashes the following commits:

f1aea13 [Jim Crist] Address comments
7878c55 [Jim Crist] Categoricals hash consistently
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants