Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of equality comparison between a simple Index and a MultiIndex #29134

Merged
merged 13 commits into from
Dec 5, 2019
Merged

Conversation

tlaytongoogle
Copy link
Contributor

@tlaytongoogle tlaytongoogle commented Oct 21, 2019

Previously, Index.equals(MultiIndex) and MultiIndex.equals(Index) both involved converting the MultiIndex into an ndarray. Since this conversion requires resolving MultiIndex's by-reference structure (i.e. its codes + levels) into ndarray's by-value structure, it can be substantially computationally expensive.

However, it is only possible for a simple Index to equal a MultiIndex in two cases:

  • the MultiIndex has only 1 level
  • the MultiIndex has d levels, and the Index is an object index of size-d sequences (e.g. d-tuples)

Thus, if the Index is not object-typed, and its nlevels differs from that of the MultiIndex, then the two are determined to be unequal without ndarray conversion.

MWE:

import pandas as pd

long_cheap_index = pd.RangeIndex(1000000)
short_expensive_index = pd.IntervalIndex(
    [pd.Interval(pd.Timestamp(2018, 10, 1), pd.Timestamp(2018, 10, 2))])

large_expensive_multiindex = pd.MultiIndex.from_product(
    [long_cheap_index, short_expensive_index])
trivial_simple_index = pd.Int64Index([])

# These operations no longer convert large_expensive_multiindex into an ndarray
# Previously, each took ~10 s; now, ~0.02 ms
large_expensive_multiindex.equals(trivial_simple_index)
trivial_simple_index.equals(large_expensive_multiindex)
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an asv for this and a whatnew note

@gfyoung gfyoung added MultiIndex Performance Memory or execution speed performance labels Oct 22, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a note to the whatsnew section (performance)

@jreback
Copy link
Contributor

jreback commented Dec 1, 2019

@tlaytongoogle can you update, this looks good.

@pep8speaks
Copy link

pep8speaks commented Dec 3, 2019

Hello @tlaytongoogle! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-12-05 16:56:31 UTC

@jreback jreback added this to the 1.0 milestone Dec 5, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. comment on the benchmark. can you update, ping on green.

asv_bench/benchmarks/index_object.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/multiindex_object.py Outdated Show resolved Hide resolved
@jreback jreback merged commit c0f6428 into pandas-dev:master Dec 5, 2019
@jreback
Copy link
Contributor

jreback commented Dec 5, 2019

thanks @tlaytongoogle very nice!

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants