Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: MultiIndex.get_indexer #43370

Merged
merged 2 commits into from
Sep 5, 2021

Conversation

jbrockmendel
Copy link
Member

Avoids expensive construction of MultiIndex.values, followed by expensive casting of sequence-of-tuples back to tuple-of-Indexes.

import pandas as pd

mi = pd.MultiIndex.from_product([range(100), range(100), range(100)])

%timeit mi.get_indexer(mi[:-1])
698 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- master
157 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <- PR

Follow-up will move _extract_level_codes out of cython so that Engine._get_indexer and Engine._get_indexer_non_unique are restored to having consistent signatures.

@jbrockmendel jbrockmendel added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance labels Sep 2, 2021
@jreback jreback added this to the 1.4 milestone Sep 5, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm typing requests for followon

@@ -603,35 +603,34 @@ cdef class BaseMultiIndexCodesEngine:
def _codes_to_ints(self, ndarray[uint64_t] codes) -> np.ndarray:
raise NotImplementedError("Implemented by subclass")

def _extract_level_codes(self, ndarray[object] target) -> np.ndarray:
def _extract_level_codes(self, target) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type in followon?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah mentioned in OP plans for follow-on to restore common signature/typing

return self._codes_to_ints(np.array(level_codes, dtype='uint64').T)

def get_indexer(self, ndarray[object] target) -> np.ndarray:
def get_indexer(self, target) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type in followon

def get_indexer_non_unique(self, ndarray[object] target):

def get_indexer_non_unique(self, target):
# target: MultiIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type if possile

@@ -3619,7 +3619,12 @@ def _get_indexer(
elif method == "nearest":
indexer = self._get_nearest_indexer(target, limit, tolerance)
else:
indexer = self._engine.get_indexer(target._get_engine_target())
tgt_values = target._get_engine_target()
if target._is_multi and self._is_multi:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth a comment here and L5453 that this is for perf as not super obvious

@jreback
Copy link
Contributor

jreback commented Sep 5, 2021

lgtm but if you can add some comments, ping on green

@jreback
Copy link
Contributor

jreback commented Sep 5, 2021

ok fair enough all of this is fine for a followon.

@jreback jreback merged commit 6a683a2 into pandas-dev:master Sep 5, 2021
@jreback
Copy link
Contributor

jreback commented Sep 5, 2021

also since this is a huge speedup this migth improve some concat / join asvs for multi-multi (if so that is worth a whatsnew note0

@jbrockmendel jbrockmendel deleted the perf-mi-get_indexer branch September 5, 2021 17:59
jbrockmendel added a commit to jbrockmendel/pandas that referenced this pull request Sep 5, 2021
jreback pushed a commit that referenced this pull request Sep 6, 2021
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants