Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: groupby.quantile #43469

Merged
merged 3 commits into from
Sep 9, 2021
Merged

Conversation

jbrockmendel
Copy link
Member

Nearly 10x

import pandas as pd
import numpy as np

np.random.seed(536445246)
arr = np.random.randn(10**5, 5)
df = pd.DataFrame(arr, columns=["A", "B", "C", "D", "E"])

gb = df.groupby(df.index % 7)

qs = np.arange(0, 1, 0.1)

%timeit res = gb.quantile(qs)
771 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- master
81.2 ms ± 8.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <- PR

@jreback jreback added this to the 1.4 milestone Sep 9, 2021
@jreback jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Performance Memory or execution speed performance labels Sep 9, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

cc @mzeitlin11

@@ -770,25 +770,25 @@ def group_ohlc(floating[:, ::1] out,

@cython.boundscheck(False)
@cython.wraparound(False)
def group_quantile(ndarray[float64_t] out,
def group_quantile(ndarray[float64_t, ndim=2] out,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we name these _2d (or similar), thinking of rank of these are 2d based?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in this file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, prob want to make a pass over the cython code to unify the names (e.g. append 1d or 2d) where appropriate

res_mgr = mgr.grouped_reduce(blk_func, ignore_failures=True)
if len(res_mgr.items) != len(mgr.items):
warnings.warn(
"Dropping invalid columns in "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests hit this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

)

def blk_func(values: ArrayLike) -> ArrayLike:
mask = isna(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0-dim dont' enter here right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

def _insert_quantile_level(idx: Index, qs: npt.NDArray[np.float64]) -> MultiIndex:
"""
Insert the sequence 'qs' of quantiles as the inner-most level of a MultiIndex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add Paramters / Returns

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@jreback
Copy link
Contributor

jreback commented Sep 9, 2021

prob worth a whatsnew note as large change in perf

@jbrockmendel
Copy link
Member Author

whatsnew added + green

Copy link
Member

@mzeitlin11 mzeitlin11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Might make a followup with couple things I noticed while reading through (but not expressly related to the pr, so no reason to block merging)

@jreback jreback merged commit c03f49c into pandas-dev:master Sep 9, 2021
@jbrockmendel jbrockmendel deleted the perf-gb-quantile branch September 9, 2021 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants