Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Merge empty frame #45838

Merged
merged 6 commits into from
Feb 9, 2022
Merged

PERF: Merge empty frame #45838

merged 6 commits into from
Feb 9, 2022

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Feb 5, 2022

Faster merge when left or right is empty.

One ASV added:

       before           after         ratio
     [7651c082]       [4f451520]
     <main>           <merge-empty-frame>
-      12.0±0.3ms      1.42±0.09ms     0.12  join_merge.Merge.time_merge_dataframe_empty(False)
-        24.3±2ms      1.29±0.01ms     0.05  join_merge.Merge.time_merge_dataframe_empty(True)

Some additional examples:

N = 10_000_000

df1 = pd.DataFrame(
    np.random.randint(0, 100, (N, 2)),
    columns=['A', 'B'],
)

df2 = df1.set_index('A')

df_empty = pd.DataFrame(columns=['A', 'C'], dtype='int64')
%timeit df1.merge(df_empty, how="right")  
242 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)        <-- main
1.12 ms ± 50.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)  <-- PR
%timeit df2.merge(df_empty, how="left", left_index=True, right_on='A')    
1.07 s ± 47.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- main
241 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- PR

@lukemanley lukemanley added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 5, 2022
@@ -216,6 +216,9 @@ def time_merge_dataframe_integer_2key(self, sort):
def time_merge_dataframe_integer_key(self, sort):
merge(self.df, self.df2, on="key1", sort=sort)

def time_merge_dataframe_empty(self, sort):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the reverse as well (e.g. left empty)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverse added here

pandas/core/reshape/merge.py Show resolved Hide resolved
pandas/core/reshape/merge.py Show resolved Hide resolved
pandas/core/reshape/merge.py Show resolved Hide resolved
@@ -206,6 +206,7 @@ Performance improvements
- Performance improvement in :meth:`DataFrame.duplicated` when subset consists of only one column (:issue:`45236`)
- Performance improvement in :meth:`.GroupBy.transform` when broadcasting values for user-defined functions (:issue:`45708`)
- Performance improvement in :meth:`.GroupBy.transform` for user-defined functions when only a single group exists (:issue:`44977`)
- Performance improvement in :meth:`DataFrame.merge` when left and/or right are empty (:issue:`45838`)
Copy link
Member

@phofl phofl Feb 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we normally use :func: merge instead of DataFrame.merge?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to :func: merge. thanks

@jreback jreback added this to the 1.5 milestone Feb 9, 2022
@jreback
Copy link
Contributor

jreback commented Feb 9, 2022

lgtm will merge on green.

tip to avoid conflicts in the release notes; add a note not at the bottom, add in middle somewhere

@phofl
Copy link
Member

phofl commented Feb 9, 2022

Greenish

@phofl phofl merged commit c51c2a7 into pandas-dev:main Feb 9, 2022
@phofl
Copy link
Member

phofl commented Feb 9, 2022

Thx @lukemanley

phofl pushed a commit to phofl/pandas that referenced this pull request Feb 14, 2022
* faster merge with empty frame

* whatsnew

* docs, tests, asvs

* fix whatsnew

Co-authored-by: Jeff Reback <jeff@reback.net>
@lukemanley lukemanley mentioned this pull request Feb 16, 2022
3 tasks
@lukemanley lukemanley deleted the merge-empty-frame branch March 2, 2022 01:13
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
* faster merge with empty frame

* whatsnew

* docs, tests, asvs

* fix whatsnew

Co-authored-by: Jeff Reback <jeff@reback.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants