ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473

jorisvandenbossche · 2022-11-02T12:30:25Z

With the Copy-on-Write implementation (see #36195 / proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit, and overview follow up issue #48998), we can avoid doing an actual copy of the data in DataFrame and Series methods that typically return a copy / new object.
A typical example is the following:

df2 = df.rename(columns=str.lower)

By default, the rename() method returns a new object (DataFrame) with a copy of the data of the original DataFrame (and thus, mutating values in df2 never mutates df). With CoW enabled (pd.options.mode.copy_on_write = True), we can still return a new object, but now pointing to the same data under the hood (avoiding an initial copy), while preserving the observed behaviour of df2 being a copy / not mutating df when df2 is mutated (though the CoW mechanism, only copying the data in df2 when actually needed upon mutation, i.e. a delayed or lazy copy).

The way this is done in practice for a method like rename() or reset_index() is by using the fact that copy(deep=None) will mean a true deep copy (current default behaviour) if CoW is not enabled, and this "lazy" copy when CoW is enabled. For example:

pandas/pandas/core/frame.py

Lines 6246 to 6249 in 7bf8d6b

    
           if inplace: 
        
               new_obj = self 
        
           else: 
        
               new_obj = self.copy(deep=None)

The initial CoW implementation in #46958 only added this logic to a few methods (to ensure this mechanism was working): rename, reset_index, reindex (when reindexing the columns), select_dtypes, to_frame and copy itself.
But there are more methods that can make use of this mechanism, and this issue is meant to as the overview issue to summarize and keep track of the progress on this front.

There is a class of methods that perform an actual operation on the data and return newly calculated data (eg typically reductions or the methods wrapping binary operators) that don't have to be considered here. It's only methods that can (potentially, in certain cases) return the original data that could make use of this optimization.

Series / DataFrame methods to update (I added a ? for the ones I wasn't directly sure about, have to look into what those exactly do to be sure, but left them here to keep track of those, can remove from the list once we know more):

Top-level functions:

pd.concat -> ENH: Add lazy copy to concat and round #50501
pd.merge et al? -> ENH: enable lazy copy in merge() for CoW #51297, ENH: Avoid copy when possible in merge #51327
- add tests for join

Want to contribute to this issue?

Pull requests tackling one of the bullet points above are certainly welcome!

Pick one of the methods above (best to stick to one method per PR)
Update the method to make use of a lazy copy (in many cases this might mean using copy(deep=None) somewhere, but for some methods it will be more involved)
Add a test for it in /pandas/tests/copy_view/test_methods.py (you can mimick on of the existing ones, eg test_select_dtypes)
- You can run the test with PANDAS_COPY_ON_WRITE=1 pytest pandas/tests/copy_view/test_methods.py to test it with CoW enabled (pandas will check that environment variable). The test needs to pass with both CoW disabled and enabled.
- The tests make use of a using_copy_on_write fixture that can be used within the test function to test different expected results depending on whether CoW is enabled or not.

The text was updated successfully, but these errors were encountered:

ntachukwu · 2022-11-06T05:19:43Z

...

ntachukwu · 2022-11-06T15:34:04Z

[] set_index

This PR 49557 is an attempt to handle the set_index method.

seljaks · 2022-11-09T20:11:17Z

Hi, I'll take a look at drop and see how it goes.

seljaks · 2022-11-20T16:21:38Z

Took at look at head and tail. CoW is already implented for these because they're just .iloc under the hood. I think they can be ticked off as is, but can submit a PR with explicit tests if needed.

seljaks · 2022-11-20T20:57:19Z

Found an inconsistency while looking at squeeze. When a dataframe has two or more columns df.squeeze returns the dataframe unchanged. This is done by returning df.iloc[slice(None), slice(None)]. So the inconsistency is in iloc . Example:

import pandas as pd

pd.options.mode.copy_on_write = True

df = pd.DataFrame({"a": [1, 2], "b": [0.4, 0.5]})
df2 = df.iloc[slice(None), slice(None)]  # or [:, :]

print("before:", df, df2, sep="\n")
df2.iloc[0, 0] = 0  # does not trigger copy_on_write
print("after:", df, df2, sep="\n")  # both df and df2 are 0 at [0, 0]

@jorisvandenbossche This doesn't seem like the desired result. Should I open a separate issue for this or submit a PR linking here when I have a fix?

Sidenote: not sure how to test squeeze behavior when going df -> series or df -> scalar without writing a util function like get_array(series). Link to squeeze docs.

jorisvandenbossche · 2022-11-28T13:42:51Z

Took at look at head and tail. CoW is already implented for these because they're just .iloc under the hood. I think they can be ticked off as is, but can submit a PR with explicit tests if needed.

Specifically for head and tail, I think we could consider changing those to actually return a hard (non-lazy) copy by default. Because those methods are typically used for interactive inspection of your data (seeing the repr of df.head()), it would be nice to avoid each of those to trigger CoW from just looking at the data.
But that for sure needs a separate issue to discuss first.

A PR to just add tests to confirm that right now they use CoW is certainly welcome.

jorisvandenbossche · 2022-11-28T13:47:01Z

When a dataframe has two or more columns df.squeeze returns the dataframe unchanged. This is done by returning df.iloc[slice(None), slice(None)]. So the inconsistency is in iloc . .... This doesn't seem like the desired result. Should I open a separate issue for this or submit a PR linking here when I have a fix?

Sorry for the slow reply here. That's indeed an inconsistency in iloc with full slice. I also noticed that a while ago and had already opened a PR to fix this: #49469

So when that is merged, the squeeze behaviour should also be fixed.

Sidenote: not sure how to test squeeze behavior when going df -> series or df -> scalar without writing a util function like get_array(series)

I think for the df -> series case you can test this with the common pattern of "mutate subset (series in this case) -> ensure parent is not mutated"? (it's OK to leave out the shares_memory checks if those are difficult to express for a certain case. In the end it is the "mutations-don't-propagate" behaviour that is the actual documented / guaranteed behaviour we certainly want to have tested)

For df -> scalar, I think that scalars are in general not mutable, so this might not need to be covered?

In numpy, squeeze returns a 0-dim array, and those are mutable (at least those are views, so if you mutate the parent, the 0-dim "scalar" also gets mutated):

>>> arr = np.array([1])
>>> scalar = arr.squeeze()
>>> scalar
array(1)
>>> arr[0] = 10
>>> scalar
array(10)

but an actual scalar is not:

>>> scalar = arr[0]
>>> scalar
10
>>> type(scalar)
numpy.int64
>>> arr[0] = 100
scalar
10

In the case of pandas, iloc used under the hood by squeeze will return a numpy scalar (and not 0-dim array):

>>> type(pd.Series(arr).iloc[0])
numpy.int64

which I think means that the df / series -> scalar case is fine (we can still test it though, with mutation the parent and ensuring the scalar is still identical)

andrewchen1216 · 2022-11-28T22:26:26Z

I will take a look at add_prefix and add_suffix.

phofl · 2022-12-24T10:52:19Z

It looks like reindex is already handled for index as well (if possible). Would be good to double check so that I did not miss anything

ntachukwu · 2022-12-29T20:25:53Z

I will take a look at truncate

Edit: @phofl has already covered truncate as mentioned below. I will take a look at squeeze instead.

phofl · 2022-12-29T20:26:38Z

Hm forgot to link, already opened a PR, will go ahead and link all of them now

Edit: Done

jorisvandenbossche · 2023-02-10T22:03:48Z

I removed query from the list above. In theory it takes any kind of expression (and so you could do df.query("1") to select the first row), but any documented and tested use case is for a boolean expression, which is then passed to self.loc[mask] or self[mask], and boolean indexing is never a view.

jorisvandenbossche added Master Tracker High level tracker for similar issues Copy / view semantics labels Nov 2, 2022

jorisvandenbossche mentioned this issue Nov 2, 2022

Copy-on-Write (PDEP-7) follow-up overview issue #48998

Open

38 tasks

This was referenced Nov 6, 2022

ENH/CoW: use lazy copy on replace method #49555

Closed

ENH/CoW: use lazy copy in set_index method #49557

Merged

ntachukwu mentioned this issue Nov 9, 2022

ENH/CoW: use lazy copy in set_axis method #49600

Merged

2 tasks

seljaks mentioned this issue Nov 14, 2022

ENH: Add copy-on-write to DataFrame.drop #49689

Merged

2 tasks

seljaks mentioned this issue Nov 29, 2022

TST/CoW: copy-on-write tests for df.head and df.tail #49963

Merged

2 tasks

jorisvandenbossche mentioned this issue Nov 30, 2022

ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate noatamir/pyladies-berlin-sprints#11

Closed

6 tasks

andrewchen1216 mentioned this issue Dec 1, 2022

TST/CoW: copy-on-write tests for add_prefix and add_suffix #49991

Merged

This was referenced Dec 2, 2022

ENH/TST: expand copy-on-write to assign() method #50010

Merged

ENH: add copy on write for df reorder_levels GH49473 #50016

Merged

davidleon123 mentioned this issue Dec 20, 2022

Cow drop level #50365

Closed

2 tasks

This was referenced Dec 24, 2022

ENH: Use cow for reindex_like #50426

Merged

ENH: Use lazy copy in infer objects #50428

Merged

ENH: Use lazy copy for dropna #50429

Merged

ENH: Add lazy copy for drop duplicates #50431

Merged

ENH: Add lazy copy to align #50432

Merged

phofl mentioned this issue Dec 29, 2022

ENH: Add lazy copy to swaplevel #50478

Merged

5 tasks

This was referenced Jan 23, 2023

ENH: Add lazy copy to astype #50802

Merged

ENH: Make shallow copy for align nocopy with CoW #50917

Merged

lithomas1 mentioned this issue Feb 3, 2023

BUG: interpolate not respecting copy_on_write #51126

Closed

3 tasks

This was referenced Feb 8, 2023

ENH: Add CoW optimization to interpolate #51249

Merged

ENH: Implement CoW for convert_dtypes #51265

Merged

ENH: Add CoW optimization for fillna #51279

Merged

This was referenced Feb 10, 2023

TST: add CoW tests for xs() and get() #51292

Merged

TST: CoW with df.isetitem() #50692

Merged

TST: add CoW test for setitem with Series being set #51296

Merged

ENH: enable lazy copy in merge() for CoW #51297

Merged

jorisvandenbossche mentioned this issue Feb 16, 2023

TST: add CoW test for update() #51426

Merged

This was referenced Feb 16, 2023

BUG: transpose not respecting CoW #51430

Merged

TST: Add tests for clip with CoW #51492

Merged

ENH: Series.fillna with empty argument making deep copy #51568

Merged

ENH: DataFrame.fillna making deep copy for dict arg #51571

Merged

This was referenced Feb 26, 2023

ENH: Improve replace lazy copy handling #51658

Merged

ENH: Improve replace lazy copy for categoricals #51659

Merged

ENH: Add CoW mechanism to replace_regex #51669

Merged

jorisvandenbossche mentioned this issue Jun 19, 2023

TST: add test for reindexing rows with matching index uses shallow copy with CoW #53723

Merged

This was referenced Jun 20, 2023

TST / CoW: Add test for mask #53745

Merged

ENH / CoW: Add lazy copy to eval #53746

Merged

BUG / CoW: Series.transform not respecting CoW #53747

Merged

hkad98 mentioned this issue Aug 2, 2023

PERF: high memory consumption for unstack #54373

Closed

3 tasks

phofl closed this as completed Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473

ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473

jorisvandenbossche commented Nov 2, 2022 •

edited by phofl

Loading

ntachukwu commented Nov 6, 2022 •

edited

Loading

ntachukwu commented Nov 6, 2022 •

edited

Loading

seljaks commented Nov 9, 2022

seljaks commented Nov 20, 2022

seljaks commented Nov 20, 2022

jorisvandenbossche commented Nov 28, 2022

jorisvandenbossche commented Nov 28, 2022

andrewchen1216 commented Nov 28, 2022

phofl commented Dec 24, 2022

ntachukwu commented Dec 29, 2022 •

edited

Loading

phofl commented Dec 29, 2022 •

edited

Loading

jorisvandenbossche commented Feb 10, 2023 •

edited

Loading

ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473

ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473

Comments

jorisvandenbossche commented Nov 2, 2022 • edited by phofl Loading

Want to contribute to this issue?

ntachukwu commented Nov 6, 2022 • edited Loading

ntachukwu commented Nov 6, 2022 • edited Loading

seljaks commented Nov 9, 2022

seljaks commented Nov 20, 2022

seljaks commented Nov 20, 2022

jorisvandenbossche commented Nov 28, 2022

jorisvandenbossche commented Nov 28, 2022

andrewchen1216 commented Nov 28, 2022

phofl commented Dec 24, 2022

ntachukwu commented Dec 29, 2022 • edited Loading

phofl commented Dec 29, 2022 • edited Loading

jorisvandenbossche commented Feb 10, 2023 • edited Loading

jorisvandenbossche commented Nov 2, 2022 •

edited by phofl

Loading

ntachukwu commented Nov 6, 2022 •

edited

Loading

ntachukwu commented Nov 6, 2022 •

edited

Loading

ntachukwu commented Dec 29, 2022 •

edited

Loading

phofl commented Dec 29, 2022 •

edited

Loading

jorisvandenbossche commented Feb 10, 2023 •

edited

Loading