Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH / CoW: Use the "lazy copy" (with Copy-on-Write) optimization in more methods where appropriate #49473

Closed
72 of 73 tasks
jorisvandenbossche opened this issue Nov 2, 2022 · 13 comments
Labels
Copy / view semantics Master Tracker High level tracker for similar issues

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 2, 2022

With the Copy-on-Write implementation (see #36195 / proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit, and overview follow up issue #48998), we can avoid doing an actual copy of the data in DataFrame and Series methods that typically return a copy / new object.
A typical example is the following:

df2 = df.rename(columns=str.lower)

By default, the rename() method returns a new object (DataFrame) with a copy of the data of the original DataFrame (and thus, mutating values in df2 never mutates df). With CoW enabled (pd.options.mode.copy_on_write = True), we can still return a new object, but now pointing to the same data under the hood (avoiding an initial copy), while preserving the observed behaviour of df2 being a copy / not mutating df when df2 is mutated (though the CoW mechanism, only copying the data in df2 when actually needed upon mutation, i.e. a delayed or lazy copy).

The way this is done in practice for a method like rename() or reset_index() is by using the fact that copy(deep=None) will mean a true deep copy (current default behaviour) if CoW is not enabled, and this "lazy" copy when CoW is enabled. For example:

pandas/pandas/core/frame.py

Lines 6246 to 6249 in 7bf8d6b

if inplace:
new_obj = self
else:
new_obj = self.copy(deep=None)

The initial CoW implementation in #46958 only added this logic to a few methods (to ensure this mechanism was working): rename, reset_index, reindex (when reindexing the columns), select_dtypes, to_frame and copy itself.
But there are more methods that can make use of this mechanism, and this issue is meant to as the overview issue to summarize and keep track of the progress on this front.

There is a class of methods that perform an actual operation on the data and return newly calculated data (eg typically reductions or the methods wrapping binary operators) that don't have to be considered here. It's only methods that can (potentially, in certain cases) return the original data that could make use of this optimization.

Series / DataFrame methods to update (I added a ? for the ones I wasn't directly sure about, have to look into what those exactly do to be sure, but left them here to keep track of those, can remove from the list once we know more):

Top-level functions:

Want to contribute to this issue?

Pull requests tackling one of the bullet points above are certainly welcome!

  • Pick one of the methods above (best to stick to one method per PR)
  • Update the method to make use of a lazy copy (in many cases this might mean using copy(deep=None) somewhere, but for some methods it will be more involved)
  • Add a test for it in /pandas/tests/copy_view/test_methods.py (you can mimick on of the existing ones, eg test_select_dtypes)
    • You can run the test with PANDAS_COPY_ON_WRITE=1 pytest pandas/tests/copy_view/test_methods.py to test it with CoW enabled (pandas will check that environment variable). The test needs to pass with both CoW disabled and enabled.
    • The tests make use of a using_copy_on_write fixture that can be used within the test function to test different expected results depending on whether CoW is enabled or not.
@ntachukwu
Copy link
Contributor

ntachukwu commented Nov 6, 2022

...

@ntachukwu
Copy link
Contributor

ntachukwu commented Nov 6, 2022

[] set_index

This PR 49557 is an attempt to handle the set_index method.

@seljaks
Copy link
Contributor

seljaks commented Nov 9, 2022

Hi, I'll take a look at drop and see how it goes.

@seljaks
Copy link
Contributor

seljaks commented Nov 20, 2022

Took at look at head and tail. CoW is already implented for these because they're just .iloc under the hood. I think they can be ticked off as is, but can submit a PR with explicit tests if needed.

@seljaks
Copy link
Contributor

seljaks commented Nov 20, 2022

Found an inconsistency while looking at squeeze. When a dataframe has two or more columns df.squeeze returns the dataframe unchanged. This is done by returning df.iloc[slice(None), slice(None)]. So the inconsistency is in iloc . Example:

import pandas as pd

pd.options.mode.copy_on_write = True

df = pd.DataFrame({"a": [1, 2], "b": [0.4, 0.5]})
df2 = df.iloc[slice(None), slice(None)]  # or [:, :]

print("before:", df, df2, sep="\n")
df2.iloc[0, 0] = 0  # does not trigger copy_on_write
print("after:", df, df2, sep="\n")  # both df and df2 are 0 at [0, 0]

@jorisvandenbossche This doesn't seem like the desired result. Should I open a separate issue for this or submit a PR linking here when I have a fix?

Sidenote: not sure how to test squeeze behavior when going df -> series or df -> scalar without writing a util function like get_array(series). Link to squeeze docs.

@jorisvandenbossche
Copy link
Member Author

Took at look at head and tail. CoW is already implented for these because they're just .iloc under the hood. I think they can be ticked off as is, but can submit a PR with explicit tests if needed.

Specifically for head and tail, I think we could consider changing those to actually return a hard (non-lazy) copy by default. Because those methods are typically used for interactive inspection of your data (seeing the repr of df.head()), it would be nice to avoid each of those to trigger CoW from just looking at the data.
But that for sure needs a separate issue to discuss first.

A PR to just add tests to confirm that right now they use CoW is certainly welcome.

@jorisvandenbossche
Copy link
Member Author

When a dataframe has two or more columns df.squeeze returns the dataframe unchanged. This is done by returning df.iloc[slice(None), slice(None)]. So the inconsistency is in iloc . .... This doesn't seem like the desired result. Should I open a separate issue for this or submit a PR linking here when I have a fix?

Sorry for the slow reply here. That's indeed an inconsistency in iloc with full slice. I also noticed that a while ago and had already opened a PR to fix this: #49469

So when that is merged, the squeeze behaviour should also be fixed.

Sidenote: not sure how to test squeeze behavior when going df -> series or df -> scalar without writing a util function like get_array(series)

I think for the df -> series case you can test this with the common pattern of "mutate subset (series in this case) -> ensure parent is not mutated"? (it's OK to leave out the shares_memory checks if those are difficult to express for a certain case. In the end it is the "mutations-don't-propagate" behaviour that is the actual documented / guaranteed behaviour we certainly want to have tested)

For df -> scalar, I think that scalars are in general not mutable, so this might not need to be covered?

In numpy, squeeze returns a 0-dim array, and those are mutable (at least those are views, so if you mutate the parent, the 0-dim "scalar" also gets mutated):

>>> arr = np.array([1])
>>> scalar = arr.squeeze()
>>> scalar
array(1)
>>> arr[0] = 10
>>> scalar
array(10)

but an actual scalar is not:

>>> scalar = arr[0]
>>> scalar
10
>>> type(scalar)
numpy.int64
>>> arr[0] = 100
scalar
10

In the case of pandas, iloc used under the hood by squeeze will return a numpy scalar (and not 0-dim array):

>>> type(pd.Series(arr).iloc[0])
numpy.int64

which I think means that the df / series -> scalar case is fine (we can still test it though, with mutation the parent and ensuring the scalar is still identical)

@andrewchen1216
Copy link
Contributor

I will take a look at add_prefix and add_suffix.

@phofl
Copy link
Member

phofl commented Dec 24, 2022

It looks like reindex is already handled for index as well (if possible). Would be good to double check so that I did not miss anything

@ntachukwu
Copy link
Contributor

ntachukwu commented Dec 29, 2022

I will take a look at truncate

Edit: @phofl has already covered truncate as mentioned below. I will take a look at squeeze instead.

@phofl
Copy link
Member

phofl commented Dec 29, 2022

Hm forgot to link, already opened a PR, will go ahead and link all of them now

Edit: Done

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Feb 10, 2023

I removed query from the list above. In theory it takes any kind of expression (and so you could do df.query("1") to select the first row), but any documented and tested use case is for a boolean expression, which is then passed to self.loc[mask] or self[mask], and boolean indexing is never a view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Master Tracker High level tracker for similar issues
Projects
None yet
Development

No branches or pull requests

6 participants