Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Handle extension arrays in algorithms.diff #31025

Merged
merged 23 commits into from
Jan 23, 2020

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Jan 15, 2020

Closes #30889
Closes #30967

This still needs a few things

  1. Additional tests. It's hard to write thorough tests in extensions/base, since we don't know what the dtype will be after the .diff() (subtracting doesn't necessarily return the same dtype).
  2. Decide what we want for BooleanArray.diff. Should that return an IntegerArray probably?
  3. (edit) Verify that there's no perf regression for datetime, datetimetz, timedelta.

Are we comfortable doing this for 1.0.0?

@TomAugspurger TomAugspurger added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 15, 2020
@@ -578,6 +578,11 @@ def dropna(self):
"""
return self[~self.isna()]

def diff(self, periods: int = 1):
if hasattr(self, "__sub__"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not thrilled about this, but it may be unavoidable. Happy to hear of an alternative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if it doesn't support -, the operation will also raise a TypeError? What you are doing below yourself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I meant the need for the hasattr just to please the type checker. If you do a diff on an EA that doens't implement __sub__ you'll get the TypeError either way.

@@ -164,6 +164,10 @@ def _from_sequence(cls, scalars, dtype=None, copy=False):
result = result.copy()
return cls(result)

def diff(self, periods: int = 1):
result = algorithms.diff(com.values_from_object(self._ndarray), periods)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

com.values_from_object was the old behavior of Series.diff. So this should be identical to the old behavior, just with an extra function call to go through self.array.diff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to override the base class version? Doesn't that work as well for numpy?

@@ -1962,6 +1962,14 @@ class ObjectValuesExtensionBlock(ExtensionBlock):
Series[T].values is an ndarray of objects.
"""

def diff(self, n: int, axis: int = 1) -> List["Block"]:
# Block.shape vs. Block.values.shape mismatch
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More technical debt from the 1D arrays inside 2D blocks :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and this is on ObjectValuesExtensionBlock, but is only useful for PeriodArray. IntervalArray is the only other array to use this, and doesn't implement __sub__. Is it worth creating a PeriodBlock just for this (IMO no)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this only needed for ObjectValuesExtensionBlock, and not for ExtensionBlock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that in principle, we can hit this from ExtensionBlock.

We hit the problem when going from a NonConsolidatable block type (like period) to a consolidatable one (like object). In that case, the values passed to make_block will be 1d, but the block is expecting them to be 2d.

In practice, I think that for most EAs, ExtensionArray.__sub__ will return another extension array. So while ExtensionBlock.diff isn't 100% correct for all EAs, I think trying to handle this case is not worth it. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So while ExtensionBlock.diff isn't 100% correct for all EAs, I think trying to handle this case is not worth it

It certainly seems fine to ignore this corner case for now.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jan 15, 2020

Nice, thanks, this is much better as it also fixes the DataFrame case - should I close #30894 then?

Just a question about the tests: if I run

~/pandas$ pytest pandas/tests/extension/base/methods.py 

I get

============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.7.3, pytest-5.3.2, py-1.8.0, pluggy-0.13.0
rootdir: /home/SERILOCAL/m.gorelli/pandas, inifile: setup.cfg
plugins: forked-1.1.2, hypothesis-4.55.4, cov-2.8.1, xdist-1.30.0
collected 0 items

I haven't had this issue with testing before, and am able to run other tests - for example,

~/pandas$ pytest pandas/tests/extension/test_categorical.py 

collects and runs 331 items.

@TomAugspurger
Copy link
Contributor Author

The tests in extension/base aren't meant to be run (they don't start with Test, so pytest ignores them). Instead, they're inherited and run for all our EAs: https://dev.pandas.io/docs/development/extending.html#testing-extension-arrays

@jorisvandenbossche
Copy link
Member

Decide what we want for BooleanArray.diff. Should that return an IntegerArray probably?

Numpy keeps boolean dtype:

In [11]: np.diff(np.array([True, False, True, True]))                                                                                                                                                              
Out[11]: array([ True,  True, False])

(but not sure how useful it is. In any way, it is easy to switch between bool and int, if as a user you want one of both)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not fully convinced of the approach here.

I would consider keeping the implementation just in algorithms.diff, and have there a specific implementation that works for EAs. In principle one should be able to define it in terms of shift and subtract?

For example, our diff is different from np.diff's behaviour, which might be a reason to avoid having it on our "array"

@@ -164,6 +164,10 @@ def _from_sequence(cls, scalars, dtype=None, copy=False):
result = result.copy()
return cls(result)

def diff(self, periods: int = 1):
result = algorithms.diff(com.values_from_object(self._ndarray), periods)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to override the base class version? Doesn't that work as well for numpy?

@TomAugspurger
Copy link
Contributor Author

I’ll take a closer look next week, but I think the shift & subtract implementation uses more memory than the implementation in algorithms.

@jorisvandenbossche
Copy link
Member

Didn't look very close, but I see something like this:

out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer]

which seems similar?

@TomAugspurger
Copy link
Contributor Author

which seems similar?

Yes indeed. I thought we had our own iterative cython version, but I was mistaken.

I am not fully convinced of the approach here.

To confirm: you're unsure of the need to add .diff to the EA interface, correct? You'd prefer we just implement the shift & sub in algorithms? That sounds OK to me.

@TomAugspurger
Copy link
Contributor Author

Hmm, one problem with BooleanArray though. We don't allow - between two BooleanArrays

In [4]: pd.Series([True, False, True, None, True], dtype='boolean').diff()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-69fe2099d77d> in <module>
----> 1 pd.Series([True, False, True, None, True], dtype='boolean').diff()

~/sandbox/pandas/pandas/core/series.py in diff(self, periods)
   2317         dtype: float64
   2318         """
-> 2319         result = algorithms.diff(self.array, periods)
   2320         return self._constructor(result, index=self.index).__finalize__(self)
   2321

~/sandbox/pandas/pandas/core/algorithms.py in diff(arr, n, axis)
   1836
   1837     if is_extension_array_dtype(dtype):
-> 1838         return arr - arr.shift(n)
   1839
   1840     is_timedelta = False

~/sandbox/pandas/pandas/core/arrays/boolean.py in boolean_arithmetic_method(self, other)
    750
    751             with np.errstate(all="ignore"):
--> 752                 result = op(self._data, other)
    753
    754             # divmod returns a tuple

TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

We do something similar with bool dtype, doing an xor instead.

        elif is_bool:
            out_arr[res_indexer] = arr[res_indexer] ^ arr[lag_indexer]

So perhaps we check dtype.kind and do xor for boolean kinds. I think that's reasonable.

@jorisvandenbossche
Copy link
Member

To confirm: you're unsure of the need to add .diff to the EA interface, correct? You'd prefer we just implement the shift & sub in algorithms? That sounds OK to me.

Yes, that would be my personal preference for now.

@TomAugspurger
Copy link
Contributor Author

K, doing that.

This will change the behaviore of Series[SparseArray[bool]].diff. Previously, that went to an Series[object], which is not at all usefully so I'm OK breaking / fixing that.

@TomAugspurger
Copy link
Contributor Author

@jorisvandenbossche can you take another look? I

  1. Removed ExtensionArray.diff
  2. Changed algos.diff to use the self - self.shift(n) for EAs
  3. Changed algos.diff to use ^ for all bool dtypes (we already did it for bool / object, just added it for EAs)

@jorisvandenbossche
Copy link
Member

Will look later in more detail, but generally looks good to me

@TomAugspurger TomAugspurger changed the title Dispatch NDFrame.diff to EAs Handle extension arrays in algorithms.diff Jan 17, 2020
@jorisvandenbossche jorisvandenbossche changed the title Handle extension arrays in algorithms.diff ENH: Handle extension arrays in algorithms.diff Jan 20, 2020
@TomAugspurger TomAugspurger modified the milestones: 1.0.0, 1.1 Jan 21, 2020
@TomAugspurger
Copy link
Contributor Author

I'm having second thoughts about the 1.0.0 milestone. I didn't realize before, but this is a behavior change for EAs that don't implement __sub__. Previously, these cast to an ndarray before doing the diff, so the dtype was lost. e.g. categorical, on master

In [2]: pd.Series(pd.Categorical([1, 2, 3])).diff()
Out[2]:
0    NaN
1    1.0
2    1.0
dtype: float64

On this branch

In [2]: pd.Series(pd.Categorical([1, 2, 3])).diff()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-c632daaa2286> in <module>
----> 1 pd.Series(pd.Categorical([1, 2, 3])).diff()

~/sandbox/pandas/pandas/core/series.py in diff(self, periods)
   2283         dtype: float64
   2284         """
-> 2285         result = algorithms.diff(self.array, periods)
   2286         return self._constructor(result, index=self.index).__finalize__(self)
   2287

~/sandbox/pandas/pandas/core/algorithms.py in diff(arr, n, axis)
   1848
   1849     if is_extension_array_dtype(dtype):
-> 1850         return op(arr, arr.shift(n))
   1851
   1852     is_timedelta = False

TypeError: unsupported operand type(s) for -: 'Categorical' and 'Categorical'

In general, I find the new behavior more useful. Users can cast before doing the operation if that's what they want. Is this too large a change for 1.0?

@jorisvandenbossche
Copy link
Member

We could also only take the new path for a subset of extension types that are "safe" (eg nullable integer, for which there is otherwise a regression, and nullable integer, which is new in 1.0.0). And then we can still change the behaviour for existing dtypes later.

However, for categorical, you could also interpret it as a breaking change, which in that sense might be nice to include in 1.0 instead of 1.1.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 21, 2020

How about a middle ground. In algorithms.diff we check

if is_extension_array_dtype(dtype):
    if hasattr(arr, '__sub__'):
         return arr - arr.shift(n)
    else:
        warnings.warn('losing dtype... will raise in the future..., convert before diffing...')
        # take the old path

that I think preserves existing behavior for people relying on the dtype being lost while letting us change the behavior in the future.

@jorisvandenbossche
Copy link
Member

That sounds good to me as well.
It's of course possible that EAs actually have a __sub__ but which raises an error .. (anyway, that seems a cornercase that we can ignore for now)

* deprecate old behavior
pandas/core/algorithms.py Show resolved Hide resolved
@@ -1860,6 +1863,12 @@ def interpolate(
placement=self.mgr_locs,
)

def diff(self, n: int, axis: int = 1) -> List["Block"]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is on ExtensionBlock now. The deprecation for EAs not implementing __sub__ required a similar block / axis handling, so I've moved it up here. The _block_shape is all the way up in Block.diff.

@@ -248,6 +248,10 @@ def test_repeat(self, data, repeats, as_series, use_numpy):
# Fails creating expected
super().test_repeat(data, repeats, as_series, use_numpy)

@pytest.mark.skip(reason="algorithms.diff skips PandasArray")
def test_diff(self, data, periods):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either a bug or not implemented behavior in PandasArray. PandasArray.shift() doesn't allow changing the dtype from int to float, so diff (which uses shift) would fail.

@jorisvandenbossche jorisvandenbossche modified the milestones: 1.1, 1.0.0 Jan 22, 2020
pandas/core/algorithms.py Outdated Show resolved Hide resolved
@@ -248,6 +248,10 @@ def test_repeat(self, data, repeats, as_series, use_numpy):
# Fails creating expected
super().test_repeat(data, repeats, as_series, use_numpy)

@pytest.mark.skip(reason="algorithms.diff skips PandasArray")
def test_diff(self, data, periods):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe xfail it then?

@TomAugspurger
Copy link
Contributor Author

All green.

FYI, I rewrote the base extension test a bit so that it works for boolean dtypes as well, so it's no longer skipped for BooleanArray.

@jorisvandenbossche jorisvandenbossche merged commit 283a882 into pandas-dev:master Jan 23, 2020
@jorisvandenbossche
Copy link
Member

@TomAugspurger Thanks!

@jorisvandenbossche
Copy link
Member

@meeseeksdev backport to 1.0.x

@jorisvandenbossche
Copy link
Member

It seems that the backports don't get automatically picked from merging PRs with the correct milestone at the moment .. Should probably check other PRs that were merged recently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.diff fails with nullable integers Series lose nullable integer dtype after call to .diff()
5 participants