ENH: Handle extension arrays in algorithms.diff #31025

TomAugspurger · 2020-01-15T00:26:03Z

This still needs a few things

Additional tests. It's hard to write thorough tests in extensions/base, since we don't know what the dtype will be after the .diff() (subtracting doesn't necessarily return the same dtype).
Decide what we want for BooleanArray.diff. Should that return an IntegerArray probably?
(edit) Verify that there's no perf regression for datetime, datetimetz, timedelta.

Are we comfortable doing this for 1.0.0?

Closes pandas-dev#30889 Closes pandas-dev#30967

TomAugspurger · 2020-01-15T00:27:34Z

pandas/core/arrays/base.py

@@ -578,6 +578,11 @@ def dropna(self):
        """
        return self[~self.isna()]

+    def diff(self, periods: int = 1):
+        if hasattr(self, "__sub__"):


I'm not thrilled about this, but it may be unavoidable. Happy to hear of an alternative.

But if it doesn't support -, the operation will also raise a TypeError? What you are doing below yourself?

Right. I meant the need for the hasattr just to please the type checker. If you do a diff on an EA that doens't implement __sub__ you'll get the TypeError either way.

TomAugspurger · 2020-01-15T00:28:18Z

pandas/core/arrays/numpy_.py

@@ -164,6 +164,10 @@ def _from_sequence(cls, scalars, dtype=None, copy=False):
            result = result.copy()
        return cls(result)

+    def diff(self, periods: int = 1):
+        result = algorithms.diff(com.values_from_object(self._ndarray), periods)


com.values_from_object was the old behavior of Series.diff. So this should be identical to the old behavior, just with an extra function call to go through self.array.diff.

Why do you need to override the base class version? Doesn't that work as well for numpy?

TomAugspurger · 2020-01-15T00:29:08Z

pandas/core/internals/blocks.py

@@ -1962,6 +1962,14 @@ class ObjectValuesExtensionBlock(ExtensionBlock):
    Series[T].values is an ndarray of objects.
    """

+    def diff(self, n: int, axis: int = 1) -> List["Block"]:
+        # Block.shape vs. Block.values.shape mismatch


More technical debt from the 1D arrays inside 2D blocks :(

Oh, and this is on ObjectValuesExtensionBlock, but is only useful for PeriodArray. IntervalArray is the only other array to use this, and doesn't implement __sub__. Is it worth creating a PeriodBlock just for this (IMO no)?

Why is this only needed for ObjectValuesExtensionBlock, and not for ExtensionBlock?

I suppose that in principle, we can hit this from ExtensionBlock.

We hit the problem when going from a NonConsolidatable block type (like period) to a consolidatable one (like object). In that case, the values passed to make_block will be 1d, but the block is expecting them to be 2d.

In practice, I think that for most EAs, ExtensionArray.__sub__ will return another extension array. So while ExtensionBlock.diff isn't 100% correct for all EAs, I think trying to handle this case is not worth it. What do you think?

So while ExtensionBlock.diff isn't 100% correct for all EAs, I think trying to handle this case is not worth it

It certainly seems fine to ignore this corner case for now.

MarcoGorelli · 2020-01-15T11:03:27Z

Nice, thanks, this is much better as it also fixes the DataFrame case - should I close #30894 then?

Just a question about the tests: if I run

~/pandas$ pytest pandas/tests/extension/base/methods.py

I get

============================================================================================== test session starts ==============================================================================================
platform linux -- Python 3.7.3, pytest-5.3.2, py-1.8.0, pluggy-0.13.0
rootdir: /home/SERILOCAL/m.gorelli/pandas, inifile: setup.cfg
plugins: forked-1.1.2, hypothesis-4.55.4, cov-2.8.1, xdist-1.30.0
collected 0 items

I haven't had this issue with testing before, and am able to run other tests - for example,

~/pandas$ pytest pandas/tests/extension/test_categorical.py

collects and runs 331 items.

TomAugspurger · 2020-01-15T11:55:56Z

The tests in extension/base aren't meant to be run (they don't start with Test, so pytest ignores them). Instead, they're inherited and run for all our EAs: https://dev.pandas.io/docs/development/extending.html#testing-extension-arrays

jorisvandenbossche · 2020-01-15T21:30:18Z

Decide what we want for BooleanArray.diff. Should that return an IntegerArray probably?

Numpy keeps boolean dtype:

In [11]: np.diff(np.array([True, False, True, True]))                                                                                                                                                              
Out[11]: array([ True,  True, False])

(but not sure how useful it is. In any way, it is easy to switch between bool and int, if as a user you want one of both)

jorisvandenbossche

I am not fully convinced of the approach here.

I would consider keeping the implementation just in algorithms.diff, and have there a specific implementation that works for EAs. In principle one should be able to define it in terms of shift and subtract?

For example, our diff is different from np.diff's behaviour, which might be a reason to avoid having it on our "array"

jorisvandenbossche · 2020-01-15T21:37:47Z

pandas/core/arrays/numpy_.py

@@ -164,6 +164,10 @@ def _from_sequence(cls, scalars, dtype=None, copy=False):
            result = result.copy()
        return cls(result)

+    def diff(self, periods: int = 1):
+        result = algorithms.diff(com.values_from_object(self._ndarray), periods)


Why do you need to override the base class version? Doesn't that work as well for numpy?

TomAugspurger · 2020-01-15T22:01:03Z

I’ll take a closer look next week, but I think the shift & subtract implementation uses more memory than the implementation in algorithms.

jorisvandenbossche · 2020-01-16T10:15:55Z

Didn't look very close, but I see something like this:

pandas/pandas/core/algorithms.py

Line 1893 in 5d49730

out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer]

which seems similar?

TomAugspurger · 2020-01-16T15:17:56Z

which seems similar?

Yes indeed. I thought we had our own iterative cython version, but I was mistaken.

I am not fully convinced of the approach here.

To confirm: you're unsure of the need to add .diff to the EA interface, correct? You'd prefer we just implement the shift & sub in algorithms? That sounds OK to me.

TomAugspurger · 2020-01-16T15:23:33Z

Hmm, one problem with BooleanArray though. We don't allow - between two BooleanArrays

In [4]: pd.Series([True, False, True, None, True], dtype='boolean').diff()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-69fe2099d77d> in <module>
----> 1 pd.Series([True, False, True, None, True], dtype='boolean').diff()

~/sandbox/pandas/pandas/core/series.py in diff(self, periods)
   2317         dtype: float64
   2318         """
-> 2319         result = algorithms.diff(self.array, periods)
   2320         return self._constructor(result, index=self.index).__finalize__(self)
   2321

~/sandbox/pandas/pandas/core/algorithms.py in diff(arr, n, axis)
   1836
   1837     if is_extension_array_dtype(dtype):
-> 1838         return arr - arr.shift(n)
   1839
   1840     is_timedelta = False

~/sandbox/pandas/pandas/core/arrays/boolean.py in boolean_arithmetic_method(self, other)
    750
    751             with np.errstate(all="ignore"):
--> 752                 result = op(self._data, other)
    753
    754             # divmod returns a tuple

TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

We do something similar with bool dtype, doing an xor instead.

        elif is_bool:
            out_arr[res_indexer] = arr[res_indexer] ^ arr[lag_indexer]

So perhaps we check dtype.kind and do xor for boolean kinds. I think that's reasonable.

jorisvandenbossche · 2020-01-16T15:43:22Z

To confirm: you're unsure of the need to add .diff to the EA interface, correct? You'd prefer we just implement the shift & sub in algorithms? That sounds OK to me.

Yes, that would be my personal preference for now.

TomAugspurger · 2020-01-17T16:01:04Z

K, doing that.

This will change the behaviore of Series[SparseArray[bool]].diff. Previously, that went to an Series[object], which is not at all usefully so I'm OK breaking / fixing that.

TomAugspurger · 2020-01-17T16:40:21Z

@jorisvandenbossche can you take another look? I

Removed ExtensionArray.diff
Changed algos.diff to use the self - self.shift(n) for EAs
Changed algos.diff to use ^ for all bool dtypes (we already did it for bool / object, just added it for EAs)

jorisvandenbossche · 2020-01-17T16:51:06Z

Will look later in more detail, but generally looks good to me

TomAugspurger · 2020-01-21T14:27:01Z

I'm having second thoughts about the 1.0.0 milestone. I didn't realize before, but this is a behavior change for EAs that don't implement __sub__. Previously, these cast to an ndarray before doing the diff, so the dtype was lost. e.g. categorical, on master

In [2]: pd.Series(pd.Categorical([1, 2, 3])).diff()
Out[2]:
0    NaN
1    1.0
2    1.0
dtype: float64

On this branch

In [2]: pd.Series(pd.Categorical([1, 2, 3])).diff()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-c632daaa2286> in <module>
----> 1 pd.Series(pd.Categorical([1, 2, 3])).diff()

~/sandbox/pandas/pandas/core/series.py in diff(self, periods)
   2283         dtype: float64
   2284         """
-> 2285         result = algorithms.diff(self.array, periods)
   2286         return self._constructor(result, index=self.index).__finalize__(self)
   2287

~/sandbox/pandas/pandas/core/algorithms.py in diff(arr, n, axis)
   1848
   1849     if is_extension_array_dtype(dtype):
-> 1850         return op(arr, arr.shift(n))
   1851
   1852     is_timedelta = False

TypeError: unsupported operand type(s) for -: 'Categorical' and 'Categorical'

In general, I find the new behavior more useful. Users can cast before doing the operation if that's what they want. Is this too large a change for 1.0?

jorisvandenbossche · 2020-01-21T14:33:35Z

We could also only take the new path for a subset of extension types that are "safe" (eg nullable integer, for which there is otherwise a regression, and nullable integer, which is new in 1.0.0). And then we can still change the behaviour for existing dtypes later.

However, for categorical, you could also interpret it as a breaking change, which in that sense might be nice to include in 1.0 instead of 1.1.

TomAugspurger · 2020-01-21T15:01:55Z

How about a middle ground. In algorithms.diff we check

if is_extension_array_dtype(dtype):
    if hasattr(arr, '__sub__'):
         return arr - arr.shift(n)
    else:
        warnings.warn('losing dtype... will raise in the future..., convert before diffing...')
        # take the old path

that I think preserves existing behavior for people relying on the dtype being lost while letting us change the behavior in the future.

jorisvandenbossche · 2020-01-21T15:19:55Z

That sounds good to me as well.
It's of course possible that EAs actually have a __sub__ but which raises an error .. (anyway, that seems a cornercase that we can ignore for now)

* deprecate old behavior

pandas/core/algorithms.py

TomAugspurger · 2020-01-21T15:33:46Z

pandas/core/internals/blocks.py

@@ -1860,6 +1863,12 @@ def interpolate(
            placement=self.mgr_locs,
        )

+    def diff(self, n: int, axis: int = 1) -> List["Block"]:


Note this is on ExtensionBlock now. The deprecation for EAs not implementing __sub__ required a similar block / axis handling, so I've moved it up here. The _block_shape is all the way up in Block.diff.

TomAugspurger · 2020-01-21T15:34:38Z

pandas/tests/extension/test_numpy.py

@@ -248,6 +248,10 @@ def test_repeat(self, data, repeats, as_series, use_numpy):
        # Fails creating expected
        super().test_repeat(data, repeats, as_series, use_numpy)

+    @pytest.mark.skip(reason="algorithms.diff skips PandasArray")
+    def test_diff(self, data, periods):


Either a bug or not implemented behavior in PandasArray. PandasArray.shift() doesn't allow changing the dtype from int to float, so diff (which uses shift) would fail.

pandas/core/algorithms.py

jreback · 2020-01-23T14:20:38Z

pandas/tests/extension/test_numpy.py

@@ -248,6 +248,10 @@ def test_repeat(self, data, repeats, as_series, use_numpy):
        # Fails creating expected
        super().test_repeat(data, repeats, as_series, use_numpy)

+    @pytest.mark.skip(reason="algorithms.diff skips PandasArray")
+    def test_diff(self, data, periods):


maybe xfail it then?

TomAugspurger · 2020-01-23T15:59:52Z

All green.

FYI, I rewrote the base extension test a bit so that it works for boolean dtypes as well, so it's no longer skipped for BooleanArray.

jorisvandenbossche · 2020-01-23T19:00:19Z

@TomAugspurger Thanks!

jorisvandenbossche · 2020-01-23T19:02:59Z

@meeseeksdev backport to 1.0.x

…hms.diff

jorisvandenbossche · 2020-01-23T19:03:41Z

It seems that the backports don't get automatically picked from merging PRs with the correct milestone at the moment .. Should probably check other PRs that were merged recently

…31255)

Dispatch NDFrame.diff to EAs

fcde96b

Closes pandas-dev#30889 Closes pandas-dev#30967

TomAugspurger added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 15, 2020

TomAugspurger mentioned this pull request Jan 15, 2020

[BUG] Series.diff was always setting the dtype to object #30894

Closed

5 tasks

TomAugspurger commented Jan 15, 2020

View reviewed changes

jorisvandenbossche reviewed Jan 15, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into 30889-diff

7c5e6f7

Merge remote-tracking branch 'upstream/master' into 30889-diff

3cc7c11

wip

5017912

TomAugspurger added 4 commits January 17, 2020 10:33

xor

dfea6a5

Merge remote-tracking branch 'upstream/master' into 30889-diff

38fe40c

doc

fc6eef0

cleanup

84e5e93

TomAugspurger changed the title ~~Dispatch NDFrame.diff to EAs~~ Handle extension arrays in algorithms.diff Jan 17, 2020

TomAugspurger added 4 commits January 17, 2020 13:06

skip pandasraray

4183b5b

Merge remote-tracking branch 'upstream/master' into 30889-diff

ab9b23f

docstrings

2f5d55f

skpi

e0ce8be

jorisvandenbossche changed the title ~~Handle extension arrays in algorithms.diff~~ ENH: Handle extension arrays in algorithms.diff Jan 20, 2020

TomAugspurger modified the milestones: 1.0.0, 1.1 Jan 21, 2020

TomAugspurger added 2 commits January 21, 2020 09:27

Updates

f3af8f5

* deprecate old behavior

localize

4d0c5cf

TomAugspurger commented Jan 21, 2020

View reviewed changes

missing backtick

6843e2b

jorisvandenbossche modified the milestones: 1.1, 1.0.0 Jan 22, 2020

jorisvandenbossche reviewed Jan 22, 2020

View reviewed changes

pandas/core/algorithms.py Outdated Show resolved Hide resolved

TomAugspurger added 3 commits January 22, 2020 08:22

fixup

bd6c157

Merge remote-tracking branch 'upstream/master' into 30889-diff

7861f57

Merge remote-tracking branch 'upstream/master' into 30889-diff

a496f13

jorisvandenbossche approved these changes Jan 23, 2020

View reviewed changes

jreback reviewed Jan 23, 2020

View reviewed changes

TomAugspurger added 3 commits January 23, 2020 08:28

Merge remote-tracking branch 'upstream/master' into 30889-diff

869ce96

asarray

8fa2836

xfails, boolean

d34ffe3

jorisvandenbossche merged commit 283a882 into pandas-dev:master Jan 23, 2020

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 23, 2020

Backport PR pandas-dev#31025: ENH: Handle extension arrays in algorit…

33235be

…hms.diff

meeseeksmachine mentioned this pull request Jan 23, 2020

Backport PR #31025 on branch 1.0.x (ENH: Handle extension arrays in algorithms.diff) #31255

Merged

TomAugspurger deleted the 30889-diff branch January 23, 2020 19:29

WillAyd pushed a commit that referenced this pull request Jan 23, 2020

Backport PR #31025: ENH: Handle extension arrays in algorithms.diff (#…

01f9742

…31255)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Handle extension arrays in algorithms.diff #31025

ENH: Handle extension arrays in algorithms.diff #31025

TomAugspurger commented Jan 15, 2020 •

edited

Loading

TomAugspurger Jan 15, 2020

jorisvandenbossche Jan 15, 2020

TomAugspurger Jan 16, 2020

TomAugspurger Jan 15, 2020

jorisvandenbossche Jan 15, 2020

TomAugspurger Jan 15, 2020

TomAugspurger Jan 15, 2020

jorisvandenbossche Jan 20, 2020

TomAugspurger Jan 20, 2020

jorisvandenbossche Jan 21, 2020

MarcoGorelli commented Jan 15, 2020 •

edited

Loading

TomAugspurger commented Jan 15, 2020

jorisvandenbossche commented Jan 15, 2020

jorisvandenbossche left a comment

jorisvandenbossche Jan 15, 2020

TomAugspurger commented Jan 15, 2020

jorisvandenbossche commented Jan 16, 2020

TomAugspurger commented Jan 16, 2020

TomAugspurger commented Jan 16, 2020

jorisvandenbossche commented Jan 16, 2020

TomAugspurger commented Jan 17, 2020

TomAugspurger commented Jan 17, 2020

jorisvandenbossche commented Jan 17, 2020

TomAugspurger commented Jan 21, 2020

jorisvandenbossche commented Jan 21, 2020

TomAugspurger commented Jan 21, 2020 •

edited

Loading

jorisvandenbossche commented Jan 21, 2020

TomAugspurger Jan 21, 2020

TomAugspurger Jan 21, 2020

jreback Jan 23, 2020

TomAugspurger commented Jan 23, 2020

jorisvandenbossche commented Jan 23, 2020

jorisvandenbossche commented Jan 23, 2020

jorisvandenbossche commented Jan 23, 2020

ENH: Handle extension arrays in algorithms.diff #31025

ENH: Handle extension arrays in algorithms.diff #31025

Conversation

TomAugspurger commented Jan 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Jan 15, 2020 • edited Loading

TomAugspurger commented Jan 15, 2020

jorisvandenbossche commented Jan 15, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 15, 2020

jorisvandenbossche commented Jan 16, 2020

TomAugspurger commented Jan 16, 2020

TomAugspurger commented Jan 16, 2020

jorisvandenbossche commented Jan 16, 2020

TomAugspurger commented Jan 17, 2020

TomAugspurger commented Jan 17, 2020

jorisvandenbossche commented Jan 17, 2020

TomAugspurger commented Jan 21, 2020

jorisvandenbossche commented Jan 21, 2020

TomAugspurger commented Jan 21, 2020 • edited Loading

jorisvandenbossche commented Jan 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 23, 2020

jorisvandenbossche commented Jan 23, 2020

jorisvandenbossche commented Jan 23, 2020

jorisvandenbossche commented Jan 23, 2020

TomAugspurger commented Jan 15, 2020 •

edited

Loading

MarcoGorelli commented Jan 15, 2020 •

edited

Loading

TomAugspurger commented Jan 21, 2020 •

edited

Loading