-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Handle extension arrays in algorithms.diff #31025
ENH: Handle extension arrays in algorithms.diff #31025
Conversation
Closes pandas-dev#30889 Closes pandas-dev#30967
pandas/core/arrays/base.py
Outdated
@@ -578,6 +578,11 @@ def dropna(self): | |||
""" | |||
return self[~self.isna()] | |||
|
|||
def diff(self, periods: int = 1): | |||
if hasattr(self, "__sub__"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not thrilled about this, but it may be unavoidable. Happy to hear of an alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if it doesn't support -
, the operation will also raise a TypeError? What you are doing below yourself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I meant the need for the hasattr just to please the type checker. If you do a diff on an EA that doens't implement __sub__
you'll get the TypeError either way.
pandas/core/arrays/numpy_.py
Outdated
@@ -164,6 +164,10 @@ def _from_sequence(cls, scalars, dtype=None, copy=False): | |||
result = result.copy() | |||
return cls(result) | |||
|
|||
def diff(self, periods: int = 1): | |||
result = algorithms.diff(com.values_from_object(self._ndarray), periods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
com.values_from_object
was the old behavior of Series.diff. So this should be identical to the old behavior, just with an extra function call to go through self.array.diff
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to override the base class version? Doesn't that work as well for numpy?
pandas/core/internals/blocks.py
Outdated
@@ -1962,6 +1962,14 @@ class ObjectValuesExtensionBlock(ExtensionBlock): | |||
Series[T].values is an ndarray of objects. | |||
""" | |||
|
|||
def diff(self, n: int, axis: int = 1) -> List["Block"]: | |||
# Block.shape vs. Block.values.shape mismatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More technical debt from the 1D arrays inside 2D blocks :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and this is on ObjectValuesExtensionBlock, but is only useful for PeriodArray. IntervalArray is the only other array to use this, and doesn't implement __sub__
. Is it worth creating a PeriodBlock
just for this (IMO no)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this only needed for ObjectValuesExtensionBlock, and not for ExtensionBlock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that in principle, we can hit this from ExtensionBlock.
We hit the problem when going from a NonConsolidatable block type (like period) to a consolidatable one (like object). In that case, the values passed to make_block
will be 1d, but the block is expecting them to be 2d.
In practice, I think that for most EAs, ExtensionArray.__sub__
will return another extension array. So while ExtensionBlock.diff isn't 100% correct for all EAs, I think trying to handle this case is not worth it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So while ExtensionBlock.diff isn't 100% correct for all EAs, I think trying to handle this case is not worth it
It certainly seems fine to ignore this corner case for now.
Nice, thanks, this is much better as it also fixes the DataFrame case - should I close #30894 then? Just a question about the tests: if I run
I get
I haven't had this issue with testing before, and am able to run other tests - for example,
collects and runs 331 items. |
The tests in |
Numpy keeps boolean dtype:
(but not sure how useful it is. In any way, it is easy to switch between bool and int, if as a user you want one of both) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not fully convinced of the approach here.
I would consider keeping the implementation just in algorithms.diff
, and have there a specific implementation that works for EAs. In principle one should be able to define it in terms of shift and subtract?
For example, our diff
is different from np.diff
's behaviour, which might be a reason to avoid having it on our "array"
pandas/core/arrays/numpy_.py
Outdated
@@ -164,6 +164,10 @@ def _from_sequence(cls, scalars, dtype=None, copy=False): | |||
result = result.copy() | |||
return cls(result) | |||
|
|||
def diff(self, periods: int = 1): | |||
result = algorithms.diff(com.values_from_object(self._ndarray), periods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to override the base class version? Doesn't that work as well for numpy?
I’ll take a closer look next week, but I think the shift & subtract implementation uses more memory than the implementation in algorithms. |
Didn't look very close, but I see something like this: pandas/pandas/core/algorithms.py Line 1893 in 5d49730
which seems similar? |
Yes indeed. I thought we had our own iterative cython version, but I was mistaken.
To confirm: you're unsure of the need to add |
Hmm, one problem with BooleanArray though. We don't allow In [4]: pd.Series([True, False, True, None, True], dtype='boolean').diff()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-69fe2099d77d> in <module>
----> 1 pd.Series([True, False, True, None, True], dtype='boolean').diff()
~/sandbox/pandas/pandas/core/series.py in diff(self, periods)
2317 dtype: float64
2318 """
-> 2319 result = algorithms.diff(self.array, periods)
2320 return self._constructor(result, index=self.index).__finalize__(self)
2321
~/sandbox/pandas/pandas/core/algorithms.py in diff(arr, n, axis)
1836
1837 if is_extension_array_dtype(dtype):
-> 1838 return arr - arr.shift(n)
1839
1840 is_timedelta = False
~/sandbox/pandas/pandas/core/arrays/boolean.py in boolean_arithmetic_method(self, other)
750
751 with np.errstate(all="ignore"):
--> 752 result = op(self._data, other)
753
754 # divmod returns a tuple
TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead. We do something similar with bool dtype, doing an xor instead. elif is_bool:
out_arr[res_indexer] = arr[res_indexer] ^ arr[lag_indexer] So perhaps we check |
Yes, that would be my personal preference for now. |
K, doing that. This will change the behaviore of Series[SparseArray[bool]].diff. Previously, that went to an Series[object], which is not at all usefully so I'm OK breaking / fixing that. |
@jorisvandenbossche can you take another look? I
|
Will look later in more detail, but generally looks good to me |
I'm having second thoughts about the 1.0.0 milestone. I didn't realize before, but this is a behavior change for EAs that don't implement In [2]: pd.Series(pd.Categorical([1, 2, 3])).diff()
Out[2]:
0 NaN
1 1.0
2 1.0
dtype: float64 On this branch In [2]: pd.Series(pd.Categorical([1, 2, 3])).diff()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-2-c632daaa2286> in <module>
----> 1 pd.Series(pd.Categorical([1, 2, 3])).diff()
~/sandbox/pandas/pandas/core/series.py in diff(self, periods)
2283 dtype: float64
2284 """
-> 2285 result = algorithms.diff(self.array, periods)
2286 return self._constructor(result, index=self.index).__finalize__(self)
2287
~/sandbox/pandas/pandas/core/algorithms.py in diff(arr, n, axis)
1848
1849 if is_extension_array_dtype(dtype):
-> 1850 return op(arr, arr.shift(n))
1851
1852 is_timedelta = False
TypeError: unsupported operand type(s) for -: 'Categorical' and 'Categorical' In general, I find the new behavior more useful. Users can cast before doing the operation if that's what they want. Is this too large a change for 1.0? |
We could also only take the new path for a subset of extension types that are "safe" (eg nullable integer, for which there is otherwise a regression, and nullable integer, which is new in 1.0.0). And then we can still change the behaviour for existing dtypes later. However, for categorical, you could also interpret it as a breaking change, which in that sense might be nice to include in 1.0 instead of 1.1. |
How about a middle ground. In algorithms.diff we check if is_extension_array_dtype(dtype):
if hasattr(arr, '__sub__'):
return arr - arr.shift(n)
else:
warnings.warn('losing dtype... will raise in the future..., convert before diffing...')
# take the old path that I think preserves existing behavior for people relying on the dtype being lost while letting us change the behavior in the future. |
That sounds good to me as well. |
@@ -1860,6 +1863,12 @@ def interpolate( | |||
placement=self.mgr_locs, | |||
) | |||
|
|||
def diff(self, n: int, axis: int = 1) -> List["Block"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this is on ExtensionBlock now. The deprecation for EAs not implementing __sub__
required a similar block / axis handling, so I've moved it up here. The _block_shape
is all the way up in Block.diff.
@@ -248,6 +248,10 @@ def test_repeat(self, data, repeats, as_series, use_numpy): | |||
# Fails creating expected | |||
super().test_repeat(data, repeats, as_series, use_numpy) | |||
|
|||
@pytest.mark.skip(reason="algorithms.diff skips PandasArray") | |||
def test_diff(self, data, periods): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either a bug or not implemented behavior in PandasArray. PandasArray.shift()
doesn't allow changing the dtype from int to float, so diff
(which uses shift) would fail.
@@ -248,6 +248,10 @@ def test_repeat(self, data, repeats, as_series, use_numpy): | |||
# Fails creating expected | |||
super().test_repeat(data, repeats, as_series, use_numpy) | |||
|
|||
@pytest.mark.skip(reason="algorithms.diff skips PandasArray") | |||
def test_diff(self, data, periods): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe xfail it then?
All green. FYI, I rewrote the base extension test a bit so that it works for boolean dtypes as well, so it's no longer skipped for BooleanArray. |
@TomAugspurger Thanks! |
@meeseeksdev backport to 1.0.x |
It seems that the backports don't get automatically picked from merging PRs with the correct milestone at the moment .. Should probably check other PRs that were merged recently |
Closes #30889
Closes #30967
This still needs a few things
extensions/base
, since we don't know what the dtype will be after the.diff()
(subtracting doesn't necessarily return the same dtype).Are we comfortable doing this for 1.0.0?