BUG: DataFrame reductions with object dtype and axis=1 #49603

rhshadrach · 2022-11-09T20:10:45Z

In pandas 1.5.x and before, the dtype of the result is different when using axis=1:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, dtype=object)
print(df.sum(axis=1, numeric_only=None))
print(df.sum(axis=1, numeric_only=False))
0    5.0
1    7.0
2    9.0
dtype: float64
0    5
1    7
2    9
dtype: object

When using axis=0, both the examples above result in object dtype. #49551 changed the numeric_only=False case to return float64 in order to preserve the default behavior. However, it seems better to have the result be object dtype for both examples above instead.

The text was updated successfully, but these errors were encountered:

rhshadrach · 2022-12-21T14:03:21Z

I'm now thinking that the result should be int instead.

Here are the results with the BlockManager

pd.set_option('data_manager', 'block')
df = pd.DataFrame({'a': [1]}, dtype=object)

print(df.sum(axis=0, numeric_only=False))
# a    1
# dtype: object

print(df.sum(axis=0, numeric_only=True))
# Series([], dtype: float64)

print(df.sum(axis=1, numeric_only=False))
# 0    1.0
# dtype: float64

print(df.sum(axis=1, numeric_only=True))
# 0    0.0
# dtype: float64

and the ArrayManager

pd.set_option('data_manager', 'array')
df = pd.DataFrame({'a': [1]}, dtype=object)

print(df.sum(axis=0, numeric_only=False))
# a    1
# dtype: int64

print(df.sum(axis=0, numeric_only=True))
# Series([], dtype: float64)

print(df.sum(axis=1, numeric_only=False))
# 0    1.0
# dtype: float64

print(df.sum(axis=1, numeric_only=True))
# 0    0.0
# dtype: float64

For df.sum(axis=0, numeric_only=False):

With the ArrayManager, the 1-dim reduction results in a Python object (an integer), which is then inferred to be of integer type when the resulting array is constructed. With the BlockManager, the 2-dim reduction results in a 1-dim NumPy array of object type (which remains object type).

In other words, the ArrayManager naturally needs to infer from Python objects whereas the BlockManager does not. I think these should be consistent, but could see arguments either way. I think I would prefer the BlockManager to attempt to convert from object rather than the ArrayManager casting to object.

For df.sum(axis=1, numeric_only=False):

This result being float looks like a bug to me; we explicitly try to cast to float at the end of DataFrame._reduce, but only when numeric_only=False and axis=1.

cc @jorisvandenbossche @jbrockmendel for any thoughts

jbrockmendel · 2022-12-21T18:56:13Z

The 1-dim reduction in AM is tricky (also shows up in 1D EAs e.g #42895).

My inclination here is to treat df.sum(axis=1) like df[0] + df[1] + ... + df[N], which we'd expect to retain object dtype in these cases.

rhshadrach · 2022-12-21T19:12:17Z

The 1-dim reduction in AM is tricky (also shows up in 1D EAs e.g #42895).

In the linked issue, the dtype starts as Int64, so makes sense that object is wrong. However here the starting dtype is object; does it make sense for the output to be int as opposed to object?

My inclination here is to treat df.sum(axis=1) like df[0] + df[1] + ... + df[N], which we'd expect to retain object dtype in these cases.

Thanks, I find this compelling. What about for something like min that can't be expressed in this fashion?

rhshadrach · 2023-02-11T21:35:07Z

In line with #51205 for groupby, I think object dtypes should remain object across all aggregations.

rhshadrach added Bug Dtype Conversions Unexpected or buggy dtype conversions DataFrame DataFrame data structure Reduction Operations sum, mean, min, max, etc. labels Nov 9, 2022

rhshadrach self-assigned this Nov 9, 2022

This was referenced Nov 9, 2022

DEPR: Enforce deprecation of numeric_only=None in DataFrame aggregations #49551

Merged

BUG: DataFrame reductions with object dtype and axis=1 #49616

Closed

BUG: DataFrame reductions with object dtype and axis=1 #49619

Closed

rhshadrach mentioned this issue Dec 13, 2022

BUG: DataFrame reductions with object dtype and axis=1 #50224

Closed

5 tasks

brobr mentioned this issue Dec 26, 2022

BUG: Nullable integer type ("Int64") lost after summing along columns-index [df.sum(axis=1) #50438

Closed

3 tasks

rhshadrach mentioned this issue Feb 11, 2023

BUG: DataFrame reductions dtypes on object input #51335

Merged

6 tasks

mroeschke closed this as completed in #51335 Feb 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame reductions with object dtype and axis=1 #49603

BUG: DataFrame reductions with object dtype and axis=1 #49603

rhshadrach commented Nov 9, 2022

rhshadrach commented Dec 21, 2022 •

edited

Loading

jbrockmendel commented Dec 21, 2022

rhshadrach commented Dec 21, 2022 •

edited

Loading

rhshadrach commented Feb 11, 2023

BUG: DataFrame reductions with object dtype and axis=1 #49603

BUG: DataFrame reductions with object dtype and axis=1 #49603

Comments

rhshadrach commented Nov 9, 2022

rhshadrach commented Dec 21, 2022 • edited Loading

jbrockmendel commented Dec 21, 2022

rhshadrach commented Dec 21, 2022 • edited Loading

rhshadrach commented Feb 11, 2023

rhshadrach commented Dec 21, 2022 •

edited

Loading

rhshadrach commented Dec 21, 2022 •

edited

Loading