Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame reductions with object dtype and axis=1 #49603

Closed
rhshadrach opened this issue Nov 9, 2022 · 4 comments · Fixed by #51335
Closed

BUG: DataFrame reductions with object dtype and axis=1 #49603

rhshadrach opened this issue Nov 9, 2022 · 4 comments · Fixed by #51335
Assignees
Labels
Bug DataFrame DataFrame data structure Dtype Conversions Unexpected or buggy dtype conversions Reduction Operations sum, mean, min, max, etc.

Comments

@rhshadrach
Copy link
Member

In pandas 1.5.x and before, the dtype of the result is different when using axis=1:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, dtype=object)
print(df.sum(axis=1, numeric_only=None))
print(df.sum(axis=1, numeric_only=False))
0    5.0
1    7.0
2    9.0
dtype: float64
0    5
1    7
2    9
dtype: object

When using axis=0, both the examples above result in object dtype. #49551 changed the numeric_only=False case to return float64 in order to preserve the default behavior. However, it seems better to have the result be object dtype for both examples above instead.

@rhshadrach
Copy link
Member Author

rhshadrach commented Dec 21, 2022

I'm now thinking that the result should be int instead.

Here are the results with the BlockManager

pd.set_option('data_manager', 'block')
df = pd.DataFrame({'a': [1]}, dtype=object)

print(df.sum(axis=0, numeric_only=False))
# a    1
# dtype: object

print(df.sum(axis=0, numeric_only=True))
# Series([], dtype: float64)

print(df.sum(axis=1, numeric_only=False))
# 0    1.0
# dtype: float64

print(df.sum(axis=1, numeric_only=True))
# 0    0.0
# dtype: float64

and the ArrayManager

pd.set_option('data_manager', 'array')
df = pd.DataFrame({'a': [1]}, dtype=object)

print(df.sum(axis=0, numeric_only=False))
# a    1
# dtype: int64

print(df.sum(axis=0, numeric_only=True))
# Series([], dtype: float64)

print(df.sum(axis=1, numeric_only=False))
# 0    1.0
# dtype: float64

print(df.sum(axis=1, numeric_only=True))
# 0    0.0
# dtype: float64

For df.sum(axis=0, numeric_only=False):

With the ArrayManager, the 1-dim reduction results in a Python object (an integer), which is then inferred to be of integer type when the resulting array is constructed. With the BlockManager, the 2-dim reduction results in a 1-dim NumPy array of object type (which remains object type).

In other words, the ArrayManager naturally needs to infer from Python objects whereas the BlockManager does not. I think these should be consistent, but could see arguments either way. I think I would prefer the BlockManager to attempt to convert from object rather than the ArrayManager casting to object.

For df.sum(axis=1, numeric_only=False):

This result being float looks like a bug to me; we explicitly try to cast to float at the end of DataFrame._reduce, but only when numeric_only=False and axis=1.

cc @jorisvandenbossche @jbrockmendel for any thoughts

@jbrockmendel
Copy link
Member

The 1-dim reduction in AM is tricky (also shows up in 1D EAs e.g #42895).

My inclination here is to treat df.sum(axis=1) like df[0] + df[1] + ... + df[N], which we'd expect to retain object dtype in these cases.

@rhshadrach
Copy link
Member Author

rhshadrach commented Dec 21, 2022

The 1-dim reduction in AM is tricky (also shows up in 1D EAs e.g #42895).

In the linked issue, the dtype starts as Int64, so makes sense that object is wrong. However here the starting dtype is object; does it make sense for the output to be int as opposed to object?

My inclination here is to treat df.sum(axis=1) like df[0] + df[1] + ... + df[N], which we'd expect to retain object dtype in these cases.

Thanks, I find this compelling. What about for something like min that can't be expressed in this fashion?

@rhshadrach
Copy link
Member Author

In line with #51205 for groupby, I think object dtypes should remain object across all aggregations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment