Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.select_dtypes(include='number') includes BooleanDtype columns #46870

Closed
2 of 3 tasks
macks22 opened this issue Apr 25, 2022 · 6 comments · Fixed by #48169
Closed
2 of 3 tasks

BUG: DataFrame.select_dtypes(include='number') includes BooleanDtype columns #46870

macks22 opened this issue Apr 25, 2022 · 6 comments · Fixed by #48169
Assignees
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@macks22
Copy link

macks22 commented Apr 25, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': pd.Series([True, False, True], dtype=pd.BooleanDtype()),
})
print(df.select_dtypes(include='number').columns)
# Output: Index(['a', 'b'], dtype='object')

Issue Description

The DataFrame.select_dtypes method includes columns with dtype of BooleanDtype when include='number' is given as a kwarg. This is different than the behavior for columns with dtype of bool and seems like a bug.

Expected Behavior

The code above should output: Index(['a'], dtype='object')

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.9.12.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Tue Feb 22 21:10:41 PST 2022; root:xnu-7195.141.26~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.2
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 59.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.2.0
pandas_datareader: None
bs4 : None
bottleneck : 1.3.4
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@macks22 macks22 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 25, 2022
@MichaelRocchio
Copy link

take

@MichaelRocchio
Copy link

MichaelRocchio commented Apr 28, 2022

Hey @macks22. I was able to recreate this issue on a different OS as well. I am going to start looking into the sourcecode, specifically the df.select_dtypes function to see if I can find what's up.

@MichaelRocchio
Copy link

MichaelRocchio commented Apr 29, 2022

@macks22
I agree that there is an issue for this function in differentiating between what it considers to be bool vs boolean datatypes which could be a larger issue all toghether.

Here are my examples:

In [1]

import pandas as pd

df = pd.DataFrame({
'a': [1, 2, 3],
'b': [True, False, True],
'c': [1.0, 2.0,3.0]})

print(df.select_dtypes(include='number').columns)
print('')
print(df.info())

Out [1]

Index(['a', 'c'], dtype='object')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       3 non-null      int64  
 1   b       3 non-null      bool
 2   c       3 non-null      float64
dtypes: bool(1), float64(1), int64(1)
memory usage: 179.0 bytes
None

In [2]

import pandas as pd

df = pd.DataFrame({
'a': [1, 2, 3],
'b': pd.Series([True, False, True], dtype=pd.BooleanDtype()),
'c': [1.0, 2.0,3.0]})

print(df.select_dtypes(include='number').columns)
print('')
print(df.info())

Out [2]

Index(['a', 'b', 'c'], dtype='object')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       3 non-null      int64  
 1   b       3 non-null      boolean
 2   c       3 non-null      float64
dtypes: boolean(1), float64(1), int64(1)
memory usage: 182.0 bytes
None

I forked pandas and created the following branch for my debugging:
https://github.com/MichaelRocchio/pandas/tree/issue_46870:

Courses of action:
Check source documentation for differentiation of bool vs boolean and if there should be one.
Create ability of .select_dtypes to work with boolean dtypes.
Check if the creation of boolean datatypes is a singular issue to pandas.BooleanDtype()

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue May 5, 2022
@simonjayhawkins
Copy link
Member

Thanks @macks22 for the report.

The DataFrame.select_dtypes method includes columns with dtype of BooleanDtype when include='number' is given as a kwarg.

in 1.2.5 as EAs were not considered and therefore the BooleanDtype column was not included in the result. so will label as a regression.

This is different than the behavior for columns with dtype of bool and seems like a bug.

agreed. numpy backed bool and EA backed boolean should be consistent.

first bad commit: [deaf138] BUG: Allow numeric ExtensionDtypes in DataFrame.select_dtypes (#38246)

@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version ExtensionArray Extending pandas with custom dtypes or arrays. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 5, 2022
@simonjayhawkins simonjayhawkins added this to the 1.4.3 milestone May 5, 2022
@MichaelRocchio
Copy link

MichaelRocchio commented May 6, 2022

@simonjayhawkins
I was able to solve this problem but I am not 100% sure on how to proceed. For now I will create a pull request from my fork and compare it to main.

The bug was in the following location:
pandas/core/arrays/boolean.py

my change is pushed to https://github.com/MichaelRocchio/pandas/tree/issue_46870

It was pretty small:

    @property
    def _is_numeric(self) -> bool:
        return True

changed to:

    @property
    def _is_numeric(self) -> bool:
        return False

Additionally I will need help on how to test if my fix didn't cause other issues.

@simonjayhawkins
Copy link
Member

moving to 1.4.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Regression Functionality that used to work in a prior pandas version
Projects
None yet
3 participants