BUG: DataFrame.select_dtypes(include='number') includes BooleanDtype columns #46870

macks22 · 2022-04-25T19:37:48Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': pd.Series([True, False, True], dtype=pd.BooleanDtype()),
})
print(df.select_dtypes(include='number').columns)
# Output: Index(['a', 'b'], dtype='object')

Issue Description

The DataFrame.select_dtypes method includes columns with dtype of BooleanDtype when include='number' is given as a kwarg. This is different than the behavior for columns with dtype of bool and seems like a bug.

Expected Behavior

The code above should output: Index(['a'], dtype='object')

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.9.12.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Tue Feb 22 21:10:41 PST 2022; root:xnu-7195.141.26~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.2
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 59.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.2.0
pandas_datareader: None
bs4 : None
bottleneck : 1.3.4
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

MichaelRocchio · 2022-04-28T07:48:34Z

take

MichaelRocchio · 2022-04-28T07:50:38Z

Hey @macks22. I was able to recreate this issue on a different OS as well. I am going to start looking into the sourcecode, specifically the df.select_dtypes function to see if I can find what's up.

MichaelRocchio · 2022-04-29T17:46:28Z

@macks22
I agree that there is an issue for this function in differentiating between what it considers to be bool vs boolean datatypes which could be a larger issue all toghether.

Here are my examples:

In [1]

import pandas as pd

df = pd.DataFrame({
'a': [1, 2, 3],
'b': [True, False, True],
'c': [1.0, 2.0,3.0]})

print(df.select_dtypes(include='number').columns)
print('')
print(df.info())

Out [1]

Index(['a', 'c'], dtype='object')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       3 non-null      int64  
 1   b       3 non-null      bool
 2   c       3 non-null      float64
dtypes: bool(1), float64(1), int64(1)
memory usage: 179.0 bytes
None

In [2]

import pandas as pd

df = pd.DataFrame({
'a': [1, 2, 3],
'b': pd.Series([True, False, True], dtype=pd.BooleanDtype()),
'c': [1.0, 2.0,3.0]})

print(df.select_dtypes(include='number').columns)
print('')
print(df.info())

Out [2]

Index(['a', 'b', 'c'], dtype='object')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       3 non-null      int64  
 1   b       3 non-null      boolean
 2   c       3 non-null      float64
dtypes: boolean(1), float64(1), int64(1)
memory usage: 182.0 bytes
None

I forked pandas and created the following branch for my debugging:
https://github.com/MichaelRocchio/pandas/tree/issue_46870:

Courses of action:
Check source documentation for differentiation of bool vs boolean and if there should be one.
Create ability of .select_dtypes to work with boolean dtypes.
Check if the creation of boolean datatypes is a singular issue to pandas.BooleanDtype()

simonjayhawkins · 2022-05-05T13:41:00Z

Thanks @macks22 for the report.

The DataFrame.select_dtypes method includes columns with dtype of BooleanDtype when include='number' is given as a kwarg.

in 1.2.5 as EAs were not considered and therefore the BooleanDtype column was not included in the result. so will label as a regression.

This is different than the behavior for columns with dtype of bool and seems like a bug.

agreed. numpy backed bool and EA backed boolean should be consistent.

first bad commit: [deaf138] BUG: Allow numeric ExtensionDtypes in DataFrame.select_dtypes (#38246)

MichaelRocchio · 2022-05-06T02:13:23Z

@simonjayhawkins
I was able to solve this problem but I am not 100% sure on how to proceed. For now I will create a pull request from my fork and compare it to main.

The bug was in the following location:
pandas/core/arrays/boolean.py

my change is pushed to https://github.com/MichaelRocchio/pandas/tree/issue_46870

It was pretty small:

    @property
    def _is_numeric(self) -> bool:
        return True

changed to:

    @property
    def _is_numeric(self) -> bool:
        return False

Additionally I will need help on how to test if my fix didn't cause other issues.

simonjayhawkins · 2022-06-22T13:58:10Z

moving to 1.4.4

macks22 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 25, 2022

github-actions bot assigned MichaelRocchio Apr 28, 2022

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue May 5, 2022

code sample for pandas-dev#46870

d31a4c6

simonjayhawkins added Regression Functionality that used to work in a prior pandas version ExtensionArray Extending pandas with custom dtypes or arrays. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 5, 2022

simonjayhawkins added this to the 1.4.3 milestone May 5, 2022

MichaelRocchio mentioned this issue May 6, 2022

Fix for issue #46870 DataFrame.select_dtypes(include='number') includes BooleanDtype #46951

Closed

4 tasks

simonjayhawkins modified the milestones: 1.4.3, 1.4.4 Jun 22, 2022

mroeschke mentioned this issue Aug 19, 2022

BUG: DataFrame.select_dtypes(include=number) still included BooleanDtype #48169

Merged

4 tasks

jorisvandenbossche closed this as completed in #48169 Aug 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.select_dtypes(include='number') includes BooleanDtype columns #46870

BUG: DataFrame.select_dtypes(include='number') includes BooleanDtype columns #46870

macks22 commented Apr 25, 2022

INSTALLED VERSIONS

MichaelRocchio commented Apr 28, 2022

MichaelRocchio commented Apr 28, 2022 •

edited

Loading

MichaelRocchio commented Apr 29, 2022 •

edited

Loading

simonjayhawkins commented May 5, 2022

MichaelRocchio commented May 6, 2022 •

edited

Loading

simonjayhawkins commented Jun 22, 2022

BUG: DataFrame.select_dtypes(include='number') includes BooleanDtype columns #46870

BUG: DataFrame.select_dtypes(include='number') includes BooleanDtype columns #46870

Comments

macks22 commented Apr 25, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

MichaelRocchio commented Apr 28, 2022

MichaelRocchio commented Apr 28, 2022 • edited Loading

MichaelRocchio commented Apr 29, 2022 • edited Loading

simonjayhawkins commented May 5, 2022

MichaelRocchio commented May 6, 2022 • edited Loading

simonjayhawkins commented Jun 22, 2022

MichaelRocchio commented Apr 28, 2022 •

edited

Loading

MichaelRocchio commented Apr 29, 2022 •

edited

Loading

MichaelRocchio commented May 6, 2022 •

edited

Loading