BUG: Boolean 2d array for setitem with DataFrame incorrect results #18582

dhirschfeld · 2017-12-01T01:47:36Z

According to my understanding of how boolean indexing should work the below test should pass:

import numpy as np
from numpy.random import randn
import pandas as pd

df = pd.DataFrame(randn(10, 5))
df1 = df.copy(); df2 = df.copy()
INVALID = (df < -1)|(df > 1)
df1[INVALID.values] = NaN
df2[INVALID] = NaN
assert df1.equals(df2)

i.e. it shouldn't matter whether or not the boolean mask is a DataFrame or a plain ndarray

Problem description

As can be seen in the picture below, when indexing a DataFrame with a boolean ndarray, rather than
setting the individual True elements entire rows are being set if any value in the row is True.

i.e. what is happening is the result I would expect from df[INVALID.any(axis=1)] = NaN rather than df[INVALID] = NaN

This is different from how indexing an ndarray works and to me at least is very surprising and entirely unexpected. This behaviour was the cause of a very subtle bug in my code :(

Output of `pd.show_versions()`


INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.21.0
pytest: 3.2.5
pip: 9.0.1
setuptools: 36.7.2
Cython: 0.27.3
numpy: 1.13.3
scipy: 1.0.0
pyarrow: 0.7.1
xarray: 0.9.6
IPython: 6.2.1
sphinx: 1.6.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.0
openpyxl: 2.5.0b1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: 0.1.2
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-12-01T11:18:06Z

the 2d ndarray input hits a path where a 1d array is expected, so this is a bug

Idiomatically and what the DataFrame input hits is .where()

In [1]: df = pd.DataFrame(np.random.randn(10, 5))
   ...: INVALID = (df < -1)|(df > 1)

In [3]: df.where(INVALID.values)
Out[3]: 
          0         1         2         3         4
0  1.444995 -1.274442       NaN  1.252285  1.744645
1       NaN       NaN  1.780958       NaN -1.790128
2       NaN       NaN       NaN  1.737279       NaN
3       NaN  1.854267       NaN  1.742598 -1.475852
4       NaN       NaN  1.284623       NaN  1.002319
5       NaN       NaN       NaN       NaN       NaN
6  1.182883       NaN       NaN       NaN       NaN
7       NaN       NaN       NaN       NaN       NaN
8       NaN       NaN -2.538289       NaN  1.767622
9 -1.057514  2.161599       NaN       NaN       NaN

In [4]: df.where(INVALID)
Out[4]: 
          0         1         2         3         4
0  1.444995 -1.274442       NaN  1.252285  1.744645
1       NaN       NaN  1.780958       NaN -1.790128
2       NaN       NaN       NaN  1.737279       NaN
3       NaN  1.854267       NaN  1.742598 -1.475852
4       NaN       NaN  1.284623       NaN  1.002319
5       NaN       NaN       NaN       NaN       NaN
6  1.182883       NaN       NaN       NaN       NaN
7       NaN       NaN       NaN       NaN       NaN
8       NaN       NaN -2.538289       NaN  1.767622
9 -1.057514  2.161599       NaN       NaN       NaN

Here's a patch to implement the correct conversion (note might fail some existing tests). Alternatively we could raise

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index d3561f8a0..82d9375e5 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -2524,10 +2524,10 @@ class DataFrame(NDFrame):
         if indexer is not None:
             return self._setitem_slice(indexer, value)
 
-        if isinstance(key, (Series, np.ndarray, list, Index)):
-            self._setitem_array(key, value)
-        elif isinstance(key, DataFrame):
+        if isinstance(key, DataFrame) or getattr(key, 'ndim', None) == 2:
             self._setitem_frame(key, value)
+        elif isinstance(key, (Series, np.ndarray, list, Index)):
+            self._setitem_array(key, value)
         else:
             # set column
             self._set_item(key, value)
@@ -2560,8 +2560,16 @@ class DataFrame(NDFrame):
     def _setitem_frame(self, key, value):
         # support boolean setting with DataFrame input, e.g.
         # df[df > df2] = 0
+
+        if isinstance(key, np.ndarray):
+            if key.shape != self.shape:
+                raise ValueError('Array conditional must be same shape as '
+                                 'self')
+            key = self._constructor(key, **self._construct_axes_dict())
+
         if key.values.size and not is_bool_dtype(key.values):
-            raise TypeError('Must pass DataFrame with boolean values only')
+            raise TypeError('Must pass DataFrame or 2-d ndarray'
+                            'with boolean values only')
 
         self._check_inplace_setting(value)
         self._check_setitem_copy()

jreback · 2017-12-01T11:18:14Z

would love a PR!

dhirschfeld · 2017-12-01T11:27:13Z

would love a PR!

Ha! Seems to be my weekend for getting stuck into pandas so I'll give it a crack...

jreback · 2017-12-01T11:37:53Z

thanks @dhirschfeld

Fixes pandas-dev#18582

jreback added Bug Difficulty Advanced Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves labels Dec 1, 2017

jreback added this to the Next Major Release milestone Dec 1, 2017

jreback changed the title ~~BUG: Boolean ndarray indexing gives incorrect results?~~ BUG: Boolean 2d array for __setitem__ with DataFrame incorrect results Dec 1, 2017

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 5, 2017

Fixed handling of boolean indexing with 2-d ndarrays

bdc08c9

Fixes pandas-dev#18582

dhirschfeld mentioned this issue Dec 5, 2017

BUG: Fixed handling of boolean indexing with 2-d ndarrays #18645

Merged

4 tasks

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017

Fixed handling of boolean indexing with 2-d ndarrays

18e7a50

Fixes pandas-dev#18582

jreback modified the milestones: Next Major Release, 0.22.0, 0.21.1 Dec 9, 2017

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 30, 2017

Fixed handling of boolean indexing with 2-d ndarrays

d95b7bf

Fixes pandas-dev#18582

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 30, 2017

Fixed handling of boolean indexing with 2-d ndarrays

fc4849a

Fixes pandas-dev#18582

jreback closed this as completed in #18645 Dec 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Boolean 2d array for setitem with DataFrame incorrect results #18582

BUG: Boolean 2d array for setitem with DataFrame incorrect results #18582

dhirschfeld commented Dec 1, 2017

jreback commented Dec 1, 2017

jreback commented Dec 1, 2017

dhirschfeld commented Dec 1, 2017

jreback commented Dec 1, 2017

BUG: Boolean 2d array for __setitem__ with DataFrame incorrect results #18582

BUG: Boolean 2d array for __setitem__ with DataFrame incorrect results #18582

Comments

dhirschfeld commented Dec 1, 2017

Problem description

Output of pd.show_versions()

jreback commented Dec 1, 2017

jreback commented Dec 1, 2017

dhirschfeld commented Dec 1, 2017

jreback commented Dec 1, 2017

BUG: Boolean 2d array for setitem with DataFrame incorrect results #18582

BUG: Boolean 2d array for setitem with DataFrame incorrect results #18582

Output of `pd.show_versions()`