Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Boolean 2d array for __setitem__ with DataFrame incorrect results #18582

Closed
dhirschfeld opened this issue Dec 1, 2017 · 4 comments · Fixed by #18645
Closed

BUG: Boolean 2d array for __setitem__ with DataFrame incorrect results #18582

dhirschfeld opened this issue Dec 1, 2017 · 4 comments · Fixed by #18645
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@dhirschfeld
Copy link
Contributor

According to my understanding of how boolean indexing should work the below test should pass:

import numpy as np
from numpy.random import randn
import pandas as pd

df = pd.DataFrame(randn(10, 5))
df1 = df.copy(); df2 = df.copy()
INVALID = (df < -1)|(df > 1)
df1[INVALID.values] = NaN
df2[INVALID] = NaN
assert df1.equals(df2)

i.e. it shouldn't matter whether or not the boolean mask is a DataFrame or a plain ndarray

Problem description

As can be seen in the picture below, when indexing a DataFrame with a boolean ndarray, rather than
setting the individual True elements entire rows are being set if any value in the row is True.

i.e. what is happening is the result I would expect from df[INVALID.any(axis=1)] = NaN rather than df[INVALID] = NaN

image

This is different from how indexing an ndarray works and to me at least is very surprising and entirely unexpected. This behaviour was the cause of a very subtle bug in my code :(

Output of pd.show_versions()


INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.21.0
pytest: 3.2.5
pip: 9.0.1
setuptools: 36.7.2
Cython: 0.27.3
numpy: 1.13.3
scipy: 1.0.0
pyarrow: 0.7.1
xarray: 0.9.6
IPython: 6.2.1
sphinx: 1.6.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.0
openpyxl: 2.5.0b1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: 0.1.2
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@jreback
Copy link
Contributor

jreback commented Dec 1, 2017

the 2d ndarray input hits a path where a 1d array is expected, so this is a bug

Idiomatically and what the DataFrame input hits is .where()

In [1]: df = pd.DataFrame(np.random.randn(10, 5))
   ...: INVALID = (df < -1)|(df > 1)

In [3]: df.where(INVALID.values)
Out[3]: 
          0         1         2         3         4
0  1.444995 -1.274442       NaN  1.252285  1.744645
1       NaN       NaN  1.780958       NaN -1.790128
2       NaN       NaN       NaN  1.737279       NaN
3       NaN  1.854267       NaN  1.742598 -1.475852
4       NaN       NaN  1.284623       NaN  1.002319
5       NaN       NaN       NaN       NaN       NaN
6  1.182883       NaN       NaN       NaN       NaN
7       NaN       NaN       NaN       NaN       NaN
8       NaN       NaN -2.538289       NaN  1.767622
9 -1.057514  2.161599       NaN       NaN       NaN

In [4]: df.where(INVALID)
Out[4]: 
          0         1         2         3         4
0  1.444995 -1.274442       NaN  1.252285  1.744645
1       NaN       NaN  1.780958       NaN -1.790128
2       NaN       NaN       NaN  1.737279       NaN
3       NaN  1.854267       NaN  1.742598 -1.475852
4       NaN       NaN  1.284623       NaN  1.002319
5       NaN       NaN       NaN       NaN       NaN
6  1.182883       NaN       NaN       NaN       NaN
7       NaN       NaN       NaN       NaN       NaN
8       NaN       NaN -2.538289       NaN  1.767622
9 -1.057514  2.161599       NaN       NaN       NaN

Here's a patch to implement the correct conversion (note might fail some existing tests). Alternatively we could raise

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index d3561f8a0..82d9375e5 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -2524,10 +2524,10 @@ class DataFrame(NDFrame):
         if indexer is not None:
             return self._setitem_slice(indexer, value)
 
-        if isinstance(key, (Series, np.ndarray, list, Index)):
-            self._setitem_array(key, value)
-        elif isinstance(key, DataFrame):
+        if isinstance(key, DataFrame) or getattr(key, 'ndim', None) == 2:
             self._setitem_frame(key, value)
+        elif isinstance(key, (Series, np.ndarray, list, Index)):
+            self._setitem_array(key, value)
         else:
             # set column
             self._set_item(key, value)
@@ -2560,8 +2560,16 @@ class DataFrame(NDFrame):
     def _setitem_frame(self, key, value):
         # support boolean setting with DataFrame input, e.g.
         # df[df > df2] = 0
+
+        if isinstance(key, np.ndarray):
+            if key.shape != self.shape:
+                raise ValueError('Array conditional must be same shape as '
+                                 'self')
+            key = self._constructor(key, **self._construct_axes_dict())
+
         if key.values.size and not is_bool_dtype(key.values):
-            raise TypeError('Must pass DataFrame with boolean values only')
+            raise TypeError('Must pass DataFrame or 2-d ndarray'
+                            'with boolean values only')
 
         self._check_inplace_setting(value)
         self._check_setitem_copy()

@jreback
Copy link
Contributor

jreback commented Dec 1, 2017

would love a PR!

@jreback jreback added Bug Difficulty Advanced Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves labels Dec 1, 2017
@jreback jreback added this to the Next Major Release milestone Dec 1, 2017
@jreback jreback changed the title BUG: Boolean ndarray indexing gives incorrect results? BUG: Boolean 2d array for __setitem__ with DataFrame incorrect results Dec 1, 2017
@dhirschfeld
Copy link
Contributor Author

would love a PR!

Ha! Seems to be my weekend for getting stuck into pandas so I'll give it a crack...

@jreback
Copy link
Contributor

jreback commented Dec 1, 2017

thanks @dhirschfeld

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 5, 2017
dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017
@jreback jreback modified the milestones: Next Major Release, 0.22.0, 0.21.1 Dec 9, 2017
dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 30, 2017
dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants