Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Can't use iloc to set a subset of a dataframe to a two-dimensional categorical data array. #44703

Closed
3 tasks done
mvashishtha opened this issue Dec 1, 2021 · 5 comments · Fixed by #44714
Closed
3 tasks done
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@mvashishtha
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
from numpy import array
df = pd.DataFrame({"status": ["a", "b", "c"]},  dtype="category")
df.iloc[array([0, 1]), array([0])] = array([['a'], ['a']])

Issue Description

I can't use iloc set a subset of a dataframe to a two-dimensional categorical data array. The reproducible example gives this stack trace:

Click to see stack trace
TypeError                                 Traceback (most recent call last)
<ipython-input-10-d6ae260d158c> in <module>
    2 from numpy import array
    3 df = pd.DataFrame({"status": ["a", "b", "c"]},  dtype="category")
----> 4 df.iloc[array([0, 1]), array([0])] = array([['a'], ['a']])

~/pandas/pandas/core/indexing.py in __setitem__(self, key, value)
  708
  709         iloc = self if self.name == "iloc" else self.obj.iloc
--> 710         iloc._setitem_with_indexer(indexer, value, self.name)
  711
  712     def _validate_key(self, key, axis: int):

~/pandas/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
 1671             self._setitem_with_indexer_split_path(indexer, value, name)
 1672         else:
-> 1673             self._setitem_single_block(indexer, value, name)
 1674
 1675     def _setitem_with_indexer_split_path(self, indexer, value, name: str):

~/pandas/pandas/core/indexing.py in _setitem_single_block(self, indexer, value, name)
 1919
 1920         # actually do the set
-> 1921         self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
 1922         self.obj._maybe_update_cacher(clear=True, inplace=True)
 1923

~/pandas/pandas/core/internals/managers.py in setitem(self, indexer, value)
  335         For SingleBlockManager, this backs s[indexer] = value
  336         """
--> 337         return self.apply("setitem", indexer=indexer, value=value)
  338
  339     def putmask(self, mask, new, align: bool = True):

~/pandas/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
  302                     applied = b.apply(f, **kwargs)
  303                 else:
--> 304                     applied = getattr(b, f)(**kwargs)
  305             except (TypeError, NotImplementedError):
  306                 if not ignore_failures:

~/pandas/pandas/core/internals/blocks.py in setitem(self, indexer, value)
 1513
 1514         check_setitem_lengths(indexer, value, self.values)
-> 1515         self.values[indexer] = value
 1516         return self
 1517

~/pandas/pandas/core/arrays/_mixins.py in __setitem__(self, key, value)
  249     def __setitem__(self, key, value):
  250         key = check_array_indexer(self, key)
--> 251         value = self._validate_setitem_value(value)
  252         self._ndarray[key] = value
  253

~/pandas/pandas/core/arrays/categorical.py in _validate_setitem_value(self, value)
 1445         if not is_hashable(value):
 1446             # wrap scalars and hashable-listlikes in list
-> 1447             return self._validate_listlike(value)
 1448         else:
 1449             return self._validate_scalar(value)

~/pandas/pandas/core/arrays/categorical.py in _validate_listlike(self, value)
 2080         # something to np.nan
 2081         if len(to_add) and not isna(to_add).all():
-> 2082             raise TypeError(
 2083                 "Cannot setitem on a Categorical with a new "
 2084                 "category, set the categories first"

TypeError: Cannot setitem on a Categorical with a new category, set the categories first

The problem seems to only come up when the dataframe has a single column. If I just add another column to the dataframe, it works:

import pandas as pd
from numpy import array
df = pd.DataFrame({"status": ["a", "b", "c"], "status2": ["d", "e", "f"]},  dtype="category")
df.iloc[array([0, 1]), array([0])] = array([['a'], ['a']])

Expected Behavior

I expect to see that the dataframe has the status column set to "a" for the first two rows.

status
0      a   
1      a   
2      c    

Installed Versions

INSTALLED VERSIONS

commit : 4e139f6
python : 3.8.12.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.0.dev0+1273.g4e139f69ff
numpy : 1.21.4
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 59.4.0
Cython : 0.29.24
pytest : 6.2.5
hypothesis : 6.29.0
sphinx : 4.3.1
blosc : None
feather : None
xlsxwriter : 3.0.2
lxml.etree : 4.6.4
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.30.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
fsspec : 2021.11.0
fastparquet : 0.7.2
gcsfs : 2021.11.0
matplotlib : 3.5.0
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 6.0.1
pyxlsb : None
s3fs : 2021.11.0
scipy : 1.7.3
sqlalchemy : 1.4.27
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

@mvashishtha mvashishtha added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 1, 2021
@mvashishtha mvashishtha changed the title BUG: Can't use iloc set a subset of a dataframe to a two-dimensional categorical data array. BUG: Can't use iloc to set a subset of a dataframe to a two-dimensional categorical data array. Dec 1, 2021
@jbrockmendel
Copy link
Member

This would be fixed by having 2D EAs. Until then I think we can kludge together a patch in ExtensionBlock.setitem

@jbrockmendel
Copy link
Member

Yes, can fix this following #44514

@jreback jreback added this to the 1.4 milestone Dec 22, 2021
@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 22, 2021
@mvashishtha
Copy link
Author

mvashishtha commented Jan 3, 2022

@jreback Thank you for fixing this! Once the fix is in the pandas version that Modin requires, I'll clean up the workaround modin-project/modin#3801.

@jreback
Copy link
Contributor

jreback commented Jan 3, 2022

hah @jbrockmendel actually made the patch here

@mvashishtha
Copy link
Author

Thanks @jbrockmendel !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants