BUG: groupby dropna=False with nan value in groupby causes ValueError when apply() #35889

cwkwong · 2020-08-25T09:04:32Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas, version 1.1.1
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'groups': ['a', 'a', 'b', np.nan],
    'tonnages': [10, 10, 20, 30]
})
dfg = df.groupby('groups', dropna=False)
rv = dfg.apply(lambda grp: pd.DataFrame({'values': list(range(len(grp)))}))

Problem description

ValueError raised on dfg.apply(lambda grp: ...); with the following stacktrace:

../../env3.8/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:859: in apply
    result = self._python_apply_general(f, self._selected_obj)
../../env3.8/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:894: in _python_apply_general
    return self._wrap_applied_output(
../../env3.8/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1230: in _wrap_applied_output
    return self._concat_objects(keys, values, not_indexed_same=not_indexed_same)
../../env3.8/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1145: in _concat_objects
    result = concat(
../../env3.8/lib/python3.8/site-packages/pandas/core/reshape/concat.py:274: in concat
    op = _Concatenator(
../../env3.8/lib/python3.8/site-packages/pandas/core/reshape/concat.py:454: in __init__
    self.new_axes = self._get_new_axes()
../../env3.8/lib/python3.8/site-packages/pandas/core/reshape/concat.py:519: in _get_new_axes
    return [
../../env3.8/lib/python3.8/site-packages/pandas/core/reshape/concat.py:520: in <listcomp>
    self._get_concat_axis() if i == self.bm_axis else self._get_comb_axis(i)
../../env3.8/lib/python3.8/site-packages/pandas/core/reshape/concat.py:576: in _get_concat_axis
    concat_axis = _make_concat_multiindex(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

indexes = [RangeIndex(start=0, stop=2, step=1), RangeIndex(start=0, stop=1, step=1), RangeIndex(start=0, stop=1, step=1)]
keys = Index(['a', 'b', nan], dtype='object', name='groups')
levels = [Index(['a', 'b', nan], dtype='object', name='groups')]
names = ['groups']

    def _make_concat_multiindex(indexes, keys, levels=None, names=None) -> MultiIndex:
    
        if (levels is None and isinstance(keys[0], tuple)) or (
            levels is not None and len(levels) > 1
        ):
            zipped = list(zip(*keys))
            if names is None:
                names = [None] * len(zipped)
    
            if levels is None:
                _, levels = factorize_from_iterables(zipped)
            else:
                levels = [ensure_index(x) for x in levels]
        else:
            zipped = [keys]
            if names is None:
                names = [None]
    
            if levels is None:
                levels = [ensure_index(keys)]
            else:
                levels = [ensure_index(x) for x in levels]
    
        if not all_indexes_same(indexes):
            codes_list = []
    
            # things are potentially different sizes, so compute the exact codes
            # for each level and pass those to MultiIndex.from_arrays
    
            for hlevel, level in zip(zipped, levels):
                to_concat = []
                for key, index in zip(hlevel, indexes):
                    mask = level == key
                    if not mask.any():
>                       raise ValueError(f"Key {key} not in level {level}")
E                       ValueError: Key nan not in level Index(['a', 'b', nan], dtype='object', name='groups')

../../env3.8/lib/python3.8/site-packages/pandas/core/reshape/concat.py:631: ValueError

Expected Output

No error should be raised. With the above, if i omit the nan:

df = pd.DataFrame({
    'groups': ['a', 'a', 'b'],
    'tonnages': [10, 10, 20]
})
dfg = df.groupby('groups', dropna=False)
rv = dfg.apply(lambda grp: pd.DataFrame({'values': list(range(len(grp)))}))

Then it works successfully with rv being:

          values
groups          
a      0       0
       1       1
b      0       0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : f2ca0a2
python : 3.8.2.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
Version : Darwin Kernel Version 18.7.0: Thu Jun 18 20:50:10 PDT 2020; root:xnu-4903.278.43~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 1.1.1
numpy : 1.18.4
pytz : 2019.3
dateutil : 2.8.1
pip : 20.1.1
setuptools : 46.4.0
Cython : 0.29.17
pytest : 5.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : 2.7.1
odfpy : None
openpyxl : 1.8.6
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.12
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

cwkwong · 2020-08-25T10:41:58Z

I've done some local tweaking in my pandas in the file pandas/core/reshape/concat.py. Starting from line 618;

...
if not all_indexes_same(indexes):
        codes_list = []

        # things are potentially different sizes, so compute the exact codes
        # for each level and pass those to MultiIndex.from_arrays

        for hlevel, level in zip(zipped, levels):
            to_concat = []
            for key, index in zip(hlevel, indexes):
                # mask = level == key                           # <------------------------------ original
                import pandas as pd                             # <------------------------------ added
                mask = ((pd.isnull(level) & pd.isnull(key)) | (level == key)) # <---------- added
                if not mask.any():
                    raise ValueError(f"Key {key} not in level {level}")
                # i = np.nonzero(level == key)[0][0] # <-------------------------------- original
                i = np.nonzero(mask)[0][0] # <----------------------------------------- added

                to_concat.append(np.repeat(i, len(index)))
            codes_list.append(np.concatenate(to_concat))

        concat_index = _concat_indexes(indexes)
...

then the ValueError is not raised. With the original example above, the rv is:

          values
groups          
a      0       0
       1       1
b      0       0
NaN    0       0

charlesdong1991 · 2020-08-25T13:03:57Z

thanks, @cwkwong do you wanna make a PR?

cwkwong · 2020-08-26T02:16:23Z

Sure @charlesdong1991 , I'll give it a go. Cheers.

cwkwong · 2020-08-28T02:24:52Z

WIP, will submit PR by end of today hopefully (note: I'm in Melb, AU timezone)

…5889) * tests should still fail. * test dropna=True|False with no np.nan in groupings. * fix expected outputs, declare expected MultiIndex in resulting dataframe after df.group().apply()

* nans at same positions in `level` and `key` compares as equal.

charlesdong1991 · 2020-08-28T07:02:54Z

there is no rush at all, just take your time! ^^ @cwkwong

* this makes test pass. * follow existing style where we create MultiIndex, then `set_levels` to reinsert nan for case when `dropna=False`, and groups has nan grouping.

* black pandas * git diff upstream/master -u -- "*.py" | flake8 --diff

…ropna-apply (pandas-dev#35889)

cwkwong added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 25, 2020

charlesdong1991 added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 25, 2020

charlesdong1991 added the Apply Apply, Aggregate, Transform, Map label Aug 25, 2020

cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 28, 2020

BUG: add unit test, should fail (pandas-dev#35889)

ac7c99a

cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 28, 2020

double quotes instead of single quote (pandas-dev#35889)

26453dc

cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 28, 2020

adjust comparison: handle np.nan compare (pandas-dev#35889)

87dbfd9

* nans at same positions in `level` and `key` compares as equal.

cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 28, 2020

BUG: update rst (pandas-dev#35889)

3c754d3

cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 28, 2020

BUG: run code formatters (pandas-dev#35889)

2cc4f39

* black pandas * git diff upstream/master -u -- "*.py" | flake8 --diff

cwkwong mentioned this issue Aug 28, 2020

BUG: Fix DataFrame.groupby().apply() for NaN groups with dropna=False #35951

Merged

5 tasks

cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 28, 2020

fix isort error during CI (pandas-dev#35889)

8a2eab8

cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 29, 2020

BUG: address PR review comments (pandas-dev#35889)

aadbd75

cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 29, 2020

forgot to run black, flake8 (pandas-dev#35889)

311059c

cwkwong added a commit to cwkwong/pandas that referenced this issue Sep 2, 2020

BUG: address review comments (pandas-dev#35889)

714c81a

cwkwong added a commit to cwkwong/pandas that referenced this issue Sep 5, 2020

Merge remote-tracking branch 'upstream/master' into gh35889-groupby-d…

8439d0b

…ropna-apply (pandas-dev#35889)

jreback added this to the 1.2 milestone Sep 5, 2020

jreback closed this as completed in #35951 Sep 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby dropna=False with nan value in groupby causes ValueError when apply() #35889

BUG: groupby dropna=False with nan value in groupby causes ValueError when apply() #35889

cwkwong commented Aug 25, 2020

INSTALLED VERSIONS

cwkwong commented Aug 25, 2020 •

edited

Loading

charlesdong1991 commented Aug 25, 2020

cwkwong commented Aug 26, 2020

cwkwong commented Aug 28, 2020

charlesdong1991 commented Aug 28, 2020

BUG: groupby dropna=False with nan value in groupby causes ValueError when apply() #35889

BUG: groupby dropna=False with nan value in groupby causes ValueError when apply() #35889

Comments

cwkwong commented Aug 25, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

cwkwong commented Aug 25, 2020 • edited Loading

charlesdong1991 commented Aug 25, 2020

cwkwong commented Aug 26, 2020

cwkwong commented Aug 28, 2020

charlesdong1991 commented Aug 28, 2020

Output of `pd.show_versions()`

cwkwong commented Aug 25, 2020 •

edited

Loading