Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby dropna=False with nan value in groupby causes ValueError when apply() #35889

Closed
2 of 3 tasks
cwkwong opened this issue Aug 25, 2020 · 5 comments · Fixed by #35951
Closed
2 of 3 tasks

BUG: groupby dropna=False with nan value in groupby causes ValueError when apply() #35889

cwkwong opened this issue Aug 25, 2020 · 5 comments · Fixed by #35951
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Milestone

Comments

@cwkwong
Copy link
Contributor

cwkwong commented Aug 25, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas, version 1.1.1

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'groups': ['a', 'a', 'b', np.nan],
    'tonnages': [10, 10, 20, 30]
})
dfg = df.groupby('groups', dropna=False)
rv = dfg.apply(lambda grp: pd.DataFrame({'values': list(range(len(grp)))}))

Problem description

ValueError raised on dfg.apply(lambda grp: ...); with the following stacktrace:

../../env3.8/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:859: in apply
    result = self._python_apply_general(f, self._selected_obj)
../../env3.8/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:894: in _python_apply_general
    return self._wrap_applied_output(
../../env3.8/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1230: in _wrap_applied_output
    return self._concat_objects(keys, values, not_indexed_same=not_indexed_same)
../../env3.8/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1145: in _concat_objects
    result = concat(
../../env3.8/lib/python3.8/site-packages/pandas/core/reshape/concat.py:274: in concat
    op = _Concatenator(
../../env3.8/lib/python3.8/site-packages/pandas/core/reshape/concat.py:454: in __init__
    self.new_axes = self._get_new_axes()
../../env3.8/lib/python3.8/site-packages/pandas/core/reshape/concat.py:519: in _get_new_axes
    return [
../../env3.8/lib/python3.8/site-packages/pandas/core/reshape/concat.py:520: in <listcomp>
    self._get_concat_axis() if i == self.bm_axis else self._get_comb_axis(i)
../../env3.8/lib/python3.8/site-packages/pandas/core/reshape/concat.py:576: in _get_concat_axis
    concat_axis = _make_concat_multiindex(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

indexes = [RangeIndex(start=0, stop=2, step=1), RangeIndex(start=0, stop=1, step=1), RangeIndex(start=0, stop=1, step=1)]
keys = Index(['a', 'b', nan], dtype='object', name='groups')
levels = [Index(['a', 'b', nan], dtype='object', name='groups')]
names = ['groups']

    def _make_concat_multiindex(indexes, keys, levels=None, names=None) -> MultiIndex:
    
        if (levels is None and isinstance(keys[0], tuple)) or (
            levels is not None and len(levels) > 1
        ):
            zipped = list(zip(*keys))
            if names is None:
                names = [None] * len(zipped)
    
            if levels is None:
                _, levels = factorize_from_iterables(zipped)
            else:
                levels = [ensure_index(x) for x in levels]
        else:
            zipped = [keys]
            if names is None:
                names = [None]
    
            if levels is None:
                levels = [ensure_index(keys)]
            else:
                levels = [ensure_index(x) for x in levels]
    
        if not all_indexes_same(indexes):
            codes_list = []
    
            # things are potentially different sizes, so compute the exact codes
            # for each level and pass those to MultiIndex.from_arrays
    
            for hlevel, level in zip(zipped, levels):
                to_concat = []
                for key, index in zip(hlevel, indexes):
                    mask = level == key
                    if not mask.any():
>                       raise ValueError(f"Key {key} not in level {level}")
E                       ValueError: Key nan not in level Index(['a', 'b', nan], dtype='object', name='groups')

../../env3.8/lib/python3.8/site-packages/pandas/core/reshape/concat.py:631: ValueError

Expected Output

No error should be raised. With the above, if i omit the nan:

df = pd.DataFrame({
    'groups': ['a', 'a', 'b'],
    'tonnages': [10, 10, 20]
})
dfg = df.groupby('groups', dropna=False)
rv = dfg.apply(lambda grp: pd.DataFrame({'values': list(range(len(grp)))}))

Then it works successfully with rv being:

          values
groups          
a      0       0
       1       1
b      0       0

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f2ca0a2
python : 3.8.2.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
Version : Darwin Kernel Version 18.7.0: Thu Jun 18 20:50:10 PDT 2020; root:xnu-4903.278.43~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 1.1.1
numpy : 1.18.4
pytz : 2019.3
dateutil : 2.8.1
pip : 20.1.1
setuptools : 46.4.0
Cython : 0.29.17
pytest : 5.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : 2.7.1
odfpy : None
openpyxl : 1.8.6
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.12
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

@cwkwong cwkwong added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 25, 2020
@cwkwong
Copy link
Contributor Author

cwkwong commented Aug 25, 2020

I've done some local tweaking in my pandas in the file pandas/core/reshape/concat.py. Starting from line 618;

...
if not all_indexes_same(indexes):
        codes_list = []

        # things are potentially different sizes, so compute the exact codes
        # for each level and pass those to MultiIndex.from_arrays

        for hlevel, level in zip(zipped, levels):
            to_concat = []
            for key, index in zip(hlevel, indexes):
                # mask = level == key                           # <------------------------------ original
                import pandas as pd                             # <------------------------------ added
                mask = ((pd.isnull(level) & pd.isnull(key)) | (level == key)) # <---------- added
                if not mask.any():
                    raise ValueError(f"Key {key} not in level {level}")
                # i = np.nonzero(level == key)[0][0] # <-------------------------------- original
                i = np.nonzero(mask)[0][0] # <----------------------------------------- added

                to_concat.append(np.repeat(i, len(index)))
            codes_list.append(np.concatenate(to_concat))

        concat_index = _concat_indexes(indexes)
...

then the ValueError is not raised. With the original example above, the rv is:

          values
groups          
a      0       0
       1       1
b      0       0
NaN    0       0

@charlesdong1991 charlesdong1991 added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 25, 2020
@charlesdong1991
Copy link
Member

thanks, @cwkwong do you wanna make a PR?

@charlesdong1991 charlesdong1991 added the Apply Apply, Aggregate, Transform, Map label Aug 25, 2020
@cwkwong
Copy link
Contributor Author

cwkwong commented Aug 26, 2020

Sure @charlesdong1991 , I'll give it a go. Cheers.

@cwkwong
Copy link
Contributor Author

cwkwong commented Aug 28, 2020

WIP, will submit PR by end of today hopefully (note: I'm in Melb, AU timezone)

cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 28, 2020
cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 28, 2020
…5889)

* tests should still fail.

* test dropna=True|False with no np.nan in groupings.

* fix expected outputs, declare expected MultiIndex in resulting
  dataframe after df.group().apply()
cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 28, 2020
cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 28, 2020
* nans at same positions in `level` and `key` compares as equal.
@charlesdong1991
Copy link
Member

there is no rush at all, just take your time! ^^ @cwkwong

cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 28, 2020
* this makes test pass.

* follow existing style where we create MultiIndex,
  then `set_levels` to reinsert nan for case when
  `dropna=False`, and groups has nan grouping.
cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 28, 2020
cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 28, 2020
* black pandas

* git diff upstream/master -u -- "*.py" | flake8 --diff
cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 28, 2020
cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 29, 2020
cwkwong added a commit to cwkwong/pandas that referenced this issue Aug 29, 2020
cwkwong added a commit to cwkwong/pandas that referenced this issue Sep 2, 2020
cwkwong added a commit to cwkwong/pandas that referenced this issue Sep 5, 2020
@jreback jreback added this to the 1.2 milestone Sep 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants