-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: groupby dropna=False with nan value in groupby causes ValueError when apply() #35889
Comments
I've done some local tweaking in my pandas in the file ...
if not all_indexes_same(indexes):
codes_list = []
# things are potentially different sizes, so compute the exact codes
# for each level and pass those to MultiIndex.from_arrays
for hlevel, level in zip(zipped, levels):
to_concat = []
for key, index in zip(hlevel, indexes):
# mask = level == key # <------------------------------ original
import pandas as pd # <------------------------------ added
mask = ((pd.isnull(level) & pd.isnull(key)) | (level == key)) # <---------- added
if not mask.any():
raise ValueError(f"Key {key} not in level {level}")
# i = np.nonzero(level == key)[0][0] # <-------------------------------- original
i = np.nonzero(mask)[0][0] # <----------------------------------------- added
to_concat.append(np.repeat(i, len(index)))
codes_list.append(np.concatenate(to_concat))
concat_index = _concat_indexes(indexes)
... then the values
groups
a 0 0
1 1
b 0 0
NaN 0 0 |
thanks, @cwkwong do you wanna make a PR? |
Sure @charlesdong1991 , I'll give it a go. Cheers. |
WIP, will submit PR by end of today hopefully (note: I'm in Melb, AU timezone) |
…5889) * tests should still fail. * test dropna=True|False with no np.nan in groupings. * fix expected outputs, declare expected MultiIndex in resulting dataframe after df.group().apply()
* nans at same positions in `level` and `key` compares as equal.
there is no rush at all, just take your time! ^^ @cwkwong |
* this makes test pass. * follow existing style where we create MultiIndex, then `set_levels` to reinsert nan for case when `dropna=False`, and groups has nan grouping.
* black pandas * git diff upstream/master -u -- "*.py" | flake8 --diff
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas, version
1.1.1
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
ValueError
raised ondfg.apply(lambda grp: ...)
; with the following stacktrace:Expected Output
No error should be raised. With the above, if i omit the
nan
:Then it works successfully with
rv
being:Output of
pd.show_versions()
INSTALLED VERSIONS
commit : f2ca0a2
python : 3.8.2.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
Version : Darwin Kernel Version 18.7.0: Thu Jun 18 20:50:10 PDT 2020; root:xnu-4903.278.43~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8
pandas : 1.1.1
numpy : 1.18.4
pytz : 2019.3
dateutil : 2.8.1
pip : 20.1.1
setuptools : 46.4.0
Cython : 0.29.17
pytest : 5.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.1
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : 2.7.1
odfpy : None
openpyxl : 1.8.6
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.12
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: