Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: inconsistent replace #35376

Closed
3 tasks done
qlieumontadv opened this issue Jul 22, 2020 · 21 comments · Fixed by #36444
Closed
3 tasks done

BUG: inconsistent replace #35376

qlieumontadv opened this issue Jul 22, 2020 · 21 comments · Fixed by #36444
Labels
Bug Regression Functionality that used to work in a prior pandas version replace replace method
Milestone

Comments

@qlieumontadv
Copy link

qlieumontadv commented Jul 22, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Problem description

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  1  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0

Problem description

Maybe I don't understand somethink or this is just non-sens

Expected Output

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  1  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  1.0
1  2  2.0

Or

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-62-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0
Cython : None
pytest : 5.4.3
hypothesis : None
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : 0.6.2
lxml.etree : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : 5.4.3
pyxlsb : None
s3fs : None
scipy : 1.5.1
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.50.1

@qlieumontadv qlieumontadv added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 22, 2020
@erfannariman
Copy link
Member

Can confirm this occurs on master as well.

@asishm
Copy link
Contributor

asishm commented Jul 22, 2020

I'm guessing this has something to do with downcasting? floats can be downcasted to ints which is why the replace using floats works for both cases but for ints it only replaces ints

@simonjayhawkins simonjayhawkins added replace replace method and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 22, 2020
@simonjayhawkins
Copy link
Member

Thanks @asishm for the report. This definitely looks like a bug. from https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.replace.html

numeric: numeric values equal to to_replace will be replaced with value

so the definition of equal needs clarification.

but if we consider pandas' notion of equals, just like python, 1 == 1.0

>>> pd.DataFrame([[1, 1.0], [2, 2.0]]) == 1
       0      1
0   True   True
1  False  False
>>>
>>> pd.DataFrame([[1, 1.0], [2, 2.0]]) == 1.0
       0      1
0   True   True
1  False  False
>>>

I would therefore expect, for consistency, the result to be

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0

@simonjayhawkins
Copy link
Member

0.25.3 was giving the expected output, so marking as regression for now pending further invesitigation on whether the change was intentional

>>> pd.__version__
'0.25.3'
>>>
>>> pd.DataFrame([[1, 1.0], [2, 2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0
>>>
>>> pd.DataFrame([[1, 1.0], [2, 2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0
>>>

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Jul 22, 2020
@simonjayhawkins
Copy link
Member

the behaviour changed with #27768, so not intentional.

01f90c1 is the first bad commit
commit 01f90c1
Author: jbrockmendel jbrockmendel@gmail.com
Date: Mon Aug 12 11:58:42 2019 -0700

CLN: short-circuit case in Block.replace (#27768)

cc @jbrockmendel

@qlieumontadv
Copy link
Author

After a few checks, I've find that this lines are at the bug source (from @simonjayhawkins ) :

if not isinstance(to_replace, list):
    if inplace:
        return [self]
    return [self.copy()]

With this lines, I have :

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  1  5.0
1  2  2.0

Without this lines, I have :

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0

@qlieumontadv
Copy link
Author

In fact with the 4 lines, I can force changes by placing 1.0 between brackets :

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace([1.0], 5)
   0    1
0  5  5.0
1  2  2.0

@QuentinN42
Copy link
Contributor

Edit: qlieumontadv was my company account, I will do a PR with my personal account (this one :)

@QuentinN42
Copy link
Contributor

@jbrockmendel can we remove this lines without affecting the results ?

if not isinstance(to_replace, list):
    if inplace:
        return [self]
    return [self.copy()]

I have the impression that it allows to increase performance by not testing some cases as explained in the comment that was deleted during this commit :

# TODO: we should be able to infer at this point that there is
#  nothing to replace

@jbrockmendel
Copy link
Member

can we remove this lines without affecting the results ?

The line before the ones you quoted is if not self._can_hold_element(to_replace):, so we should only get here if the array doesn't contain to_replace. If we are getting here with the example from the OP, that suggests a problem in _can_hold_element

@QuentinN42
Copy link
Contributor

QuentinN42 commented Sep 17, 2020

After some tests, I may have found the problem:
In the pandas/core/internals/blocks.py file, in the IntBlock class, if the maybe_infer_dtype_type function returns None (which is normal), it will return is_integer(element).
By replacing it with is_integer(element) or is_float(element) this case is covered.

I'm not very satisfied with this fix, I feel it's more of a DIY fix.
May I open a PR ?

@QuentinN42
Copy link
Contributor

Here is the changes I made :
is_float added to IntBlock._can_hold_element gist

@jbrockmendel
Copy link
Member

By replacing it with is_integer(element) or is_float(element) this case is covered.

is_float(element) is a start, but don't we only want to include floats that are int-like? so (is_float(element) and element.is_integer()). to be even more careful we should check that casting to the target dtype is lossless

QuentinN42 added a commit to QuentinN42/pandas that referenced this issue Sep 18, 2020
Added return is_integer(element) or is_float(element) to the IntBlock._can_hold_element method because an Block of ints can be replaced from int.
Read pandas-dev#35376 for more info
QuentinN42 added a commit to QuentinN42/pandas that referenced this issue Sep 18, 2020
As @jbrockmendel said in pandas-dev#35376, you can replace an int by a float in an IntBlock only if the element is an integer.
@QuentinN42
Copy link
Contributor

I've create a PR for this.
I've NOT run any tests or code style checks !

@QuentinN42
Copy link
Contributor

It might be usefull to add some unit tests for this.

@simonjayhawkins
Copy link
Member

It might be usefull to add some unit tests for this.

This is required as part of any PR. use the example in the OP as a test.

@simonjayhawkins simonjayhawkins added this to the 1.1.3 milestone Sep 18, 2020
@QuentinN42
Copy link
Contributor

Sorry but what is an OP ? 😞

@simonjayhawkins
Copy link
Member

it normally means 'original poster'. I actually meant the first post. so use the code sample in #35376 (comment)

@QuentinN42
Copy link
Contributor

Ok, thx 😄

@QuentinN42
Copy link
Contributor

I'm working on the test for the PR and I want to put # GH xxx in comment, is it the issue or the PR number ?
Aka # GH 35376 or # GH 36444 ?

@simonjayhawkins
Copy link
Member

the issue number, i.e. 35376

QuentinN42 added a commit to QuentinN42/pandas that referenced this issue Sep 18, 2020
Added the pandas-dev#35376 error as a test.
@jreback jreback changed the title BUG: inconsistant replace BUG: inconsistent replace Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Regression Functionality that used to work in a prior pandas version replace replace method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants