BUG: inconsistent replace #35376

qlieumontadv · 2020-07-22T09:37:32Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Problem description

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  1  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0

Problem description

Maybe I don't understand somethink or this is just non-sens

Expected Output

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  1  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  1.0
1  2  2.0

Or

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-62-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0
Cython : None
pytest : 5.4.3
hypothesis : None
sphinx : 3.1.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : 0.6.2
lxml.etree : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : 5.4.3
pyxlsb : None
s3fs : None
scipy : 1.5.1
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.50.1

The text was updated successfully, but these errors were encountered:

erfannariman · 2020-07-22T13:21:06Z

Can confirm this occurs on master as well.

asishm · 2020-07-22T13:28:01Z

I'm guessing this has something to do with downcasting? floats can be downcasted to ints which is why the replace using floats works for both cases but for ints it only replaces ints

simonjayhawkins · 2020-07-22T15:11:59Z

Thanks @asishm for the report. This definitely looks like a bug. from https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.replace.html

numeric: numeric values equal to to_replace will be replaced with value

so the definition of equal needs clarification.

but if we consider pandas' notion of equals, just like python, 1 == 1.0

>>> pd.DataFrame([[1, 1.0], [2, 2.0]]) == 1
       0      1
0   True   True
1  False  False
>>>
>>> pd.DataFrame([[1, 1.0], [2, 2.0]]) == 1.0
       0      1
0   True   True
1  False  False
>>>

I would therefore expect, for consistency, the result to be

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0

simonjayhawkins · 2020-07-22T15:20:46Z

0.25.3 was giving the expected output, so marking as regression for now pending further invesitigation on whether the change was intentional

>>> pd.__version__
'0.25.3'
>>>
>>> pd.DataFrame([[1, 1.0], [2, 2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0
>>>
>>> pd.DataFrame([[1, 1.0], [2, 2.0]]).replace(1, 5)
   0    1
0  5  5.0
1  2  2.0
>>>

simonjayhawkins · 2020-07-22T19:30:58Z

the behaviour changed with #27768, so not intentional.

01f90c1 is the first bad commit
commit 01f90c1
Author: jbrockmendel jbrockmendel@gmail.com
Date: Mon Aug 12 11:58:42 2019 -0700

CLN: short-circuit case in Block.replace (#27768)

cc @jbrockmendel

qlieumontadv · 2020-08-17T13:38:26Z

After a few checks, I've find that this lines are at the bug source (from @simonjayhawkins ) :

if not isinstance(to_replace, list):
    if inplace:
        return [self]
    return [self.copy()]

With this lines, I have :

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  1  5.0
1  2  2.0

Without this lines, I have :

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace(1.0, 5)
   0    1
0  5  5.0
1  2  2.0

qlieumontadv · 2020-08-17T13:42:03Z

In fact with the 4 lines, I can force changes by placing 1.0 between brackets :

>>> pd.DataFrame([[1,1.0],[2,2.0]]).replace([1.0], 5)
   0    1
0  5  5.0
1  2  2.0

QuentinN42 · 2020-09-09T05:58:39Z

Edit: qlieumontadv was my company account, I will do a PR with my personal account (this one :)

QuentinN42 · 2020-09-09T07:08:14Z

@jbrockmendel can we remove this lines without affecting the results ?

if not isinstance(to_replace, list):
    if inplace:
        return [self]
    return [self.copy()]

I have the impression that it allows to increase performance by not testing some cases as explained in the comment that was deleted during this commit :

# TODO: we should be able to infer at this point that there is
#  nothing to replace

jbrockmendel · 2020-09-09T19:26:21Z

can we remove this lines without affecting the results ?

The line before the ones you quoted is if not self._can_hold_element(to_replace):, so we should only get here if the array doesn't contain to_replace. If we are getting here with the example from the OP, that suggests a problem in _can_hold_element

QuentinN42 · 2020-09-17T20:04:37Z

After some tests, I may have found the problem:
In the pandas/core/internals/blocks.py file, in the IntBlock class, if the maybe_infer_dtype_type function returns None (which is normal), it will return is_integer(element).
By replacing it with is_integer(element) or is_float(element) this case is covered.

I'm not very satisfied with this fix, I feel it's more of a DIY fix.
May I open a PR ?

QuentinN42 · 2020-09-17T20:13:19Z

Here is the changes I made :
is_float added to IntBlock._can_hold_element gist

jbrockmendel · 2020-09-18T01:54:58Z

By replacing it with is_integer(element) or is_float(element) this case is covered.

is_float(element) is a start, but don't we only want to include floats that are int-like? so (is_float(element) and element.is_integer()). to be even more careful we should check that casting to the target dtype is lossless

Added return is_integer(element) or is_float(element) to the IntBlock._can_hold_element method because an Block of ints can be replaced from int. Read pandas-dev#35376 for more info

@jbrockmendel

As @jbrockmendel said in pandas-dev#35376, you can replace an int by a float in an IntBlock only if the element is an integer.

QuentinN42 · 2020-09-18T06:56:23Z

I've create a PR for this.
I've NOT run any tests or code style checks !

QuentinN42 · 2020-09-18T06:57:32Z

It might be usefull to add some unit tests for this.

simonjayhawkins · 2020-09-18T08:29:45Z

It might be usefull to add some unit tests for this.

This is required as part of any PR. use the example in the OP as a test.

QuentinN42 · 2020-09-18T08:41:33Z

Sorry but what is an OP ? 😞

simonjayhawkins · 2020-09-18T08:47:19Z

it normally means 'original poster'. I actually meant the first post. so use the code sample in #35376 (comment)

QuentinN42 · 2020-09-18T08:48:56Z

Ok, thx 😄

QuentinN42 · 2020-09-18T12:37:22Z

I'm working on the test for the PR and I want to put # GH xxx in comment, is it the issue or the PR number ?
Aka # GH 35376 or # GH 36444 ?

simonjayhawkins · 2020-09-18T12:57:23Z

the issue number, i.e. 35376

Added the pandas-dev#35376 error as a test.

qlieumontadv added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 22, 2020

simonjayhawkins added replace replace method and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 22, 2020

simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Jul 22, 2020

QuentinN42 added a commit to QuentinN42/pandas that referenced this issue Sep 18, 2020

BUG: is_float(e) and e.is_integer() added to IntBlock._can_hold_element

3222d60

As @jbrockmendel said in pandas-dev#35376, you can replace an int by a float in an IntBlock only if the element is an integer.

QuentinN42 mentioned this issue Sep 18, 2020

BUG: inconsistent replace #36444

Merged

5 tasks

simonjayhawkins added this to the 1.1.3 milestone Sep 18, 2020

QuentinN42 added a commit to QuentinN42/pandas that referenced this issue Sep 18, 2020

replace test added

ff63b40

Added the pandas-dev#35376 error as a test.

simonjayhawkins mentioned this issue Sep 19, 2020

BUG: Fix replace for different dtype equal value #34878

Closed

5 tasks

jreback changed the title ~~BUG: inconsistant replace~~ BUG: inconsistent replace Sep 22, 2020

jreback closed this as completed in #36444 Sep 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: inconsistent replace #35376

BUG: inconsistent replace #35376

qlieumontadv commented Jul 22, 2020 •

edited

Loading

INSTALLED VERSIONS

erfannariman commented Jul 22, 2020

asishm commented Jul 22, 2020

simonjayhawkins commented Jul 22, 2020

simonjayhawkins commented Jul 22, 2020

simonjayhawkins commented Jul 22, 2020

qlieumontadv commented Aug 17, 2020

qlieumontadv commented Aug 17, 2020

QuentinN42 commented Sep 9, 2020

QuentinN42 commented Sep 9, 2020

jbrockmendel commented Sep 9, 2020

QuentinN42 commented Sep 17, 2020 •

edited

Loading

QuentinN42 commented Sep 17, 2020

jbrockmendel commented Sep 18, 2020

QuentinN42 commented Sep 18, 2020

QuentinN42 commented Sep 18, 2020

simonjayhawkins commented Sep 18, 2020

QuentinN42 commented Sep 18, 2020

simonjayhawkins commented Sep 18, 2020

QuentinN42 commented Sep 18, 2020

QuentinN42 commented Sep 18, 2020

simonjayhawkins commented Sep 18, 2020

BUG: inconsistent replace #35376

BUG: inconsistent replace #35376

Comments

qlieumontadv commented Jul 22, 2020 • edited Loading

Problem description

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

erfannariman commented Jul 22, 2020

asishm commented Jul 22, 2020

simonjayhawkins commented Jul 22, 2020

simonjayhawkins commented Jul 22, 2020

simonjayhawkins commented Jul 22, 2020

qlieumontadv commented Aug 17, 2020

qlieumontadv commented Aug 17, 2020

QuentinN42 commented Sep 9, 2020

QuentinN42 commented Sep 9, 2020

jbrockmendel commented Sep 9, 2020

QuentinN42 commented Sep 17, 2020 • edited Loading

QuentinN42 commented Sep 17, 2020

jbrockmendel commented Sep 18, 2020

QuentinN42 commented Sep 18, 2020

QuentinN42 commented Sep 18, 2020

simonjayhawkins commented Sep 18, 2020

QuentinN42 commented Sep 18, 2020

simonjayhawkins commented Sep 18, 2020

QuentinN42 commented Sep 18, 2020

QuentinN42 commented Sep 18, 2020

simonjayhawkins commented Sep 18, 2020

qlieumontadv commented Jul 22, 2020 •

edited

Loading

Output of `pd.show_versions()`

QuentinN42 commented Sep 17, 2020 •

edited

Loading