Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using df.at() for assignment changes views of certain columns, but not all of them #22372

Closed
liesb opened this issue Aug 15, 2018 · 9 comments · Fixed by #45287
Closed

Using df.at() for assignment changes views of certain columns, but not all of them #22372

liesb opened this issue Aug 15, 2018 · 9 comments · Fixed by #45287
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@liesb
Copy link

liesb commented Aug 15, 2018

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

def f2(seed=789, niter=10):
    rng = np.random.RandomState(seed=seed)
    df = pd.DataFrame(index=np.arange(5))
    df['x'] = rng.uniform(0, 2, size=5)
    df['y'] = rng.uniform(0, 2, size=5)
    df['cost'] = rng.uniform(0, 10, size=5)
    # print the initial df
    print(df)
    # for loop: replace the worst param combo and evaluate
    for i in range(niter):
        # idx of param combo with highest cost
        worst_idx = np.argmax(df.cost.values)
        # IF line below is commented out, I don't get the discrepancy
        centroid = df.loc[rng.choice(np.arange(5), size=1)].mean(axis=0)
        for p in ['x', 'y']:
            new_param = 4
            # THIS IS THE OFFENDING CALL to df.at! 
            # Change it to .loc and it doesn't give the same issue
            df.at[worst_idx, p] = new_param
#             df.loc[worst_idx, p] = new_param
        # The at below is not the offending at
        df.at[worst_idx, 'cost'] = rng.uniform(low=0, high=10)
#         df.loc[worst_idx, 'cost'] = rng.uniform(low=0, high=10, size=(1,))
    return df

test = f2()
test
test.cost
test[['cost']]
test.iloc[4]

Problem description

The cost column is displayed incorrectly when we use the command test. It is correct when we use test.cost but not when we use test[['cost']] or test.iloc[4]. The other columns are displayed correctly, it seems.
This doesn't happen when we use .loc() to assign new_param to the x and y columns.
It also doesn't happen when we comment out centroid = df.loc[rng.choice(np.arange(5), size=1)].mean(axis=0).

The current behaviour is problematic because choosing different ways of viewing the data yield different answers, which shouldn't be the case.

Expected Output

  • The cost column in test would correspond to test.cost.
  • test[['cost']] and test.cost give the same results
  • test.iloc[4] displays the 4th element of test.cost

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 36.4.0
Cython: None
numpy: 1.15.0
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd WillAyd added Usage Question Needs Info Clarification about behavior needed to assess issue labels Aug 15, 2018
@WillAyd
Copy link
Member

WillAyd commented Aug 15, 2018

Quite a few things being thrown out there but unfortunately it's not explicit enough as to what issue(s) you are trying to address. Also the expectation that test[['cost']] and test.cost are the same is incorrect, as the former returns a frame whereas the latter returns a series

Please refer to the contributing guide on how to open bug reports. The biggest thing here is to post a minimally reproducible example:

https://pandas.pydata.org/pandas-docs/stable/contributing.html#bug-reports-and-enhancement-requests

If you can provide the minimally reproducible example and clarify what the issue is exactly feel free to reopen

@WillAyd WillAyd closed this as completed Aug 15, 2018
@liesb
Copy link
Author

liesb commented Aug 15, 2018 via email

@jreback
Copy link
Contributor

jreback commented Aug 15, 2018

@liesb pls post a minimal example - you have lots of things going on which are unrelated
it’s actually very very hard to tell if anything is amiss

@liesb
Copy link
Author

liesb commented Aug 15, 2018

Minimal example: I'm changing the value in the cost column of df from 2 to 789. However, I don't see this when I use df, df[['cost']] or df.iloc[0]; it is visible when using df.cost.

import numpy as np
import pandas as pd

df = pd.DataFrame(index=[0])
df['x'] = 1
df['cost'] = 2
print(df)
idx = np.argmax(df.cost.values)
centroid = df.loc[[0]]
new_param = 4
df.at[idx, 'x'] = new_param
df.at[idx, 'cost'] = 789
print(df)
df.cost 
df[['cost']]
df.iloc[0]

@liesb
Copy link
Author

liesb commented Aug 15, 2018

I'm adding the pd.showversions() output again as I used a different machine:

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 40.0.0
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd WillAyd reopened this Aug 15, 2018
@WillAyd WillAyd added Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Info Clarification about behavior needed to assess issue Usage Question labels Aug 15, 2018
@WillAyd
Copy link
Member

WillAyd commented Aug 15, 2018

OK thanks for the example. Just to remove unnecessary prints and make it more concise:

df = pd.DataFrame(index=[0])
df['x'] = 1
df['cost'] = 2
idx = np.argmax(df.cost.values)
centroid = df.loc[[0]]
new_param = 4
df.at[idx, 'x'] = new_param
df.at[idx, 'cost'] = 789

print(df['cost'])  # Print 789
print(df[['cost']])  # Prints 2

It is strange that the frame shows the original value before the .at assignment.

Note that almost any change to this to make it idiomatic in construction or remove elements is fine, for example even doing:

df = pd.DataFrame([[1, 2]], columns=['x', 'cost'], index=[0])
idx = np.argmax(df.cost.values)
centroid = df.loc[[0]]
new_param = 4
df.at[idx, 'x'] = new_param
df.at[idx, 'cost'] = 789

print(df['cost'])  # Print 789
print(df[['cost']])  # Prints 789

My guess is there is some wonky state management going on. Investigation and PRs are always welcome

@WillAyd WillAyd added Bug and removed Indexing Related to indexing on series/frames, not to indexes themselves labels Aug 15, 2018
@mroeschke mroeschke added the Indexing Related to indexing on series/frames, not to indexes themselves label Nov 2, 2019
@jbrockmendel
Copy link
Member

df._item_cache needs to be cleared, not sure exactly where

@jbrockmendel
Copy link
Member

Looks like the centroid = df.loc[[0]] line is triggering a "consolidate_inplace" without clearing the _item_cache.

@jbrockmendel
Copy link
Member

_getitem_iterable calls self.obj._reindex_with_indexers which calls mgr.reindex_indexer which calls _consolidate_inplace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants