Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"import pandas" changes NumPy error-handling settings #13109

Closed
mdickinson opened this issue May 7, 2016 · 13 comments
Closed

"import pandas" changes NumPy error-handling settings #13109

mdickinson opened this issue May 7, 2016 · 13 comments
Labels
API Design Compat pandas objects compatability with Numpy or Python functions
Milestone

Comments

@mdickinson
Copy link

mdickinson commented May 7, 2016

A simple import pandas apparently does a np.seterr(all='ignore'): https://github.com/pydata/pandas/blob/23eb483d17ce41c0fe41b0bfa72c90df82151598/pandas/compat/numpy/__init__.py#L8-L9

This is a problem when using Pandas as a library, particularly during testing: a test (completely unrelated to Pandas) that should have produced warnings can fail to do so just because some earlier test happened to import a library that imported a library that imported something from Pandas (for example). Or a test runner that's done a np.seterr(all='raise') specifically to catch potential issues can end up catching nothing, because some Pandas import part-way through the test run turned the error handling off again.

I'm working around this right now by wrapping every pandas import in with np.errstate():. For example:

with np.errstate():
    from pandas import DataFrame

But that's (a) ugly, and (b) awkward if the pandas import is in a third-party library out of your control.

Please consider removing this feature!

Code Sample, a copy-pastable example if possible

Enthought Canopy Python 2.7.11 | 64-bit | (default, Mar 18 2016, 14:49:17) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy, pandas
>>> 1.0 / numpy.arange(10)
array([        inf,  1.        ,  0.5       ,  0.33333333,  0.25      ,
        0.2       ,  0.16666667,  0.14285714,  0.125     ,  0.11111111])

Expected Output

I expected to see a RuntimeWarning from the division by zero above, as in:

Enthought Canopy Python 2.7.11 | 64-bit | (default, Mar 18 2016, 14:49:17) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> 1.0 / numpy.arange(10)
__main__:1: RuntimeWarning: divide by zero encountered in divide
array([        inf,  1.        ,  0.5       ,  0.33333333,  0.25      ,
        0.2       ,  0.16666667,  0.14285714,  0.125     ,  0.11111111])

output of pd.show_versions()

>>> pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.6.7
Cython: None
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: None
patsy: None
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
@TomAugspurger
Copy link
Contributor

This came up somewhat recently in #12464, but I don't think anything came out of it.

@jreback
Copy link
Contributor

jreback commented May 7, 2016

@mdickinson this is very very common to do. The entire point of pandas is to turn things to NaN AND propogate them.

It IS possible to turn this off by having practially every deferral option to numpy wrap this in a context manager (rather than setting the global).

It would'nt be that invasive (could do it with a decorator), but adds complexity, may be somewhat non-performant, and its a fair amount of work.

If you want to give this a stab, then i'll reopen.

@jreback jreback closed this as completed May 7, 2016
@jreback jreback added this to the No action milestone May 7, 2016
@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label May 7, 2016
@mdickinson
Copy link
Author

The entire point of pandas is to turn things to NaN AND propogate them.

Sure, and that's fine when you're using Pandas to do data analysis at a command line. It's a long way from fine when you're using Pandas as a library for a couple of small tasks (in this particular case, flexible reading from .csv files) within a larger application. Then you're in the situation where one of your application dependencies has unilaterally and silently turned off NumPy warnings for the entire application, oblivious to the needs of that application or any of its many other libraries. That's just rude. :-) Deciding to turn off NumPy warnings (or any warnings, for that matter) globally is a decision that should be made at application level, not at the level of one particular library.

It's interesting to compare with the seaborn library, where import seaborn changes several matplotlib settings. Again, that's a useful thing to do for the purposes of visual exploration from the command line, but you want to avoid that import-time side-effect when using seaborn as a library in a larger application. The seaborn folks have provided a seaborn.apionly module for that exact purpose, so that you do import seaborn.apionly as sns and any changes that have been made to the matplotlib settings elsewhere in your app are unaffected. Would something like this be possible for Pandas?

@jreback
Copy link
Contributor

jreback commented May 8, 2016

well this setting has existed since pandas inception. as I layed out above I don't see much gain in changing this. If you would like to make an attempt by all means.

Sure, and that's fine when you're using Pandas to do data analysis at a command line. It's a long way from fine when you're using Pandas as a library for a couple of small tasks (in this particular case, flexible

your statement is very odd. Virtually the entire user base uses pandas for data analysis. Want to drop back to some exception handling in numpy w/o using pandas seems dubious at best.

@anntzer
Copy link
Contributor

anntzer commented May 11, 2016

I agree that this behavior is very annoying. Most of the time, my only use of pandas is to have a (great) implementation of groupby, and do not want other errors to pass silently. Likewise, because seaborn depends on pandas, it is easy to end up with this change even without explicitly importing pandas.

Two possible solutions may be

  • Provide pandas.apionly, similar to seaborn.apionly. The problem is that this is going to take a while to propagate down to other packages that depend on pandas.
  • Only change the errstate if at an interactive prompt (as defined by the presence of sys.ps1, not sys.flags.interactive, as the latter is unset when in IPython).

@rkern
Copy link
Contributor

rkern commented May 11, 2016

I do not believe that unilaterally modifying numpy's error-handling is a common practice. If you proposed that over at numpy-discussion, I think you would get some pushback. Importing numpy.ma used to do this for basically the same reasons, and we eventually received enough complaints about it to fix it. If you would like to propose that pandas' default settings are better than numpy's (and I am rather more sympathetic to that idea now than I was in the past), then let's change numpy's defaults so that we have consistent, documented behavior across the board.

Part of the reason that you have seen only a few complaints about this behavior since pandas' inception is that it changes behavior silently. It took a couple of years after the same behavior's inception in numpy.ma for us to notice it and fix it too.

Having an application that has both data analysis (where NaN may mean missing data) and moderately-complicated numerical algorithms (where NaNs can appear from perfectly-clean inputs due to domain violations and other numerical errors) is not at all dubious. Most nontrivial data analysis tasks have both. Cleaning missing data first before doing the heavy numerical work of data analysis/statistics/machine learning is quite common. That's why numpy has the ability to control the FPE-handling at a pretty fine grain.

@jreback
Copy link
Contributor

jreback commented May 11, 2016

well @rkern this has been in place from < 2012 AFAICT. This IS possible, but performance of the context managers would need to be considered, and much more importantly someone to do it; are you volunteering?

@rkern
Copy link
Contributor

rkern commented May 11, 2016

In progress.

In so doing, I found another reason to avoid the fire-and-forget np.seterr() in favor of more fine-grained np.errstate(). pandas assumes that no one else has changed the settings. If someone uses with np.errstate(invalid='raise'): then binary comparisons will be silently wrecked.

[~]
|1> import pandas

[~]
|2> df = pandas.DataFrame(dict(x=[np.nan, 1, 2]))

[~]
|3> df
     x
0  NaN
1  1.0
2  2.0

[~]
|4> df < 0
       x
0  False
1  False
2  False

[~]
|5> df > 0
       x
0  False
1   True
2   True

[~]
|6> with np.errstate(invalid='raise'): print df < 0
      x
0  True
1  True
2  True

[~]
|7> with np.errstate(invalid='raise'): print df > 0
      x
0  True
1  True
2  True

@jreback jreback removed this from the No action milestone May 11, 2016
@jreback jreback reopened this May 11, 2016
@jreback
Copy link
Contributor

jreback commented May 11, 2016

ok thanks @rkern that would be awesome!

yeah I have seen that inverse case as well.

@jreback jreback added this to the Next Major Release milestone May 11, 2016
@TomAugspurger
Copy link
Contributor

@jreback how do we envision this interacting with the internals refactor? This is kind of a gray-zone on the defined behavior right now, I don't think we would want to commit to anything that could change with the refactor.
As an example:

s = pd.Series([1, 2, 3])
with np.errstate(invalid='raise'): s.reindex(range(4)) > 1

That casts to float and uses the np.nan marker for NaN, so it raises. Post internals-refactor that should still be an int dtype, so it won't use the np.nan missing values, so presumably it wouldn't raise, unless pandas explicitly does so. Should we give Wes a heads up?

@jreback
Copy link
Contributor

jreback commented May 11, 2016

no @TomAugspurger I don't think it will have much of an effect. IF its done, then computation that is deferred to numpy will use a context manager to wrap the err state. That's it, as it is one now. Ideally one would have a function like:

def compute_numpy(name, arr, *args, **kwargs):
    with np.errstate(invalid='raise'):
          return getattr(arr, name)(*args, **kwargs)

something like that. There would have to be some logic in this function to handle some of this but its not a big deal. It will mostly be isolated.

@mcg1969
Copy link

mcg1969 commented Aug 10, 2016

Just hit this problem myself, I'm definitely in favor of a fix. In this case, Pandas was a dependency of another package I was using---and indeed I wasn't even using the Pandas-dependent functionality---but I was bitten by this anyway. Thanks to @rkern for working on this.

@mcg1969
Copy link

mcg1969 commented Aug 21, 2016

This is great. @rkern thank you for the hard work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants