"import pandas" changes NumPy error-handling settings #13109

mdickinson · 2016-05-07T07:07:04Z

A simple import pandas apparently does a np.seterr(all='ignore'): https://github.com/pydata/pandas/blob/23eb483d17ce41c0fe41b0bfa72c90df82151598/pandas/compat/numpy/__init__.py#L8-L9

This is a problem when using Pandas as a library, particularly during testing: a test (completely unrelated to Pandas) that should have produced warnings can fail to do so just because some earlier test happened to import a library that imported a library that imported something from Pandas (for example). Or a test runner that's done a np.seterr(all='raise') specifically to catch potential issues can end up catching nothing, because some Pandas import part-way through the test run turned the error handling off again.

I'm working around this right now by wrapping every pandas import in with np.errstate():. For example:

with np.errstate():
    from pandas import DataFrame

But that's (a) ugly, and (b) awkward if the pandas import is in a third-party library out of your control.

Please consider removing this feature!

Code Sample, a copy-pastable example if possible

Enthought Canopy Python 2.7.11 | 64-bit | (default, Mar 18 2016, 14:49:17) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy, pandas
>>> 1.0 / numpy.arange(10)
array([        inf,  1.        ,  0.5       ,  0.33333333,  0.25      ,
        0.2       ,  0.16666667,  0.14285714,  0.125     ,  0.11111111])

Expected Output

I expected to see a RuntimeWarning from the division by zero above, as in:

Enthought Canopy Python 2.7.11 | 64-bit | (default, Mar 18 2016, 14:49:17) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> 1.0 / numpy.arange(10)
__main__:1: RuntimeWarning: divide by zero encountered in divide
array([        inf,  1.        ,  0.5       ,  0.33333333,  0.25      ,
        0.2       ,  0.16666667,  0.14285714,  0.125     ,  0.11111111])

output of `pd.show_versions()`

>>> pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.6.7
Cython: None
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: None
patsy: None
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2016-05-07T13:29:47Z

This came up somewhat recently in #12464, but I don't think anything came out of it.

jreback · 2016-05-07T14:29:28Z

@mdickinson this is very very common to do. The entire point of pandas is to turn things to NaN AND propogate them.

It IS possible to turn this off by having practially every deferral option to numpy wrap this in a context manager (rather than setting the global).

It would'nt be that invasive (could do it with a decorator), but adds complexity, may be somewhat non-performant, and its a fair amount of work.

If you want to give this a stab, then i'll reopen.

mdickinson · 2016-05-08T08:06:21Z

The entire point of pandas is to turn things to NaN AND propogate them.

Sure, and that's fine when you're using Pandas to do data analysis at a command line. It's a long way from fine when you're using Pandas as a library for a couple of small tasks (in this particular case, flexible reading from .csv files) within a larger application. Then you're in the situation where one of your application dependencies has unilaterally and silently turned off NumPy warnings for the entire application, oblivious to the needs of that application or any of its many other libraries. That's just rude. :-) Deciding to turn off NumPy warnings (or any warnings, for that matter) globally is a decision that should be made at application level, not at the level of one particular library.

It's interesting to compare with the seaborn library, where import seaborn changes several matplotlib settings. Again, that's a useful thing to do for the purposes of visual exploration from the command line, but you want to avoid that import-time side-effect when using seaborn as a library in a larger application. The seaborn folks have provided a seaborn.apionly module for that exact purpose, so that you do import seaborn.apionly as sns and any changes that have been made to the matplotlib settings elsewhere in your app are unaffected. Would something like this be possible for Pandas?

jreback · 2016-05-08T14:11:05Z

well this setting has existed since pandas inception. as I layed out above I don't see much gain in changing this. If you would like to make an attempt by all means.

Sure, and that's fine when you're using Pandas to do data analysis at a command line. It's a long way from fine when you're using Pandas as a library for a couple of small tasks (in this particular case, flexible

your statement is very odd. Virtually the entire user base uses pandas for data analysis. Want to drop back to some exception handling in numpy w/o using pandas seems dubious at best.

anntzer · 2016-05-11T03:01:28Z

I agree that this behavior is very annoying. Most of the time, my only use of pandas is to have a (great) implementation of groupby, and do not want other errors to pass silently. Likewise, because seaborn depends on pandas, it is easy to end up with this change even without explicitly importing pandas.

Two possible solutions may be

Provide pandas.apionly, similar to seaborn.apionly. The problem is that this is going to take a while to propagate down to other packages that depend on pandas.
Only change the errstate if at an interactive prompt (as defined by the presence of sys.ps1, not sys.flags.interactive, as the latter is unset when in IPython).

rkern · 2016-05-11T08:39:29Z

I do not believe that unilaterally modifying numpy's error-handling is a common practice. If you proposed that over at numpy-discussion, I think you would get some pushback. Importing numpy.ma used to do this for basically the same reasons, and we eventually received enough complaints about it to fix it. If you would like to propose that pandas' default settings are better than numpy's (and I am rather more sympathetic to that idea now than I was in the past), then let's change numpy's defaults so that we have consistent, documented behavior across the board.

Part of the reason that you have seen only a few complaints about this behavior since pandas' inception is that it changes behavior silently. It took a couple of years after the same behavior's inception in numpy.ma for us to notice it and fix it too.

Having an application that has both data analysis (where NaN may mean missing data) and moderately-complicated numerical algorithms (where NaNs can appear from perfectly-clean inputs due to domain violations and other numerical errors) is not at all dubious. Most nontrivial data analysis tasks have both. Cleaning missing data first before doing the heavy numerical work of data analysis/statistics/machine learning is quite common. That's why numpy has the ability to control the FPE-handling at a pretty fine grain.

jreback · 2016-05-11T11:55:24Z

well @rkern this has been in place from < 2012 AFAICT. This IS possible, but performance of the context managers would need to be considered, and much more importantly someone to do it; are you volunteering?

rkern · 2016-05-11T15:14:08Z

In progress.

In so doing, I found another reason to avoid the fire-and-forget np.seterr() in favor of more fine-grained np.errstate(). pandas assumes that no one else has changed the settings. If someone uses with np.errstate(invalid='raise'): then binary comparisons will be silently wrecked.

[~]
|1> import pandas

[~]
|2> df = pandas.DataFrame(dict(x=[np.nan, 1, 2]))

[~]
|3> df
     x
0  NaN
1  1.0
2  2.0

[~]
|4> df < 0
       x
0  False
1  False
2  False

[~]
|5> df > 0
       x
0  False
1   True
2   True

[~]
|6> with np.errstate(invalid='raise'): print df < 0
      x
0  True
1  True
2  True

[~]
|7> with np.errstate(invalid='raise'): print df > 0
      x
0  True
1  True
2  True

jreback · 2016-05-11T15:17:06Z

ok thanks @rkern that would be awesome!

yeah I have seen that inverse case as well.

TomAugspurger · 2016-05-11T17:01:37Z

@jreback how do we envision this interacting with the internals refactor? This is kind of a gray-zone on the defined behavior right now, I don't think we would want to commit to anything that could change with the refactor.
As an example:

s = pd.Series([1, 2, 3])
with np.errstate(invalid='raise'): s.reindex(range(4)) > 1

That casts to float and uses the np.nan marker for NaN, so it raises. Post internals-refactor that should still be an int dtype, so it won't use the np.nan missing values, so presumably it wouldn't raise, unless pandas explicitly does so. Should we give Wes a heads up?

jreback · 2016-05-11T17:32:58Z

no @TomAugspurger I don't think it will have much of an effect. IF its done, then computation that is deferred to numpy will use a context manager to wrap the err state. That's it, as it is one now. Ideally one would have a function like:

def compute_numpy(name, arr, *args, **kwargs):
    with np.errstate(invalid='raise'):
          return getattr(arr, name)(*args, **kwargs)

something like that. There would have to be some logic in this function to handle some of this but its not a big deal. It will mostly be isolated.

mcg1969 · 2016-08-10T19:02:04Z

Just hit this problem myself, I'm definitely in favor of a fix. In this case, Pandas was a dependency of another package I was using---and indeed I wasn't even using the Pandas-dependent functionality---but I was bitten by this anyway. Thanks to @rkern for working on this.

mcg1969 · 2016-08-21T17:02:12Z

This is great. @rkern thank you for the hard work!

jreback closed this as completed May 7, 2016

jreback added this to the No action milestone May 7, 2016

jreback added the Compat pandas objects compatability with Numpy or Python functions label May 7, 2016

jreback added the API Design label May 11, 2016

jreback removed this from the No action milestone May 11, 2016

jreback reopened this May 11, 2016

jreback added this to the Next Major Release milestone May 11, 2016

This was referenced May 11, 2016

df.query bug giving RuntimeWarning: divide by zero encountered in log10 in align.py", line 98, in _align_core ordm = np.log10(abs(reindexer_size - term_axis_size)) #13135

Closed

Fine-grained errstate handling #13145

Closed

jreback modified the milestones: 0.19.0, Next Major Release May 11, 2016

jreback mentioned this issue Jun 29, 2016

FloatingPointError on ewm().std() #13529

Closed

jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Jul 8, 2016

jreback closed this as completed in ce61b3f Aug 21, 2016

shoyer mentioned this issue Dec 14, 2016

Don't warn when doing comparisons or arithmetic with NaN pydata/xarray#1164

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"import pandas" changes NumPy error-handling settings #13109

"import pandas" changes NumPy error-handling settings #13109

mdickinson commented May 7, 2016 •

edited

Loading

TomAugspurger commented May 7, 2016

jreback commented May 7, 2016

mdickinson commented May 8, 2016

jreback commented May 8, 2016

anntzer commented May 11, 2016

rkern commented May 11, 2016

jreback commented May 11, 2016

rkern commented May 11, 2016

jreback commented May 11, 2016

TomAugspurger commented May 11, 2016

jreback commented May 11, 2016

mcg1969 commented Aug 10, 2016

mcg1969 commented Aug 21, 2016

"import pandas" changes NumPy error-handling settings #13109

"import pandas" changes NumPy error-handling settings #13109

Comments

mdickinson commented May 7, 2016 • edited Loading

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

TomAugspurger commented May 7, 2016

jreback commented May 7, 2016

mdickinson commented May 8, 2016

jreback commented May 8, 2016

anntzer commented May 11, 2016

rkern commented May 11, 2016

jreback commented May 11, 2016

rkern commented May 11, 2016

jreback commented May 11, 2016

TomAugspurger commented May 11, 2016

jreback commented May 11, 2016

mcg1969 commented Aug 10, 2016

mcg1969 commented Aug 21, 2016

mdickinson commented May 7, 2016 •

edited

Loading

output of `pd.show_versions()`