Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plotting Int64 columns with nulled integers (NAType) fails #32073

Closed
Khris777 opened this issue Feb 18, 2020 · 19 comments · Fixed by #38014
Closed

Plotting Int64 columns with nulled integers (NAType) fails #32073

Khris777 opened this issue Feb 18, 2020 · 19 comments · Fixed by #38014
Assignees
Labels
Bug good first issue NA - MaskedArrays Related to pd.NA and nullable extension arrays Visualization plotting
Milestone

Comments

@Khris777
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [7, 5, np.nan, 3, 2]})
df.plot(x='A', y='B')
df = df.astype('Int64')
df.plot(x='A', y='B')

Problem description

The first plotting command works, the second throws the error message

TypeError: float() argument must be a string or a number, not 'NAType'

Expected Output

NAType should be treated the same way as numpy nan in plotting. Maybe transformed on the fly?

(I'm unsure if this is a pandas, a numpy, or a matplotlib issue, I'm starting here)

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200209
Cython : 0.29.15
pytest : None
hypothesis : None
sphinx : 2.4.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fastparquet : 0.3.3
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.2.7
numba : 0.48.0

@TomAugspurger
Copy link
Contributor

Can you post the full traceback?

@TomAugspurger TomAugspurger added Needs Info Clarification about behavior needed to assess issue Visualization plotting labels Feb 18, 2020
@Khris777
Copy link
Author

Gladly:

Traceback (most recent call last):

  File "C:\Users\My.Name\Documents\Python_Projects\GIT\miscellaneous\temp.py", line 14, in <module>
    df.plot(x='A', y='B')

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\plotting\_core.py", line 847, in __call__
    return plot_backend.plot(data, kind=kind, **kwargs)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\plotting\_matplotlib\__init__.py", line 61, in plot
    plot_obj.generate()

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\plotting\_matplotlib\core.py", line 263, in generate
    self._make_plot()

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\plotting\_matplotlib\core.py", line 1085, in _make_plot
    **kwds,

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\plotting\_matplotlib\core.py", line 1104, in _plot
    lines = MPLPlot._plot(ax, x, y_values, style=style, **kwds)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\plotting\_matplotlib\converter.py", line 66, in wrapper
    return func(*args, **kwargs)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\plotting\_matplotlib\core.py", line 656, in _plot
    return ax.plot(*args, **kwds)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\matplotlib\axes\_axes.py", line 1667, in plot
    self.add_line(line)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\matplotlib\axes\_base.py", line 1902, in add_line
    self._update_line_limits(line)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\matplotlib\axes\_base.py", line 1924, in _update_line_limits
    path = line.get_path()

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\matplotlib\lines.py", line 1027, in get_path
    self.recache()

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\matplotlib\lines.py", line 675, in recache
    y = _to_unmasked_float_array(yconv).ravel()

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\matplotlib\cbook\__init__.py", line 1388, in _to_unmasked_float_array
    return np.ma.asarray(x, float).filled(np.nan)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\numpy\ma\core.py", line 7849, in asarray
    subok=False, order=order)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\numpy\ma\core.py", line 2795, in __new__
    order=order, subok=True, ndmin=ndmin)

TypeError: float() argument must be a string or a number, not 'NAType'

@jorisvandenbossche
Copy link
Member

We need to convert to floats with NaNs before passing the data to matplotlib, I suppose.

@jorisvandenbossche jorisvandenbossche added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Info Clarification about behavior needed to assess issue labels Feb 19, 2020
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Feb 19, 2020
@jeandersonbc
Copy link
Contributor

Hi! Can I make an attempt on this one? I'm looking for issues that can be useful to the community and help me to understand better how pandas works in depth :)

@jorisvandenbossche
Copy link
Member

@jeandersonbc yes, that's welcome!
If you have any questions, don't hesitate to ask

@AnnaDaglis
Copy link
Contributor

take

@jorisvandenbossche
Copy link
Member

@AnnaDaglis someone else (@jeandersonbc) already just commented he would like to look at it, so I would first give him some time before taking this issue

@jeandersonbc
Copy link
Contributor

Thanks @jorisvandenbossche, I didn't manage time to work on the issue since last time that I posted so that's why I didn't reply early.
As far as I can see, it is my understanding from your suggestion that there should be verification for NA values before calling plot_backend.plot(data, kind=kind, **kwargs), right? I'll be submitting a PR later today with some tests to see how it goes.

@jorisvandenbossche
Copy link
Member

there should be verification for NA values before calling plot_backend.plot(data, kind=kind, **kwargs)

Yes, I think so. But the check might need to be done lower in the stack, as when calling plot_backend.plot(), data is still a dataframe. I think we can focus here on the matplotlib backend included in pandas, and thus we need to ensure converting to float dtype (with to_numpy(float, na_value=np.nan) if it is nullable integer dtype) before passing it to matplotlib's plot().

@AnnaDaglis AnnaDaglis removed their assignment Feb 21, 2020
@jeandersonbc
Copy link
Contributor

take

@jeandersonbc
Copy link
Contributor

jeandersonbc commented Mar 2, 2020

So, I approached the problem by checking the numerical data in _compute_plot_data from matplot's backend. all tests pass, but I wonder if more tests should be added (e.g., at "pandas/tests/plotting/test_frame.py"). Any thoughts? Thanks already!

@MarcoGorelli
Copy link
Member

Hi @AnnaDaglis - looks like the linked PR went stale (unfortunately), are you still interested in working on this?

@cvanweelden
Copy link
Contributor

take

@cvanweelden cvanweelden removed their assignment Jul 7, 2020
@cvanweelden
Copy link
Contributor

I'm no longer working on this, as my minimal fix wasn't accepted and this might need a more structural solution for 3rd party lib operations on nullable types.

@vetedde
Copy link

vetedde commented Jul 13, 2020

take

@rkc007
Copy link
Contributor

rkc007 commented Oct 19, 2020

It's been 2 months since nobody is working on it. I am interested to work on this issue.

@rkc007
Copy link
Contributor

rkc007 commented Oct 19, 2020

take

@MarcoGorelli
Copy link
Member

It's been 2 months since nobody is working on it. I am interested to work on this issue.

Awesome! Let us know if you want/need help

@jreback jreback modified the milestones: Contributions Welcome, 1.2 Dec 4, 2020
@jeandersonbc jeandersonbc removed their assignment Dec 8, 2020
@jankaWIS
Copy link

jankaWIS commented Jul 2, 2022

Hi, I have a comment on this issue. It happened to me in v1.2.4, but it seems like it has been fixed now (v1.4.3) it has been fixed, but I have not found where or when it happened. I believe it's connected and could save time for someone who encounters this error while plotting with matplotlib:

ValueError: values must be a 1D array

If one runs the following code:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({
    'x' : [1,2,3,4,5],
    'y' : [1.,2.,3.,4.1,5]
})

print(df.dtypes)
# x      int64
# y    float64
# dtype: object

# plot
plt.plot(df["x"],df["y"])

# convert types
df = df.convert_dtypes()

print(df.dtypes)
# x      Int64
# y    Float64
# dtype: object

# plot
plt.plot(df["x"],df["y"])

ie. if one converts dtypes to pandas dtypes, suddenly plotting with matplotlib fails. The code above will plot the first plot but after one converts the types, it fails. Notice that it doesn't matter if the variable is Float64 or Int64, ie. just plotting plt.plot(df["x"],df["x"]) or plt.plot(df["y"],df["y"]) will yield the same.

The full output is below:

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) ~/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in _get_values(self, indexer) 936 try: --> 937 return self._constructor(self._mgr.get_slice(indexer)).__finalize__(self) 938 except ValueError:

~/anaconda3/lib/python3.8/site-packages/pandas/core/internals/managers.py in get_slice(self, slobj, axis)
1606 blk = self._block
-> 1607 array = blk._slice(slobj)
1608 block = blk.make_block_same_class(array, placement=slice(0, len(array)))

~/anaconda3/lib/python3.8/site-packages/pandas/core/internals/blocks.py in _slice(self, slicer)
1923
-> 1924 return self.values[slicer]
1925

~/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/masked.py in getitem(self, item)
114
--> 115 return type(self)(self._data[item], self._mask[item])
116

~/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/integer.py in init(self, values, mask, copy)
347 )
--> 348 super().init(values, mask, copy=copy)
349

~/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/masked.py in init(self, values, mask, copy)
89 if values.ndim != 1:
---> 90 raise ValueError("values must be a 1D array")
91 if mask.ndim != 1:

ValueError: values must be a 1D array

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
/var/folders/bx/tb4883l53hdd3zp2y0nyy_4m0000gp/T/ipykernel_92470/2350386431.py in
16 print(df.dtypes)
17 # plot
---> 18 plt.plot(df["x"],df["y"])

~/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py in plot(scalex, scaley, data, *args, **kwargs)
3017 @_copy_docstring_and_deprecators(Axes.plot)
3018 def plot(*args, scalex=True, scaley=True, data=None, **kwargs):
-> 3019 return gca().plot(
3020 *args, scalex=scalex, scaley=scaley,
3021 **({"data": data} if data is not None else {}), **kwargs)

~/anaconda3/lib/python3.8/site-packages/matplotlib/axes/_axes.py in plot(self, scalex, scaley, data, *args, **kwargs)
1603 """
1604 kwargs = cbook.normalize_kwargs(kwargs, mlines.Line2D)
-> 1605 lines = [*self._get_lines(*args, data=data, **kwargs)]
1606 for line in lines:
1607 self.add_line(line)

~/anaconda3/lib/python3.8/site-packages/matplotlib/axes/_base.py in call(self, data, *args, **kwargs)
313 this += args[0],
314 args = args[1:]
--> 315 yield from self._plot_args(this, kwargs)
316
317 def get_next_color(self):

~/anaconda3/lib/python3.8/site-packages/matplotlib/axes/_base.py in _plot_args(self, tup, kwargs, return_kwargs)
488
489 if len(xy) == 2:
--> 490 x = _check_1d(xy[0])
491 y = _check_1d(xy[1])
492 else:

~/anaconda3/lib/python3.8/site-packages/matplotlib/cbook/init.py in _check_1d(x)
1360 message='Support for multi-dimensional indexing')
1361
-> 1362 ndim = x[:, None].ndim
1363 # we have definitely hit a pandas index or series object
1364 # cast to a numpy array.

~/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in getitem(self, key)
875 return self._get_values(key)
876
--> 877 return self._get_with(key)
878
879 def _get_with(self, key):

~/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in _get_with(self, key)
890 )
891 elif isinstance(key, tuple):
--> 892 return self._get_values_tuple(key)
893
894 elif not is_list_like(key):

~/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in _get_values_tuple(self, key)
920 # mpl hackaround
921 if com.any_none(*key):
--> 922 result = self._get_values(key)
923 deprecate_ndim_indexing(result, stacklevel=5)
924 return result

~/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in _get_values(self, indexer)
940 # see tests.series.timeseries.test_mpl_compat_hack
941 # the asarray is needed to avoid returning a 2D DatetimeArray
--> 942 return np.asarray(self._values[indexer])
943
944 def _get_value(self, label, takeable: bool = False):

~/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/masked.py in getitem(self, item)
113 item = check_array_indexer(self, item)
114
--> 115 return type(self)(self._data[item], self._mask[item])
116
117 def _coerce_to_array(self, values) -> Tuple[np.ndarray, np.ndarray]:

~/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/integer.py in init(self, values, mask, copy)
346 "the 'pd.array' function instead"
347 )
--> 348 super().init(values, mask, copy=copy)
349
350 def neg(self):

~/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/masked.py in init(self, values, mask, copy)
88 )
89 if values.ndim != 1:
---> 90 raise ValueError("values must be a 1D array")
91 if mask.ndim != 1:
92 raise ValueError("mask must be a 1D array")

ValueError: values must be a 1D array

In case it was relevant, this was my installation at that time:

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.11.final.0
python-bits : 64
OS : Darwin
OS-release : 21.5.0
Version : Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.4
numpy : 1.22.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.2
setuptools : 52.0.0.post20210125
Cython : 0.29.24
pytest : 6.2.5
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.26.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.3
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.19.0
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment