Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fixes plotting with nullable integers (#32073) #34896

Closed
wants to merge 8 commits into from

Conversation

cvanweelden
Copy link
Contributor

@cvanweelden cvanweelden commented Jun 20, 2020

Similar to #32410 this tries to fix plotting with nullable integer extension data types by casting them to float before passing on to matplotlib.

@cvanweelden
Copy link
Contributor Author

@MarcoGorelli I'm not sure where to put the whatsnew entry. doc/source/whatsnew/v1.1.0.rst? Under "Other enhancements"?

@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Visualization plotting labels Jun 20, 2020
pandas/plotting/_matplotlib/core.py Outdated Show resolved Hide resolved
@MarcoGorelli
Copy link
Member

@MarcoGorelli I'm not sure where to put the whatsnew entry. doc/source/whatsnew/v1.1.0.rst? Under "Other enhancements"?

Thanks for your PR! I think this should go under the plotting section of the bug fixes

@cvanweelden
Copy link
Contributor Author

I've added the safe_convert_to_ndarray that seems to work for both GroupBy.quantile, but I've got some doubts about intended behavior:

  • GroupBy.quantile doesn't seem to handle datetime64tz dtypes, given that is_datetime64_dtype returns False for these types, see
    elif is_datetime64_dtype(vals.dtype):
    , I'm not sure if it should? If so it seems to be missing tests.
  • Previously datetime (non-extension type) was converted to float in GroupBy.quantile, now they are just represented as datetime64 array but that doesn't seem to break any tests. I think I should figure out if that is just because of missing tests, or if it is actually working like this, but I'm not sure how to define the quantile of a set of dates.
  • There is a case in the if statement in safe_convert_to_ndarray where the input dtype is an extension dtype, but not one of the known ones. Can this happen? If so, should it raise something like NotImplementedError, or should it try to convert like now with the possibility that you get an array of object dtype back?

Also any feedback on the docstring and typehints would be appreciated. Because of the many types that could be the input I wasn't sure what types to add there so I only added a hint for output type.

@MarcoGorelli MarcoGorelli self-requested a review June 26, 2020 16:14
@cvanweelden
Copy link
Contributor Author

Ah I don't pass my own test it seems, was sure I reran it after the last change. 🤦 The problem here is that for a Series object it'd need to call Series.dt.tz_localize or it will be called on the index instead of on the values, while for a DatetimeIndex it is DatetimeIndex.tz_localize.

Perhaps we can simplify it by making safe_convert_to_ndarray take only pandas.array input and have whatever calls it take care of extracting the data it wants to convert?

pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
pandas/core/dtypes/cast.py Show resolved Hide resolved
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just making a few comments (still need to address some of your questions though, but will get round to it)

pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
@cvanweelden
Copy link
Contributor Author

Mypy gives the following error:

pandas/core/dtypes/cast.py:1777: error: Item "ExtensionArray" of "Union[ExtensionArray, Index]" has no attribute "tz_localize"
pandas/core/dtypes/cast.py:1777: error: Item "Index" of "Union[ExtensionArray, Index]" has no attribute "tz_localize"
pandas/core/dtypes/cast.py:1777: error: Item "Index" of "Union[Any, Index]" has no attribute "tz_localize"

I could fix this by defering this to another function that only takes care of the datetime64tz conversion, but how do I type that for specific dtype?

@cvanweelden
Copy link
Contributor Author

cvanweelden commented Jun 27, 2020

The version of safe_convert_to_ndarray I've implemented now seems to work for the intended plotting bugfix, and for the GroupBy.quantile function. However, the more I worked on this the more I think that this shouldn't be it's own function:

  • The real way to safely convert to ndarray should be the to_numpy method.
  • GroupBy.quantile and MPLPlot._compute_plot_data have different motivations to subvert this default path and handle extension dtypes in an unsafe way (i.e. changing the data such as bool -> float).
  • cast.py already seems to have quite some near-duplicate functionality and adding this function feels like adding more noise on top of that.

How I see it there are 2 options:

  • A: we fix the type issue with the current code, maybe add a warning to docstring to use to_numpy instead, and merge this.
  • B: we revert to the fix where we only touch MPLPlot._compute_plot_data. This has my preference given the above.

Regarding the dates quantile; in both this version and master quantile of [2001, NaT, 2003] is 1839-05-13 12:06:21.572612096. I'll see if there is already an issue open for how GroupBy.quantile handles date series with NaT or else I'll open a new one for that.

EDIT: the quantile issue seems too close to #33168 and #29485 to open a new issue for it.

"A": [1, 2, 3, 4, 5],
"B": [7, 5, np.nan, 3, 2],
"C": pd.to_datetime(dates, format="%Y"),
"D": pd.to_datetime(dates, format="%Y", utc=True),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli This is the test case that hits the elif is_datetime64tz_dtype(values.dtype) branch in safe_convert_to_ndarray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cvanweelden thanks for pointing me towards this.

Are you sure? I just tried running it and didn't hit it. Perhaps it was true before you changed

safe_convert_to_ndarray(numeric_data[col])

to

safe_convert_to_ndarray(numeric_data[col].values)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, you are right. I wasn't aware that the underlying values were actually an ndarray with non-extended dtype, so I didn't know this change would have that effect. Thanks!

@cvanweelden
Copy link
Contributor Author

Given that the requested direction to have the same code path shared between the plotting function and GroupBy.quantile doesn't sit well by me, I'm going to close this PR and unassign myself from the issue so that somebody else can pick it up that can either retry a minimal fix for the bug or that wants to do a bigger rewrite to clean up some of the issues with operations on nullable types.

@cvanweelden cvanweelden closed this Jul 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Visualization plotting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Plotting Int64 columns with nulled integers (NAType) fails
3 participants