Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix Period and PeriodIndex support of combined offsets aliases #13874

Closed
wants to merge 1 commit into from
Closed

Conversation

agraboso
Copy link
Contributor

@agraboso agraboso commented Aug 1, 2016

Essentially, makes sure any freq string passed to a Period or PeriodIndex goes through _Period._maybe_convert_freq(), which calls to_offset(), which is where the logic for combining aliases is.

All the examples in #13730 result in the correct output, and all existing tests pass. I have not written any new ones yet — I first wanted to get the opinion of the maintainers.

This PR builds on #13868 (without it, some existing tests fail).

@jreback
Copy link
Contributor

jreback commented Aug 1, 2016

this seems to be on top of #13868 was that intentional? (usually we try to do independent PR's). if they are not its ok, but then should just put in one / indicate at the top that is the case.

@sinhrks

@jreback jreback added the Frequency DateOffsets label Aug 1, 2016
@jreback
Copy link
Contributor

jreback commented Aug 1, 2016

can you run an asv on this, make sure looks good for perf.

@jreback
Copy link
Contributor

jreback commented Aug 1, 2016

also would need to add tests from the issue. I dont' think we are actually testing this (and that's why it doesn't fail).

@@ -1203,6 +1208,7 @@ class Period(_Period):

def _ordinal_from_fields(year, month, quarter, day,
hour, minute, second, freq):
freq = Period._maybe_convert_freq(freq)
Copy link
Member

@sinhrks sinhrks Aug 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't needed, as _ordinal_from_field is only called from here (after 1130th line)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I was being overly cautious (maybe some other part of the code might need to call this in the future?) but I can remove it.

@sinhrks sinhrks added the Period Period data type label Aug 1, 2016
@agraboso
Copy link
Contributor Author

agraboso commented Aug 2, 2016

I ran the asv benchmarks. Here are the results:

    before     after       ratio
  [59f25575] [01c55da7]
+    3.73ms    16.80ms      4.51  reindex.frame_drop_dup_na_inplace.time_frame_drop_dup_na_inplace
+  898.72ms      3.12s      3.47  plotting.plot_andrews_curves.time_plot_andrews_curves
-   45.25ms    13.87ms      0.31  period.create_period_index_from_date_range.time_period_index

I have no experience whatsoever with these, so I don't know how to interpret the results above, but I find it perplexing that the two tests that run slower have absolutely nothing to do with time series...

@sinhrks
Copy link
Member

sinhrks commented Aug 2, 2016

Thanks for the test. The slowness looks to be caused by regex search. Maybe we should try dict lookup first like get_offset does (related to #4205, merging to_offset and get_offset to make fast/simple impl).

@agraboso
Copy link
Contributor Author

agraboso commented Aug 2, 2016

@sinhrks I'm confused. The asv test that has to do with periods is faster with this PR...

@sinhrks
Copy link
Member

sinhrks commented Aug 2, 2016

Ah sorry, I looked first 2 ratios (looks unrelated to the period). Pls ignore my comment.

@codecov-io
Copy link

codecov-io commented Aug 2, 2016

Current coverage is 85.31% (diff: 96.29%)

Merging #13874 into master will increase coverage by <.01%

@@             master     #13874   diff @@
==========================================
  Files           139        139          
  Lines         50143      50146     +3   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42777      42780     +3   
  Misses         7366       7366          
  Partials          0          0          

Powered by Codecov. Last update 7e15923...49a3783

@jreback jreback modified the milestone: 0.19.0 Aug 2, 2016
@jreback
Copy link
Contributor

jreback commented Aug 2, 2016

pls rebase. Need tests to close #13730 ; and add to whatsnew.

@agraboso
Copy link
Contributor Author

agraboso commented Aug 2, 2016

Rebased. I have also added tests for Period, Period.asfreq, PeriodIndex, PeriodIndex.asfreq, PeriodIndex.to_datetime and period_range with combined aliases.

There is one standing issue: the function get_standard_freq() does not accepted combined aliases as of now. Apart from tests, it is only called from three places in the whole codebase:

  • Lines period.py#L609 and period.pyx#L742 both read

    freqstr = frequencies.get_standard_freq(other)
    

    They can — and should — be changed to

    freqstr = other.rule_code
    
  • period.py#L488-L489

    freq = Period._maybe_convert_freq(freq)
    freq = frequencies.get_standard_freq(freq)
    

    can — and should — be changed to

    freq = Period._maybe_convert_freq(freq).freqstr
    

In view of this, I propose deprecating get_standard_freq() (as is already done with get_offset_name()) — or removing it completely if possible.

@agraboso
Copy link
Contributor Author

agraboso commented Aug 2, 2016

I implemented the changes and deprecation I proposed in my last comment in the last commit, including tests for the deprecation warning anywhere get_standard_freq() was tested.

@@ -864,3 +865,4 @@ Bug Fixes
- Bug in ``.to_excel()`` when DataFrame contains a MultiIndex which contains a label with a NaN value (:issue:`13511`)
- Bug in ``pd.read_csv`` in Python 2.x with non-UTF8 encoded, multi-character separated data (:issue:`3404`)
- Bug in ``Index`` raises ``KeyError`` displaying incorrect column when column is not in the df and columns contains duplicate values (:issue:`13822`)
- Bug in ``Period`` and ``PeriodIndex`` creating wrong dates when frequency has multiple offsets (:issue:`13874`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"combined offsets" or "combined aliases" should be better.

@agraboso
Copy link
Contributor Author

agraboso commented Aug 2, 2016

@sinhrks Is "combined offset aliases" okay?

@agraboso agraboso changed the title BUG: Fix Period and PeriodIndex support of combined alias offsets BUG: Fix Period and PeriodIndex support of combined offsets aliases Aug 2, 2016
@sinhrks
Copy link
Member

sinhrks commented Aug 2, 2016

@agraboso Thx, it sounds ok.

@@ -478,7 +485,7 @@ def asfreq(self, freq=None, how='E'):
"""
how = _validate_end_alias(how)

freq = frequencies.get_standard_freq(freq)
freq = Period._maybe_convert_freq(freq).freqstr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean stripping the .freqstr at the end? Sure, that works (tests pass)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thx to confirm.

@agraboso
Copy link
Contributor Author

agraboso commented Aug 3, 2016

@jreback How does it look?


Parameters
----------
freq : str or tuple or datetime.timedelta or DateOffset or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use commas rather than or

@agraboso
Copy link
Contributor Author

agraboso commented Aug 3, 2016

@jreback Done.


See Also
--------
DateOffset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this pandas.DateOffset

@jreback
Copy link
Contributor

jreback commented Aug 3, 2016

you are going to have a flake8 minor issue (long lines). otherwise lgtm.

can you run an asv vs master (just -b period) is fine, and show the results.

@jreback jreback added this to the 0.19.0 milestone Aug 3, 2016
@agraboso
Copy link
Contributor Author

agraboso commented Aug 3, 2016

you are going to have a flake8 minor issue (long lines). otherwise lgtm.

flake8 does not complain. Which are the long lines you refer to? Found and fixed them.

can you run an asv vs master (just -b period) is fine, and show the results.

    before     after       ratio
  [768bf495] [81262a57]
-   54.20ms    14.96ms      0.28  period.create_period_index_from_date_range.time_period_index

@jreback
Copy link
Contributor

jreback commented Aug 3, 2016

are u just showing the single change in asv?
there are more period benchmarks

@agraboso
Copy link
Contributor Author

agraboso commented Aug 3, 2016

Here is the whole output:

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 24 total benchmarks (2 commits * 1 environments * 12 benchmarks)
[  0.00%] · For pandas commit hash 81262a57:
[  0.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....................................
[  0.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  4.17%] ··· Running gil.nogil_datetime_fields.time_datetime_to_period                                                                                                    18.29ms
[  8.33%] ··· Running gil.nogil_datetime_fields.time_period_to_datetime                                                                                                    55.16ms
[ 12.50%] ··· Running period.create_period_index_from_date_range.time_period_index                                                                                         14.96ms
[ 16.67%] ··· Running period.period_algorithm.time_period_index_drop_duplicates                                                                                            30.93μs
[ 20.83%] ··· Running period.period_algorithm.time_period_index_value_counts                                                                                                1.12ms
[ 25.00%] ··· Running period.period_algorithm.time_period_series_drop_duplicates                                                                                           16.60ms
[ 29.17%] ··· Running period.period_algorithm.time_period_series_value_counts                                                                                               6.94ms
[ 33.33%] ··· Running period.period_setitem.time_period_setitem                                                                                                             failed
[ 37.50%] ··· Running plotting.plot_timeseries_period.time_plot_timeseries_period                                                                                         308.25ms
[ 41.67%] ··· Running timeseries.timeseries_iter_periodindex.time_timeseries_iter_periodindex                                                                                6.57s
[ 45.83%] ··· Running timeseries.timeseries_iter_periodindex_preexit.time_timeseries_iter_periodindex_preexit                                                              55.71ms
[ 50.00%] ··· Running timeseries.timeseries_period_downsample_mean.time_timeseries_period_downsample_mean                                                                  26.72ms
[ 50.00%] · For pandas commit hash 768bf495:
[ 50.00%] ·· Building for conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 50.00%] ·· Benchmarking conda-py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 54.17%] ··· Running gil.nogil_datetime_fields.time_datetime_to_period                                                                                                    19.20ms
[ 58.33%] ··· Running gil.nogil_datetime_fields.time_period_to_datetime                                                                                                    51.12ms
[ 62.50%] ··· Running period.create_period_index_from_date_range.time_period_index                                                                                         54.20ms
[ 66.67%] ··· Running period.period_algorithm.time_period_index_drop_duplicates                                                                                            33.60μs
[ 70.83%] ··· Running period.period_algorithm.time_period_index_value_counts                                                                                                1.07ms
[ 75.00%] ··· Running period.period_algorithm.time_period_series_drop_duplicates                                                                                           16.89ms
[ 79.17%] ··· Running period.period_algorithm.time_period_series_value_counts                                                                                               6.42ms
[ 83.33%] ··· Running period.period_setitem.time_period_setitem                                                                                                             failed
[ 87.50%] ··· Running plotting.plot_timeseries_period.time_plot_timeseries_period                                                                                         294.17ms
[ 91.67%] ··· Running timeseries.timeseries_iter_periodindex.time_timeseries_iter_periodindex                                                                                4.01s
[ 95.83%] ··· Running timeseries.timeseries_iter_periodindex_preexit.time_timeseries_iter_periodindex_preexit                                                              40.75ms
[100.00%] ··· Running timeseries.timeseries_period_downsample_mean.time_timeseries_period_downsample_mean                                                                  24.24ms
    before     after       ratio
  [768bf495] [81262a57]
-   54.20ms    14.96ms      0.28  period.create_period_index_from_date_range.time_period_index

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

By the way, I just realized that period.period_setitem.time_period_setitem fails even on master.

@agraboso
Copy link
Contributor Author

agraboso commented Aug 8, 2016

@jreback

@jreback jreback closed this in 81819b7 Aug 8, 2016
@jreback
Copy link
Contributor

jreback commented Aug 8, 2016

thanks!

@agraboso agraboso deleted the fix-13730 branch August 9, 2016 19:36
gfyoung added a commit to forking-repos/pandas that referenced this pull request Jan 1, 2018
* Remove frequencies.get_standard_freq
* Drop the "freqstr" keyword from frequencies.to_offset

Deprecated in v0.19.0

xref pandas-devgh-13874
gfyoung added a commit to forking-repos/pandas that referenced this pull request Jan 1, 2018
* Remove frequencies.get_standard_freq
* Drop the "freqstr" keyword from frequencies.to_offset

Deprecated in v0.19.0

xref pandas-devgh-13874
gfyoung added a commit to forking-repos/pandas that referenced this pull request Jan 1, 2018
* Remove frequencies.get_standard_freq
* Drop the "freqstr" keyword from frequencies.to_offset

Deprecated in v0.19.0

xref pandas-devgh-13874
gfyoung added a commit to forking-repos/pandas that referenced this pull request Jan 1, 2018
* Remove frequencies.get_standard_freq
* Drop the "freqstr" keyword from frequencies.to_offset

Deprecated in v0.19.0

xref pandas-devgh-13874
jreback pushed a commit that referenced this pull request Jan 2, 2018
* Remove frequencies.get_standard_freq
* Drop the "freqstr" keyword from frequencies.to_offset

Deprecated in v0.19.0

xref gh-13874
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Frequency DateOffsets Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

period_range creates wrong dates when freq has multiple offsets
4 participants