Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: fast inf checking in to_excel #11352

Merged
merged 1 commit into from
Oct 17, 2015
Merged

Conversation

chris-b1
Copy link
Contributor

Adds new functions to check for infinity rather than calling np.isposinf and np.isneginf, which were (surprising to me) a significant drag on to_excel.

I also added xlsxwriter to the asv build configuration. openpyxl is still failing, I'm assuming something do with the specific version on conda?

    before     after       ratio
  [472e6e0e] [8002555d]
   156.36ms   154.16ms      0.99  packers.packers_read_excel.time_packers_read_excel
     failed     failed       n/a  packers.packers_write_excel_openpyxl.time_packers_write_excel_openpyxl
   469.69ms   357.49ms      0.76  packers.packers_write_excel_xlsxwriter.time_packers_write_excel_xlsxwriter
   368.96ms   270.60ms      0.73  packers.packers_write_excel_xlwt.time_packers_write_excel_xlwt

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance IO Excel read_excel, to_excel labels Oct 17, 2015
@jorisvandenbossche jorisvandenbossche added this to the 0.17.1 milestone Oct 17, 2015
@jorisvandenbossche
Copy link
Member

Looks good to me!

@jreback
Copy link
Contributor

jreback commented Oct 17, 2015

I'd rather make a bigger change to how the xls writing is done, to make it much more performant. The idea is to copy what is done for csv writing.

You chunk write, then convert the columns by blocks with to_native_types() (you would have to expand this to for example handle the inf conversions, it already handles the nan and float_format for FloatBlock). these already return object arrays.

Then you simply iterate and yield the ExcelCells, but no conversion is done (as it was all done before), so should be faster/simpler.

@chris-b1
Copy link
Contributor Author

Yeah, I had actually started looking into that - this particular change was low hanging fruit, so I thought it might make sense to do first.

There probably is some performance to be picked up still, but an awful lot of the time is spent in the actual writing steps now.

In [1]: df = pd.DataFrame({'b': np.linspace(0, 1000, 100000), 'c': np.linspace(0, 100, 100000)})

In [2]: %prun df.to_excel('temp.xlsx')


   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.707    0.707    5.212    5.212 excel.py:1410(write_cells)
   300000    0.699    0.000    1.773    0.000 xmlwriter.py:129(_xml_number_element)
     1842    0.612    0.000    0.612    0.000 {built-in method compress}
   300001    0.568    0.000    0.685    0.000 format.py:1817(_format_regular_rows)
   300000    0.543    0.000    1.092    0.000 worksheet.py:482(write_number)
   300002    0.542    0.000    2.490    0.000 worksheet.py:5085(_write_cell)
        1    0.512    0.512    3.912    3.912 worksheet.py:4911(_write_rows)
600004/300002    0.471    0.000    2.439    0.000 worksheet.py:51(cell_wrapper)
   300002    0.430    0.000    2.175    0.000 worksheet.py:349(write)
   300002    0.400    0.000    0.400    0.000 {pandas.json.dumps}
  1500255    0.398    0.000    0.398    0.000 {isinstance}
   502071    0.390    0.000    0.390    0.000 {method 'write' of 'file' objects}
   500157    0.285    0.000    0.923    0.000 codecs.py:353(write)
   300002    0.267    0.000    0.267    0.000 worksheet.py:3459(_check_dimensions)
       11    0.267    0.024    0.267    0.024 {method 'close' of 'file' objects}
   300002    0.265    0.000    0.388    0.000 format.py:1707(_format_value)
   500157    0.252    0.000    0.252    0.000 {_codecs.utf_8_encode}
   400143    0.238    0.000    0.365    0.000 xmlwriter.py:180(_escape_attributes)
   300003    0.207    0.000    1.280    0.000 format.py:1931(get_formatted_cells)
   500157    0.188    0.000    1.111    0.000 codecs.py:692(write)
   300002    0.184    0.000    0.225    0.000 excel.py:494(_conv_value)
   300002    0.157    0.000    0.302    0.000 utility.py:604(supported_datetime)
   300002    0.136    0.000    0.136    0.000 utility.py:37(xl_rowcol_to_cell_fast)
   100001    0.134    0.000    0.507    0.000 worksheet.py:5038(_write_row)
   100001    0.132    0.000    0.354    0.000 xmlwriter.py:61(_xml_start_tag_unencoded)
   400167    0.127    0.000    0.127    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
   300002    0.116    0.000    0.116    0.000 format.py:1644(__init__)
        1    0.116    0.116    0.136    0.136 worksheet.py:4989(_calculate_spans)
   300000    0.111    0.000    0.189    0.000 <string>:8(__new__)
   300004    0.077    0.000    0.077    0.000 {built-in method __new__ of type object at 0x000000001E2A0670}
   100037    0.059    0.000    0.238    0.000 xmlwriter.py:70(_xml_end_tag)
   300000    0.054    0.000    0.054    0.000 worksheet.py:4515(_isinf)

@jreback
Copy link
Contributor

jreback commented Oct 17, 2015

@chris-b1 agreed. Ok rebase this (prob a whatsnew conflict). The can look at other improvements if desired.

@chris-b1
Copy link
Contributor Author

@jreback - rebased

@chris-b1 chris-b1 mentioned this pull request Oct 17, 2015
2 tasks
jreback added a commit that referenced this pull request Oct 17, 2015
PERF: fast inf checking in to_excel
@jreback jreback merged commit 2431425 into pandas-dev:master Oct 17, 2015
@chris-b1 chris-b1 deleted the excel-infcheck branch October 21, 2015 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants