PERF: fast inf checking in to_excel #11352

chris-b1 · 2015-10-17T01:20:27Z

Adds new functions to check for infinity rather than calling np.isposinf and np.isneginf, which were (surprising to me) a significant drag on to_excel.

I also added xlsxwriter to the asv build configuration. openpyxl is still failing, I'm assuming something do with the specific version on conda?

    before     after       ratio
  [472e6e0e] [8002555d]
   156.36ms   154.16ms      0.99  packers.packers_read_excel.time_packers_read_excel
     failed     failed       n/a  packers.packers_write_excel_openpyxl.time_packers_write_excel_openpyxl
   469.69ms   357.49ms      0.76  packers.packers_write_excel_xlsxwriter.time_packers_write_excel_xlsxwriter
   368.96ms   270.60ms      0.73  packers.packers_write_excel_xlwt.time_packers_write_excel_xlwt

jorisvandenbossche · 2015-10-17T10:26:43Z

Looks good to me!

jreback · 2015-10-17T14:39:35Z

I'd rather make a bigger change to how the xls writing is done, to make it much more performant. The idea is to copy what is done for csv writing.

You chunk write, then convert the columns by blocks with to_native_types() (you would have to expand this to for example handle the inf conversions, it already handles the nan and float_format for FloatBlock). these already return object arrays.

Then you simply iterate and yield the ExcelCells, but no conversion is done (as it was all done before), so should be faster/simpler.

chris-b1 · 2015-10-17T14:52:21Z

Yeah, I had actually started looking into that - this particular change was low hanging fruit, so I thought it might make sense to do first.

There probably is some performance to be picked up still, but an awful lot of the time is spent in the actual writing steps now.

In [1]: df = pd.DataFrame({'b': np.linspace(0, 1000, 100000), 'c': np.linspace(0, 100, 100000)})

In [2]: %prun df.to_excel('temp.xlsx')


   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.707    0.707    5.212    5.212 excel.py:1410(write_cells)
   300000    0.699    0.000    1.773    0.000 xmlwriter.py:129(_xml_number_element)
     1842    0.612    0.000    0.612    0.000 {built-in method compress}
   300001    0.568    0.000    0.685    0.000 format.py:1817(_format_regular_rows)
   300000    0.543    0.000    1.092    0.000 worksheet.py:482(write_number)
   300002    0.542    0.000    2.490    0.000 worksheet.py:5085(_write_cell)
        1    0.512    0.512    3.912    3.912 worksheet.py:4911(_write_rows)
600004/300002    0.471    0.000    2.439    0.000 worksheet.py:51(cell_wrapper)
   300002    0.430    0.000    2.175    0.000 worksheet.py:349(write)
   300002    0.400    0.000    0.400    0.000 {pandas.json.dumps}
  1500255    0.398    0.000    0.398    0.000 {isinstance}
   502071    0.390    0.000    0.390    0.000 {method 'write' of 'file' objects}
   500157    0.285    0.000    0.923    0.000 codecs.py:353(write)
   300002    0.267    0.000    0.267    0.000 worksheet.py:3459(_check_dimensions)
       11    0.267    0.024    0.267    0.024 {method 'close' of 'file' objects}
   300002    0.265    0.000    0.388    0.000 format.py:1707(_format_value)
   500157    0.252    0.000    0.252    0.000 {_codecs.utf_8_encode}
   400143    0.238    0.000    0.365    0.000 xmlwriter.py:180(_escape_attributes)
   300003    0.207    0.000    1.280    0.000 format.py:1931(get_formatted_cells)
   500157    0.188    0.000    1.111    0.000 codecs.py:692(write)
   300002    0.184    0.000    0.225    0.000 excel.py:494(_conv_value)
   300002    0.157    0.000    0.302    0.000 utility.py:604(supported_datetime)
   300002    0.136    0.000    0.136    0.000 utility.py:37(xl_rowcol_to_cell_fast)
   100001    0.134    0.000    0.507    0.000 worksheet.py:5038(_write_row)
   100001    0.132    0.000    0.354    0.000 xmlwriter.py:61(_xml_start_tag_unencoded)
   400167    0.127    0.000    0.127    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
   300002    0.116    0.000    0.116    0.000 format.py:1644(__init__)
        1    0.116    0.116    0.136    0.136 worksheet.py:4989(_calculate_spans)
   300000    0.111    0.000    0.189    0.000 <string>:8(__new__)
   300004    0.077    0.000    0.077    0.000 {built-in method __new__ of type object at 0x000000001E2A0670}
   100037    0.059    0.000    0.238    0.000 xmlwriter.py:70(_xml_end_tag)
   300000    0.054    0.000    0.054    0.000 worksheet.py:4515(_isinf)

jreback · 2015-10-17T14:54:38Z

@chris-b1 agreed. Ok rebase this (prob a whatsnew conflict). The can look at other improvements if desired.

chris-b1 · 2015-10-17T14:58:22Z

@jreback - rebased

PERF: fast inf checking in to_excel

jorisvandenbossche added Performance Memory or execution speed performance IO Excel read_excel, to_excel labels Oct 17, 2015

jorisvandenbossche added this to the 0.17.1 milestone Oct 17, 2015

PERF: fast inf checking in to_excel

c5c44e8

chris-b1 force-pushed the excel-infcheck branch from 3a6eaf8 to c5c44e8 Compare October 17, 2015 14:57

chris-b1 mentioned this pull request Oct 17, 2015

PERF: refactor ExcelFormatter #11355

Closed

2 tasks

jreback added a commit that referenced this pull request Oct 17, 2015

Merge pull request #11352 from chris-b1/excel-infcheck

2431425

PERF: fast inf checking in to_excel

jreback merged commit 2431425 into pandas-dev:master Oct 17, 2015

chris-b1 deleted the excel-infcheck branch October 21, 2015 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: fast inf checking in to_excel #11352

PERF: fast inf checking in to_excel #11352

chris-b1 commented Oct 17, 2015

jorisvandenbossche commented Oct 17, 2015

jreback commented Oct 17, 2015

chris-b1 commented Oct 17, 2015

jreback commented Oct 17, 2015

chris-b1 commented Oct 17, 2015

PERF: fast inf checking in to_excel #11352

PERF: fast inf checking in to_excel #11352

Conversation

chris-b1 commented Oct 17, 2015

jorisvandenbossche commented Oct 17, 2015

jreback commented Oct 17, 2015

chris-b1 commented Oct 17, 2015

jreback commented Oct 17, 2015

chris-b1 commented Oct 17, 2015