infer_dtype() function slower in latest version #28814

borisaltanov · 2019-10-07T08:18:40Z

Code Sample

from datetime import datetime

import numpy as np
import pandas as pd
from pandas.api.types import infer_dtype

print(pd.__version__)

RUN_COUNT = 5

df = pd.DataFrame(np.ones((100000, 1000)))

avg_times = []

for _ in range(RUN_COUNT):

    start = datetime.now()

    for col in df.columns:
        infer_dtype(df[col], skipna=True)

    avg_times.append(datetime.now() - start)

print('Average time: ', np.mean(avg_times))

Problem description

When I run the above code on my machine, there is major difference in the performance between different versions. I ran the code using new conda environments created with Python 3.7 and the specific version of Pandas. The table below shows the execution times.

Pandas Version	Execution time
0.23.4	0:00:00.013999
0.25.1	0:00:15.623905

The example code is only an example. In the most cases I need this function to get more specific type of the column than the type returned by the .dtype parameter (usually when working with mixed data).

jreback · 2019-11-01T11:24:31Z

for completeness
0.23.4

In [7]: %timeit infer_dtype(df[0], skipna=False)                                                                                                        
6.86 µs ± 74.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [8]: %timeit infer_dtype(df[0], skipna=True)                                                                                                         
6.92 µs ± 94.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

master

In [28]: %timeit infer_dtype(df[0], skipna=True)                                                                                                         
16.7 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [29]: %timeit infer_dtype(df[0], skipna=False)                                                                                                        
1.29 ms ± 4.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

we should be short-cutting on a numpy dtype, maybe was changed

@borisaltanov if you want to take a look would be appreciated.

groutr · 2019-12-11T05:59:11Z

The block of code that seems to be responsible is:

pandas/pandas/_libs/lib.pyx

Lines 1262 to 1263 in ad0539b

    
           if skipna: 
        
               values = values[~isnaobj(values)]

That code was introduced in this PR: #23422

A profiling of master with test case above:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000   29.204    0.029   89.633    0.090 missing.pyx:91(isnaobj)
100001000   28.189    0.000   38.350    0.000 nattype.pyx:767(is_null_datetimelike)
100001000   22.086    0.000   60.436    0.000 missing.pyx:26(checknull)
100001000   10.160    0.000   10.160    0.000 util.pxd:78(is_float_object)
     1000    2.365    0.002   92.055    0.092 lib.pyx:1097(infer_dtype)

The call to isnaobj to mask out the nan values turns out to be expensive.
EDIT: isnaobj also converts every value in the column into object before calling checknull.

jbrockmendel added the Performance Memory or execution speed performance label Oct 27, 2019

groutr mentioned this issue Dec 11, 2019

PERF: Fix performance regression in infer_dtype #30202

Merged

5 tasks

jreback added this to the 1.0 milestone Dec 15, 2019

jreback closed this as completed in #30202 Dec 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infer_dtype() function slower in latest version #28814

infer_dtype() function slower in latest version #28814

borisaltanov commented Oct 7, 2019

jreback commented Nov 1, 2019

groutr commented Dec 11, 2019 •

edited

Loading

infer_dtype() function slower in latest version #28814

infer_dtype() function slower in latest version #28814

Comments

borisaltanov commented Oct 7, 2019

Code Sample

Problem description

jreback commented Nov 1, 2019

groutr commented Dec 11, 2019 • edited Loading

groutr commented Dec 11, 2019 •

edited

Loading