Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infer_dtype() function slower in latest version #28814

Closed
borisaltanov opened this issue Oct 7, 2019 · 2 comments · Fixed by #30202
Closed

infer_dtype() function slower in latest version #28814

borisaltanov opened this issue Oct 7, 2019 · 2 comments · Fixed by #30202
Labels
Performance Memory or execution speed performance
Milestone

Comments

@borisaltanov
Copy link

Code Sample

from datetime import datetime

import numpy as np
import pandas as pd
from pandas.api.types import infer_dtype

print(pd.__version__)

RUN_COUNT = 5

df = pd.DataFrame(np.ones((100000, 1000)))

avg_times = []

for _ in range(RUN_COUNT):

    start = datetime.now()

    for col in df.columns:
        infer_dtype(df[col], skipna=True)

    avg_times.append(datetime.now() - start)

print('Average time: ', np.mean(avg_times))

Problem description

When I run the above code on my machine, there is major difference in the performance between different versions. I ran the code using new conda environments created with Python 3.7 and the specific version of Pandas. The table below shows the execution times.

Pandas Version Execution time
0.23.4 0:00:00.013999
0.25.1 0:00:15.623905

The example code is only an example. In the most cases I need this function to get more specific type of the column than the type returned by the .dtype parameter (usually when working with mixed data).

@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Oct 27, 2019
@jreback
Copy link
Contributor

jreback commented Nov 1, 2019

for completeness
0.23.4

In [7]: %timeit infer_dtype(df[0], skipna=False)                                                                                                        
6.86 µs ± 74.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [8]: %timeit infer_dtype(df[0], skipna=True)                                                                                                         
6.92 µs ± 94.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

master

In [28]: %timeit infer_dtype(df[0], skipna=True)                                                                                                         
16.7 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [29]: %timeit infer_dtype(df[0], skipna=False)                                                                                                        
1.29 ms ± 4.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

we should be short-cutting on a numpy dtype, maybe was changed

@borisaltanov if you want to take a look would be appreciated.

@groutr
Copy link
Contributor

groutr commented Dec 11, 2019

The block of code that seems to be responsible is:

pandas/pandas/_libs/lib.pyx

Lines 1262 to 1263 in ad0539b

if skipna:
values = values[~isnaobj(values)]

That code was introduced in this PR: #23422

A profiling of master with test case above:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000   29.204    0.029   89.633    0.090 missing.pyx:91(isnaobj)
100001000   28.189    0.000   38.350    0.000 nattype.pyx:767(is_null_datetimelike)
100001000   22.086    0.000   60.436    0.000 missing.pyx:26(checknull)
100001000   10.160    0.000   10.160    0.000 util.pxd:78(is_float_object)
     1000    2.365    0.002   92.055    0.092 lib.pyx:1097(infer_dtype)

The call to isnaobj to mask out the nan values turns out to be expensive.
EDIT: isnaobj also converts every value in the column into object before calling checknull.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants