Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance of Series label-indexing with a list of labels #16285

Closed
jhrmnn opened this issue May 8, 2017 · 5 comments · Fixed by #16295
Closed

performance of Series label-indexing with a list of labels #16285

jhrmnn opened this issue May 8, 2017 · 5 comments · Fixed by #16295
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Milestone

Comments

@jhrmnn
Copy link

jhrmnn commented May 8, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
from numpy import random
dct = dict(zip(range(1000), random.randint(1000, size=1000)))
keys = random.randint(1000, size=1000000).tolist()
%timeit [dct[k] for k in keys]
# 86.2 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
sdct = pd.Series(dct)
%timeit sdct[keys]
# 673 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Problem description

I would expect the Series performance to be comparable, if not faster than the Python comprehension.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 16.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 35.0.2
Cython: None
numpy: 1.12.1
scipy: 0.19.0
statsmodels: 0.6.1
xarray: None
IPython: 6.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
matplotlib: 2.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
boto: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

chris-b1 commented May 8, 2017

Yeah, this is slow. Note that it is much faster if the keys are wrapped in an array.

In [11]: %timeit [dct[k] for k in keys]
10 loops, best of 3: 79.5 ms per loop

In [12]: %timeit sdct[keys]
1 loop, best of 3: 645 ms per loop

In [13]: %timeit sdct[np.array(keys)]
10 loops, best of 3: 65.1 ms per loop

Bottleneck seems to be in clean_index_list

@chris-b1 chris-b1 added Performance Memory or execution speed performance Indexing Related to indexing on series/frames, not to indexes themselves labels May 8, 2017
@jhrmnn
Copy link
Author

jhrmnn commented May 8, 2017

Ah, I see, good catch, could have tried that.

This makes it comparably faster to the comprehension. Shouldn't it be significantly faster though? I assume the comprehension is interpreted, whereas the Series lookup is one call to the C extension. Or does it boil down to efficiencies of Pandas's and C Python's hash tables?

@chris-b1
Copy link
Contributor

chris-b1 commented May 8, 2017

The bulk of the time in this operation is actually in placing the new values, not the hash table lookups. Below I skip the hash table in both cases

In [154]: vals = list(dct.values())

# python
In [157]: %timeit [vals[x] for x in keys]
10 loops, best of 3: 64.4 ms per loop

In [158]: %timeit [dct[k] for k in keys]
10 loops, best of 3: 80.5 ms per loop
# 16.1 ms of "overhead"

# pandas / numpy
In [160]: %timeit sdct.values.take(keys)
10 loops, best of 3: 52.2 ms per loop

In [159]: In [13]: %timeit sdct[np.array(keys)]
10 loops, best of 3: 62.3 ms per loop
# 10.1 ms of "overhead"

You are are right that these are C operations that avoid python overhead, but they are also basic python ops on optimized data structures, so not as much pickup as you might guess.

@jorisvandenbossche
Copy link
Member

Bottleneck seems to be in clean_index_list

This is indeed the case:

In [29]: %timeit pd._libs.lib.clean_index_list(keys)
1 loop, best of 3: 532 ms per loop

@chris-b1 do you understand the purpose of _ensure_index / clean_index_list in this case? (or @jreback ?)
I mean, why is it for each item in the list checking whether it is a list?
This notion seems introduced in 27e34a4 (by then still in python), but this is for setting as axis, not gettting. If that is the only use-case, we could certainly split this off of the current _ensure_index so it is not used for getting.

jreback added a commit to jreback/pandas that referenced this issue May 9, 2017
@jreback jreback added this to the 0.20.2 milestone May 9, 2017
@jreback
Copy link
Contributor

jreback commented May 9, 2017

so you can look at #16295, but this is actually quite subtle. you cannot simply np.asarray things, otherwise numpy converts things oddly in some cases. this is why np.array works above, its exactly what we do when we can convert it cleanly (e.g. a list of integers).

jreback added a commit that referenced this issue May 9, 2017
pawroman pushed a commit to pawroman/pandas that referenced this issue May 9, 2017
pcluo pushed a commit to pcluo/pandas that referenced this issue May 22, 2017
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue May 29, 2017
TomAugspurger pushed a commit that referenced this issue May 30, 2017
closes #16285
(cherry picked from commit ce4eef3)
stangirala pushed a commit to stangirala/pandas that referenced this issue Jun 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants