performance of Series label-indexing with a list of labels #16285

jhrmnn · 2017-05-08T14:30:05Z

Code Sample, a copy-pastable example if possible

import pandas as pd
from numpy import random
dct = dict(zip(range(1000), random.randint(1000, size=1000)))
keys = random.randint(1000, size=1000000).tolist()
%timeit [dct[k] for k in keys]
# 86.2 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
sdct = pd.Series(dct)
%timeit sdct[keys]
# 673 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Problem description

I would expect the Series performance to be comparable, if not faster than the Python comprehension.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 16.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 35.0.2
Cython: None
numpy: 1.12.1
scipy: 0.19.0
statsmodels: 0.6.1
xarray: None
IPython: 6.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
matplotlib: 2.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2017-05-08T14:59:54Z

Yeah, this is slow. Note that it is much faster if the keys are wrapped in an array.

In [11]: %timeit [dct[k] for k in keys]
10 loops, best of 3: 79.5 ms per loop

In [12]: %timeit sdct[keys]
1 loop, best of 3: 645 ms per loop

In [13]: %timeit sdct[np.array(keys)]
10 loops, best of 3: 65.1 ms per loop

Bottleneck seems to be in clean_index_list

jhrmnn · 2017-05-08T18:45:05Z

Ah, I see, good catch, could have tried that.

This makes it comparably faster to the comprehension. Shouldn't it be significantly faster though? I assume the comprehension is interpreted, whereas the Series lookup is one call to the C extension. Or does it boil down to efficiencies of Pandas's and C Python's hash tables?

chris-b1 · 2017-05-08T21:29:53Z

The bulk of the time in this operation is actually in placing the new values, not the hash table lookups. Below I skip the hash table in both cases

In [154]: vals = list(dct.values())

# python
In [157]: %timeit [vals[x] for x in keys]
10 loops, best of 3: 64.4 ms per loop

In [158]: %timeit [dct[k] for k in keys]
10 loops, best of 3: 80.5 ms per loop
# 16.1 ms of "overhead"

# pandas / numpy
In [160]: %timeit sdct.values.take(keys)
10 loops, best of 3: 52.2 ms per loop

In [159]: In [13]: %timeit sdct[np.array(keys)]
10 loops, best of 3: 62.3 ms per loop
# 10.1 ms of "overhead"

You are are right that these are C operations that avoid python overhead, but they are also basic python ops on optimized data structures, so not as much pickup as you might guess.

jorisvandenbossche · 2017-05-08T22:10:31Z

Bottleneck seems to be in clean_index_list

This is indeed the case:

In [29]: %timeit pd._libs.lib.clean_index_list(keys)
1 loop, best of 3: 532 ms per loop

@chris-b1 do you understand the purpose of _ensure_index / clean_index_list in this case? (or @jreback ?)
I mean, why is it for each item in the list checking whether it is a list?
This notion seems introduced in 27e34a4 (by then still in python), but this is for setting as axis, not gettting. If that is the only use-case, we could certainly split this off of the current _ensure_index so it is not used for getting.

closes pandas-dev#16285

jreback · 2017-05-09T02:16:15Z

so you can look at #16295, but this is actually quite subtle. you cannot simply np.asarray things, otherwise numpy converts things oddly in some cases. this is why np.array works above, its exactly what we do when we can convert it cleanly (e.g. a list of integers).

closes #16285

closes pandas-dev#16285

closes pandas-dev#16285 (cherry picked from commit ce4eef3)

closes #16285 (cherry picked from commit ce4eef3)

closes pandas-dev#16285

chris-b1 added Performance Memory or execution speed performance Indexing Related to indexing on series/frames, not to indexes themselves labels May 8, 2017

jreback added a commit to jreback/pandas that referenced this issue May 9, 2017

PERF: fix clean_index_list perf

b8fb546

closes pandas-dev#16285

jreback added this to the 0.20.2 milestone May 9, 2017

jreback mentioned this issue May 9, 2017

PERF: fix clean_index_list perf #16295

Merged

jreback closed this as completed in #16295 May 9, 2017

jreback added a commit that referenced this issue May 9, 2017

PERF: fix clean_index_list perf (#16295)

ce4eef3

closes #16285

pawroman pushed a commit to pawroman/pandas that referenced this issue May 9, 2017

PERF: fix clean_index_list perf (pandas-dev#16295)

dbdca51

closes pandas-dev#16285

pcluo pushed a commit to pcluo/pandas that referenced this issue May 22, 2017

PERF: fix clean_index_list perf (pandas-dev#16295)

9bb77c3

closes pandas-dev#16285

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue May 29, 2017

PERF: fix clean_index_list perf (pandas-dev#16295)

1c26a78

closes pandas-dev#16285 (cherry picked from commit ce4eef3)

TomAugspurger pushed a commit that referenced this issue May 30, 2017

PERF: fix clean_index_list perf (#16295)

3449529

closes #16285 (cherry picked from commit ce4eef3)

stangirala pushed a commit to stangirala/pandas that referenced this issue Jun 11, 2017

PERF: fix clean_index_list perf (pandas-dev#16295)

fd82aff

closes pandas-dev#16285

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance of Series label-indexing with a list of labels #16285

performance of Series label-indexing with a list of labels #16285

jhrmnn commented May 8, 2017

chris-b1 commented May 8, 2017

jhrmnn commented May 8, 2017

chris-b1 commented May 8, 2017

jorisvandenbossche commented May 8, 2017

jreback commented May 9, 2017

performance of Series label-indexing with a list of labels #16285

performance of Series label-indexing with a list of labels #16285

Comments

jhrmnn commented May 8, 2017

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

chris-b1 commented May 8, 2017

jhrmnn commented May 8, 2017

chris-b1 commented May 8, 2017

jorisvandenbossche commented May 8, 2017

jreback commented May 9, 2017

Output of `pd.show_versions()`