Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Mergesort is unstable when ascending=False #6399

Closed
jdreaver opened this issue Feb 18, 2014 · 2 comments
Closed

BUG: Mergesort is unstable when ascending=False #6399

jdreaver opened this issue Feb 18, 2014 · 2 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@jdreaver
Copy link

The Issue

When using DataFrame.sort_by(kind='mergesort'), sorting is supposed to be stable. Unfortunately, that is not the case when ascending=False.

# Create a DataFrame where the first column is already in descending order.
In [96]: df = pd.DataFrame([[2, 'first'], [2, 'second'], [1, 'a'], [1, 'b']], columns=['sort_col', 'order'])

In [97]: df
Out[97]: 
   sort_col   order
0         2   first
1         2  second
2         1       a
3         1       b

[4 rows x 2 columns]

# Look at the 'order' column. Clearly not stable.
In [98]: df.sort_index(by='sort_col', kind='mergesort', ascending=False)
Out[98]: 
   sort_col   order
1         2  second
0         2   first
3         1       b
2         1       a

[4 rows x 2 columns]

More Info

Inside the sort_by() source code, argsort() is called on the sorted column. Then, ascending is checked and if it is False, the indexes are simply reversed. Here is the relevant code snippet in pandas/core/frame.py:

k = self[by].values
...
indexer = k.argsort(kind=kind)
...
if not ascending:
    indexer = indexer[::-1]

Since numpy always sorts in ascending order, this actually guarantees sorting is always unstable! Check this out:

In [110]: df['sort_col'].values.argsort(kind='mergesort')[::-1]
Out[110]: array([1, 0, 3, 2])

Clearly, simply reversing the indices doesn't work. We need to sort a reversed k, reverse the indices, and then subtract the indices from the highest index so they correspond to the original k:

In [112]: 3 - df['sort_col'].values[::-1].argsort(kind='mergesort')[::-1]
Out[112]: array([0, 1, 2, 3])

The workaround in my code is to stable sort descending is reverse the DataFrame, sort ascending, and reverse again.

What is the best way to fix this? This is probably an easy fix, but I've never contributed to pandas, so I need to set up my fork and make sure I can run tests before working on a pull request.

My Versions

Here are my versions:

In [75]: show_versions()

INSTALLED VERSIONS
------------------
Python: 3.3.3.final.0
OS: Linux
Release: 3.12.9-2-ARCH
Processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.0
Cython: 0.20
Numpy: 1.8.0
Scipy: 0.13.0
statsmodels: Not installed
    patsy: Not installed
scikits.timeseries: Not installed
dateutil: 2.2
pytz: 2013.9
bottleneck: Not installed
PyTables: Not Installed
    numexpr: Not Installed
matplotlib: 1.3.1
openpyxl: 1.8.2
xlrd: 0.9.2
xlwt: Not installed
xlsxwriter: Not installed
sqlalchemy: Not installed
lxml: Not installed
bs4: Not installed
html5lib: Not installed
bigquery: Not installed
apiclient: Not installed
@jreback jreback added this to the 0.14.0 milestone Feb 18, 2014
@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

thanks for bringing this up.
See here for some howtos:
https://github.com/pydata/pandas/wiki

essentially create a test that fails on the current codebase (e.g. your example above),
and stick it in say tests/test_frame.py

then put in your fix and see if it fixes your test and DOESN't break anything else. if it does you have to either fix your code or possibly change a test if it is incorrect.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

closed via 8517c05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants