-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python3: regression for unique on dtype=object arrays with varying items types (Trac #2188) #641
Comments
any fresh ideas on this issue? |
The only options for implementing
Only the sorting and hashing strategies have reasonable speed, and only the sorting and all-against-all strategies have reasonable memory overhead for large arrays. So I guess we could add fallback options to unique where if sorting doesn't work then it tries one of the other strategies? But OTOH it's not nice to have a function that sometimes suddenly takes massively more CPU or memory depending on what input you give it. I guess I'd be +1 on a patch that adds a strategy={"sort", "hash", "bruteforce"} option to np.unique, so users with weird data can decide on what makes sense for their situation. If you want to write such a thing :-) |
at first I wondered if it could be sorting + hash table for unsortable items (didn't check if cmp of elements is used when sorting elements of dtype object array) so sorting |
just for those who might need a workaround, here is how I do it via 'hashing' through the built in sets for my case:
|
Not sure what you just said :-) But on further thought sorting is really not reliable for dtype=object If the objects are hashable then you can just do set(arr) to get the unique On Tue, Sep 17, 2013 at 5:40 PM, Yaroslav Halchenko <
|
gy... ok -- cruel description in Python: def bucketed_unique(a):
buckets = {}
for x in a:
t = type(x)
if not (t in buckets):
buckets[t] = bucket = []
else:
bucket = buckets[t]
bucket.append(x)
out = []
for bucket in buckets.itervalues():
# here could be actually set of conditions instead of blind try/except
try:
out.append(np.unique(bucket))
except:
out.append(np.array(list(set(bucket)), dtype=object))
return np.hstack(out)
|
sure thing -- no 'bucketing' should be done for non-object ndarrays |
That algorithm doesn't use == as its definition of uniqueness. Objects of
|
indeed! not sure but may be post-hoc analysis across buckets would make sense... btw atm problem reveals itself also for comparison with complex numbers:
|
although on the 2nd thought -- what should the 'unique' value dtype then be among all available choices (int/float/complex)? with non-object array it is clear... with heterogeneous object array, not so -- may be even different dtypes should be maintained as such... |
Here's the way I solved order the ints before the strings in object dtypes uses pandas hashtable implementation but could easily be swapped out / adapted to c-code I think |
Anyone want to take a swing at this? Not sure what to do about record dtypes. |
Any updates on this? I encountered this bug when trying to use scikit-learn's LabelEncoder on pandas DataFrame columns with dtype "object" and missing values |
- python3 requires decoding the bytes objects to produce strings - np.unique() fails for dtype=object in python3 (see numpy/numpy#641) - flake, pydocstyle
* WIP: add basic GDF1 and GDF2 support only tested on a few GDF test files doesn't break existing EDF tests * Remove \x00 from channel names * Fix data calibration - Unified offset and calibration values for EDF/BDF/GDF modules - few PEP8 fixes - TODO: GDF 2.x triggers are detected as multiples of 256 - TODO: overflow and saturation detection * Proper imports imports in test_gdf.py * Fix to test_edf.py An additional warning has been added in edf.py when Lowpass information is not available in file header. As a result test_stim_channel() would fail. The test should now expect 2 warnings. * Separated EDF and GDF code - EDF and GDF file headers are now read in their own private subfunction - some PEP8 and cosmetic changes - GDF events are still multiples of 256 * Add biosig GDF test file make Travis happy * python3 fixes - python3 requires decoding the bytes objects to produce strings - np.unique() fails for dtype=object in python3 (see numpy/numpy#641) - flake, pydocstyle * Read biosemi GDF stim_channel properly * Speed up reading from file - fromstring -> fromfile * Test routines and added safechecks Also remove test datafiles (see mne-tools/mne-testing-data#23) * Update utils, and missing lowpass info no longer issues warning * Added stim_channel argument check * Address comments * Address comments and update doc * Address comments * Docstyle
This one is really old. Is it still relevant? |
seems to be the case at least with 1.15.4 in debian: $> python3 --version
Python 3.6.5
$> PYTHONPATH=.. python3 -c 'import numpy as np; print(np.__version__); print(repr(repr(np.unique(np.array([1,2,None, "str"])))))'
1.15.4
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3/dist-packages/numpy/lib/arraysetops.py", line 233, in unique
ret = _unique1d(ar, return_index, return_inverse, return_counts)
File "/usr/lib/python3/dist-packages/numpy/lib/arraysetops.py", line 281, in _unique1d
ar.sort()
TypeError: '<' not supported between instances of 'NoneType' and 'int' |
Definitely still relevant. Just came across this, trying to call Regarding the question of how to make this work, when sorting is undefined: I'd much prefer a slow algorithm over the status quo of raising an error. (In my experience, oftentimes, if you need performant algorithms, you shouldn't be using an object array to begin with.) |
I think this is a feature request, not a bug. The docs clearly state:
For the case of an array like |
It would be nice to have an option to not return a sorted array, it would allow some optimizations. |
This was the most bothering regression in porting about 150K lines from Py2.7 to 3.8. Navigating around this lead to a lot of ugly workarounds and slower code. I see that this is caused by Guido's decision in Python, but a fast, non-sorted option for unique would be really helpful. |
Same words as @mcgfeller . I would even take a slowish non-sorted version. |
Original ticket http://projects.scipy.org/numpy/ticket/2188 on 2012-07-23 by @yarikoptic, assigned to unknown.
tested against current master (present in 1.6.2 as well):
If with python2.x series it works ok, without puking:
NB I will report a bug on repr here separately if not yet filed
it fails with python3.x altogether:
whenever IMHO it must operate correctly -- semantic of unique() action should not imply ability to sort the elements
The text was updated successfully, but these errors were encountered: