Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: is_string_dtype is not strict #15585

Closed
jreback opened this issue Mar 5, 2017 · 5 comments · Fixed by #49378
Closed

API: is_string_dtype is not strict #15585

jreback opened this issue Mar 5, 2017 · 5 comments · Fixed by #49378
Labels
Bug Strings String extension data type and string data

Comments

@jreback
Copy link
Contributor

jreback commented Mar 5, 2017

#15533 (comment)

pandas.types.common.is_string_dtype is not strict, just checking if its a fixed-width numpy string/unicode type or object dtype. It could be made strict via something like this.

def is_string_dtype(arr_or_dtype):
    dtype = _get_dtype(arr_or_dtype)
    if isinstance(arr_or_dtype, np.ndarray):
        if not lib.infer_dtype(arr_or_dtype) in ['string', 'unicode']:
            return False

    return dtype.kind in ('O', 'S', 'U') and not is_period_dtype(dtype)

this would need performance checking to see if its a problem. Further this then changes the API a tiny bit (as we allow an object OR a dtype to be passed in). Which is probably ok as long as its documented a bit.

@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Mar 5, 2017
@jreback jreback added this to the 0.20.0 milestone Mar 5, 2017
@jreback
Copy link
Contributor Author

jreback commented Mar 5, 2017

xref @ResidentMario

@ResidentMario
Copy link
Contributor

There definitely needs to be a strict implementation somewhere in types, I have a working-ish solution for #15533 that depends on this.

Maybe also consider leaving this method as-is and throwing in a new is_string_dtype_strict method with this signature, if performance is a problem.

How to test performance? Will timeit do, or need to add something to asv_bench?

@jreback
Copy link
Contributor Author

jreback commented Mar 5, 2017

asv is the place for things like this.

@jreback
Copy link
Contributor Author

jreback commented Mar 5, 2017

@ResidentMario you shouldn't need this fyi

@jreback jreback modified the milestones: 0.20.0, 0.21.0 Mar 23, 2017
@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017
@ivirshup
Copy link
Contributor

ivirshup commented Feb 6, 2019

Any updates on this? It seems like behavior changed in v0.24.0, but not necessarily for the better.

import pandas as pd
from pandas.api.types import is_string_dtype

is_string_dtype(pd.Series(pd.Categorical([1,2,3])))
# False in v0.23.4
# True in v0.24.1

# Still:
is_string_dtype(pd.Series([(0,1), (1,1)]))
# True in v0.23.4
# True in v0.24.1

This was causing issues in a project where we were expecting the results of this check not to change between versions. On further investigation, it seems like this function is unreliable (tuples aren't strings, categoricals as strings feels very R).

@mroeschke mroeschke added the Strings String extension data type and string data label Oct 27, 2019
@mroeschke mroeschke added Bug and removed Dtype Conversions Unexpected or buggy dtype conversions labels May 3, 2020
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Mar 25, 2021
As documented in [this pandas issue](pandas-dev/pandas#15585), `is_string_type` for pandas is not strict and will characterize a whole bunch of things as strings that aren't. For our purposes, this is problematic because basically all subclasses of `ExtensionDType` will be classified as strings by that function. This is definitely not appropriate, so I modified our version of `is_string_dtype` to explicitly reject all of our extension dtypes (previously it was only excluding categorical types). I'm not 100% confident that no other parts of the code base rely on the current (erroneous) behavior, but the cudf tests all passed for me locally and my attempt to trace all calls of `utils.is_string_dtype` all look to be places where the change gives more correct behavior, so I think our best bet is to just move forward with this change. Any problems that result from this change in the future due to other code relying on the current behavior should probably be characterized as bugs in the calling code and fixed there. The same goes for for external codes that relied on this behavior; this change is potentially breaking for them as well, but again is something that they should be addressing.

Authors:
  - Vyas Ramasubramani (@vyasr)

Approvers:
  - Keith Kraus (@kkraus14)

URL: #7710
alexreg added a commit to alexreg/pandas that referenced this issue Oct 10, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants