Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: str.extractall for several matches #11386

Closed
wants to merge 1 commit into from

Conversation

tdhock
Copy link
Contributor

@tdhock tdhock commented Oct 20, 2015

For a series S, the excellent S.str.extract method returns the first match in each subject of the series:

>>> import re
>>> import pandas as pd
>>> import numpy as np
>>> data = {
...     'Dave': 'dave@google.com',
...     'multiple': 'rob@gmail.com some text steve@gmail.com',
...     'none': np.nan,
...     }
>>> pattern = r'''
... (?P<user>[a-z]+)
... @
... (?P<domain>[a-z]+)
... \.
... (?P<tld>[a-z]{2,4})
... '''
>>> S = pd.Series(data)
>>> S.str.extract(pattern, re.VERBOSE)
          user  domain  tld
Dave      dave  google  com
multiple   rob   gmail  com
none       NaN     NaN  NaN
>>> 

That's great, but sometimes we want to extract all matches in each element of the series. You can do that with S.str.findall but its result does not include the names specified in the capturing groups of the regular expression:

>>> S.str.findall(pattern, re.VERBOSE)
Dave                           [(dave, google, com)]
multiple    [(rob, gmail, com), (steve, gmail, com)]
none                                             NaN
dtype: object
>>> 

I propose the S.str.extractall method which returns a Series the same length as the subject S. Each element of the series is a DataFrame with a row for each match and a column for each group:

>>> result = S.str.extractall(pattern, re.VERBOSE)
>>> result[0]
   user  domain  tld
0  dave  google  com
>>> result[1]
    user domain  tld
0    rob  gmail  com
1  steve  gmail  com
>>> result[2]
Empty DataFrame
Columns: [user, domain, tld]
Index: []
>>> 

Before I write any more testing code, can we start a discussion about whether or not this is an acceptable design choice, in relation to the other functionality of pandas? @sinhrks @jorisvandenbossche @jreback @mortada since you seem to be discussing extract in #10103

Also do you have any ideas about how to get the result (a Series of DataFrames) to print more nicely? With my current fork we have

>>> result
Dave                   user  domain  tld
0  dave  google  com
multiple        user domain  tld
0    rob  gmail  com
1  s...
none        Empty DataFrame
Columns: [user, domain, tld]
I...
dtype: object
>>> 

In R the equivalent functionality is provided by the https://github.com/tdhock/namedCapture package (str_match_all_named returns a list of data.frames), and the resulting printout is readable because of the way that R prints lists:

> library(namedCapture)
> S <- c(
+   Dave='dave@google.com',
+   multiple='rob@gmail.com some text steve@gmail.com',
+   none=NA)
> pattern <- paste0(
+   "(?P<user>[a-z]+)",
+   "@",
+   "(?P<domain>[a-z]+)",
+   "[.]",
+   "(?P<tld>[a-z]{2,4})")
> str_match_all_named(S, pattern)
$Dave
     user   domain   tld  
[1,] "dave" "google" "com"

$multiple
     user    domain  tld  
[1,] "rob"   "gmail" "com"
[2,] "steve" "gmail" "com"

$none
<0 x 0 matrix>

> 

@jreback
Copy link
Contributor

jreback commented Oct 20, 2015

never embed DataFrames within a Series!

instead you could simply return a multi-indexed frame, e.g. the first level are the original indexes of the Series, the 2nd level are the number of matches the columns as you have them

@jreback
Copy link
Contributor

jreback commented Oct 20, 2015

In [13]: DataFrame({'user' : ['dave','rob','steve',np.nan],
                              'domain' : ['google.com','gmail.com','gmail.com',np.nan]},
                              index=pd.MultiIndex.from_tuples([('Dave',0),('multiple',0),('multiple',1),('none',0)]))
Out[13]: 
                domain   user
Dave     0  google.com   dave
multiple 0   gmail.com    rob
         1   gmail.com  steve
none     0         NaN    NaN

@jreback
Copy link
Contributor

jreback commented Oct 20, 2015

@tdhock further you would prob add a parameter on how to handle partial matches (e.g. the some and text fields), e.g. would you be greedy then nan-fill or skip unless all filled (e.g. what you are doing now)

@jreback jreback added Enhancement Strings String extension data type and string data labels Oct 20, 2015
@tdhock
Copy link
Contributor Author

tdhock commented Oct 21, 2015

Thanks for the suggestions @jreback. Now extractall returns a DataFrame with a MultiIndex:

>>> import re
>>> import pandas as pd
>>> import numpy as np
>>> data_dict = {
...     'single': {
...         "Dave":'dave@google.com',
...         "Toby":'tdhock5@gmail.com',
...         "Maude":'maudelaperriere@gmail.com',
...         },
...     'multiple': {
...         "robAndSteve": 'rob@gmail.com some text steve@gmail.com',
...         "abcdef": 'a@b.com some text c@d.com and e@f.com',
...         },
...     'none': {
...         "missing":np.nan,
...         "empty":"",
...         },
...     }
>>> tuple_list = []
>>> subject_list = []
>>> for k1, d in data_dict.iteritems():
...     for k2, subject in d.iteritems():
...         k = (k1, k2)
...         tuple_list.append(k)
...         subject_list.append(subject)
... 
>>> index = pd.MultiIndex.from_tuples(tuple_list, names=("matches", "subject"))
>>> Si = pd.Series(subject_list, index)
>>> named_pattern = r'''
... (?P<user>[a-z0-9]+)
... @
... (?P<domain>[a-z]+)
... \.
... (?P<tld>[a-z]{2,4})
... '''
>>> iresult = Si.str.extractall(named_pattern, re.VERBOSE)
>>> iresult
                                 user  domain  tld
matches  subject                                  
single   Dave                    dave  google  com
         Maude        maudelaperriere   gmail  com
         Toby                 tdhock5   gmail  com
multiple robAndSteve              rob   gmail  com
         robAndSteve            steve   gmail  com
         abcdef                     a       b  com
         abcdef                     c       d  com
         abcdef                     e       f  com
>>> S = pd.Series(subject_list)
>>> result = S.str.extractall(named_pattern, re.VERBOSE)
>>> result
              user  domain  tld
0             dave  google  com
1  maudelaperriere   gmail  com
2          tdhock5   gmail  com
3              rob   gmail  com
3            steve   gmail  com
4                a       b  com
4                c       d  com
4                e       f  com
>>>

So then we can access all the matches for the subject string with rob and steve via:

>>> iresult.loc["multiple", "robAndSteve"]
                       user domain  tld
matches  subject                       
multiple robAndSteve    rob  gmail  com
         robAndSteve  steve  gmail  com
>>> result.loc[3]
    user domain  tld
3    rob  gmail  com
3  steve  gmail  com

For subjects that have 0 matches I think it would be more consistent and user-friendly if the following would return a DataFrame with 0 rows rather than an exception. Is that possible using some index options?

>>> result.loc[5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/indexing.py", line 1198, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "pandas/core/indexing.py", line 1342, in _getitem_axis
    self._has_valid_type(key, axis)
  File "pandas/core/indexing.py", line 1304, in _has_valid_type
    error()
  File "pandas/core/indexing.py", line 1291, in error
    (key, self.obj._get_axis_name(axis)))
KeyError: 'the label [5] is not in the [index]'
>>> iresult.loc["none", "empty"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/indexing.py", line 1196, in __getitem__
    return self._getitem_tuple(key)
  File "pandas/core/indexing.py", line 709, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "pandas/core/indexing.py", line 822, in _getitem_lowerdim
    result = self._handle_lowerdim_multi_index_axis0(tup)
  File "pandas/core/indexing.py", line 804, in _handle_lowerdim_multi_index_axis0
    raise e1
KeyError: 'none'
>>> 

@tdhock
Copy link
Contributor Author

tdhock commented Nov 1, 2015

I further propose the extractiter method which returns an iterator over DataFrames, one for each subject: one row for each match, one column for each group:

>>> for df in Si.str.extractiter(named_pattern, re.VERBOSE):
...   print df
... 
   user  domain  tld
0  dave  google  com
              user domain  tld
0  maudelaperriere  gmail  com
      user domain  tld
0  tdhock5  gmail  com
    user domain  tld
0    rob  gmail  com
1  steve  gmail  com
  user domain  tld
0    a      b  com
1    c      d  com
2    e      f  com
Empty DataFrame
Columns: [user, domain, tld]
Index: []
Empty DataFrame
Columns: [user, domain, tld]
Index: []
>>> for df in Si.str.extractiter(named_pattern, re.VERBOSE):
...   print df["domain"]
... 
0    google
Name: domain, dtype: object
0    gmail
Name: domain, dtype: object
0    gmail
Name: domain, dtype: object
0    gmail
1    gmail
Name: domain, dtype: object
0    b
1    d
2    f
Name: domain, dtype: object
Series([], Name: domain, dtype: object)
Series([], Name: domain, dtype: object)
>>> 

@jreback
Copy link
Contributor

jreback commented Nov 1, 2015

@tdhock let's just keep it straightforward for now
we don't support iterators like this so it would be a big API change

@tdhock
Copy link
Contributor Author

tdhock commented Nov 1, 2015

OK, in that case I deleted extractiter, and added some more tests and examples for extractall.


>>> S.str.extractall("(?P<letter>[ab])(?P<digit>\d)")
letter digit
A a 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would this NOT be a multi-index here? Having a duplicated index is not at all convenient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what you mean. Can you please clarify?

When the input Series is not multi-indexed, there is no reason the output DataFrame should be. This is the same as the behavior of the standard extract method:

>>> from pandas import Series
>>> S = Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
>>> S.str.extractall("(?P<letter>[ab])(?P<digit>\d)")
  letter digit
A      a     1
A      a     2
B      b     1
>>> S.str.extract("(?P<letter>[ab])(?P<digit>\d)")
  letter digit
A      a     1
B      b     1
C    NaN   NaN
>>> e_df = S.str.extract("(?P<letter>[ab])(?P<digit>\d)")
>>> e_df.index
Index([u'A', u'B', u'C'], dtype='object')
>>> e_df.keys()
Index([u'letter', u'digit'], dtype='object')
>>> ea_df = S.str.extractall("(?P<letter>[ab])(?P<digit>\d)")
>>> ea_df.index
Index([u'A', u'A', u'B'], dtype='object')
>>> ea_df.keys()
Index([u'letter', u'digit'], dtype='object')
>>> 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.extract returns a like-index object to the original.

the proposed .extractall by definition will have duplicates of some of the index elements. This is very different. This by its ery nature should return a MultiIndex (or if the input has a multi-index), then add a level.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2015

can you rebase / squash and i'll take a look

@tdhock tdhock force-pushed the extractall branch 2 times, most recently from 4140521 to 21bc58f Compare November 25, 2015 16:22
@tdhock
Copy link
Contributor Author

tdhock commented Nov 25, 2015

OK @jreback this is my first time doing a rebase / squash. Did I do it correctly?

@jreback
Copy link
Contributor

jreback commented Nov 25, 2015

yes that looks right

@jreback
Copy link
Contributor

jreback commented Nov 25, 2015

  • add to api.rst
  • add a note in v0.18.0.txt (use this PR number as the issue number), in Enhancement
  • update text.rst
  • update test in test_categorical if needed (this implements .str as well)


>>> S.str.extractall("(?P<letter>[ab])?(?P<digit>\d)")
letter digit
match
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should have the Index name from the original Series (could be None), for the first level, the 2nd level is ok as match. pls add tests for that as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact the name of the index is taken from the original Series, and it could be None.

In [3]: S = Series(["a1a2", "b1", "c1"], ["A", "B", "C"])

In [5]: S.str.extractall("(?P<letter>[ab])?(?P<digit>\d)")
Out[5]: 
        letter digit
  match             
A 0          a     1
  1          a     2
B 0          b     1
C 0        NaN     1

In [6]: S.str.extractall("(?P<letter>[ab])?(?P<digit>\d)").index
Out[6]: 
MultiIndex(levels=[[u'A', u'B', u'C'], [0, 1]],
           labels=[[0, 0, 1, 2], [0, 1, 0, 0]],
           names=[None, u'match'])

In [7]: Sn = Series(["a1a2", "b1", "c1"], ["A", "B", "C"])

In [10]: Sn.index.name = "capital"

In [12]: Sn.str.extractall("(?P<letter>[ab])?(?P<digit>\d)")
Out[12]: 
              letter digit
capital match             
A       0          a     1
        1          a     2
B       0          b     1
C       0        NaN     1

Do you think I should add a name to the Series used in the docstring?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessary, just make sure have a test for the name. This is now the default to check names (on series/frame comparisons)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, indeed there is already a test for a subject Series with a named index.

@tdhock
Copy link
Contributor Author

tdhock commented Nov 26, 2015

TODO update docstrings

@tdhock
Copy link
Contributor Author

tdhock commented Nov 26, 2015

TODOs

  • double check api.rst
  • add a note in v0.18.0.txt (use this PR number as the issue number), in Enhancement
  • double check text.rst
  • add extractall tests to test_categorical? (this implements .str as well) Is this needed? There are no tests for extract in test_categorical.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2015

.str functions are all tested in test--categorical - only the ones that need args are special cased

S = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
S.str.extract("[ab](?P<digit>\d)")

the ``extractall`` method (introduced in version 0.18)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a versionadded tag here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@tdhock tdhock force-pushed the extractall branch 3 times, most recently from d32db93 to 89be755 Compare January 29, 2016 16:57
@tdhock
Copy link
Contributor Author

tdhock commented Jan 29, 2016

I rebased with master and removed the duplications in the whatsnew file.

I ran the tests on my machine but I am getting many error which are unrelated to my PR

thocking@silene:~/pandas(extractall)$ nosetests pandas/tests/test_strings.py
.E..EE.EEE....E........E..EEE..EE..E......EEEEE....EEE...EEE.EE.E..E..EE......E.EE...
======================================================================
ERROR: test_capitalize (pandas.tests.test_strings.TestStringMethods)
----------------------------------------------------------------------

======================================================================
ERROR: test_title (pandas.tests.test_strings.TestStringMethods)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/thocking/pandas/pandas/tests/test_strings.py", line 288, in test_title
    tm.assert_almost_equal(mixed, exp)
  File "testing.pyx", line 58, in pandas._testing.assert_almost_equal (pandas/src/testing.c:3342)
  File "testing.pyx", line 139, in pandas._testing.assert_almost_equal (pandas/src/testing.c:2287)
  File "/home/thocking/pandas/pandas/core/series.py", line 558, in __getitem__
    result = self.index.get_value(self, key)
  File "/home/thocking/pandas/pandas/indexes/base.py", line 1920, in get_value
    raise IndexError(key)
IndexError: 0

----------------------------------------------------------------------
Ran 85 tests in 0.369s

FAILED (errors=34)

@jreback
Copy link
Contributor

jreback commented Jan 29, 2016

did you rebuild the extensions, e.g. make?

@tdhock
Copy link
Contributor Author

tdhock commented Jan 29, 2016

thanks for the tip. After rebuilding the extensions all tests pass on my machine.

@@ -201,9 +207,106 @@ and optional groups like

.. ipython:: python

pd.Series(['a1', 'b2', '3']).str.extract('(?P<letter>[ab])?(?P<digit>\d)')
pd.Series(['a1', 'b2', '3']).str.extract('([ab])?(\d)')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make extract and extractall sub-sections (I think you might have to use ^^^^) as the sub-headings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@jreback
Copy link
Contributor

jreback commented Feb 1, 2016

@tdhock just a couple of doc changes. ping when pushed and green and we'll merge.

groups_or_na = _groups_or_na_fun(regex)

if regex.groups == 1:
result = np.array([groups_or_na(val)[0] for val in arr], dtype=object)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

groups_or_na(subject) should be easier to understand than f(subject)

@tdhock
Copy link
Contributor Author

tdhock commented Feb 9, 2016

OK @jreback I think I have addressed all your concerns.

@jreback jreback closed this in 67730dd Feb 9, 2016
@jreback
Copy link
Contributor

jreback commented Feb 9, 2016

@tdhock thanks! great PR! and you put up with our comments!

only last thing: http://pandas-docs.github.io/pandas-docs-travis/ will have the built docs (may take a bit of time as Travis is sometimes queued). This builds all docs & doc-strings. Have a look and pls issue a followup-PR if anything needs clarification / formatting.

jreback pushed a commit that referenced this pull request Feb 10, 2016
This PR clarifies the new documentation for extract and extractall.
It was requested by @jreback in
#11386 (comment)

Author: Toby Dylan Hocking <tdhock5@gmail.com>

Closes #12281 from tdhock/extract-docs and squashes the following commits:

2019d1b [Toby Dylan Hocking] DOC: extract/extractall clarifications
cldy pushed a commit to cldy/pandas that referenced this pull request Feb 11, 2016
Author: Toby Dylan Hocking <tdhock5@gmail.com>

Closes pandas-dev#11386 from tdhock/extractall and squashes the following commits:

0c1c3d1 [Toby Dylan Hocking] ENH: extract(expand), extractall
cldy pushed a commit to cldy/pandas that referenced this pull request Feb 11, 2016
This PR clarifies the new documentation for extract and extractall.
It was requested by @jreback in
pandas-dev#11386 (comment)

Author: Toby Dylan Hocking <tdhock5@gmail.com>

Closes pandas-dev#12281 from tdhock/extract-docs and squashes the following commits:

2019d1b [Toby Dylan Hocking] DOC: extract/extractall clarifications
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants