ENH: str.extractall for several matches #11386

tdhock · 2015-10-20T16:42:01Z

For a series S, the excellent S.str.extract method returns the first match in each subject of the series:

>>> import re
>>> import pandas as pd
>>> import numpy as np
>>> data = {
...     'Dave': 'dave@google.com',
...     'multiple': 'rob@gmail.com some text steve@gmail.com',
...     'none': np.nan,
...     }
>>> pattern = r'''
... (?P<user>[a-z]+)
... @
... (?P<domain>[a-z]+)
... \.
... (?P<tld>[a-z]{2,4})
... '''
>>> S = pd.Series(data)
>>> S.str.extract(pattern, re.VERBOSE)
          user  domain  tld
Dave      dave  google  com
multiple   rob   gmail  com
none       NaN     NaN  NaN
>>>

That's great, but sometimes we want to extract all matches in each element of the series. You can do that with S.str.findall but its result does not include the names specified in the capturing groups of the regular expression:

>>> S.str.findall(pattern, re.VERBOSE)
Dave                           [(dave, google, com)]
multiple    [(rob, gmail, com), (steve, gmail, com)]
none                                             NaN
dtype: object
>>>

I propose the S.str.extractall method which returns a Series the same length as the subject S. Each element of the series is a DataFrame with a row for each match and a column for each group:

>>> result = S.str.extractall(pattern, re.VERBOSE)
>>> result[0]
   user  domain  tld
0  dave  google  com
>>> result[1]
    user domain  tld
0    rob  gmail  com
1  steve  gmail  com
>>> result[2]
Empty DataFrame
Columns: [user, domain, tld]
Index: []
>>>

Before I write any more testing code, can we start a discussion about whether or not this is an acceptable design choice, in relation to the other functionality of pandas? @sinhrks @jorisvandenbossche @jreback @mortada since you seem to be discussing extract in #10103

Also do you have any ideas about how to get the result (a Series of DataFrames) to print more nicely? With my current fork we have

>>> result
Dave                   user  domain  tld
0  dave  google  com
multiple        user domain  tld
0    rob  gmail  com
1  s...
none        Empty DataFrame
Columns: [user, domain, tld]
I...
dtype: object
>>>

In R the equivalent functionality is provided by the https://github.com/tdhock/namedCapture package (str_match_all_named returns a list of data.frames), and the resulting printout is readable because of the way that R prints lists:

> library(namedCapture)
> S <- c(
+   Dave='dave@google.com',
+   multiple='rob@gmail.com some text steve@gmail.com',
+   none=NA)
> pattern <- paste0(
+   "(?P<user>[a-z]+)",
+   "@",
+   "(?P<domain>[a-z]+)",
+   "[.]",
+   "(?P<tld>[a-z]{2,4})")
> str_match_all_named(S, pattern)
$Dave
     user   domain   tld  
[1,] "dave" "google" "com"

$multiple
     user    domain  tld  
[1,] "rob"   "gmail" "com"
[2,] "steve" "gmail" "com"

$none
<0 x 0 matrix>

>

jreback · 2015-10-20T16:51:55Z

never embed DataFrames within a Series!

instead you could simply return a multi-indexed frame, e.g. the first level are the original indexes of the Series, the 2nd level are the number of matches the columns as you have them

jreback · 2015-10-20T16:57:13Z

In [13]: DataFrame({'user' : ['dave','rob','steve',np.nan],
                              'domain' : ['google.com','gmail.com','gmail.com',np.nan]},
                              index=pd.MultiIndex.from_tuples([('Dave',0),('multiple',0),('multiple',1),('none',0)]))
Out[13]: 
                domain   user
Dave     0  google.com   dave
multiple 0   gmail.com    rob
         1   gmail.com  steve
none     0         NaN    NaN

jreback · 2015-10-20T16:59:35Z

@tdhock further you would prob add a parameter on how to handle partial matches (e.g. the some and text fields), e.g. would you be greedy then nan-fill or skip unless all filled (e.g. what you are doing now)

tdhock · 2015-10-21T14:06:49Z

Thanks for the suggestions @jreback. Now extractall returns a DataFrame with a MultiIndex:

>>> import re
>>> import pandas as pd
>>> import numpy as np
>>> data_dict = {
...     'single': {
...         "Dave":'dave@google.com',
...         "Toby":'tdhock5@gmail.com',
...         "Maude":'maudelaperriere@gmail.com',
...         },
...     'multiple': {
...         "robAndSteve": 'rob@gmail.com some text steve@gmail.com',
...         "abcdef": 'a@b.com some text c@d.com and e@f.com',
...         },
...     'none': {
...         "missing":np.nan,
...         "empty":"",
...         },
...     }
>>> tuple_list = []
>>> subject_list = []
>>> for k1, d in data_dict.iteritems():
...     for k2, subject in d.iteritems():
...         k = (k1, k2)
...         tuple_list.append(k)
...         subject_list.append(subject)
... 
>>> index = pd.MultiIndex.from_tuples(tuple_list, names=("matches", "subject"))
>>> Si = pd.Series(subject_list, index)
>>> named_pattern = r'''
... (?P<user>[a-z0-9]+)
... @
... (?P<domain>[a-z]+)
... \.
... (?P<tld>[a-z]{2,4})
... '''
>>> iresult = Si.str.extractall(named_pattern, re.VERBOSE)
>>> iresult
                                 user  domain  tld
matches  subject                                  
single   Dave                    dave  google  com
         Maude        maudelaperriere   gmail  com
         Toby                 tdhock5   gmail  com
multiple robAndSteve              rob   gmail  com
         robAndSteve            steve   gmail  com
         abcdef                     a       b  com
         abcdef                     c       d  com
         abcdef                     e       f  com
>>> S = pd.Series(subject_list)
>>> result = S.str.extractall(named_pattern, re.VERBOSE)
>>> result
              user  domain  tld
0             dave  google  com
1  maudelaperriere   gmail  com
2          tdhock5   gmail  com
3              rob   gmail  com
3            steve   gmail  com
4                a       b  com
4                c       d  com
4                e       f  com
>>>

So then we can access all the matches for the subject string with rob and steve via:

>>> iresult.loc["multiple", "robAndSteve"]
                       user domain  tld
matches  subject                       
multiple robAndSteve    rob  gmail  com
         robAndSteve  steve  gmail  com
>>> result.loc[3]
    user domain  tld
3    rob  gmail  com
3  steve  gmail  com

For subjects that have 0 matches I think it would be more consistent and user-friendly if the following would return a DataFrame with 0 rows rather than an exception. Is that possible using some index options?

>>> result.loc[5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/indexing.py", line 1198, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "pandas/core/indexing.py", line 1342, in _getitem_axis
    self._has_valid_type(key, axis)
  File "pandas/core/indexing.py", line 1304, in _has_valid_type
    error()
  File "pandas/core/indexing.py", line 1291, in error
    (key, self.obj._get_axis_name(axis)))
KeyError: 'the label [5] is not in the [index]'
>>> iresult.loc["none", "empty"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/indexing.py", line 1196, in __getitem__
    return self._getitem_tuple(key)
  File "pandas/core/indexing.py", line 709, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "pandas/core/indexing.py", line 822, in _getitem_lowerdim
    result = self._handle_lowerdim_multi_index_axis0(tup)
  File "pandas/core/indexing.py", line 804, in _handle_lowerdim_multi_index_axis0
    raise e1
KeyError: 'none'
>>>

tdhock · 2015-11-01T18:59:53Z

I further propose the extractiter method which returns an iterator over DataFrames, one for each subject: one row for each match, one column for each group:

>>> for df in Si.str.extractiter(named_pattern, re.VERBOSE):
...   print df
... 
   user  domain  tld
0  dave  google  com
              user domain  tld
0  maudelaperriere  gmail  com
      user domain  tld
0  tdhock5  gmail  com
    user domain  tld
0    rob  gmail  com
1  steve  gmail  com
  user domain  tld
0    a      b  com
1    c      d  com
2    e      f  com
Empty DataFrame
Columns: [user, domain, tld]
Index: []
Empty DataFrame
Columns: [user, domain, tld]
Index: []
>>> for df in Si.str.extractiter(named_pattern, re.VERBOSE):
...   print df["domain"]
... 
0    google
Name: domain, dtype: object
0    gmail
Name: domain, dtype: object
0    gmail
Name: domain, dtype: object
0    gmail
1    gmail
Name: domain, dtype: object
0    b
1    d
2    f
Name: domain, dtype: object
Series([], Name: domain, dtype: object)
Series([], Name: domain, dtype: object)
>>>

jreback · 2015-11-01T20:32:41Z

@tdhock let's just keep it straightforward for now
we don't support iterators like this so it would be a big API change

tdhock · 2015-11-01T23:17:49Z

OK, in that case I deleted extractiter, and added some more tests and examples for extractall.

jreback · 2015-11-02T11:54:37Z

pandas/core/strings.py

+
+    >>> S.str.extractall("(?P<letter>[ab])(?P<digit>\d)")
+      letter digit
+    A      a     1


why would this NOT be a multi-index here? Having a duplicated index is not at all convenient.

not sure what you mean. Can you please clarify?

When the input Series is not multi-indexed, there is no reason the output DataFrame should be. This is the same as the behavior of the standard extract method:

>>> from pandas import Series >>> S = Series(["a1a2", "b1", "c1"], ["A", "B", "C"]) >>> S.str.extractall("(?P<letter>[ab])(?P<digit>\d)") letter digit A a 1 A a 2 B b 1 >>> S.str.extract("(?P<letter>[ab])(?P<digit>\d)") letter digit A a 1 B b 1 C NaN NaN >>> e_df = S.str.extract("(?P<letter>[ab])(?P<digit>\d)") >>> e_df.index Index([u'A', u'B', u'C'], dtype='object') >>> e_df.keys() Index([u'letter', u'digit'], dtype='object') >>> ea_df = S.str.extractall("(?P<letter>[ab])(?P<digit>\d)") >>> ea_df.index Index([u'A', u'A', u'B'], dtype='object') >>> ea_df.keys() Index([u'letter', u'digit'], dtype='object') >>>

.extract returns a like-index object to the original.

the proposed .extractall by definition will have duplicates of some of the index elements. This is very different. This by its ery nature should return a MultiIndex (or if the input has a multi-index), then add a level.

jreback · 2015-11-25T15:43:37Z

can you rebase / squash and i'll take a look

tdhock · 2015-11-25T16:24:09Z

OK @jreback this is my first time doing a rebase / squash. Did I do it correctly?

jreback · 2015-11-25T16:25:26Z

yes that looks right

jreback · 2015-11-25T16:27:00Z

add to api.rst
add a note in v0.18.0.txt (use this PR number as the issue number), in Enhancement
update text.rst
update test in test_categorical if needed (this implements .str as well)

jreback · 2015-11-25T16:29:22Z

pandas/core/strings.py

+
+    >>> S.str.extractall("(?P<letter>[ab])?(?P<digit>\d)")
+            letter digit
+      match             


I think this should have the Index name from the original Series (could be None), for the first level, the 2nd level is ok as match. pls add tests for that as well.

In fact the name of the index is taken from the original Series, and it could be None.

In [3]: S = Series(["a1a2", "b1", "c1"], ["A", "B", "C"]) In [5]: S.str.extractall("(?P<letter>[ab])?(?P<digit>\d)") Out[5]: letter digit match A 0 a 1 1 a 2 B 0 b 1 C 0 NaN 1 In [6]: S.str.extractall("(?P<letter>[ab])?(?P<digit>\d)").index Out[6]: MultiIndex(levels=[[u'A', u'B', u'C'], [0, 1]], labels=[[0, 0, 1, 2], [0, 1, 0, 0]], names=[None, u'match']) In [7]: Sn = Series(["a1a2", "b1", "c1"], ["A", "B", "C"]) In [10]: Sn.index.name = "capital" In [12]: Sn.str.extractall("(?P<letter>[ab])?(?P<digit>\d)") Out[12]: letter digit capital match A 0 a 1 1 a 2 B 0 b 1 C 0 NaN 1

Do you think I should add a name to the Series used in the docstring?

not necessary, just make sure have a test for the name. This is now the default to check names (on series/frame comparisons)

OK, indeed there is already a test for a subject Series with a named index.

tdhock · 2015-11-26T14:59:30Z

TODO update docstrings

tdhock · 2015-11-26T17:42:18Z

TODOs

double check api.rst
add a note in v0.18.0.txt (use this PR number as the issue number), in Enhancement
double check text.rst
add extractall tests to test_categorical? (this implements .str as well) Is this needed? There are no tests for extract in test_categorical.

jreback · 2015-11-26T17:56:58Z

.str functions are all tested in test--categorical - only the ones that need args are special cased

jreback · 2015-11-27T13:28:15Z

doc/source/text.rst

+   S = pd.Series(["a1a2", "b1", "c1"], ["A", "B", "C"])
+   S.str.extract("[ab](?P<digit>\d)")
+
+the ``extractall`` method (introduced in version 0.18)


use a versionadded tag here

tdhock · 2016-01-29T17:01:47Z

I rebased with master and removed the duplications in the whatsnew file.

I ran the tests on my machine but I am getting many error which are unrelated to my PR

thocking@silene:~/pandas(extractall)$ nosetests pandas/tests/test_strings.py
.E..EE.EEE....E........E..EEE..EE..E......EEEEE....EEE...EEE.EE.E..E..EE......E.EE...
======================================================================
ERROR: test_capitalize (pandas.tests.test_strings.TestStringMethods)
----------------------------------------------------------------------

======================================================================
ERROR: test_title (pandas.tests.test_strings.TestStringMethods)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/thocking/pandas/pandas/tests/test_strings.py", line 288, in test_title
    tm.assert_almost_equal(mixed, exp)
  File "testing.pyx", line 58, in pandas._testing.assert_almost_equal (pandas/src/testing.c:3342)
  File "testing.pyx", line 139, in pandas._testing.assert_almost_equal (pandas/src/testing.c:2287)
  File "/home/thocking/pandas/pandas/core/series.py", line 558, in __getitem__
    result = self.index.get_value(self, key)
  File "/home/thocking/pandas/pandas/indexes/base.py", line 1920, in get_value
    raise IndexError(key)
IndexError: 0

----------------------------------------------------------------------
Ran 85 tests in 0.369s

FAILED (errors=34)

jreback · 2016-01-29T17:03:51Z

did you rebuild the extensions, e.g. make?

tdhock · 2016-01-29T17:14:15Z

thanks for the tip. After rebuilding the extensions all tests pass on my machine.

jreback · 2016-02-01T23:20:46Z

doc/source/text.rst

@@ -201,9 +207,106 @@ and optional groups like

 .. ipython:: python

-   pd.Series(['a1', 'b2', '3']).str.extract('(?P<letter>[ab])?(?P<digit>\d)')
+   pd.Series(['a1', 'b2', '3']).str.extract('([ab])?(\d)')


make extract and extractall sub-sections (I think you might have to use ^^^^) as the sub-headings

jreback · 2016-02-01T23:27:58Z

@tdhock just a couple of doc changes. ping when pushed and green and we'll merge.

tdhock · 2016-02-09T12:57:00Z

pandas/core/strings.py

+    groups_or_na = _groups_or_na_fun(regex)
+
+    if regex.groups == 1:
+        result = np.array([groups_or_na(val)[0] for val in arr], dtype=object)


groups_or_na(subject) should be easier to understand than f(subject)

tdhock · 2016-02-09T16:42:35Z

OK @jreback I think I have addressed all your concerns.

jreback · 2016-02-09T22:22:32Z

@tdhock thanks! great PR! and you put up with our comments!

only last thing: http://pandas-docs.github.io/pandas-docs-travis/ will have the built docs (may take a bit of time as Travis is sometimes queued). This builds all docs & doc-strings. Have a look and pls issue a followup-PR if anything needs clarification / formatting.

@jreback

This PR clarifies the new documentation for extract and extractall. It was requested by @jreback in #11386 (comment) Author: Toby Dylan Hocking <tdhock5@gmail.com> Closes #12281 from tdhock/extract-docs and squashes the following commits: 2019d1b [Toby Dylan Hocking] DOC: extract/extractall clarifications

Author: Toby Dylan Hocking <tdhock5@gmail.com> Closes pandas-dev#11386 from tdhock/extractall and squashes the following commits: 0c1c3d1 [Toby Dylan Hocking] ENH: extract(expand), extractall

@jreback

This PR clarifies the new documentation for extract and extractall. It was requested by @jreback in pandas-dev#11386 (comment) Author: Toby Dylan Hocking <tdhock5@gmail.com> Closes pandas-dev#12281 from tdhock/extract-docs and squashes the following commits: 2019d1b [Toby Dylan Hocking] DOC: extract/extractall clarifications

jreback added Enhancement Strings String extension data type and string data labels Oct 20, 2015

jreback reviewed Nov 2, 2015
View reviewed changes

tdhock force-pushed the extractall branch 2 times, most recently from 4140521 to 21bc58f Compare November 25, 2015 16:22

jreback reviewed Nov 25, 2015
View reviewed changes

jreback reviewed Nov 27, 2015
View reviewed changes

tdhock force-pushed the extractall branch 3 times, most recently from d32db93 to 89be755 Compare January 29, 2016 16:57

jreback reviewed Feb 1, 2016
View reviewed changes

tdhock force-pushed the extractall branch from 89be755 to 5830e8b Compare February 9, 2016 12:52

tdhock reviewed Feb 9, 2016
View reviewed changes

tdhock force-pushed the extractall branch from 5830e8b to cfee227 Compare February 9, 2016 14:02

ENH: extract(expand), extractall

0c1c3d1

tdhock force-pushed the extractall branch from cfee227 to 0c1c3d1 Compare February 9, 2016 15:00

jreback closed this in 67730dd Feb 9, 2016

tdhock mentioned this pull request Feb 10, 2016

DOC: extract/extractall clarifications #12281

Closed

sinhrks mentioned this pull request Apr 10, 2016

str.extract raises ValueError with group named "name" #11385

Closed

mortada mentioned this pull request Apr 10, 2016

ENH: Index StringMethods should return MultiIndex when result dimension is more than one #10008

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: str.extractall for several matches #11386

ENH: str.extractall for several matches #11386

tdhock commented Oct 20, 2015

jreback commented Oct 20, 2015

jreback commented Oct 20, 2015

jreback commented Oct 20, 2015

tdhock commented Oct 21, 2015

tdhock commented Nov 1, 2015

jreback commented Nov 1, 2015

tdhock commented Nov 1, 2015

jreback Nov 2, 2015

tdhock Nov 2, 2015

jreback Nov 2, 2015

jreback commented Nov 25, 2015

tdhock commented Nov 25, 2015

jreback commented Nov 25, 2015

jreback commented Nov 25, 2015

jreback Nov 25, 2015

tdhock Nov 25, 2015

jreback Nov 25, 2015

tdhock Nov 25, 2015

tdhock commented Nov 26, 2015

tdhock commented Nov 26, 2015

jreback commented Nov 26, 2015

jreback Nov 27, 2015

tdhock Nov 27, 2015

tdhock commented Jan 29, 2016

jreback commented Jan 29, 2016

tdhock commented Jan 29, 2016

jreback Feb 1, 2016

tdhock Feb 8, 2016

jreback commented Feb 1, 2016

tdhock Feb 9, 2016

tdhock commented Feb 9, 2016

jreback commented Feb 9, 2016

ENH: str.extractall for several matches #11386

ENH: str.extractall for several matches #11386

Conversation

tdhock commented Oct 20, 2015

jreback commented Oct 20, 2015

jreback commented Oct 20, 2015

jreback commented Oct 20, 2015

tdhock commented Oct 21, 2015

tdhock commented Nov 1, 2015

jreback commented Nov 1, 2015

tdhock commented Nov 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 25, 2015

tdhock commented Nov 25, 2015

jreback commented Nov 25, 2015

jreback commented Nov 25, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdhock commented Nov 26, 2015

tdhock commented Nov 26, 2015

jreback commented Nov 26, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdhock commented Jan 29, 2016

jreback commented Jan 29, 2016

tdhock commented Jan 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 1, 2016

Choose a reason for hiding this comment

tdhock commented Feb 9, 2016

jreback commented Feb 9, 2016