4.2 Grouping

Frequently you need to obtain more information than just whether the RE matched or not. Regular expressions are also often used to dissect strings by writing a RE divided into several subgroups which match different components of interest. For example, an RFC-822 header line is divided into a header name and a value, separated by a ":". This can be handled by writing a regular expression which matches an entire header line, and has one group which matches the header name, and another group which matches the header's value.

Groups are marked by the "(", ")" metacharacters. "(" and ")" have much the same meaning as they do in mathematical expressions; they group together the expressions contained inside them. For example, you can repeat the contents of a group with a repeating qualifier, such as *, +, ?, or {m,n}. For example, (ab)* will match zero or more repetitions of "ab".

>>> p = re.compile('(ab)*')
>>> print p.match('ababababab').span()
(0, 10)

Groups indicated with "(", ")" also capture the starting and ending index of the text that they match; this can be retrieved by passing an argument to group(), start(), end(), and span(). (Later we'll see how to express groups that don't capture the span of text that they match.) Groups are numbered starting with 0. Group 0 is always present; it's the whole RE. The methods all have group 0 as their default argument.

>>> p = re.compile('(a)b')
>>> m = p.match('ab')
>>> m.group()
>>> m.group(0)

Subgroups are numbered from left to right, from 1 upward. Groups can be nested; to determine the number, just count the opening parenthesis characters, going from left to right.

>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
>>> m.group(1)
>>> m.group(2)

group() can be passed multiple group numbers at a time, in which case it will return a tuple containing the corresponding values for those groups.

>>> m.group(2,1,2)
('b', 'abc', 'b')

The groups() method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are.

>>> m.groups()
('abc', 'b')

Backreferences allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string. For example, \1 will succeed if the exact contents of group 1 can be found at the current position, and fails otherwise. Remember that Python's string literals also use a backslash followed by numbers to allow including arbitrary characters in a string, so be sure to use a raw string when incorporating backreferences in a RE.

For example, the following RE detects doubled words in a string.

>>> p = re.compile(r'(\b\w+)\s+\1')
>>> p.search('Paris in the the spring').group()
'the the'

Backreferences like this aren't often useful for just searching through a string -- there are few text formats which repeat data in this way -- but you'll soon find out that they're very useful when performing string substitutions.