4.3 Non-capturing, and Named Groups

Elaborate REs may use many groups, both to capture substrings of interest, and to group and structure the RE itself. In complex REs, it becomes difficult to keep track of the group numbers. There are two features which help with this problem. Both of them use a common syntax for regular expression extensions, so we'll look at that first.

Perl 5 added several additional features to standard regular expressions, and the Python re module supports most of them. It would have been difficult to choose new single-keystroke metacharacters or new special sequences beginning with "\" to represent the new features, without making Perl's regular expressions confusingly different from standard REs. If you chose "&" as a new metacharacter, for example, old expressions would be assuming that "&" was a regular character and wouldn't have escaped it by writing \& or [&]. The solution chosen was to use (?...) as the extension syntax. "?" immediately after a parenthesis was a syntax error, because the "?" would have nothing to repeat, so this doesn't introduce any compatibility problems. The characters immediately after the "?" indicate what extension is being used, so (?=foo) is one thing (a positive lookahead assertion) and (?:foo) is something else (a non-capturing group containing the subexpression foo).

Python adds an extension syntax to Perl's extension syntax. If the first character after the question mark is a "P", you know that it's a extension that's specific to Python. Currently there are two such extensions: (?P<name>...) defines a named group, and (?P=name) is a backreference to a named group. If future versions of Perl 5 add similar features using a different syntax, the re module will be changed to support the new syntax, while preserving the Python-specific syntax for compatibility's sake.

Now that we've looked at the general extension syntax, we can return to the features that simplify working with groups in complex REs. Since groups are numbered from left to right, and a complex expression may use many groups, it can become difficult to keep track of the correct numbering, and modifying such a complex RE is annoying. Insert a new group near the beginning, and you change the numbers of everything that follows it.

First, sometimes you'll want to use a group to collect a part of a regular expression, but aren't interested in retrieving the group's contents. You can make this fact explicit by using a non-capturing group: (?:...), where you can put any other regular expression inside the parentheses.

>>> m = re.match("([abc])+'', "abc")
>>> m.groups()
('c',)
>>> m = re.match("(?:[abc])+", "abc")
>>> m.groups()
()

Except for the fact that you can't retrieve the contents of what the group matched, a non-capturing group behaves exactly the same as a capturing group; you can put anything inside it, repeat it with a repetition metacharacter such as "*", and nest it within other groups (capturing or non-capturing). (?:...) is particularly useful when modifying an existing group, since you can add new groups without changing how all the other groups are numbered. It should be mentioned that there's no performance difference in searching between capturing and non-capturing groups; neither form is any faster than the other.

The second, and more significant, feature, is named groups; instead of referring to them by numbers, groups can be referenced by a name.

The syntax for a named group is one of the Python-specific extensions: (?P<name>...). name is, obviously, the name of the group. Except for associating a name with a group, named groups also behave identically to capturing groups. The MatchObject methods that deal with capturing groups all accept either integers, to refer to groups by number, or a string containing the group name. Named groups are still given numbers, so you can retrieve information about a group in two ways:

>>> p = re.compile(r'(?P<word>\b\w+\b)')
>>> m = p.search( '(((( Lots of punctuation )))' )
>>> m.group('word')
'Lots'
>>> m.group(1)
'Lots'
Named groups are handy because they let you use easily-remembered names, instead of having to remember numbers. Here's an example RE from the imaplib module:

InternalDate = re.compile(r'INTERNALDATE "'
        r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
	r'(?P<year>[0-9][0-9][0-9][0-9])'
        r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
        r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
        r'"')
It's obviously much easier to retrieve m.group('zonem'), instead of having to remember to retrieve group 9.

Since the syntax for backreferences refers to the number of the group, in an expression like (...)\1, there's naturally a variant that uses the group name instead of the number. This is also a Python extension: (?P=name) indicates that the contents of the group called name should again be found at the current point. The regular expression for finding doubled words, (\b\w+)\s+\1 can also be written as (?P<word>\b\w+)\s+(?P=word):

>>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
>>> p.search('Paris in the the spring').group()
'the the'