6.3 Greedy versus Non-Greedy

When repeating a regular expression, as in a*, the resulting action is to consume as much of the pattern as possible. This fact often bites you when you're trying to match a pair of balanced delimiters, such as the angle brackets surrounding an HTML tag. The naïve pattern for matching a single HTML tag doesn't work because of the greedy nature of .*.

>>> s = '<html><head><title>Title</title>'
>>> len(s)
32
>>> print re.match('<.*>', s).span()
(0, 32)
>>> print re.match('<.*>', s).group()
<html><head><title>Title</title>

The RE matches the "<" in "<html>", and the .* consumes the rest of the string. There's still more left in the RE, though, and the > can't match at the end of the string, so the regular expression engine has to backtrack character by character until it finds a match for the >. The final match extends from the "<" in "<html>"to the ">" in "</title>", which isn't what you want.

In this case, the solution is to use the non-greedy qualifiers *?, +?, ??, or {m,n}?, which match as little text as possible. In the above example, the ">" is tried immediately after the first "<" matches, and when it fails, the engines advances a character at a time, retrying the ">" at every step. This produces just the right result:

>>> print re.match('<.*?>', s).group()
<html>