2.2 Repeating Things

Being able to match varying sets of characters is the first thing regular expressions can do that isn't already possible with Python's string module. However, if that was the only additional capability of regexes, they wouldn't be much of an advance over string. The second capability is that you can specify that portions of the RE must be repeated a certain number of times.

The first metacharacter for repeating things that we'll look at it is *. * doesn't match the literal character "*"; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.

For example, ca*t will match "ct" (0 "a"characters), "cat" (1 "a"), "caaat" (3 "a"characters), and so forth. The RE engine has various internal limitations, stemming from the size of C's int type, that will prevent it from matching over 2 billion "a" characters; you probably don't have enough memory to construct a string that large, so you shouldn't run into that limit.

Repetitions such as * are greedy; when repeating a RE, the matching engine will try to repeat it as many times as possible. If later portions of the pattern don't match, the matching engine will then back up and try again with few repetitions.

A step-by-step example will make this more obvious. Let's consider the expression a[bcd]*b. This matches the letter "a", zero or more letters from the class [bcd], and finally ends with a "b". Now imagine matching this RE against the string "abcbd".

Step   Matched   Explanation  
1 a The a in the RE matches. 
2 abcbd The engine matches [bcd]*, going as far as it can, which is to the end of the string. 
3 Failure The engine tries to match b, but the current position is at the end of the string, so it fails. 
4 abcb Back up, so that [bcd]* matches one less character. 
5 Failure Try b again, but the current position is at the last character, which is a "d". 
6 abc Back up again, so that [bcd]* is only matching "bc". 
6 abcb Try b again. This time but the character at the current position is "b", so it succeeds. 

The end of the RE has now been reached, and it has matched "abcb". This demonstrates how the matching engine goes as far as it can at first, and if no match is found, it will then progressively back up and retry the rest of the RE again and again. It will back up until it's tried zero matches for [bcd]*, and if that subsequently fails, it will conclude that the string doesn't match the RE at all.

Another repeating metacharacter is +, which matches one or more times. Pay careful attention to the difference between * and
regexp+; * matches zero or more times, so whatever's being repeated may not be present at all, while + requires at least one occurrence. To use a similar example, ca+t will match "cat" (1 "a"), "caaat" (3 "a"'s), but won't match "ct".

There are two more repeating qualifiers. The question mark character, ?, matches either once or zero times; you can think of it as marking something as being optional. For example, home-?brew matches either "homebrew" or "home-brew".

The most complicated repeated qualifier is {m,n}, where m and n are decimal numbers. This means there must be at least m repetitions, and at most n. For example, a/{1,3}b will match "a/b", "a//b", and "a///b". It won't match "ab", which has no slashes, or "a////b", which has four.

You can omit either m or n; in that case, a reasonable value is assumed for the missing value. Omitting m is interpreted as a lower limit of 0, while omitting n results in an upper bound of infinity -- actually, the 2 billion limit mentioned earlier, but that might as well be infinity.

Readers of a reductionist bent may notice that the 3 other qualifiers can all be expressed using this notation. {0,} is the same as *, {1,} is equivalent to +, and {0,1} is the same as ?. It's better to use *, +, or ?, simply because they're shorter and easier to read.