4.1 More Metacharacters

There are some metacharacters that we haven't covered yet. Most of them will be covered in this section.

Some of the remaining metacharacters to be discussed are zero-width assertions. They don't cause the engine to advance through the string at all; instead, they consume no characters at all, and simply succeed or fail. For example, \b is an assertion that the current position is located at a word boundary; the position isn't changed by the \b at all. This means that zero-width assertions should never be repeated, because if they match once at a given location, they can obviously be matched an infinite number of times.

|

Alternation, or the ``or'' operator. If A and B are regular expressions, A|B will match any string that matches either "A" or "B". | has very low precedence, in order to make it work reasonably when you're alternating multi-character strings. Crow|Servo will match either "Crow" or "Servo", not "Cro", a "w" or an "S", and "ervo".

To match a literal "|", use \|, or enclose it inside a character class, as in [|].

^

Matches at the beginning of lines. Unless the MULTILINE flag has been set, this will only match at the beginning of the string. In MULTILINE mode, this also matches immediately after each newline within the string.

For example, if you wish to match the word "From" only at the beginning of a line, the RE to use is ^From.

>>> print re.match('^From', 'From Here to Eternity')
<re.MatchObject instance at 80c1520>
>>> print re.match('^From', 'Reciting From Memory')
None

To match a literal "^", use \^, or enclose it inside a character class, as in [\].

$

Matches at the end of lines, which is defined as either the end of the string, or any location followed by a newline character.

>>> print re.search('}$', '{block}')
<re.MatchObject instance at 80adfa8>
>>> print re.search('}$', '{block} ')
None
>>> print re.search('}$', '{block}\n')
<re.MatchObject instance at 80adfa8>

To match a literal "$", use \$, or enclose it inside a character class, as in [$].

\A

Matches only at the start of the string. When not in MULTILINE mode, \A and ^ are effectively the same. In MULTILINE mode, however, they're different; \A still matches only at the beginning of the string, but ^ may match at several locations inside the string (anywhere following a newline character).

\Z

Matches only at the end of the string.

\b

Word boundary. This is a zero-width assertion that matches only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, as indicated by so the end of a word is indicated by whitespace or a non-alphanumeric character.

The following example matches "class" only when it's a complete word; it won't match when it's containing inside another word.

>>> p = re.compile(r'\bclass\b')
>>> print p.search('no class at all')
<re.MatchObject instance at 80c8f28>
>>> print p.search('the declassified algorithm')
None
>>> print p.search('one subclass is')
None

There are two subtleties you should remember when using this special sequence. First, this is the worst collision between Python's string literals and regular expression sequences. In Python's string literals, "\b" is the backspace character, ASCII value 8. If you're not using raw strings, then Python will convert the "\b" to a backspace, and your RE won't match as you expect it to. The following example looks the same as our previous RE, but omits the "r" in front of the RE string.

>>> p = re.compile('\bclass\b')
>>> print p.search('no class at all')
None
>>> print p.search('\b' + 'class' + '\b')  
<re.MatchObject instance at 80c3ee0>

Second, inside a character class, where there's no use for this assertion, \b represents the backspace character, for compatibility with Python's string literals.

\B

Another zero-width assertion, this is the opposite of \b, only matching when the current position is not at a word boundary.