3.2 The Backslash Plague

As stated earlier, regular expressions use the backslash character ("\") to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflictts with Python's usage of the same character for the same purpose in string literals.

Let's say you want to write a RE that matches the string "\section", which might be found in a L^ATEX file. To figure out what to write in the program code, start with the desired string to be matched. Next, you must escape any backslashes and other metacharacters by preceding them with a backslash, resulting in the string "\\section". The resulting string that must be passed to re.compile() must be \\section. However, to express this as a Python string literal, both backslashes must be escaped again.

Characters Stage

\section Text string to be matched
\\section Escaped backslash for re.compile
"\\\\section" Escaped backslashes for a string literal

Characters	Stage
`\section`	Text string to be matched
`\\section`	Escaped backslash for `re.compile`
`"\\\\section"`	Escaped backslashes for a string literal

In short, to match a literal backslash, one has to write '\\\\' as the RE string, because the regular expression must be "\\", and each backslash must be expressed as "\\" inside a regular Python string literal. In REs that feature backslashes repeatedly, this leads to lots of repeated backslashes which makes the resulting strings difficult to understand.

The solution is to use Python's raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with "r", so r"\n" is a two-character string containing "\" and "n", while "\n" is a one-character string containing a newline. Frequently regular expressions will be expressed in Python code using this raw string notation.

Regular String Raw string

"ab*" r"ab*"
"\\\\section" r"\\section"
"\\w+\\s+\\1" r"\w+\s+\1"

Regular String	Raw string
`"ab*"`	`r"ab*"`
`"\\\\section"`	`r"\\section"`
`"\\w+\\s+\\1"`	`r"\w+\s+\1"`