4.4 Other Assertions

Another zero-width assertion is the lookahead assertion. Lookahead assertions are available in both positive and negative form, and look like this:

(?=...)
Positive lookahead assertion. This succeeds if the contained regular expression, represented here by ..., successfully matches at the current location, and fails otherwise. But, once the contained expression has been tried, the matching engine doesn't advance at all; the rest of the pattern is tried right where the assertion started.

(?!...)
Negative lookahead assertion. This is the opposite of the positive assertion; it succeeds if the contained expression doesn't match at the current position in the string.

An example will help make this concrete, and will demonstrate a case where a lookahead is useful. Consider a simple pattern to match a filename, and split it apart into a base name and an extension, separated by a ".". For example, in "news.rc", "news"is the base name, and "rc" is the filename's extension.

The pattern to match this is quite simple: .*[.].*$. (Notice that the "." needs to be treated specially because it's a metacharacter; I've put it inside a character class. Also notice the trailing $; this is added to ensure that all the rest of the string must be included in the extension.) This regular expression matches "foo.bar" and "autoexec.bat" and "sendmail.cf" and "printers.conf".

Now, consider complicating the problem a bit; what if you want to match filenames where the extension is not "bat"? Some incorrect attempts:

.*[.][^b].*$

First attempt: Exclude "bat" by requiring that the first character of the extension is not a "b". This is wrong, because it also doesn't match "foo.bar".

.*[.]([^b]..|.[^a].|..[^t])$

The expression gets messier when you try to patch up the first solution by requiring one of the following cases must match: the first character of the extension isn't "b"; the second character isn't "a"; or the third character isn't "t". This accepts "foo.bar" and rejects "autoexec.bat", but it requires a three-letter extension, and doesn't accept "sendmail.cf". Another bug, so we'll complicate the pattern again in an effort to fix it.

.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$

In the third attempt, the second and third letters are all made optional in order to allow matching extensions shorter than three characters, such as "sendmail.cf".

The pattern's getting really complicated now, which makes it hard to read and understand. When you write a regular expression, ask yourself: if you encountered this expression in a program, how hard would it be to figure out what the expression was intended to do? Worse, this solution doesn't scale well; if the problem changes, and you want to exclude both "bat" and "exe" as extensions, the pattern would get still more complicated and confusing.

A negative lookahead cuts through all this. Go back to the original pattern, and, before the .* which matches the extension, insert (?!bat$). This means: if the expression bat doesn't match at this point, try the rest of the pattern; if bat$ does match, the whole pattern will fail. (The trailing $ is required to ensure that something like "sample.batch", where the extension only starts with "bat", will be allowed.

After this modification, the whole pattern is .*[.](?!bat$).*$. Excluding another filename extension is now easy; simply add it as an alternative inside the assertion. .*[.](?!bat$|exe$).*$ excludes both "bat" and "exe".