Russ Cox: ‘Regular Expression Matching Can Be Simple and Fast’

Interesting low-level white paper on the design of regular expression engines. The problem, as Cox sees it, is that most popular regex engines, while they run quickly in nearly all cases, are susceptible to certain patterns which can take extraordinarily long times to execute. (In practical terms, they never finish.)

However, there’s quite a bit of hand-waving in Cox’s proposed solution, which is a modified version of a regular expression engine devised by Unix co-creator Ken Thompson 40 years ago. Cox’s implementation doesn’t support backreferences or any modern regex syntax — I’d be a lot more intrigued if he hadn’t left these features as exercises for the reader. And I think Cox makes too much of the fact that what we now call “regular expressions” no longer fit the formal mathematical definition of regular expressions. I’m with Jeffrey Friedl here — this just means “regular expression” has taken on a new meaning, not that we should abandon the non-regular syntax.

(Thanks to Jim Correia.)

Tuesday, 13 February 2007