Alan Storm Explains My URL Regex Pattern

Great follow-up from Alan Storm to my piece yesterday on the regex for matching URLs. Storm does what I was too lazy to do: dissect the regex and explain how it works. I’m a big fan of the “/x” format for regex patterns — it’s a big aid for readability and maintainability.

As for Storm’s list of exceptions:

  • I didn’t bother trying to make the pattern match anything other than a single pair of balanced parentheses in a URL because I’ve never actually seen a real life URL that contains parentheses in any other form. I was ready to get all Friedl on this pattern’s ass and let it match nested parens, but thought the better of it. Why bother complicating the pattern to match something that doesn’t seem to occur in the wild? That’s what I mean about the pattern being practical.

  • As for web site addresses that lack both a protocol (like “http:”) and a “www.” prefix, it didn’t seem worth the effort for my own purposes. In the places where I use this pattern personally (I’ll write more about that soon), there’s almost always a protocol for the URLs. I special-cased “www.” because it was easy and obvious. It’s hard to match something like “example.com” without also matching something like “example.txt” unless you use a list of known TLDs, and that’s a direction I didn’t want to go.

(I got a bunch of other great feedback and suggested tweaks via email; still going through them, but will post a follow-up.)

Saturday, 28 November 2009