Regular Expressions: The Wart on the Bum of Every Language in Existence


Regular ExpressionsThoughts

Regular expression syntax has historically been elusive and strange. Mastery of it’s syntax requires many hours of experimentation in a silent room, by yourself. It’s a grueling exercise in patience, practice and even intuition when documentation fails or simply does not exist.

Like some bizarre, lone shamanic journey, I emerged from regular expression ‘study’, proclaiming that “I have found it, and now I understand”. That was when my family told me that I need to get out into the real world more often, and step away from the computer. Not everyone will appreciate the time and energy placed into regular expression mastery, which made it seem like the cool secret only hard core UNIX geeks knew fifteen years ago. Now it’s no secret, it’s extremely useful, yet still a huge pain in the bum.

Much of the regular expression syntax used today evolved from the ed/sed and awk UNIX commands. They had some odd syntax, but were not so bad to work with.

ed/sed were (and still are) search and replace commands. The syntax usually involved a a ‘from’ and ‘to’ patterns looking something like this:

sed ‘s/FROM/TO/’ filename

For deletions, it only required a ‘from’ pattern, similar to this:

sed ‘/REMOVE_THIS_LINE/d’

It was very powerful in it’s day, and still is useful in certain contexts. An excellent sed reference can be found here:

http://www.student.northpark.edu/pemente/sed/sed1line.txt

Patterns had to exist entirely on one line for sed to work correctly. This was a huge limitation when trying to process what constitutes a record for your data set, or data spanning multiple lines. Then along came ‘awk’, which allowed you to change the field delimiter and record separator to almost anything you wish (not just ‘ ‘ and ‘\n’), refer to matched text as numbered groups, and do some clever advanced pattern matching and substitution.

Here is a decent awk document:

http://www.vectorsite.net/tsawk.html

As time went on, regular expression syntax, lovingly referred to as regex, matured. It grew hair, then it stopped shaving, and things got messy again.

Regex modules in various languages support many of the basic constructs of sed and awk, only the syntax varies slightly. Conceptually, it would not hurt to learn sed and awk basics, then move up to more recent regex modules, because it will broaden the mind and help you learn the overall zeitgeist of regex syntax and concepts.

Perl became known to me because of it’s mature, ‘early adulthood’ regular expression syntax. Perl gave us a syntax we spent a lot of time learning, yet with meta character representing whitespace: ‘\s’, alphanumeric characters including underscore: ‘\w’, non-whitespace: ‘\S’, and the list goes on. Perl also introduced a slightly different pattern matching group syntax than awk, which turned out to be conceptually easier to grok. I personally wrote to Larry Wall, one of the creators of Perl, and thanked him profusely for saving me so much time and energy.

Here’s a decent Perl regular expression primer:

http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

Python’s regular expression syntax is even more robust and painful, allowing you to create functions which are called with every group match, and all sorts of convoluted loveliness. It is the equivalent of taking a quality Angus beef burger and grilling it with cardamom and curry. It’s complex, not what people expect, and an acquired taste is necessary to really appreciate it. It’s not a good place to start learning regular expression syntax, because of it’s complexity and lack of good documentation. There is one great document that I have permanently affixed to my desk, written by Andrew Kuchling, here:

http://www.amk.ca/python/howto/regex/

Nifty features such as finditer() do both iteration over groups and matching simultaneously. Patterns can be compiled for efficiency, and then passed function parameters to call upon every group match. Groups can be named or numbered, and reused in the same regex pattern (not really new, since the & operator existed in sed, and numbered groups existed in awk, but still very handy). Python regex is useful for solving simple to moderately difficult patten substitution problems. But extremely difficult problems require a once hidden feature called ‘scanner’. The scanner used to be in a module called ‘sre’, only known and understood by Fredrich Lundh. He remains the guru of this syntax, and has written pages about how to use scanning and grouping for complex pattern matching:

http://effbot.org/zone/xml-scanner.htm

Essentially, the scanner allows you to use the pattern grouping method combined with the ability to call a function upon a group match, across a huge ‘OR’ed pattern. So your matching pattern looks something like this:

match_one_of_these_HTML_parts = re.compile(“|&’\”=\s]+)|(\/\s*>)|(\s+)|(.)”)

Every OR operator ‘|’ separates the pattern into a numbered group. The scanner returns a last index which tells you which group was matched. Clever, but unsightly for the beginner. I still recommend starting with sed, awk and Perl, and acquiring the taste for the ‘re’ module.

I did not cover Java or Ruby regular expression syntax because I don’t know it. If one of the other DevChix could contribute these pieces, I’d be so thrilled.

Do you have a challenging regular expression, and need help with it, or want to brag about how you solved it? Please do share.

~G~