Regular Expressions: The Wart on the Bum of Every Language in Existence
May 25th, 2007 byRegular expression syntax has historically been elusive and strange. Mastery of it’s syntax requires many hours of experimentation in a silent room, by yourself. It’s a grueling exercise in patience, practice and even intuition when documentation fails or simply does not exist.
Like some bizarre, lone shamanic journey, I emerged from regular expression ’study’, proclaiming that “I have found it, and now I understand”. That was when my family told me that I need to get out into the real world more often, and step away from the computer. Not everyone will appreciate the time and energy placed into regular expression mastery, which made it seem like the cool secret only hard core UNIX geeks knew fifteen years ago. Now it’s no secret, it’s extremely useful, yet still a huge pain in the bum.
Much of the regular expression syntax used today evolved from the ed/sed and awk UNIX commands. They had some odd syntax, but were not so bad to work with.
ed/sed were (and still are) search and replace commands. The syntax usually involved a a ‘from’ and ‘to’ patterns looking something like this:
sed ’s/FROM/TO/’ filename
For deletions, it only required a ‘from’ pattern, similar to this:
sed ‘/REMOVE_THIS_LINE/d’
It was very powerful in it’s day, and still is useful in certain contexts. An excellent sed reference can be found here:
http://www.student.northpark.edu/pemente/sed/sed1line.txt
Patterns had to exist entirely on one line for sed to work correctly. This was a huge limitation when trying to process what constitutes a record for your data set, or data spanning multiple lines. Then along came ‘awk’, which allowed you to change the field delimiter and record separator to almost anything you wish (not just ‘ ‘ and ‘\n’), refer to matched text as numbered groups, and do some clever advanced pattern matching and substitution.
Here is a decent awk document:
http://www.vectorsite.net/tsawk.html
As time went on, regular expression syntax, lovingly referred to as regex, matured. It grew hair, then it stopped shaving, and things got messy again.
Regex modules in various languages support many of the basic constructs of sed and awk, only the syntax varies slightly. Conceptually, it would not hurt to learn sed and awk basics, then move up to more recent regex modules, because it will broaden the mind and help you learn the overall zeitgeist of regex syntax and concepts.
Perl became known to me because of it’s mature, ‘early adulthood’ regular expression syntax. Perl gave us a syntax we spent a lot of time learning, yet with meta character representing whitespace: ‘\s’, alphanumeric characters including underscore: ‘\w’, non-whitespace: ‘\S’, and the list goes on. Perl also introduced a slightly different pattern matching group syntax than awk, which turned out to be conceptually easier to grok. I personally wrote to Larry Wall, one of the creators of Perl, and thanked him profusely for saving me so much time and energy.
Here’s a decent Perl regular expression primer:
http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
Python’s regular expression syntax is even more robust and painful, allowing you to create functions which are called with every group match, and all sorts of convoluted loveliness. It is the equivalent of taking a quality Angus beef burger and grilling it with cardamom and curry. It’s complex, not what people expect, and an acquired taste is necessary to really appreciate it. It’s not a good place to start learning regular expression syntax, because of it’s complexity and lack of good documentation. There is one great document that I have permanently affixed to my desk, written by Andrew Kuchling, here:
http://www.amk.ca/python/howto/regex/
Nifty features such as finditer() do both iteration over groups and matching simultaneously. Patterns can be compiled for efficiency, and then passed function parameters to call upon every group match. Groups can be named or numbered, and reused in the same regex pattern (not really new, since the & operator existed in sed, and numbered groups existed in awk, but still very handy). Python regex is useful for solving simple to moderately difficult patten substitution problems. But extremely difficult problems require a once hidden feature called ’scanner’. The scanner used to be in a module called ’sre’, only known and understood by Fredrich Lundh. He remains the guru of this syntax, and has written pages about how to use scanning and grouping for complex pattern matching:
http://effbot.org/zone/xml-scanner.htm
Essentially, the scanner allows you to use the pattern grouping method combined with the ability to call a function upon a group match, across a huge ‘OR’ed pattern. So your matching pattern looks something like this:
match_one_of_these_HTML_parts = re.compile(”|&’\”=\s]+)|(\/\s*>)|(\s+)|(.)”)
Every OR operator ‘|’ separates the pattern into a numbered group. The scanner returns a last index which tells you which group was matched. Clever, but unsightly for the beginner. I still recommend starting with sed, awk and Perl, and acquiring the taste for the ‘re’ module.
I did not cover Java or Ruby regular expression syntax because I don’t know it. If one of the other DevChix could contribute these pieces, I’d be so thrilled.
Do you have a challenging regular expression, and need help with it, or want to brag about how you solved it? Please do share.
~G~


May 26th, 2007 at 11:19 pm
It might be worth turning those URLs into actual lt;a&;gt elements for easy clicking.
I learned regular expressions in the JavaScript world and didn’t have much trouble at all getting used to the syntax. Then again Perl’s various $ operators are impossibly opaque for me, so I guess everyone has their own aptitudes.
May 27th, 2007 at 10:46 am
Done, thanks.
Perl’s operators take some getting used to. Because they have evolved from the UNIX world, people not familiar with UNIX shell or FSF tools will have a hard time groking it at first. Perl’s $_ current record, $@ current array , and $$ current process operators are obscure enough. To learn in perl regex that ‘$’ now means end of line is extremely obscure, I know.
My copy of the O’Reilly Perl Pocket Reference is dog-eared and worn out for this reason. I STILL occasionally have to glance at it:
http://www.amazon.com/
Perl-Pocket-Reference-Johan-Vromans/dp/0596003749
But it does help.
~G~
June 7th, 2007 at 8:33 am
Rebol is a nice cross-platform language which relies on grammars instead of regexes. In many cases, grammars can be easier to read than regexes.
Rebol’s parse function follows the basic form:
PARSE [
(code executed on match)
(code executed on match)
...
]
See if you can follow this code, which extracts all hyperlinks and their labels from a page:
links: []
html: read http://www.devchix.com
parse/all html [
any [
thru {" copy label to
(repend links [trim/lines label url])
]
to end
]
foreach [label link] links [print [label link]]
More info on this approach here:
http://www.rebol.com/docs/core23/rebolcore-15.html
June 7th, 2007 at 9:56 am
If you’re totally new to Regex, here’s a good starting point: http://www.regular-expressions.info/
June 7th, 2007 at 2:12 pm
Formatting got messed up on my previous post.
The pattern would be:
PARSE “string….” [
RULE (code executed on match)
RULE (code executed on match)
…
]
Hopefully character entities are permitted. With a little luck the code would then be:
parse/all html [
any [
thru {<a href=} copy url to {"}
thru {>} copy label to <a/>
(repend links [trim/lines label url])
]
to end
]
December 19th, 2007 at 7:03 pm
I’m having the hardest time learning how to use sed to look for matches within a file and report only on those matches instead of the whole line. It’s great to be able to find/replace, or just replace, but what about just find. Here’s an example of what I’m talking about: finding a server name in each line of a file and reporting just that server’s name.
Here’s what a line might look like:
System registry: sysreg_srv@att-stageapp01
I just want my sed command to report the “sysreg_srv@att-stageapp01″ portion of this line.
Naming for a server is very loose, so I thought I was onto something with sed ’s/[^ ]*@[^ ]*//’, but couldn’t figure out how to negate the match, so running
echo “System registry: sysreg_srv@att-stagevmjciapp01″ | sed ’s/[^ ]*@[^ ]*//’
only reported on what I didn’t want to match.
If anyone feels like they could lead me to a solution for this type of operation, I would be happy to follow.
December 20th, 2007 at 8:01 am
Here’s your solution:
echo “System registry: sysreg_srv@att-stagevmjciapp01″ | sed ’s/.* *: *//’
sed is doing this:
match any char (.), all occurrences of any char (*) up to a space, then zero or more occurrences of space(*) up to a “:”, then space, zero or more occurrences after the space (*). Replace it with nothing (//)
Enjoy,
Gloria