The old adage on regular expressions
Posted By Michael on September 5, 2011
A co-worker is fond of saying, “You have a problem so you solve it with a regular expression. Now you have two problems.”
Regular expressions are powerful tools but they can easily cause unforeseen complications. The script I’ve been trying to improve contains a section for stripping HTML. It uses HTML::Strip to do this but does a lot of pre- and post-processing with a series of regular expressions. If the script was run with the stripping feature enabled an extra 21 minutes to the running time. You can imagine how quickly that adds up.
Looking at the regexes I was able to easily tweak them a bunch of them. Some I was able to combine through the use of classes, some I removed unnecessary groupings. At the end I’d reduced the processing time by about 8 1/2 minutes. I’m not completely celebrating yet as I’m still working on verifying the new data, but currently all the expected IDs are in place and things look good.
13 minutes is still a lengthy processing time, but it’s a big improvement and I think I may still be able to whittle it down. While I spotted numerous issues with the old regexes, I did actually learn a couple things while doing this. Regular Expressions are far more complex than they seem at first and it’s very worthwhile to familiarize yourself with how they work. The perlre section at perldoc.perl.org is a wonderful resource but before trying to absorbing it all I’d look for a primer. Robert’s Perl Tutorial has an excellent section on regular expressions. For a more pointed document, Steve Litt’s Perl Regular Expressions is the top Google result and looks quite good.
Comments
Leave a Reply