Whenever possible: Do it once
Posted By Michael on September 22, 2011
We’ve been trying to deliver a custom report for a client using another system that has been long neglected. Of course we were also trying to do it in a way the original designers did not plan for. In a good state it was taking 3.5 hours to run. In a bad, up to 22. The client had a deadline and was getting anxious. Time to dive under the hood and see what could be done.
There were a few steps I took to improving things including moving the process to a new server and stopping unnecessary filters from appearing (ie. filtering to allow all possible cases). The new server obviously helped a large amount but we were still looking at a 45 minute – 1 hour process and it would still take weeks to deliver all the data.
In the end I got it down to 6 minutes. What made the final big difference? Fixing the regexes.
This time it was a bit of an experiment and I really wasn’t sure it would work. We had a large number of values to filter* for but as long as one appeared in each row of data the row should be retrieved. Due to the way the original designers implemented the filtering it was running a single regex check for each value. It was returning as soon as one was found, but in this case it was still an incredible amount of regexes. I rewrote it so all values were done in one OR class. Suddenly the report ran smooth as silk.
It’s a maxim I always try to observe and recently have seen it’s usefulness time and time again – whenever possible: do it once. Write to disk once, search once, calculate once, and loop as little as possible. If you could do it this way and you aren’t then you’re wasting cycles.
In this particular case there is a possible issue of too many cases within the single regex and that was my concern. Luckily so far it hasn’t been a problem.
* I suppose I should mention that our current setup for this particular data does not involve a database. Currently I’m trying to look into Hadoop and Hive to solve many problems.
Comments
Leave a Reply