Let’s take the shortcut

Posted By on September 5, 2011

In my continuing adventures with legacy code I’ve come across a couple mistakes that I can only figure came from someone who didn’t really know the language.

I have to extract information from several large gzipped files. These files will be filtered for specific IDs and the results kept in a new file. A pretty basic process, but I’d also felt that this must be one of the bottlenecks in our daily reporting. So based on that instinct I decided to take a closer look and issues were quickly apparent.

Example 1:

system ("zcat $gzipedFile > $file");

It’s easy to see what’s going on here: the gzipped file is being unarchived and processed line by line into a new file. We want to retain the archive for future reporting so we avoid the use of gzip -d. On its own, it’s not a terrible command, but the next thing we do is this:

open (FILE, "<:utf8", $file) or die "FAILED to open $file! $!\n";

The system command was taking about 22 seconds to run on a ~500mb file. We can combine these commands into one and make the process much faster:

open (PROC, "-|:utf8", "zcat $gzippedFile") or die "FAILED to start proc: $!\n";

Instead of working with a file handle we work with a process handle. Using the 3+ argument version of open(), the -| in the mode parameter tells perl we are using a process which will pipe data to us. Doing it this way removed the 22 second startup cost of unarchiving the gzip. The length of my while() loop which followed was increased but only by 4 seconds. That’s almost an 81% gain.

Example 2:

Near the end of this process I found the following code operating on the file we have been using for output:

system ("cat $finalFile > $finalFile.ext");
unlink $file;

Yes, this outputs the contents of the file into a new file and then deletes the original. I honestly have no idea why it was done this way and cannot explain it. The purpose seems to be to get a specific extension that we use to label our reports, but there’s nothing between file creation and completion that requires us not to use the that extension. This was taking more than a minute to complete. The entire time was removed simply by giving the output file the correct file name in the first place.

This had an odd side effect that took me awhile to track down. As a final step we output some cksum information. Suddenly the time of the cksum operation skyrocketed. Eventually I realized that the open() for the output file never had a corresponding close(). The open file handle seemed to be slowing down the cksum. The previous method, ridiculous as it was, got around the file handle issue by creating a new file. I added the close() in and the cksum returned to normal. The close() has a noticeable time cost but it is still much quicker than the system() method.

I’ve managed to remove a minute of processing time from this process which will have a significant impact.

About the author

Comments

Leave a Reply