Perl Before Swine

Pair Programming: It’s In the Planning

Posted By Michael on March 3, 2012

Preamble

Pair programming, in case you’re unfamiliar with the term, is exactly what it sounds like: 2 programmers working on together on the same problem at the same time. The pair share a single keyboard, usually with one doing the typing (a.k.a. the driver), and collaboratively working out a solution for the task. The idea is that the resulting quality is higher due to a second pair of eyes sanity checking everything that is done. When it works right this can actually increase productivity over two programmers working independently on distinct tasks.

On to the goods

TechCrunch.com published an article titled Pair Programming Considered Harmful? about the pros and cons of pair programming and, more generally, what sort of privacy vs collaboration works best for software developers. It got me thinking about my experiences with pair programming and specifically my experiences in the last week. I’ll get to that in a separate post, but first I’m going to speak about the more traditional experiences I’ve had.

I work for a company that’s had a fair amount of growth and transition from small to medium sized business in the last few years. This creates a much stronger focus on maintainability and scalability of technology. Honestly it’s not just about the longevity of the business either. It’s a simple fact that as more people work on a project it needs to be easier for more people to work on it. In turn this leads to a lot of refactoring, improving our code standards, and exploring ways of working that we hadn’t previously. Pair programming was one of the earliest things we tried. I wouldn’t say it’s always been successful, but we continue to use it and it’s viewed positively in our environment. (A look at the comments on the TechCrunch article suggests other shops have had a lot more trouble.)

Our first attempts had some stumbling to them. The most immediate discovery was that some tasks simply worked better for it than others. This is a difficult item to quantify with a blanket rule but what it quickly led to was people identifying tasks that they felt would work well for pair programming during planning sessions. We were then able to earmark these task and plan around that. These decisions were made due to a knowledge of our code base and our expectations of what would be needed for the task.

The next “obvious” lesson was that the dynamics of pairing could be extremely different depending on the task and the people involved. I remember working on a task involving our ETL system which is completely done in Perl. The developer I was paired with had never worked in Perl. He was a smart guy but I was working in a language I knew well in a code base I was intimately familiar with whereas he had never looked at this code nor used the language, and he was quickly lost. It meant that not only did I need to explain what each function was trying to accomplish but how it could be done with the language. This quickly went from a collaborative process to a training process. Even at that I think it was less effective due to me being the driver. Overall it felt frustrating for both of us. It would lead us to become more aware of the goals of pairing on a task and how it would affect the requirements.

Alluded to in that last story but perhaps not quite as obvious is the impact that can be had by selecting the driver. In a balanced dynamic – skill sets, knowledge regarding the task at hand, experience, and confidence are all relatively equal – this should make little difference. As those vertices move it has a much bigger impact. I’ve seen highly confident developers sit back and ask their partner’s leading questions just to get the results they desire. I’ve seen the less experience lose interest when being an observer. In the earlier example I tended to drive ahead without regard to how well my partner was keeping up. In most cases the “driver’s seat” is an empowering place, but with certain combinations it can create a situation where the observer becomes more of a critic than collaborator.

Based on my experience, I think there’s a few key points to address that help make a positive pair programming experience. They are:

What is the goal? – What task are we trying to complete?
What is the goal? – What are the developers expected to gain from this?
What is the goal? – What is the team/department/company expected to gain?

The first point is important for any story a developer is going to be working on, but in clarifying that question we’ve found it’s easy to identify how well a task will suit pair programming.

The second point helps the developers identify what their roles will be in this pair. Along with the third it helps contextualize the dynamic. Examples of developer goals include assisting with unraveling an obscure bug, learning a new system, cross-training, exposure to new languages, exposure to new problem solving techniques, or simply team integration. Larger goals might include distributing domain knowledge, improving team dynamic or efficiency, or simply elevating the quality of the output.

A pair of developers working together on the same goals can lead to great experiences. A pair of programmers at odds with their expected goals only leads to frustration.

Category: Pair Programming | No Comments »
Tags: pair programming, TechCrunch, Working in Teams

StackOverflow

Posted By Michael on January 17, 2012

If you’ve ever Googled for help with a programming problem then you’ve probably come across StackOverflow.com. If you haven’t it’s a site where you can ask questions of other developers and receive help on an incredibly wide variety of topics. There’s over 30,000 different subject tags and very knowledgeable people answering. Questions, comments, and answers can all be voted on and from what I’ve seen the votes do reflect quality and accuracy.

I’d been aware of the site for several years and had often read threads on it when trying to resolve an issue, but it’s only recently that I’ve been registered. I started answering questions and this week I’m in the top 2% for reputation gains! (find me here: http://stackoverflow.com/users/1138252/ilion ) While that is fun to crow about it’s also been really great to help others out. It’s also really interesting to see how others approach problems and just how they are using various tools. I highly recommend getting involved!

Category: Community | No Comments »
Tags: Stack Overflow

Don’t be afraid of the tool box

Posted By Michael on January 8, 2012

Last night I was browsing through questions on Stack Overflow when I came across the statement “I don’t want to use any modules, I want to learn.” I would counter with: learn to use modules.

Reading this actually took me back to my early days of programming and falling into a very similar trap. It’s true that when you use modules you are now depending on someone else’s code and it does create a dependency, but installing perl modules is extremely simple and they can be used without root or admin access on a system so they should not be a roadblock in making code useful. The one area of caution would be in using a module that does a lot for something very simple. If you simply want to extract a list of file extensions from an array of file names, you probably don’t need to use File::PathInfo, although you could.

If you’re curious about how a module works open it up and look at the code. Most of the best things I’ve learned about programming have come from looking at how other people do things. If you find something the could be improved, don’t abandon the module. Instead, fix it and offer the patch to the author. If it’s merged in then you’ve contributed back to the Perl community and everyone benefits!

But you shouldn’t be afraid to make use of modules. Doing so is ignoring some of the best tools in your tool box. At the least it’s like avoiding the use of an electric screwdriver in favour of a hand powered one. Sure, you could do it, but it will probably cost you a lot of time and effort.

Category: perl | 2 Comments »
Tags: cpan, modules

All Quiet

Posted By Michael on October 16, 2011

It’s been awhile since I updated here which I’ve found unfortunate. The reason is pretty simple: I was pushed into working on a pile of PHP code. PHP4 code at that. It was unpleasant. (I don’t normally have anything specifically against PHP and for webdev it’s been my go-to language. But this was a bad scene.)

This past week I’ve been looking into Datastax’s Brisk product. It partners Hive/Hadoop with Cassandra and appears to be an excellent solution for some of the problems we need to solve quickly. The proof-of-concept work I actually did in Python, but I hope to go forward with more Perl based solutions. Net::Cassandra and Net::Cassandra::Easy seemed to be the modules to use previously, but it looks like they haven’t been updated in quite some time and broke with Cassandra 0.7. Cassandra::Lite comes to the rescue.

Currently I’m working with CentOS 6 which led to some troubles installing Brisk due to JNA for CentOS 6 being out of date. I was able to get past this by manually installing JNA and then doing a binary install of Brisk. It looks like using the JNA RPM for Fedora 15 will solve the issue though and next time I’ll likely install Brisk from the rpm. I’ve encountered some odd behaviour with Hive failing on simple queries and I’d like to know it has nothing to do with me screwing up something in the install.

Category: NoSQL | No Comments »
Tags: Cassandra, Cassandra::Lite, Hadoop, Hive, NoSQL

Whenever possible: Do it once

Posted By Michael on September 22, 2011

We’ve been trying to deliver a custom report for a client using another system that has been long neglected. Of course we were also trying to do it in a way the original designers did not plan for. In a good state it was taking 3.5 hours to run. In a bad, up to 22. The client had a deadline and was getting anxious. Time to dive under the hood and see what could be done.

There were a few steps I took to improving things including moving the process to a new server and stopping unnecessary filters from appearing (ie. filtering to allow all possible cases). The new server obviously helped a large amount but we were still looking at a 45 minute – 1 hour process and it would still take weeks to deliver all the data.

In the end I got it down to 6 minutes. What made the final big difference? Fixing the regexes.

This time it was a bit of an experiment and I really wasn’t sure it would work. We had a large number of values to filter* for but as long as one appeared in each row of data the row should be retrieved. Due to the way the original designers implemented the filtering it was running a single regex check for each value. It was returning as soon as one was found, but in this case it was still an incredible amount of regexes. I rewrote it so all values were done in one OR class. Suddenly the report ran smooth as silk.

It’s a maxim I always try to observe and recently have seen it’s usefulness time and time again – whenever possible: do it once. Write to disk once, search once, calculate once, and loop as little as possible. If you could do it this way and you aren’t then you’re wasting cycles.

In this particular case there is a possible issue of too many cases within the single regex and that was my concern. Luckily so far it hasn’t been a problem.

* I suppose I should mention that our current setup for this particular data does not involve a database. Currently I’m trying to look into Hadoop and Hive to solve many problems.

Category: Uncategorized | No Comments »
Tags:

The old adage on regular expressions

Posted By Michael on September 5, 2011

A co-worker is fond of saying, “You have a problem so you solve it with a regular expression. Now you have two problems.”

Regular expressions are powerful tools but they can easily cause unforeseen complications. The script I’ve been trying to improve contains a section for stripping HTML. It uses HTML::Strip to do this but does a lot of pre- and post-processing with a series of regular expressions. If the script was run with the stripping feature enabled an extra 21 minutes to the running time. You can imagine how quickly that adds up.

Looking at the regexes I was able to easily tweak them a bunch of them. Some I was able to combine through the use of classes, some I removed unnecessary groupings. At the end I’d reduced the processing time by about 8 1/2 minutes. I’m not completely celebrating yet as I’m still working on verifying the new data, but currently all the expected IDs are in place and things look good.

13 minutes is still a lengthy processing time, but it’s a big improvement and I think I may still be able to whittle it down. While I spotted numerous issues with the old regexes, I did actually learn a couple things while doing this. Regular Expressions are far more complex than they seem at first and it’s very worthwhile to familiarize yourself with how they work. The perlre section at perldoc.perl.org is a wonderful resource but before trying to absorbing it all I’d look for a primer. Robert’s Perl Tutorial has an excellent section on regular expressions. For a more pointed document, Steve Litt’s Perl Regular Expressions is the top Google result and looks quite good.

Category: Legacy Issues, perl | No Comments »
Tags: regex

Let’s take the shortcut

Posted By Michael on September 5, 2011

In my continuing adventures with legacy code I’ve come across a couple mistakes that I can only figure came from someone who didn’t really know the language.

I have to extract information from several large gzipped files. These files will be filtered for specific IDs and the results kept in a new file. A pretty basic process, but I’d also felt that this must be one of the bottlenecks in our daily reporting. So based on that instinct I decided to take a closer look and issues were quickly apparent.

Example 1:

system ("zcat $gzipedFile > $file");

It’s easy to see what’s going on here: the gzipped file is being unarchived and processed line by line into a new file. We want to retain the archive for future reporting so we avoid the use of gzip -d. On its own, it’s not a terrible command, but the next thing we do is this:

open (FILE, "<:utf8", $file) or die "FAILED to open $file! $!\n";

The system command was taking about 22 seconds to run on a ~500mb file. We can combine these commands into one and make the process much faster:

open (PROC, "-|:utf8", "zcat $gzippedFile") or die "FAILED to start proc: $!\n";

Instead of working with a file handle we work with a process handle. Using the 3+ argument version of open(), the -| in the mode parameter tells perl we are using a process which will pipe data to us. Doing it this way removed the 22 second startup cost of unarchiving the gzip. The length of my while() loop which followed was increased but only by 4 seconds. That’s almost an 81% gain.

Example 2:

Near the end of this process I found the following code operating on the file we have been using for output:

system ("cat $finalFile > $finalFile.ext");
unlink $file;

Yes, this outputs the contents of the file into a new file and then deletes the original. I honestly have no idea why it was done this way and cannot explain it. The purpose seems to be to get a specific extension that we use to label our reports, but there’s nothing between file creation and completion that requires us not to use the that extension. This was taking more than a minute to complete. The entire time was removed simply by giving the output file the correct file name in the first place.

This had an odd side effect that took me awhile to track down. As a final step we output some cksum information. Suddenly the time of the cksum operation skyrocketed. Eventually I realized that the open() for the output file never had a corresponding close(). The open file handle seemed to be slowing down the cksum. The previous method, ridiculous as it was, got around the file handle issue by creating a new file. I added the close() in and the cksum returned to normal. The close() has a noticeable time cost but it is still much quicker than the system() method.

I’ve managed to remove a minute of processing time from this process which will have a significant impact.

Category: Legacy Issues, perl | No Comments »
Tags: Adventures in Legacy Code

Bad slipup with ternary operators

Posted By Michael on September 1, 2011

I’m a fan of ternary operators. I’ve seen some people complain that they make code less readable but I don’t agree. I suppose it’s fair to say it’s not as plain to scan through but I think any decent programmer should understand them. The other day, however, I came across a legacy one in our ETL process that was not doing the right job.

One section of our ETL loads in a bunch of values that, if null, are supposed to be set to hyphens. ( There’s some reason for this absurdity that relates to 3rd parties, but we’ll not get into that.) These values are sent to a perl function that validates and assigns them with the following ternary operation:

push @values, $value ? $value : "-";

Can you guess the problem?

In the majority of cases this works perfectly. A null or undefined value will evaluate as false and the hyphen will return. Unfortunately, sometimes this value can be 0, and Perl, being the way it is, will evaluate that as false too!

It was a pretty obvious bug when I found it. In essence this is not asking the right question. We don’t want to know if $value is true or false, we want to know if it’s defined. The fix is extremely simple:

push @values, defined($value) ? $value : "-";

Category: perl | No Comments »
Tags: operators, ternary, truthiness

DBD:Mock

Posted By Michael on August 31, 2011

Within about the last year unit testing became a big thing at the company I’ve been working for. Although the company has been around for awhile, most of its growth has been in the last year to two which means a lot of process has been evolving to ensure quality standards on a larger team. Of course this also means we’re dealing with trying to retrofit a lot of legacy code with unit tests. It’s a joy I tell you.

Now I’m not quite one of the proselytizers of unit testing, but I’ve been forced to admit it has helped me out so it’s a tool I’ll continue to use. (On the other hand, just today someone on our team hit the shortcomings of unit testing where all his unit tests passed but his program didn’t.) Of course for our perl code our main testing tool is the great Test::More module which makes use of TAP (Test Anything Protocol). With a small make script we’ve been able to integrate these tests with Gerrit and Jenkins so they run with every commit and must pass before code can be merged.

When we first started this I found database related code to be the most irritating to test. Trying to mock it out with mock objects was too much of a pain. This is where DBD::Mock came to the rescue. If your code is written so that database handles can easily be passed to objects then using DBD::Mock takes all the pain of mocking out a database away. When that doesn’t work we make use of some homegrown code to mock specific subs and return the DBD::Mock database handle object instead of the regular dbh.

The module installs just like any other DBD and you access it with a regular DBI->connect:

my $dbh_mock = DBI->connect("DBI:Mock:", "", "");

From there you can execute any database statement you normally would against the handle. That in itself is pretty helpful as it removes any need for further mocking or actual database connectivity. It also ensures you’re not dirtying a database with unit test artifacts. But it’s far more powerful than that. To begin with you can populate the object with some result sets. It’s as easy as:

@res = (
  ['col1','col2','col3'],
  ['foo','bar','baz'],
  ['bee','baa','boo'],
);
$dbh_mock->{mock_add_resultset} = \@res;

The first set functions as column headers. Further sets function as row results which could be iterated over with$sth->fetchrow or similar. Multiple result sets can be added. The first statement requiring results will return from the first set, the second from the second set, and so on. But what if you want a result set returned for a specific statement?

As a basic example, we have some functions built around determining free space on a server so we know if we need to drop old data before loading in new data. There’s a few things that happen on the way to the statement that actually affects what we want to test, so we want to make sure the result set is returned for our specific database size query (you’ll see I’m using postgres here):

$dbh_mock->{mock_add_resultset} = {
  sql => "SELECT pg_database_size('dbname')",
  results => [["size"], [10000]]
};

Other queries can return from the general result set pool in the object, but only the size query will return this set.

Another feature I’ve found quite useful is the ability to access the history for the mock object. The history stores all the sql queries that have been executed against it and can easily be accessed through an array ref:

my $history=$dbh_mock->{mock_all_history};

Basic tricks can be performed such as testing the size of the array ( is(scalar(@{$history}), 5) ) to make sure the expected number of queries were run, or searching the array for specific queries or running is_deeply tests against the params:

is_deeply($history->[0]->{bound_params}, ['foo',5555,'2099-01-01',11]);

There’s a lot more DBD::Mock can do and I highly recommend it for anyone trying to unit test objects that deal with databases.

Category: perl, Unit Testing | No Comments »
Tags: modules

Perl Before Swine

Recent Posts

Recent Comments

Categories

Meta