serialized.net

A study in fascination burnout

Testing Perl Code That Runs Commands

At the September Los Angeles Perl Mongers meetup, Tommy Stanton presented on an in-progress bit of code he’s working on, App::Git::HomeSync. (Presentation)

As you’d hope and expect from something headed CPAN-ward, he’s got lots of tests. As you might have guessed from the name, this module needs to run git quite a bit, with different command line arguments. Tommy’s approach to testing this is good – ship the module with some “fixtures” (a directory in a known state which gets unpacked into a temp directory) and then run the command line app in that directory.

There is another way to approach this, and I realized I didn’t have any “open sourceable” code which demonstrates this technique. I got a lot of the way through writing this before realizing I have blogged a more basic version of this idea before, but this is a new-and-improved take on things, with much deeper examples.

tl;dr

  • Use IPC::Run to run your command line apps
  • In your test code, intercept the calls to IPC::Run::run and return your own data, based on the command line used
  • Store sample command line output inside the test file using Data::Section

I’ve stashed a full, functional example of this idea in my Acme::System repository on GitHub.

Code Walkthrough

Using IPC::Run in your code

First, there’s the module code itself. This module does 2 very stupid things.

  • It returns the sum of all the PID’s (Process ID’s) on the system, and it calls ps to get this information.
  • It returns the value of one of the columns from the vmstat tool.

The important thing is to use IPC::Run::run to actually run the code, instead of a blind system() call. Because it’s a module call, and has a very simple interface, it’s much easier to mock it (“override the functionality with ‘fake’ functionality) for testing.

This is in lib/Acme/System.pm. Here’s the pidsum method (the vmstat_col method is basically the same, check out the full code if you’re curious):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
    =method pidsum
        Return the sum of all the PID's on a system
    =cut

    sub pidsum {
        my @ps_cmd = ("ps", "-Ao", "pid,cmd");
        my ($stdin, $stderr) = (undef, undef);
        my $ps_output;
        IPC::Run::run(\@ps_cmd, \$stdin, \$ps_output, \$stderr);
        return sum(
            map { /^\s*(\d+)/ ? $1 : 0 }
            split(/\n/, $ps_output)
        );
    };

So, as I said, stupid code. I only care about getting the output, so I pass in undef for the other values – IPC::Run::run wants you to supply them anyway.

However, that little sum/map/split thing sure looks fancy. How much would be willing to wager that it’s bug-free? Probably not much. How would you even test code like that?

So, let’s cheat – run ps just the once, stash the results, and use those for testing from there on out.

Storing the command line output results

Check out the the whole test file to see what’s going on overall.

There’s a few bits of weirdness, for sure. Down the end you’ll see:

1
2
3
4
5
6
7
8
9
10
11
12
    __DATA__
    __[ ps -Ao pid,cmd ]__
    PID CMD
        1 init [2]
        7 [khelper]
    ....
    1887 /usr/sbin/acpid
    __[ vmstat -n 1 1 ]__
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
    r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
    0  0    648  85932 194240 119124    0    0     0     5    2   51  0  0 100  0
    __END__

This is the format that Data::Section wants. You just need to stash each command you run between the underscored square brackets, and follow it with some sample command line data.

The tricky thing is this – Data::Section was built to be used with Modules, not “plain jane .t files.” So to make that work, you need to:

  • Give your test a package name. I chose to just stick a Test:: in front of the module name.
  • Give Data::Section an instance of that object to stick it’s methods into.
1
2
3
    # magic; Data::Section wants this to be a module, not a test file.
    # trick into thinking this hashref is a member of Test::Acme::System
    my $data = {}; bless $data;

This is the only part of this whole technique that feels truly hacky. If anyone has suggestions for better ways to manage this, let me know.

Overriding IPC::Run::run in a test

There are lots of ways to override a module’s methods; I have had good experiences with Test::MockModule. It’s pretty easy.

Write a callback that emulates IPC::Run::run

Here’s the code that makes that evil hack above worthwhile. Here’s all you have to do to recover those canned program execution results, using the section_data method provided by Data::Section (and that hacky $data reference):

1
2
3
4
    sub mock_ipc_run {
        my($cmd, $stdin, $stdout, $stderr) = @_;
        $$stdout = ${$data->section_data(join(" ", @$cmd))};
    }

“Mock” that in place

1
2
3
    # override the real run object with one that will use the __DATA__ block
    my $module = new Test::MockModule('IPC::Run');
    $module->mock('run', \&mock_ipc_run);

Actually doing the testing

At this point, test away!

1
2
3
4
5
6
7
8
9
# Actually "run the tests", using the canned results from the __DATA__ block
cmp_ok(Acme::System::pidsum(), "==", 8278);

cmp_ok(Acme::System::vmstat_col("buff"), "==", 194240);
cmp_ok(Acme::System::vmstat_col("si"), "==", 0);
cmp_ok(Acme::System::vmstat_col("cs"), "==", 51);
cmp_ok(Acme::System::vmstat_col("swpd"), "==", 648);

done_testing;

So I can (independently) calculate what the results should have been, given the arbitrary data I’ve saved in the __DATA__ block, and test based on those values. Awesome.

1
2
3
4
5
jbarratt@dev:~/work/Acme-System$ prove -l t/00-fakerun.t
t/00-fakerun.t .. ok
All tests successful.
Files=1, Tests=5,  0 wallclock secs ( 0.01 usr  0.03 sys +  0.00 cusr  0.10 csys =  0.14 CPU)
Result: PASS

Trust me, the normal code will still actually call the system

Just for fun, I threw in a script that actually uses this module to get live data:

1
2
3
4
5
6
    use Acme::System;

    print "Sum of all system PID's: " . Acme::System::pidsum() . "\n";

    print "Current CPU user time: " . Acme::System::vmstat_col("us") . "\n";
    print "Current Free Mem: " . Acme::System::vmstat_col("free") . "\n";

And sure enough, if you run it, the data is getting updated live. IPC::Run really is working on the live system.

1
2
3
4
5
6
7
8
jbarratt@dev:~/work/Acme-System/lib$ ../bin/live
Sum of all system PID's: 367273
Current CPU user time: 0
Current Free Mem: 70448
jbarratt@dev:~/work/Acme-System/lib$ ../bin/live
Sum of all system PID's: 367291
Current CPU user time: 0
Current Free Mem: 70556

Wrapping it all up

Other than the hackish trick to get Data::Section working in what’s not really a module, this code is really clean, readable, and easy to maintain. It works well for pretty much any module you might care to use instead of IPC::Run – there are lots of options, but as long as you use one of the module-ized ones, you can hook the module name and go from there.

Especially if you write lots of sysadmin tools, and especially if they have costs or risks associated with running them (fsck? rm -rf?) this technique can be a lifesaver. It’s only as good as the inputs you give it, though. I made a mistake the first time I figured this workflow out of grabbing an output which, in real life, ended up with more whitespace than I’d accounted for, because the counters had gotten bigger between when I snagged my “output to test with” and when it was running on “live output.”

I hope it helps, and if you have any suggestions about how to improve the technique, let me know (or send a pull request!)

Up-Up-Updated ‘Inbox Zero With Mail.app’ Technique

In March and December of 2009 I described how to do the Inbox Zero method with Mac Mail, often known as Mail.app.

Since publishing those articles I’d been fed up with the performance of all the methods I’d been using, and finally switched over to a true beast of a mail processing machine, Mail Act-On. The native plugin performance was great, and I was pretty happy with this solution.

The only thing I wasn’t happy with was that I was still only using the single feature – the ability to, with a hotkey, move mail into an archive folder. And for this, I’d happily paid.

However, I was recently contacted by the developer of the perfect plugin for me – Archive, a button for Apple Mail.. It does Archive and nothing else. It’s a native plugin so it’s fast, fast, fast. You don’t have to jump through any of the hoops of Applescripting and Service menus and Quicksilver.

I’ve been using it for a few weeks now and it’s been perfect – clean, simple, and exactly what I wanted all along.

The new 'Archive' feature

It’s free, and you can download it from the author’s site.

Why Using Packages Makes Sense in a Configuration Management World

I woke up this morning to a discussion on twitter between two of my favorite internet people, Andrew Shafer and R.I. Pienaar.

Andrew I know from his previous job with Reductive (now Puppet Labs) and I love what he has to say. (I really liked his DevOps Cafe episode – in particular making me change my opinion about “commitments” in Agile contexts.)

R.I. is a force of nature – his blog is great, and full of years of hard-earned wisdom, and mcollective project is something I can’t wait to roll out.

The discussion centered around the question that I’ll paraphrase:

In the era of configuration management tools like Chef and Puppet, what value do packages provide? What are the pros and cons of packaging?

RI's question

littleidea question

It was clear that both of them were feeling the pinch of expressing themselves in 140 characters. It’s a topic I’m pretty passionate about, after 15+ years fighting to keep systems under control, so I figured I’d write up my take on it.

Packages vs wget && tar

What are the pros and cons at the utility level?

(As dogma-free and objective as I can be, of course…)

The cost of building packages

Let’s start with the only downside I can think of to having to build packages – it’s an extra step, and takes some time.

Packaging your own code is easy – you solve it once, and then have something like Hudson or BuildBot take it from there. However, packaging upstream code that’s not in your distro is a pain in the butt. That’s a given.

Both of these get worse if (like me) you’re stuck running multiple distributions. Right now we have to build .rpm, .deb and Solaris packages.

Depending on what language you’re using, there might be some tools that help package things the right way. For Debian+Perl, for example, dh-make-perl is getting to the point of being awesome and very usable.

One way to get packages for upstream stuff that’s not very painful is with a tool like CheckInstall – you do the equivalent of a make install one time in a sandbox, and that gives you a package you can install at will and get all the benefits I’ll elaborate below.

No matter what, it’s an extra step.

@AshBerlin points out that, in the case that you’re managing some upstream software, that this is a cost you have to take on for every version they release, not just a 1 time cost.

There is no question that it’s ‘easier’ to do a make install than to build a package (every time the software is updated) and get that installed.

So here’s why that’s worth putting up with.

The value of building packages

Redundant, Version-aware Repositories

What’s can go wrong with code that looks like this? (Arbitrary link I happened to have in my .bash_history)

1
$ wget http://yuilibrary.com/downloads/yuicompressor/yuicompressor-2.4.2.zip
  1. What happens when the version changes? You have to update your configuration. This can be a good or bad thing, but in some cases you really want $latest to be installed, rather than the hard-coded version someone supplied the last time they edited the configuration manifest.
  2. What happens when the yuilibrary folks change their (arbitrary) download URI’s?
  3. What happens if http://yuilibrary.com is down the next time you want to do a build?
  4. What if you want to be shipping yuicompressor-2.4.1.zip? Is that still available for download?

When you want to install a package, the more reliable you can make the process, the better. Upgrading all the servers in a cluster? You want all the servers to be upgraded. Trying to bring a new server online? You want to be able to do that with a very low probability of anything going wrong. (The more you can trust your deploys, especially in the age of automated infrastructure, it saves you money to be able to bring servers up as “Just in Time” as possible.)

The “repo” model provided in at least the Red Hat and Debian packaging system handles all of these cases really, really well.

  • You can provide a list of repositories that an attempt to install a package will try from until they find one working
  • It’s a trivial sysadmin task to have several repos with the same content available. Each one doesn’t even have to be fancy and “HA”.
  • It’s 100% predictable (and an implementation detail you don’t have to worry about) what will happen when you say you either want a specific version of a package, or the latest version.

Built-in “tripwire”-like functionality

Apt and Yum both keep checksums of all the files on the system installed by packages. So at any time you can ask “have any files supposed to be managed by packages been modified”?

This is a useful thing to know for security reasons, of course. However, it’s even more important for helping people adapt their behaviors.

In an environment that’s moving from “not configuration managed” to “configuration managed”, and the status quo has been “modify the files on production servers”, it’s great to be able to get a nagios alert that one of the servers is now out of configuration, check the sums, and find out exactly what file(s) and when were modified.

(If you couple that with a nice ‘everyone logs in as their user and sudo’s when needed’ policy, you can find out exactly who and when, as well.)

The package manager knows how files got on your system

Knowing what files got spewed into your system from your average make install is pretty predictable, but certainly not always.

This is useful for 2 cases:

Troubleshooting, and knowing where to make changes when you find the problem

The server keeps throwing 500 errors. Why? Ah, an untrapped exception in /some/file/deep/on/the/system. Ok, I can fix that. Where do I need to go fix that? dpkg -S <filename> tells me the exact package responsible for that file.

1
2
3
4
$ dpkg -S /usr/bin/factor
    coreutils: /usr/bin/factor
$ dpkg -S /usr/bin/facter
    facter: /usr/bin/facter

That’s one area in particular where configuration management systems can add an extra layer of value. R.I. actually wrote a tool that helps you discover which puppet module is responsible for configuring a given resource on the server. (localconfig.yaml parser) So if the answer is, rather than “it’s something shipped with a package”, “it’s a config file that puppet wrote with your module called my_module”, you can easily find and fix it.

Uninstalling

I’ve had some concrete problems from this, where I did an upgrade, and cruft left over from the previous version interfered with the new versions. For dynamic languages which build up large library trees, this can be particularly nasty, since default search paths might end up including remnants of an old version.

When the package manager knows the location of every file, it can rip them out as happily as it put them in.

Dependencies are built in

At the level of configuration management, I really care about the application I’m configuring.

I want to run lighttpd. I tell the configuration manager to install it. I don’t want to have to do a research project to find all the supporting libraries required for it. Also, I really don’t want to track down the -dev versions of all of those libraries by hand.

This is especially important for upgrades – if an application starts using a new library after an upgrade (or depends on a newer version of a library when upgrading) that’s all handled (and expressible) at the package layer.

Discovery of available updates are built in

This is one of the most compelling reasons.

If you’ve got a system with 9 tarball’d packages,

  • What versions are currently installed?
  • What software has updates available?

It’s bad enough if you’re installing software you know about, but I assume we’re going to be using a distribution at some point. You also REALLY want to know about updates your distribution is providing, right? Especially when it’s things like critical kernel issues, openssl problems – anything that can be remotely triggered, at the very least.

Knowing what versions are available can be easily automated. You can use things like

to handle this, and even do things like automatically install security patches if you’d like.

Since you need to solve this ‘sysadmin problem’ at large anyway, why not leverage the tools and practices you build around this to learn about your own software, as well? It’s just as valuable to know that there’s a new build of apache2 available from the core Debian repo as it is to know that our_custom_app, which we expected to be at version 2.6 everywhere, is pinging because host25 is still running 2.5.

Cryptographic signatures are built in

Packages give you a way to trust that the bits you’re about to install are the ones you should be installing.

In this time of huge malware infestations, attempts to trojan even things like the linux kernel, and a large black market for “owned boxes”, you have every reason not to trust that the software you download isn’t compromised in some way.

Most places you can download tarballs also post checksums of what those files should be. There’s two problems with this:

  1. If you can’t trust the place the download is hosted, why would you trust the sum?
  2. That adds a lot of complexity to the automated installer.

You go from

  • Download the package
  • Decompress, configure, make

To

  • Download the package
  • Scrape for the latest checksums
  • Verify the checksums
  • THEN install the packages

Packages solve both problems.

  1. They distribute a public key asynchronously from the hosting of the packages. Unless you are owned already, you can verify that a package was signed by the person who holds the matching private key.
  2. Checking signatures is built in. Your package install will fail if something goes wrong.

Binary-identical code on every server

There are lots of things that can go wrong when you try and build a package from source.

  • You may not have some development libraries you need. Great, now you’re stuck managing those explicitly in configuration management as well. This also means that
  • Every server has to have the full stack of software needed to be able to build your software
  • Compilation may fail, especially if the package was updated from the last time you tried to build it.
  • Things that varied from build-to-build need to be accounted for when troubleshooting. “Huh, redis on host25 keeps crashing. Do we need to rebuild it?”

In general, it’s very nice to be able to completely decouple the tasks of

  • Creating a ‘build’ of your software and stack
  • Deploying a ‘build’ of your software and stack

Environment Support

One reason we like packages and repos is that it lets us define a configuration as:

  • A set of which packages
  • A set of configurations of those packages

Then when we’re working in a topic or feature branch, we can create a repository just for that branch (this is also automated), and the repo configuration is the only thing that needs to be modified in the configuration management code.

Also, because you only need a subset of all the packages you need in a repo, this lets us “stack” them.

  • Prefer my project repo (which only has my project-sensitive packages getting built into it, the stuff I’m modifying)
  • Fall back to the production repo for anything else you need

The End

Packages may add some pain and complexity up front to the install process, but they add a tremendous amount of value to the “lifecycle management” of your applications. Most of the hell we go through as people running servers doesn’t happen the first month we set them up, it’s month 6, 18, and 24 that are the real problem. And those problems (“servers are graveyards of state”, etc) that make Configuration Management the right thing to do in the first place.

Use them. Love them.