serialized.net

A study in fascination burnout

Using R and `ggplot2` to Contemplate Relocation

My wife and I are entertaining a still heavily “what-if” idea: migrating away from Southern California into the greener (and, it turns out, greyer) lands of the Pacific Northwest. (Portland seems particularly “us”.)

One big problem (and there are plenty of other problems that come up when you’re thinking about a major move) is the weather. Having both lived our lives calling Central and Southern California “home”, we’ve gotten spoiled by sunny, warm days being the norm.

I’ve been using R quite a bit at work, so answering the question of “how bad is the weather, really?” seemed like the perfect place to apply it.

Problem 1: Get some data.

I searched for a few minutes and found that the Weather Underground has historical information online, with handy links to getting it in CSV format. The only problem was that it seemed to only work a year at a time, so I bodged a quick perl script to download what was needed. (Cleaner links to all the code is available in my R github repo)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
my ($day, $month, $year) = (localtime(time))[3,4,5];
$month += 1; $year += 1900;

my $uri_a = "http://www.wunderground.com/history/airport/$code";
my $uri_b = "1/1/CustomHistory.html?dayend=$day&monthend=$month&yearend=$year&req_city=NA&req_state=NA&req_statename=NA&format=1";

my %seen = ();

for my $i (1991 .. $year) {
  my $csv = get("$uri_a/$i/$uri_b");
  for my $line (split(/\n/, $csv)) {
    $line =~ s/<br\s+\/>//;
    my $first = (split(/,/, $line))[0];
    next unless defined($first);
    if(!defined($seen{$first})) {
      $seen{$first}++;
      print "$line\n";
    }
  }
}

The closest airport (which seems to be how the historical data is stored) to us is Van Nuys, so I pulled that CSV as well.

1
2
$ ./fetch_weather KPDX > KPDX.csv
$ ./fetch_weather KVNY > KVNY.csv

Problem 2: Import the data into R

Thankfully, the CSV data from the Underground is pretty clean. (A few NA’s, but nothing major.) So, basic loading of the file is cake.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
citydiff <- function(city1_name, city1_logfile, city2_name, city2_logfile) {

  first_city <- read.csv(city1_logfile);
  second_city <- read.csv(city2_logfile);

  # Attach the name to each table
  first_city$city <- city1_name;
  second_city$city <- city2_name;

  # join the tables together in holy matrimony
  weather <- rbind(first_city, second_city);

  # ... and make the city name an unordered factor
  weather$city <- factor(weather$city, levels=c(city1_name, city2_name))

  # Fix up precipitation (it came in as a factor)
  weather$PrecipitationIn <- as.numeric(weather$PrecipitationIn)

  # and parse the date field
  weather$PST <- as.Date(weather$PST, "%Y-%m-%d")

  .....
}

(It’s throwing a warning about a factor that I don’t understand, but it seems to be working overall, so I’m not really sure what’s going on there. I even broke out R’s awesome debugger to no avail.)

Let’s look at the data.

Basic Temperature distribution

The first thing to do is just toss all the temperatures (there are 14,000+ of them since 1991) into a probability distribution, just to see where they fall. This is good use for qplot:

1
qplot(Max.TemperatureF, color=city, data = weather, geom="density")

The curves are similar, obviously offset from each other by about 20 degrees F; but the Portland curve has a sharper back slope – meaning that temperatures do fall in the 30-50 range, but more often hit 50-90.

Ok, temps are interesting, but let’s see how they fall across the year. We could look month by month, or day by day, but that feels “too coarse” and “too granular” respectively. I chose to break things out by week of the year. The trick here is to use as.POSIXlt() to interpret our Date, then modulo the day of year by 7. The sapply is just to fix the case (leap year?) where a single date would end up in the 53rd week of the year – this just clamps them all to 52.

1
weather$weeknum <- sapply(1 + as.POSIXlt(weather$PST)$yday %/% 7, function(x) { min(x, 52)})

So, I really fell in love (all over again) with ggplot2 while making this graph. Check it out:

And the code:

1
2
3
4
5
6
7
8
ggplot(weather, aes(x=weeknum, Max.TemperatureF, colour=Max.TemperatureF))
  + facet_grid(city~.)
  + geom_point(alpha=1/6, position=position_jitter(width=3))
  + scale_colour_gradient(low="blue", high="red")
  + geom_hline(yintercept=90) + geom_hline(yintercept=60)
  + geom_smooth(method="loess", size=1)
  + scale_x_continuous(formatter=function(x) format(strptime(paste("1990 1 ", x), format="%Y %w %U"), "%B"))
  + xlab("Month") + ylab("Maximum Temperature (F)")

Breaking it down:

  • using the weather data frame we made, with weeknum on the x axis, and Max.TemperatureF on the y axis and keying the colorization
  • Facet the graph (make multiple sections), one for each city in our data set
  • Draw the data with points, at opacity set to 1/6, so points that get plotted a lot will end up darker, and jittered a bit to smooth things out
  • Color the data points with a gradient, from blue to red. (Nice for intuitive hot/cold display)
  • Put horizontal lines at 60 and 90 degrees (The points at which it starts to feel “chilly” and “hot”)
  • Run a smoothed line through the whole thing to make the center more obvious
  • Convert the x axis from being hard-to-comprehend “week numbers” back to months
  • Then put good labels on the X and Y axis

It’s really readable, beauitful, and was actually shockingly simple to make. (Other than remapping the X axis. That was fun. The trick is to take a week number, tag it into an arbitrary year with an arbitrary day (so strptime can parse it), then reformat that date as %B, being the full date string. There’s got to be a better way…. but that worked!)

So what does the image actually tell us about the weather? Hopefully, it’s obvious. In Portland, December to February are pretty consistently chilly. In the summer, it occasionally gets hot, but between May and October it’s pretty perfect.

In Van Nuys, it’s rare to be chilly, but in June to October, get ready to run your AC.

It’s not just about the temperature

So, it’s nice not to need a sweatshirt for the temperature, but what about the gray? It can be warm and still depressingly dark outside. Thankfully, the data set includes a cloud cover metric!

Cloud Cover is a number from 0-8, representing the amount of the sky which is covered…. by cloud.

Let’s take a quick peek at the distribution with qplot, sending a smooth line (and leaving the raw week numbers this time) through the data:

1
qplot(weeknum, CloudCover, data = weather, geom="smooth", color=city, span=1)

So, the cities have similar curves, also offset (by about 3, or 37% cloud cover). In the middle of the summer, Van Nuys typically has around 18% cover, and Portland 43%. So: Portland is Cloudy. Myth Confirmed.

Breaking out ggplot2, we can get a much clearer picture of how cloudy it is:

1
2
3
4
5
6
7
ggplot(weather, aes(x=weeknum, CloudCover, colour=CloudCover))
  + facet_grid(city~.)
  + geom_point(position=position_jitter(width=0.5, height=0.5), alpha=I(1/6))
  + geom_smooth(size=1)
  + scale_colour_gradient(low="blue", high="black")
  + scale_x_continuous(formatter=function(x) format(strptime(paste("1990 1 ", x), format="%Y %w %U"), "%B"))
  + xlab("Month") + ylab("Cloud Cover")

It’s using basically the same method as for the temperature graph, except I modeled the cloud cover as a color range from blue to black. (Awww, cute.)

You can see something interesting when you compare it back to the temperature graphs: the cold days correlate with the cloudy days pretty well! So, the winter is sucky-cold and sucky-gray all at the same time. At least you get it out of the way all at once (while the days are short, too!)

So how nice is the spring and summer?

It’s just a perfect day

What might a perfect day look like?

  • Comfortable temperature (60-90 degrees)
  • Not too cloudy (Less than 50% cloud cover)

It turns out that’s an easy query to make against the data frame.

1
2
3
4
5
weather$ideal <- F
weather <- within(weather, {
    ideal[Max.TemperatureF >= 60 & Max.TemperatureF <= 90 & CloudCover <= 4] <- T
  })
weather$ideal <- as.logical(weather$ideal)

Make a new column called ideal and fill it with F’s, then change the value to T wherever those conditions are met – then convert the whole thing to a logical (boolean) type. Easy!

So now we can see where “ideal” versus “not-ideal” days fall across the year. Week-by-week would probably be too noisy for this, so I added a column for month to the data frame as well.

1
weather$month <- factor(format(weather$PST, "%B"), order=TRUE, levels=c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))

A much simpler incarnation of ggplot this time:

1
2
3
4
ggplot(weather, aes(x=month, fill=ideal))
  + geom_histogram()
  + facet_grid(city~.)
  + opts(axis.text.x  = theme_text(angle=90, size=10)))

And the results are again interesting; I would not have bet on November having the most ideal days here in Southern California, but it makes sense. And, July-September, over 1/2 the Portland days are ideal ones! (cough, like basically every month in Van Nuys.)

(Of course, anyone who’s been to Van Nuys can attest to the fact that, no matter how great the weather is, you’re very unlikely to have a perfect day in general. Because you’re in Van Nuys.)

So just to quickly summarize the data:

1
2
ideal_days <- weather[weather$ideal==TRUE,]
qplot(city, data=ideal_days)

Portland has about 42% of the “ideal days” that Van Nuys does. Which, given everything else that Portland’s got going for it is actually still a pretty good result.

That gives me an idea… I bet I can point R at some Crime, Traffic, School, and Cost of Living statistics to get a more well-rounded picture :)

Screencasting Presentations With Screenflow

I’ve been really enjoying screencasting my presentations, and have come up with a nice workflow using the (commercial) Mac app ScreenFlow.

Watch the video for the full details, but the basic process is:

  • Record the screen while you give the presentation in “presenter mode” (so notes are available)
  • Also record video from your computer’s camera
  • Crop down to just the slide portion of the presenter view
  • Overlay the video on the slides and make it fully transparent (Opacity 0%)
  • Skim through the video, watching for slide changes as cues
  • Insert video actions (⌘K) and flip the Opacity between (100%/0%) to show the slide or the video.

At this point, I can come up with a very nice ‘canned version’ of a presentation in only slightly more than the time it takes to simply give the presentation. It’s also a great way to rehearse!

In the video I also recommend the Samson Go Mic, ($40) which I’ve been nothing but happy with.

(The full list of Screenflow Keyboard Shortcuts is good to keep handy.)

A technical side note: the video quality in the beginng of the video is a little off. The actual “presentation” was in 4:3, but the application demo was widescreen. This meant a bit of ffmpeg magic, (attaching ‘letterboxes’ to the sides to make it effectively widescreen) so I could glue the two videos together. When you actually use this method to simply give a presentation, it ends up looking even nicer.

1
$ ffmpeg -i Presentation.mov -s 560x416 -r 30000/1001 -padleft 46 -padright 48 -vcodec libx264 -vpre hq -acodec libf@c -ac 2 -ar 48000 -ab 192k PresentationLetterbox.mov

Going Fast With GMail’s Keyboard Shortcuts

I’m frequently surprised by how many people don’t know about the awesome GMail keyboard shortcuts that are built in to the web application. (And yes, this is the 1 billionth google hit on this topic, but the battle against ignorance is won post by post!)

I use them constantly, and they really help churn through the inbox efficiently and effectively.

First, you need to turn them on:

  • Settings (Gear icon) -> General -> Keyboard Shortcuts On

Click the link above to see the overall list, but the basics I always use:

  • x => select a message
  • e => archive a message
  • j/k => Move to next/previous conversation

Just those 4 keys will let you BLAST through conversations that don’t apply to you. I often end up doing (x,j,x,j,x,j,e) (select 3 conversations, then archive the lot of them in one swoop.) If it’s spam, or something I know I’ll never need again, replace ’e’ with ’#’ (delete)

  • # => delete selected conversation

Ok, so sometimes you actually need to read a conversation:

  • o => open the conversation under the cursor for reading

Our good friend ’x’ will pop you right back to the main view, with the current message selected. So if there’s a lot of messages I need to read, (o,x,j,o,x,j,…. e) will be reading through them all, then archiving the bunch.

If it’s one of those conversations that you think will go on for a while and isn’t relevant, the wonderful m (mute) will pre-archive the rest of the conversation, without you having to deal with it popping back up later! (I have to imagine there are some monster and interminable threads that happen in the googleplex.)

  • m => mute the conversation

And then there’s replying! Using shortcuts gets you another gorgeous side effect, way lower risk of accidentally reply-all’ing.

When you have a message selected

  • r => reply
  • a => reply-all
  • f => forward

Boom. Tasty efficiency.

Update: @naydichev reminds me of the most important shortcut of all:

  • ? => help (show all commands.)