Jonathan Dobres

child's play and a bit about data visualization

The books have closed on Child’s Play 2010, and this year’s total is a truly awe-inspiring 2.3 million dollars. With 2010 in the total, this means that cumulatively, Child’s Play has raised just shy of nine million dollars. Nine million dollars over the last eight years, every single cent of it helping to improve the life of a sick kid. If that’s not something to be proud of, I don’t know what is.

But I didn’t sit down in front of the computer today to talk to you about that. I do that enough. Instead, we’re going to talk about this sweet chart I made. Incidentally, at the time of this writing, Googling “sweet chart I made” returns this as the first result. Who am I to disagree?

Last year’s chart was put together with Numbers. I’m generally very happy with Numbers—certainly much happier than I ever was with the sluggish, bloated, obtuse mess that is Microsoft Excel—but the chart I produced last year has some problems. The spacing on the x-axis looks weird, and that’s a poor way to format a date anyway. Since the key shows the annual totals, it kind of defeats the point of the chart. And why did I go with a filled line chart? Because every year has many missing data points, and a filled chart was the only way to get Numbers to draw each year as a connected line.

This year’s chart was put together with R and ggplot2. Here’s what I like about it, and what I don’t.

What I Like

  • R and ggplot2. I can’t recommend R highly enough. It’s fast, flexible, powerful, and oh yeah, free. I now use it for all my data analysis needs. I intentionally gave myself a hellish, badly formatted CSV file to work with here, just to see if R could beat it into shape. No sweat. As for ggplot2, it’s overkill for some situations, but a great solution for most. Maps, anyone?
  • The date axis. It’s nicely labeled, with every major gridline representing exactly one week. Look closely, and you’ll see that the minor gridlines split the weeks into days.
  • I ditched the legend, and instead placed year labels at the ends of their respective lines. Extra special audience challenge: do this in Excel without killing yourself after fifteen minutes.
  • Cumulative total is computed on the fly and automatically added to the plot’s title.
  • In fact, the whole plot is defined programmatically, even the year labels, so adding in 2011’s data should be a cinch.

What I Don’t Like

  • There are too many colors on this thing. ggplot2 computes those colors by finding equally spaced points along the rainbow, so as more colors are added, the difference between them gets smaller. I’m using these colors to keep each line visually separate, but why? Do you need to see every data point of every year? One alternative would be to color in the current year and the previous year, and turn all others a shade of dark gray.
  • The larger problem, though, is that this plot doesn’t serve much of a point. I don’t have enough data to get an accurate sense of how quickly Child’s Play accumulates funds. Look at where the lines start. The early years start at $0, but the spread runs up to nearly $500,000. Is that variability a reflection of larger corporate donations kicking off the fundraiser, or is it that Child’s Play runs year-round, and the charity is taking more money in during the non-holiday months of the year? In short, the only reliable data points in the plot are the totals, in which case a simple table could tell you just as much.

Still, it’s a fun exercise. I certainly learned a lot about R while working on this, and that’ll pay off in the future. Maybe I’ll tackle Boston’s weather data instead.