Autumn at Mount Auburn Cemetery

If I have a Happy Place™, it takes the form of a forest at the height of autumn. I like to think of autumn as the time of year when all the trees stop pretending and show us who they really are. Last week I had the pleasure of overdosing on autumnal splendor with my friend, Paul. Together we went on a long walk through Cambridge’s Mount Auburn Cemetery. Dedicated in 1831, Mount Auburn was the United States’s first “garden cemetery”—meant to be a place of beauty and contemplation as much as it was (and still is) a burial site.

We couldn’t have asked for a more beautiful autumn day at the height of foliage season, and it resulted in some pretty amazing photos. Facebook’s compression algorithms hit the tiny details in these particular photos really hard. So I took it upon myself to build a little gallery on my own humble site. Click any of the thumbnails for a bigger view, or hit the download button for the even larger originals.

algorithmic art: random circle packing

The algorithm above is a variant of one that currently runs on my front page. The heart of it is a circle packing algorithm, which you can find at my GitHub. All the code is there, so I won’t bother to explain every line of it here. There are, however, a few interesting bits I’d like to highlight.

Circle Packing

Circle packing is a well-established topic in mathematics. The central question is: how best can you cram non-overlapping circles into some available space? There are lots of variations on the idea, and lots of ways to implement solutions. My version works by proposing random locations for circles of smaller and smaller sizes, keeping only the circles that do not overlap with any circles already on the canvas. In pseudo-code, the simplest method looks like this:

let existing_circles = [ ]
let radius = 100

while (radius >= 0.5) {
  trying 1000 times {
	let new_circle = circle(random_x, random_y, radius)
	if (new_circle does not overlap existing_circles) {
	  add new_circle to existing_circles
	  restart trying
	}
  }

  // If we made it through all 1,000 tries without placing a circle, shrink the radius by 1%.
  // The program stops if the radius drops below 0.5 pixels.
  radius = radius * 0.99
}

The above is pseudo-code, and it hides a lot of complexity. In particular, line 7—new_circle does not overlap existing_circles—glosses over the most interesting part of the algorithm. What, exactly, is the most efficient way of checking to see if a new circle does not overlap any of the ones that have already been placed?

Notice that existing_circles is just a list, and it stores circles in the order in which they were created. Every time a new circle is proposed, its position is compared to the first circle in the list, then the next, and so on, up until a comparison fails or the list ends. On the one hand, the list structure has an implicit optimization: larger circles were created first and will be checked first, so areas on the canvas with the highest potential for overlap get checked early. On the other hand, if our maximum circle size is smaller, this advantage disappears. And in either case, successfully placing a new circle means that every existing circle needs to be checked, every time. The more circles we’ve placed, the more will need to be checked. This is like looking up the definition of the word packing by opening up the dictionary to the first entry and dutifully reading every word from aardvark to pace.

For the purposes of my website, that’s not acceptable—I need something that fills the canvas pretty quickly.1 What we really need is a data structure that organizes circles by their position, at least vaguely. That would make it easy to restrict our comparison only to circles that are near to the proposed new circle. Enter quadtrees.

Quadtrees

Of course, if you were looking for a specific word in a dictionary, you wouldn’t read it through from the start. Searching for packing, you’d probably grab a big chunk of the pages and flip open to somewhere near marzipan. Knowing that p comes after m, you flip forward by another chunk, landing at quintessence. Having gone too far, you browse backwards, perhaps more carefully, until you arrive at packing. This method of searching the dictionary is, roughly, a binary search. Rather than starting at the beginning of the data, you jump straight to the middle. If that’s not the object you’re looking for, go halfway to the beginning or end, depending on whether you’ve overshot or undershot. Continue taking these ever-smaller half-steps until you’ve narrowed your search to the thing you’re looking for.

This method of searching is a little less efficient for words near the beginning of the dictionary, but much more efficient, on average, across all possible entries.2

A quadtree generalizes this idea to 2D space, like our canvas. The quadtree works by dividing the canvas into four quadrants. Every time we propose a new circle, the quadtree checks to see where the circle landed. If the circle lands in an empty cell, great, make a note of that. If the circle lands in a cell that already has a circle, the cell splits into four smaller cells, repeating until each cell once again holds one circle. The image above shows a quadtree storing twenty circles, with a twenty-first proposed circle in blue. When checking for possible overlaps, we can quickly determine that three of the four biggest quadrants don’t overlap the proposed circle and one does—four quick checks and we’ve eliminated 75% of the canvas, no need to check the seventeen circles contained within. The proposed circle does overlap the upper-right quadrant, so its sub-quads will need to be checked—four more checks. After about eight comparisons, as opposed to twenty, we’ve determined that the proposed circle is valid. As more circles are added and the grid becomes finer, the number of checks becomes roughly proportional to the potential area of overlap.3

There are many fine implementations of quadtrees, and I certainly don’t claim that mine is the best, but it was very satisfying to learn it myself.4

Into the Quadforest

I’m glossing over one other thing. We don’t just want to compare the circle positions, but rather, their areas. Using a single quadtree, we have to assume that the potential search area is a square with side lengths equal to the diameter of the proposed new circle, plus the maximum circle diameter stored in the tree. In practice, as the circles get smaller, we’ll be checking a much bigger area, and many more circles, than is necessary. The solution is to use multiple quadtrees, the first of which stores and searches against the largest circles, the next smaller ones, and so on. In this way, we can ensure that larger circles are checked first, restoring the advantage from the naive method.

All these optimizations really do make a difference. When the maximum circle size is large (say 200 pixels in diameter), the quadforest search is about three times faster than the naive list search. When the maximum circle size is smaller (say 20 pixels in diameter), the quadforest is about thirty times faster. Code that pits the two methods against each other in a race is available in the repo.

Happy Accidents

Having gotten the algorithm working nicely, I wanted some visual interest, or as the kids say, “art”. My first thought was to have the circles pop into existence with an animation. I’m not clearing the canvas on each new frame, which saves me from having to redraw every circle for every frame. And because I was testing my quadtree, the drawing commands were set to draw unfilled, outline shapes. So when I specified a five-frame “pop” animation, what I ended up with was five concentric rings for each circle—the ghosts of the pop animation. These appear thin and delicate for the larger circles, but thicker and almost solid for the smaller ones. The unexpected patterning produces a variety of texture and brightness that I really enjoy.

And there you have it. Efficient random circle packing, done fast, live, and pretty. The code for this algorithm is available on Github.

  1. If I were doing this in Processing proper, it wouldn’t really matter. A lightly optimized version of this basic algorithm can pack a large sketch almost completely in a few seconds. But this is P5, which means Javascript in a web browser. More optimizations are necessary. 

  2. Assuming aardvark is the first entry in a 128 word dictionary, you can get to it in 7 steps using the binary search method, as opposed to 1 step using the start-to-finish naive method. If zebra is the last entry in the dictionary, it also requires 7 steps with binary search, as opposed to 128 with the naive method. 

  3. More or less. As the canvas fills up, it’ll get harder to find new valid locations, and a very large, very dense, or very lopsided tree would introduce some inefficiencies. Wikipedia, as always, has more

  4. The code for this sketch is mostly the quadtree implementation, with a few specialized functions for handling the circle stuff. Maybe I should retitle this post. 

compound inequalities in R

Suppose you’re writing some code and you want to check to see if a variable, say x, is between two other numbers. The standard way to do this comparison in most languages, including R, would be something like:

x >= 10 & x <= 20

Not too shabby, all things considered. Of course, the repetition of the variable name is unfortunate, especially if you’re checking a.very.long.variable.name. You also have to flip either the second comparison operator or the order of its arguments, which can be a little confusing to read. Utility functions like dplyr::between give you the ability to write:

between(x, 10, 20)

That’s more convenient, but between assumes that the comparison should be inclusive on both sides; it implements >= and <=, but not > or <. We could write our own version of between to allow flags for whether each side of the comparison should be inclusive, but at that point the syntax gets more cumbersome than the thing it’s meant to replace.

What we really want is the ability to write the comparison as a compound inequality, which you might remember from grade school. They look like this:

10 < x <= 20

This succinctly expresses that x should be greater than 10 and less than or equal to 20. Expressing exclusive or inclusive comparisons on either side is as easy as changing the operators, and you type the compared variables and values only once.

It turns out that it’s possible to implement custom compound inequality operators in R, and it’s not hard. The main hurdle is that the custom operator needs to do one of two things: either store the result of a new comparison but return the original data (to be passed along to the second comparison), or return the result of a comparison that already has some stored results attached. Since R allows us to set arbitrary attributes on data primitives, we can easily meet these requirements.

'%<<%' <- function(lhs, rhs) {
  
  if (is.null(attr(lhs, 'compound-inequality-partial'))) {
    out <- rhs
    attr(out, 'compound-inequality-partial') <- lhs < rhs
  } else {
    out <- lhs < rhs & attr(lhs, 'compound-inequality-partial')
  }
  
  return(out)
}

That’s all we need to implement compound “less-than” comparisons (we’ll also need similar definitions for <<=, >>, and >>=, but more on that in a minute). I’m using %<<% instead of just %<% because the complementary %>% operator would conflict with the widely used pipe operator. The doubled symbol also neatly reinforces that this operator is meant to evaluate two comparisons.

Whenever the custom %<<% is encountered, it checks to see if the argument on the lefthand side has an attribute called compound-inequality-partial. If it doesn’t, the result of the comparison is attached to the original data as an attribute with that name, and the modified data are returned. If the attribute name exists, the function checks to see whether both the comparison stored in compound-inequality-partial and the second comparison are true. That’s all there is to it.

Implementing this functionality for the remaining three comparisons is pretty repetitive, so let’s instead write a generalized comparison function. To do this, we’ll take advantage of the fact that in R, 1 + 2 can also be invoked as '+'(1, 2) or do.call('+', list(1, 2)). Here’s our generalized helper:

compound.inequality <- function(lhs, rhs, comparison) {
  if (is.null(attr(lhs, 'compound-inequality-partial'))) {
    out <- rhs
    attr(out, 'compound-inequality-partial') <- do.call(comparison, list(lhs, rhs))
  } else {
    out <- do.call(comparison, list(lhs, rhs)) & attr(lhs, 'compound-inequality-partial')
  }
  
  return(out)
}

And then the definitions for all four operators become:

'%<<%' <- function(lhs, rhs) {
  return(compound.inequality(lhs, rhs, '<'))
}

'%<<=%' <- function(lhs, rhs) {
  return(compound.inequality(lhs, rhs, '<='))
}

'%>>%' <- function(lhs, rhs) {
  return(compound.inequality(lhs, rhs, '>'))
}

'%>>=%' <- function(lhs, rhs) {
  return(compound.inequality(lhs, rhs, '>='))
}

The comparison 2 %<<% 4 %<<% 6 is true. If you only wrote the first comparison, you’d get back 4, with TRUE attached as an attribute. Now it’s easy, and very readable, to specify compound “between” comparisons that are inclusive or exclusive on either side.

I’m not sure that I’d actually use this in “real” scripts, since these types of comparisons are relatively rare for me. But it was a fun problem to think about, and useful enough to share. This surprisingly simple solution demonstrates the flexibility of the R language. Code for this project is also available on Github.

the pleasures and pitfalls of data visualization with maps

I’ve been thinking more about data visualization, if the previous post wasn’t enough of an indication. I recently stumbled upon an interesting post by Sean Lorenz at the Domino Data Lab blog on how to use R’s ggmap package to put stylish map tiles into your visualizations.

In the midst of his engaging and accessible tour of the package, Lorenz offers the following visualization of fatal motor vehicle collisons by state (or at any rate, a version very similar to the one I have here). Click to embiggen:

Lorenz says that this is a nice visualization of “the biggest offenders: California, Florida, Texas and New York.” But something about that list of states in that particular order seemed somehow familiar, and caught my attention.

Disclaimer

Before I continue, I’d like to state that the following is an informal analysis. It is intended for entertainment and educational purposes only, and is not scientifically rigorous, exhaustive, or in any way conclusive.

Data and R code for this project are available in my Github repository.

What Were We Talking About?

So, back to that map. The one where California, Florida, Texas, and New York were the “biggest offenders”. This struck me as suspicious, as California, Florida, Texas, and New York are coincidentally the most populous states in the country (in that order). A close reading of Lorenz’s post shows that he is, indeed, working with the raw number of collisions per state, and is not normalizing by the states’ populations. That’s important because, as you can see, there’s a very strong relationship between the number of people in a state and the number of accidents:

With a relationship that strong (even removing the Big Four outliers, R2 = 0.80), a plot of raw motor vehicle accidents is essentially just a map of population. Dividing the number of collisions per state by that state’s population yields a very different map:

As you can see, many of the bubbles are now roughly the same size. Florida still stands out from the crowd, and there’s a state somewhere in the Northeast with a very high accident rate, though it’s hard to tell which one, exactly. This map is no longer useful as a visualization, because it’s no longer clarifying the relationships between the data points. This is the main problem with bubble maps: they often visualize data that do not actually have a meaningful spatial relationship, and they do so in a way that would be hard to examine even without the superfluous geography.

So how else might we visualize this information? Well, how about a chloropleth map, with each state simply colored by accident rate:

The brighter the state, the higher the rate.1 Here, it’s more obvious that Delaware, New Mexico, Louisiana, and Florida have high accident rates, but which has the highest? And what about that vast undifferentiated middle? Maybe the problem is that all those subtle shades of blue are hard to tell apart. Maybe we should simplify the coloring and bin the data into quantiles?

Colorful, and simpler, but not necessarily more useful. From this map, you’d get the impression that Florida, Delaware, Texas, California, and many other states all have comparable accident rates, which isn’t necessarily true. How about we ditch our commitment to maps and just plot the data points?

Ah, clarity at last. Here I’m still coloring the points by quantile so that you can see how much of the data’s variability was hidden in the previous map. Now it’s immediately clear that Delaware, New Mexico, Florida, Louisiana, and South Carolina all have unusually high accident rates (arguably, so do North Carolina and Arizona). Beyond that, almost every other state clusters within one standard deviation of the national average, with South Dakota having a notably low accident rate.

Of course, Lorenz’s point wasn’t really about the accident data, it was about how nifty maps can be. Having just spent the last few paragraphs demonstrating the ways in which maps can fool us, you might get the impression that I’m down on maps. I’m really not. Maps can be beautiful and informative, but by definition, they need to show us data with some spatial relevance. Take, for instance, this map of the Boston marathon.

The lovely tiles are pulled from Stamen Maps, courtesy of the ggmap package, and I really can’t emphasize enough how much work ggmap is doing for you here.2 From there, it was pretty easy to overlay some custom data. The red line traces the path of the Boston Marathon, and the blue line shows Boston’s city limits. Curiously, very little of the Boston Marathon actually happens in Boston. Starting way out west in Hopkinton, the marathon doesn’t touch Boston proper until it briefly treads through parts of Brighton. Then the route passes through Brookline (a separate township from Boston, and quite ardent about it), before re-entering the city limits near Boston’s Fenway neighborhood, just a couple of miles from the finish line.

  1. Here I’m converting the accident rate (essentially, accidents per person) to a z-score. Z-scores have a variety of useful properties, one of which is that for any normally distributed set of observations (as the accident rates are, more or less), 99% of the data should fall between values of -3 and 3. This is more useful than the teensy percentage values, which are hard to interpret as a meaningful unit. Put another way, the simple percentage values tell me about the accident rate’s relationship to its state (not very useful), while the z-score tells me about the accident rates relative to each other (much more useful). 

  2. Really. Adding in that clean outline of Boston’s city limits was a nightmare. You have no idea how difficult it was to a) find good shape data, b) read that data into R correctly, c) project the data into latitude/longitude coordinates, and d) plot it neatly. That ggmap can put so much beautifully formatted data into a plot so quickly is a real marvel. 

understanding ggplot: an example

If you use R, odds are good that you’ve also come across ggplot2, by Hadley Wickham. This wildly popular plotting package is one of the most useful tools in the entire R ecosystem. If it didn’t exist, I would have had to invent it (or a significantly inferior version of it). ggplot’s elegant, powerful syntax can turn you into a data vis wizard, if you know what you’re doing. It’s getting to “I know what I’m doing” that can be tough.

Earlier today, I came across a Bloomberg article on how more older drivers are staying on the road and becoming the auto industry’s fastest-growing demographic. It’s an interesting read, but what caught my attention was this figure:

From a data visualization standpoint, this plot is actually several flavors of weird and bad (more on that later), but suppose you find yourself enamored of its fun colors and jaunty angles, and maybe you want to see how much of this you can replicate purely with ggplot.

First thing’s first, download the data, which has just three columns: age.group, year, and value. Next we’ll calculate a marker for whether the number of drivers in an age group is increasing, as well as the actual percentage change:1

data.bloom <- ddply(data.bloom, .(age.group), mutate, 
                    is.increasing = value[year == 2013] >= value[year == 2003],
                    percent = (value[year == 2013] - value[year == 2003]) / value[year == 2003] * 100
);

And now we’re ready to plot! Let’s get the basics going:

plot.bloom <- ggplot(data=data.bloom, aes(x = year)) +
  geom_ribbon(aes(ymin = 0, ymax = value)) +
  geom_point(aes(y = value)) +
  facet_wrap(facets = ~age.group, nrow = 2);
print(plot.bloom);

ggplot’s syntax often mystifies new users, so let’s go through it.

  • Line 1 is the main call to ggplot. It’s what tells R that we’re building a new ggplot object. The ggplot() function really only needs two arguments: a dataset (self-explanatory) and a set of aesthetics, defined in the mysterious aes() function called inside of ggplot(). ggplot will attempt to go through each individual row of the dataset and put it on our plot. The aesthetics defined in aes() tell ggplot which columns define which properties of the plot. In this main call to aes(), I’ve only specified that “year” should go on the x-axis. Notice that I can refer to the “year” column by name alone; no quotes, no reference to its parent data frame. The aes() function implicitly knows about its associated dataset. This aesthetic mapping will then trickle down to every subsequent piece of the plot (unless I change the aesthetic mapping later on down the line).
  • Line 2 is our first geom, which defines a graphical object. In this case, I’m using geom_ribbon(), which is useful for building polygonal shapes such as shaded regions, error bars, highlights, and so on. The geom_ribbon() function is pretty smart; it wants you to define “ymin”, “ymax”, “xmin”, and “xmax”, but if you give it just “x”, it’ll use that value for the min and max values. Since we already defined “x” up in ggplot(), we just have to give it “ymin” and “ymax”. Here I’m saying that I want “ymin” to always be 0, and “ymax” to take on the data values.
  • Line 3 calls geom_point(), and it does what it says on the tin. It already has “x” from the call to ggplot(), and it needs a “y”. In this case, that’s just “value” again.
  • Line 4 calls facet_wrap(). Faceting, an implementation of the small multiples technique, is perhaps the most powerful feature of ggplot. Faceting makes it possible to take a plot and split it into smaller sub-plots defined by some other factor or combination of factors. Our call to facet_wrap() simply says to split the plot up by “age.group”, and arrange the resulting set of sub-plots into exactly 2 rows.2

Four lines of code3, and we’re 90% of the way there. This is the point at which you’d call over your boss to show him your new, exciting, preliminary finding. But you wouldn’t call over your boss’s boss. The plot needs more work first. The labeling on the x-axis is screwy and there’s a distinct lack of eye-catching color here. Let’s fix that.

plot.bloom <- ggplot(data=data.bloom, aes(x = year)) +
  geom_ribbon(aes(ymin = 0, ymax = value, fill = is.increasing)) +
  geom_point(aes(y = value)) +
  facet_wrap(facets = ~age.group, nrow = 2, scales = 'free') +
  scale_x_continuous(breaks = unique(data.bloom$year), expand=c(0.16, 0)) +
  scale_fill_manual(values=c('TRUE'='#99e5e5', 'FALSE'='#f27595'));
print(plot.bloom);

Here are the changes I’ve made:

  • Line 2 now adds fill = is.increasing to the aesthetics. This will color the polygons according to whether “is.increasing” is TRUE or FALSE.
  • Line 6’s call to scale_fill_manual() tells ggplot that the “fill” aesthetic is going to be controlled manually. While there a number of built-in color scales, such as scale_fill_brewer() or scale_fill_grey(), scale_fill_manual() simply allows us to define our own colors. In this case, I’ve picked the specific colors from the Bloomberg figure, and mapped them to the two possible values in the “is.increasing” column.
  • Line 5 calls scale_x_continuous(). Since our “year” column is numeric data, ggplot assumes that 2003 and 2013 are just two points on a line (you can see lots of intermediate values in the plot above, as ggplot attempted to create neatly spaced tick marks). One way to clear up the labels would be to convert the year column to a factor, and then ggplot would label only the values that exist in the dataset. Another way would be to convert the year column to Date objects, and use scale_x_date(), but that’s overkill here. Instead, I’ve simply used the “breaks” argument to restrict labeling to the two values found in the dataset. The two-element vector being sent to “expand” tells ggplot how much padding to add to the ends of the x-axis. The first value defines a percentage value (16% here), while the second would add some raw value (here, some number of years). The default is c(0.04, 0), so I’ve essentially just increased the padding a little.
  • Lastly, Line 4 adds scales = 'free' to facet_wrap(). In the previous plot, notice that each facet shared the same axes, and axis values were only written on the far left and bottom of the plot. This is efficient and readable, but it’s not how the Bloomberg plot does things. For the sake of replication, we need x-axis labels on each facet.

We’re getting closer, but these changes have produced some undesirable effects. The use of a fill aesthetic has caused ggplot to helpfully add a legend, which we don’t want, while the use of scales = 'free' has caused each facet to plot the data against its own set of y-axis limits, losing the nice comparable scaling we had before. We can keep the individual axes and the unified scaling with one addition:

plot.bloom <- ggplot(data=data.bloom, aes(x = year)) +
  geom_ribbon(aes(ymin = 0, ymax = value, fill = is.increasing)) +
  geom_point(aes(y = value)) +
  facet_wrap(facets = ~age.group, nrow = 2, scales = 'free') +
  scale_x_continuous(breaks = unique(data.bloom$year), expand=c(0.16, 0)) +
  scale_y_continuous(limits=c(0, 50), expand=c(0, 0)) +
  scale_fill_manual(values=c('TRUE'='#99e5e5', 'FALSE'='#f27595'));
print(plot.bloom);

Line 6 establishes explicit limits for the y-axes (ranging from 0 to 50), and knocks out the padding entirely, so that the data touches the y-axis, as in the Bloomberg plot. But wait! The Bloomberg figure also prints the value of each data point, and includes the written percentage as well. Two calls to geom_text() can get that done:

plot.bloom <- ggplot(data=data.bloom, aes(x = year)) +
  geom_ribbon(aes(ymin = 0, ymax = value, fill = is.increasing)) +
  geom_point(aes(y = value)) +
  geom_text(aes(label = sprintf('%0.1f', value), y = value), vjust = -1, size=3) +
  geom_text(subset=.(year == 2013), aes(label = sprintf('%+0.1f%%', percent)), x = 2008, y = 0, vjust = -1, fontface = 'bold', size=3) +
  facet_wrap(facets = ~age.group, nrow = 2, scales = 'free') +
  scale_x_continuous(breaks = unique(data.bloom$year), expand=c(0.16, 0)) +
  scale_y_continuous(limits=c(0, 50), expand=c(0, 0)) +
  scale_fill_manual(values=c('TRUE'='#99e5e5', 'FALSE'='#f27595'));
print(plot.bloom);

Lines 4 and 5 accomplish similar things, so let’s just focus on line 5:

  • Notice that the “label” argument inside aes() invokes sprintf(), which is a very handy way of ensuring that the percentage values are all neatly rounded to one decimal place and attached to a percent symbol. It’s possible to do these types of transformations within ggplot, thus saving you the trouble of having to create transforms within the dataset itself.
  • “x” is set to 2008, which is halfway between 2003 and 2013, thus ensuring that the label appears at the horizontal center of the plot. Note that “x” is set outside of aes(). You can do that if you’re setting an aesthetic to a single value.
  • Likewise, “y” is always 0, the baseline of the plot, and a “vjust” of -1 will set the text slightly above that baseline.
  • The “fontface” and “size” attributes should be fairly self-explanatory.
  • Lastly, remember when I said that ggplot wants to plot every row in your dataset? This means that ggplot would print the percentage values twice, right on top of each other, since they appear twice (one per age group). The “subset” parameter ensures that I’m only pulling the labels once.

Finally, the rest is cosmetic:

plot.bloom <- ggplot(data=data.bloom, aes(x = year)) +
  geom_ribbon(aes(ymin = 0, ymax = value, fill = is.increasing)) +
  geom_point(color='white', size=3, aes(y = value)) +
  geom_point(color='black', size=2, aes(y = value)) +
  geom_text(aes(label = sprintf('%0.1f', value), y = value), vjust = -1, size=3) +
  geom_text(subset=.(year == 2013), aes(label = sprintf('%+0.1f%%', percent)), x = 2008, y = 0, vjust = -1, fontface = 'bold', size=3) +
  facet_wrap(facets = ~age.group, nrow = 2, scales = 'free') +
  scale_x_continuous(breaks = unique(data.bloom$year), expand=c(0.16, 0)) +
  scale_y_continuous(limits=c(0, 50), expand=c(0, 0)) +
  scale_fill_manual(values=c('TRUE'='#99e5e5', 'FALSE'='#f27595')) +
  labs(x=NULL, y=NULL) +
  theme_classic() + theme(legend.position='none',
                          axis.line.y=element_blank(),
                          axis.ticks.y=element_blank(),
                          axis.text.y=element_blank(),
                          axis.text.x=element_text(color='#aaaaaa'),
                          strip.text=element_text(face='bold'),
                          strip.background=element_rect(fill='#eeeeee', color=NA),
                          panel.margin.x = unit(0.25, 'in'),
                          panel.margin.y = unit(0.25, 'in')
  );
print(plot.bloom);

  • Lines 3 and 4 create two duplicate sets of points, in black and white, thus creating a white outline around the points.
  • Line 11 turns off the axis labeling.
  • Line 12 tells ggplot to use its built-in “classic” theme, which features a white background and no gridlines, thus saving us some typing in theme().
  • Lines 12-20 are just one big adventure in the aforementioned theme(), which allows us to control the style of the plot. ggplot builds everything out of some basic theme elements, including element_rect(), element_text(), and element_line(). Setting any parameter of the theme to element_blank() removes that element from the plot itself, automatically re-arranging the remaining elements to use the available space.4

As far as I’m aware, this is as close as you can get to reproducing the Bloomberg plot without resorting to image editing software. ggplot can’t color facet labels individually (notice that in the Bloomberg version, the labels for the two oldest age groups are a different color than the rest). While I could muck around with arrows and label positioning for the percentage values, it would involve a lot of finicky trial and error, and generally isn’t worth the trouble.

So, we’ve done it. We’ve replicated this plot, which is horrible for the following reasons:

  • The primary data are redundantly represented by three different elements: the colored polygons, the data points, and the data labels.
  • Percentage change is represented by two separate elements: the color of the polygon and the printed label.
  • The repetition of the axis labels is unnecessary.
  • Splitting the pairs of points into a series of eight sub-plots obscures the relationship between age group and the number of active drivers. There’s a lot of empty space between the meaningful data, and the arrangement of sub-plots into two rows makes the age relationship harder to examine.

Using the same dataset, one could write this code:

plot.line <- ggplot(data=data.bloom, aes(x = age.group, y = value, color = as.factor(year))) +
    geom_line(aes(group = year)) +
    geom_point(size = 5.5, color = 'white') +
    geom_point(size = 4) +
    geom_text(aes(label=sprintf('%+0.1f%%', percent), color = is.increasing), y = -Inf, vjust = -1, size=3, show_guides=F) +
    geom_text(data=data.frame(year=c('2003', '\n2013')), aes(label=year), x=Inf, y = Inf, hjust=1, vjust=1) +
    scale_y_continuous(limits = c(0, 45)) +
    labs(x = 'Age Group', y = 'Number of Licensed Drivers (millions)') +
    scale_color_manual(values = c('FALSE'='#f00e48', 'TRUE'='#24a57c', '2003'='lightblue', '2013'='orange', '\n2013'='orange')) +
    theme_classic() + theme(legend.position = 'none',
                            axis.line = element_line(color='#999999'),
                            axis.ticks = element_line(color='#999999'),
                            axis.text.x = element_text(angle=45, hjust=1, vjust=1)
                        );
print(plot.line)

And produce this plot:

You could argue with some of the choices I’ve made here. For example, I’m still using points and lines to encode the same data. But I tend to subscribe to Robert Kosara’s definition of chart junk, which states that, “chart junk is any element of a chart that does not contribute to clarifying the intended message.” Here, the lines serve to remind the audience that the data are on a rough age continuum, while the points clarify which parts are real data and which are interpolation. I’ve also added color to the percentage change values, as I feel it’s important to highlight the one age group that is experiencing a decrease in licensure. Eagle-eyed readers will probably notice that I’m abusing ggplot a little, in that I’ve supressed the automated legend and am using geom_text() to create a custom legend of colored text (and the tricks I had to pull in scale_color_manual() are just weird).5

So what have we learned about ggplot? Look at how much explaining I had to do after each piece of code. This tells you that ggplot packs an awful lot of power into a few commands. A lot of that power comes from the things ggplot does implicitly, like establishing uniform axis limits across facets and building sensible axis labels based on the type of data used for the “x” and “y” aesthetics. Notice that ggplot even arranged the facets in the correct order.6 At the same time, ggplot allows for a lot of explicit control through the various scale() commands or theme(). The key to mastering ggplot is understanding how the system “thinks” about your data, and what it will try do for you. Lastly, notice that I was able to cook up a basic visualization of the data with just a few succinct lines of code, and gradually build the visualization’s complexity, iterating its design until I arrived at the presentation I wanted. This is a powerful feature, and one ideal for data exploration.

Feel free to grab the data and R code for this project.

  1. We will, of course, be using plyr, another indispensible tool from the mind of Dr. Wickham. In this case, we use mutate() in the call to ddply() to add two new columns to the dataset (which I’m calling “data.bloom”). For each age group, calculate whether the value increases from 2003 to 2013 (“is.increasing”), and the percentage change (“percent”). 

  2. Note that facet_wrap() and facet_grid() use an elegant formula interface for defining facets, of the form rows ~ columns. In this case, I only need to define either the rows or the columns. But in another dataset I could facet, say, age.group + income ~ gender

  3. Technically everything before print() is a single line of code, and I’m inserting line breaks for readability (notice the + signs at the end of each line, which connect all of the ggplot commands). 

  4. ggplot’s sensible way of handling the automatic layout and scaling of plot elements is half the reason I prefer it over the base plotting system. 

  5. And notice my use of y = -Inf in the call to geom_text(). This tells ggplot to place the text at the minimum of the y-axis, whatever that might turn out to be, a very useful feature that is documented nowhere

  6. Really, we just got lucky there. When you are faceting based on character data (as we are here), ggplot attempts to arrange facets alphabetically, which works to our favor in this case. If you needed to place the facets in an arbitrary order, your best bet would be to convert your faceting columns into ordered factors, in which case ggplot will place the facets using the underlying order. 

the first 90%

I work with lots of data. Not what you’d call “big data”, at least not technically, but maybe “biggish”. More than enough that Excel would crash just trying to open the dataset, assuming you were foolish enough to try. The amount of data is voluminous enough, and the relationship between the raw data and what you’re trying to analyze complex enough, that you need pretty decent data management chops to even access it correctly. But let’s say you have accessed it correctly. Now you can proceed to perform your analysis, make data visaulizations, and be a sorcerer of the digital age. Right?

Wrong. You left out the most important step: getting your data into the right format, making sure each data point has all the right labels in the right places to allow you to proceed to the real science. Cleaning data—that is, pulling in the unprocessed data, transforming it, rearranging it, relabeling it, discarding garbage, and otherwise getting it into a format that will play nicely with your analysis tools—is easily 90% of the job.

Let me give you an example.

This is a plot of the hours of daylight (sunset time subtracted from sunrise time) that Boston, Massachusetts received throughout 2014. I got the data from the US Naval Observatory after reading this post about the merits of Daylight Savings Time. Request a data file for any location in the US, and you’ll find it looks like this (scroll the box rightward to see the whole thing):

             o  ,    o  ,                                BOSTON, MASSACHUSETTS                         Astronomical Applications Dept.
Location: W071 05, N42 19                          Rise and Set for the Sun for 2014                   U. S. Naval Observatory        
                                                                                                       Washington, DC  20392-5420     
                                                         Eastern Standard Time                                                        
                                                                                                                                      
                                                                                                                                      
       Jan.       Feb.       Mar.       Apr.       May        June       July       Aug.       Sept.      Oct.       Nov.       Dec.  
Day Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set
     h m  h m   h m  h m   h m  h m   h m  h m   h m  h m   h m  h m   h m  h m   h m  h m   h m  h m   h m  h m   h m  h m   h m  h m
01  0713 1623  0658 1659  0620 1734  0527 1810  0440 1844  0410 1914  0411 1925  0437 1904  0510 1818  0541 1726  0618 1638  0654 1613
02  0714 1624  0657 1700  0618 1735  0525 1811  0438 1845  0410 1915  0412 1925  0438 1903  0511 1817  0543 1724  0619 1637  0655 1613
03  0714 1624  0656 1701  0616 1737  0524 1812  0437 1846  0409 1916  0413 1924  0439 1901  0512 1815  0544 1722  0620 1635  0656 1612
04  0714 1625  0655 1702  0615 1738  0522 1814  0436 1847  0409 1917  0413 1924  0440 1900  0513 1813  0545 1721  0621 1634  0657 1612
05  0713 1626  0653 1704  0613 1739  0520 1815  0435 1848  0409 1917  0414 1924  0441 1859  0514 1811  0546 1719  0623 1633  0658 1612
06  0713 1627  0652 1705  0612 1740  0518 1816  0433 1849  0408 1918  0414 1924  0442 1858  0515 1810  0547 1717  0624 1632  0659 1612
07  0713 1628  0651 1706  0610 1741  0517 1817  0432 1850  0408 1919  0415 1923  0443 1856  0516 1808  0548 1716  0625 1631  0700 1612
08  0713 1629  0650 1708  0608 1743  0515 1818  0431 1852  0408 1919  0416 1923  0444 1855  0517 1806  0549 1714  0626 1629  0701 1612
09  0713 1630  0649 1709  0607 1744  0513 1819  0430 1853  0408 1920  0416 1922  0445 1854  0518 1805  0550 1712  0628 1628  0702 1612
10  0713 1632  0647 1710  0605 1745  0512 1820  0428 1854  0407 1920  0417 1922  0446 1852  0519 1803  0551 1711  0629 1627  0702 1612
11  0712 1633  0646 1712  0603 1746  0510 1821  0427 1855  0407 1921  0418 1921  0447 1851  0520 1801  0553 1709  0630 1626  0703 1612
12  0712 1634  0645 1713  0601 1747  0508 1823  0426 1856  0407 1921  0419 1921  0448 1850  0521 1759  0554 1707  0631 1625  0704 1612
13  0712 1635  0644 1714  0600 1749  0507 1824  0425 1857  0407 1922  0419 1920  0450 1848  0522 1758  0555 1706  0633 1624  0705 1612
14  0711 1636  0642 1715  0558 1750  0505 1825  0424 1858  0407 1922  0420 1920  0451 1847  0523 1756  0556 1704  0634 1623  0706 1612
15  0711 1637  0641 1717  0556 1751  0504 1826  0423 1859  0407 1923  0421 1919  0452 1845  0524 1754  0557 1702  0635 1622  0706 1613
16  0710 1638  0639 1718  0555 1752  0502 1827  0422 1900  0407 1923  0422 1919  0453 1844  0525 1752  0558 1701  0636 1622  0707 1613
17  0710 1640  0638 1719  0553 1753  0500 1828  0421 1901  0407 1923  0423 1918  0454 1842  0526 1751  0559 1659  0637 1621  0708 1613
18  0709 1641  0637 1721  0551 1754  0459 1829  0420 1902  0407 1924  0424 1917  0455 1841  0527 1749  0601 1658  0639 1620  0708 1614
19  0709 1642  0635 1722  0549 1755  0457 1830  0419 1903  0407 1924  0424 1916  0456 1839  0529 1747  0602 1656  0640 1619  0709 1614
20  0708 1643  0634 1723  0548 1757  0456 1832  0418 1904  0408 1924  0425 1916  0457 1838  0530 1745  0603 1655  0641 1618  0709 1614
21  0707 1644  0632 1724  0546 1758  0454 1833  0418 1905  0408 1925  0426 1915  0458 1836  0531 1743  0604 1653  0642 1618  0710 1615
22  0707 1646  0631 1726  0544 1759  0453 1834  0417 1906  0408 1925  0427 1914  0459 1835  0532 1742  0605 1652  0644 1617  0711 1615
23  0706 1647  0629 1727  0543 1800  0451 1835  0416 1907  0408 1925  0428 1913  0500 1833  0533 1740  0607 1650  0645 1617  0711 1616
24  0705 1648  0628 1728  0541 1801  0450 1836  0415 1908  0409 1925  0429 1912  0501 1832  0534 1738  0608 1649  0646 1616  0711 1617
25  0704 1650  0626 1729  0539 1802  0448 1837  0414 1909  0409 1925  0430 1911  0502 1830  0535 1736  0609 1647  0647 1615  0712 1617
26  0704 1651  0624 1731  0537 1803  0447 1838  0414 1910  0409 1925  0431 1910  0503 1828  0536 1735  0610 1646  0648 1615  0712 1618
27  0703 1652  0623 1732  0536 1805  0445 1839  0413 1910  0410 1925  0432 1909  0504 1827  0537 1733  0611 1644  0649 1614  0712 1619
28  0702 1653  0621 1733  0534 1806  0444 1841  0412 1911  0410 1925  0433 1908  0505 1825  0538 1731  0613 1643  0650 1614  0713 1619
29  0701 1655             0532 1807  0443 1842  0412 1912  0411 1925  0434 1907  0506 1823  0539 1729  0614 1642  0652 1614  0713 1620
30  0700 1656             0530 1808  0441 1843  0411 1913  0411 1925  0435 1906  0507 1822  0540 1728  0615 1640  0653 1613  0713 1621
31  0659 1657             0529 1809             0411 1914             0436 1905  0509 1820             0616 1639             0713 1622

                                             Add one hour for daylight time, if and when in use.

Getting that plot out of this data turns out to be a little tricky, and most of the trick is in the import and cleanup phases. Right now, the data are arranged such that the day of the month is on the rows, while the month, hour, minute, and sunrise/sunset label are on the columns. This is often called “wide” data, which is easy to look at, but usually hard to work with. Our goal is to create a “long” dataset in which each row holds a single timestamp corresponding to one day’s sunrise or sunset (essentially, two rows per day). I’m going to show you how to do it using R. You’ll also need the following R packages: reshape2, plyr, and lubridate.

First things first, we need to import the data, ideally so that each meaningful number (the hours and minutes for each day of the year) ends up in a neat column. While the double-nested headers are unfortunate (hour and minute are nested within sunrise/sunset, which are nested within month), at least the data follow a nice fixed-width format, with each column ending after a predictable number of characters. R happens to have a handy read.fwf function, which is specialized for reading in these types of files.

data.raw <- read.fwf(
  file='Boston Daylight Data 2014.txt', 
  skip=9, 
  nrows=31,
  colClasses='numeric', 
  strip.white=T,
  widths=c(2, rep(c(4, 2, 3, 2), 12))
);

The read.fwf command accomplishes a lot, so I’ve spread its arguments out over several lines. I’m telling the function to read in the file, skip its first nine rows (none of which contain data), read exactly the next 31, make sure to import all the columns as numbers (not text strings), strip out any extra whitespace, and lastly, how many characters wide each column should be. This produces a dataset that looks like this (I’m cutting out a lot of the data, but there are a total of 49 columns and 31 rows):

 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
  1  7 13 16 23  6 58 16 59   6  20  17  34   5  27  18  10   4  40  18
  2  7 14 16 24  6 57 17  0   6  18  17  35   5  25  18  11   4  38  18
  3  7 14 16 24  6 56 17  1   6  16  17  37   5  24  18  12   4  37  18
  4  7 14 16 25  6 55 17  2   6  15  17  38   5  22  18  14   4  36  18
  5  7 13 16 26  6 53 17  4   6  13  17  39   5  20  18  15   4  35  18
  6  7 13 16 27  6 52 17  5   6  12  17  40   5  18  18  16   4  33  18

Now we just need to name the columns:

colnames(data.raw) <- c('day', paste(rep(month.abb, each=4), rep(c('rise', 'set'), each=2), c('hour', 'minute')));

This is a somewhat confusing use of the paste function, but basically I’m creating a vector of names: the first one is “day”, followed by names that follow the convention “Month rise/set hour/minute” (for example, “Jan rise hour”). Creating the labels at this stage saves us the trouble of having to extract them later.1 Our next step is to melt the dataset.

data.daylight <- melt(data.raw, id.vars='day');

By default, melt wants to reduce a dataset to just two columns: “variable” and “value” (“variable” becomes a column containing the dataset’s former column names, and “value” stores their corresponding values). The columns specified in id.vars are preserved in the new data frame, and are not melted into the “variable” column. So now our dataset looks like this:

 day      variable value
   1 Jan rise hour     7
   2 Jan rise hour     7
   3 Jan rise hour     7
   4 Jan rise hour     7
   5 Jan rise hour     7
   6 Jan rise hour     7

Now I want to take my “variable” column and split it into three new columns: month, event (sunrise/sunset), and time (hour/minute). This is easily done with colsplit. Note that I’m combining it with cbind, so that I can attach the new columns to my dataset without creating a temporary variable.

data.daylight <- cbind(data.daylight, colsplit(data.daylight$variable, ' ', c('month', 'event', 'time')));

Which makes the data look like this:

 day      variable value month event time
   1 Jan rise hour     7   Jan  rise hour
   2 Jan rise hour     7   Jan  rise hour
   3 Jan rise hour     7   Jan  rise hour
   4 Jan rise hour     7   Jan  rise hour
   5 Jan rise hour     7   Jan  rise hour
   6 Jan rise hour     7   Jan  rise hour

We’re nearly there. All that’s left is to get each event’s hour and minute into the same row. As near as I can tell, there’s no better way to do it than with the handy dcast function. With this function, I’m saying that “month”, “day”, and “event” should define the rows, while the different values stored in “time” should form new columns.

data.daylight <- dcast(data.daylight, month + day + event ~ time);
 month day event hour minute
   Apr   1  rise    5     27
   Apr   1   set   18     10
   Apr   2  rise    5     25
   Apr   2   set   18     11
   Apr   3  rise    5     24
   Apr   3   set   18     12

From importing the data to this use of dcast, I’ve only written five lines of code. Now would be a great time to scroll back up and remember how the data looked originally. I’ll wait.

And that’s what I call “the first 90%”. The data are now in a highly flexible “long” format, and can be used with ease. For example, say we wanted to a) convert the “month” and “day” columns into proper Date data, which will make plotting much easier, and b) calculate the minute of the day at which the sunrise/sunset event occurred. Enter mutate, the easiest way to do this kind of transformation (with a call to lubridate’s ymd function to turn strings of numbers into Dates):

data.daylight <- mutate(data.daylight,
                      date=ymd(paste('2014', month, day)),
                      event.minute=hour * 60 + minute);
 month day event hour minute       date event.minute
   Apr   1  rise    5     27 2014-04-01          327
   Apr   1   set   18     10 2014-04-01         1090
   Apr   2  rise    5     25 2014-04-02          325
   Apr   2   set   18     11 2014-04-02         1091
   Apr   3  rise    5     24 2014-04-03          324
   Apr   3   set   18     12 2014-04-03         1092

Think about how tedious and error-prone it would have been to create the equivalent of the “date” and “event.minute” columns with the data as originally formatted. But now we’re getting into what I call “the other 90%”, which is another story for another time.

  1. There a lot of different ways to skin a cat in R, and therefore lots of different ways you might have generated and assigned these labels. In fact, there are lots of ways to do almost anything in R. Before I knew about read.fwf, I used readLines and some clever regular expression magic to separate out the time values. Trust me, read.fwf is much easier.