posted November 4 2018
Autumn at Mount Auburn Cemetery
If I have a Happy Place™, it takes the form of a forest at the height of autumn. I like to think of autumn as the time of year when all the trees stop pretending and show us who they really are. Last week I had the pleasure of overdosing on autumnal splendor with my friend, Paul. Together we went on a long walk through Cambridge’s Mount Auburn Cemetery. Dedicated in 1831, Mount Auburn was the United States’s first “garden cemetery”—meant to be a place of beauty and contemplation as much as it was (and still is) a burial site.
We couldn’t have asked for a more beautiful autumn day at the height of foliage season, and it resulted in some pretty amazing photos. Facebook’s compression algorithms hit the tiny details in these particular photos really hard. So I took it upon myself to build a little gallery on my own humble site. Click any of the thumbnails for a bigger view, or hit the download button for the even larger originals.
posted April 1 2018
algorithmic art: random circle packing
The algorithm above is a variant of one that currently runs on my front page. The heart of it is a circle packing algorithm, which you can find at my GitHub. All the code is there, so I won’t bother to explain every line of it here. There are, however, a few interesting bits I’d like to highlight.
Circle Packing
Circle packing is a well-established topic in mathematics. The central question is: how best can you cram non-overlapping circles into some available space? There are lots of variations on the idea, and lots of ways to implement solutions. My version works by proposing random locations for circles of smaller and smaller sizes, keeping only the circles that do not overlap with any circles already on the canvas. In pseudo-code, the simplest method looks like this:
let existing_circles = [ ]
let radius = 100
while (radius >= 0.5) {
trying 1000 times {
let new_circle = circle(random_x, random_y, radius)
if (new_circle does not overlap existing_circles) {
add new_circle to existing_circles
restart trying
}
}
// If we made it through all 1,000 tries without placing a circle, shrink the radius by 1%.
// The program stops if the radius drops below 0.5 pixels.
radius = radius * 0.99
}
The above is pseudo-code, and it hides a lot of complexity. In particular, line 7—new_circle does not overlap existing_circles
—glosses over the most interesting part of the algorithm. What, exactly, is the most efficient way of checking to see if a new circle does not overlap any of the ones that have already been placed?
Notice that existing_circles
is just a list, and it stores circles in the order in which they were created. Every time a new circle is proposed, its position is compared to the first circle in the list, then the next, and so on, up until a comparison fails or the list ends. On the one hand, the list structure has an implicit optimization: larger circles were created first and will be checked first, so areas on the canvas with the highest potential for overlap get checked early. On the other hand, if our maximum circle size is smaller, this advantage disappears. And in either case, successfully placing a new circle means that every existing circle needs to be checked, every time. The more circles we’ve placed, the more will need to be checked. This is like looking up the definition of the word packing by opening up the dictionary to the first entry and dutifully reading every word from aardvark to pace.
For the purposes of my website, that’s not acceptable—I need something that fills the canvas pretty quickly.1 What we really need is a data structure that organizes circles by their position, at least vaguely. That would make it easy to restrict our comparison only to circles that are near to the proposed new circle. Enter quadtrees.
Quadtrees
Of course, if you were looking for a specific word in a dictionary, you wouldn’t read it through from the start. Searching for packing, you’d probably grab a big chunk of the pages and flip open to somewhere near marzipan. Knowing that p comes after m, you flip forward by another chunk, landing at quintessence. Having gone too far, you browse backwards, perhaps more carefully, until you arrive at packing. This method of searching the dictionary is, roughly, a binary search. Rather than starting at the beginning of the data, you jump straight to the middle. If that’s not the object you’re looking for, go halfway to the beginning or end, depending on whether you’ve overshot or undershot. Continue taking these ever-smaller half-steps until you’ve narrowed your search to the thing you’re looking for.
This method of searching is a little less efficient for words near the beginning of the dictionary, but much more efficient, on average, across all possible entries.2
A quadtree generalizes this idea to 2D space, like our canvas. The quadtree works by dividing the canvas into four quadrants. Every time we propose a new circle, the quadtree checks to see where the circle landed. If the circle lands in an empty cell, great, make a note of that. If the circle lands in a cell that already has a circle, the cell splits into four smaller cells, repeating until each cell once again holds one circle. The image above shows a quadtree storing twenty circles, with a twenty-first proposed circle in blue. When checking for possible overlaps, we can quickly determine that three of the four biggest quadrants don’t overlap the proposed circle and one does—four quick checks and we’ve eliminated 75% of the canvas, no need to check the seventeen circles contained within. The proposed circle does overlap the upper-right quadrant, so its sub-quads will need to be checked—four more checks. After about eight comparisons, as opposed to twenty, we’ve determined that the proposed circle is valid. As more circles are added and the grid becomes finer, the number of checks becomes roughly proportional to the potential area of overlap.3
There are many fine implementations of quadtrees, and I certainly don’t claim that mine is the best, but it was very satisfying to learn it myself.4
Into the Quadforest
I’m glossing over one other thing. We don’t just want to compare the circle positions, but rather, their areas. Using a single quadtree, we have to assume that the potential search area is a square with side lengths equal to the diameter of the proposed new circle, plus the maximum circle diameter stored in the tree. In practice, as the circles get smaller, we’ll be checking a much bigger area, and many more circles, than is necessary. The solution is to use multiple quadtrees, the first of which stores and searches against the largest circles, the next smaller ones, and so on. In this way, we can ensure that larger circles are checked first, restoring the advantage from the naive method.
All these optimizations really do make a difference. When the maximum circle size is large (say 200 pixels in diameter), the quadforest search is about three times faster than the naive list search. When the maximum circle size is smaller (say 20 pixels in diameter), the quadforest is about thirty times faster. Code that pits the two methods against each other in a race is available in the repo.
Happy Accidents
Having gotten the algorithm working nicely, I wanted some visual interest, or as the kids say, “art”. My first thought was to have the circles pop into existence with an animation. I’m not clearing the canvas on each new frame, which saves me from having to redraw every circle for every frame. And because I was testing my quadtree, the drawing commands were set to draw unfilled, outline shapes. So when I specified a five-frame “pop” animation, what I ended up with was five concentric rings for each circle—the ghosts of the pop animation. These appear thin and delicate for the larger circles, but thicker and almost solid for the smaller ones. The unexpected patterning produces a variety of texture and brightness that I really enjoy.
And there you have it. Efficient random circle packing, done fast, live, and pretty. The code for this algorithm is available on Github.
-
If I were doing this in Processing proper, it wouldn’t really matter. A lightly optimized version of this basic algorithm can pack a large sketch almost completely in a few seconds. But this is P5, which means Javascript in a web browser. More optimizations are necessary. ↩
-
Assuming aardvark is the first entry in a 128 word dictionary, you can get to it in 7 steps using the binary search method, as opposed to 1 step using the start-to-finish naive method. If zebra is the last entry in the dictionary, it also requires 7 steps with binary search, as opposed to 128 with the naive method. ↩
-
More or less. As the canvas fills up, it’ll get harder to find new valid locations, and a very large, very dense, or very lopsided tree would introduce some inefficiencies. Wikipedia, as always, has more. ↩
-
The code for this sketch is mostly the quadtree implementation, with a few specialized functions for handling the circle stuff. Maybe I should retitle this post. ↩
posted February 10 2018
compound inequalities in R
Suppose you’re writing some code and you want to check to see if a variable, say x
, is between two other numbers. The standard way to do this comparison in most languages, including R, would be something like:
x >= 10 & x <= 20
Not too shabby, all things considered. Of course, the repetition of the variable name is unfortunate, especially if you’re checking a.very.long.variable.name
. You also have to flip either the second comparison operator or the order of its arguments, which can be a little confusing to read. Utility functions like dplyr::between
give you the ability to write:
between(x, 10, 20)
That’s more convenient, but between
assumes that the comparison should be inclusive on both sides; it implements >=
and <=
, but not >
or <
. We could write our own version of between
to allow flags for whether each side of the comparison should be inclusive, but at that point the syntax gets more cumbersome than the thing it’s meant to replace.
What we really want is the ability to write the comparison as a compound inequality, which you might remember from grade school. They look like this:
10 < x <= 20
This succinctly expresses that x
should be greater than 10 and less than or equal to 20. Expressing exclusive or inclusive comparisons on either side is as easy as changing the operators, and you type the compared variables and values only once.
It turns out that it’s possible to implement custom compound inequality operators in R, and it’s not hard. The main hurdle is that the custom operator needs to do one of two things: either store the result of a new comparison but return the original data (to be passed along to the second comparison), or return the result of a comparison that already has some stored results attached. Since R allows us to set arbitrary attributes on data primitives, we can easily meet these requirements.
'%<<%' <- function(lhs, rhs) {
if (is.null(attr(lhs, 'compound-inequality-partial'))) {
out <- rhs
attr(out, 'compound-inequality-partial') <- lhs < rhs
} else {
out <- lhs < rhs & attr(lhs, 'compound-inequality-partial')
}
return(out)
}
That’s all we need to implement compound “less-than” comparisons (we’ll also need similar definitions for <<=
, >>
, and >>=
, but more on that in a minute). I’m using %<<%
instead of just %<%
because the complementary %>%
operator would conflict with the widely used pipe operator. The doubled symbol also neatly reinforces that this operator is meant to evaluate two comparisons.
Whenever the custom %<<%
is encountered, it checks to see if the argument on the lefthand side has an attribute called compound-inequality-partial
. If it doesn’t, the result of the comparison is attached to the original data as an attribute with that name, and the modified data are returned. If the attribute name exists, the function checks to see whether both the comparison stored in compound-inequality-partial
and the second comparison are true. That’s all there is to it.
Implementing this functionality for the remaining three comparisons is pretty repetitive, so let’s instead write a generalized comparison function. To do this, we’ll take advantage of the fact that in R, 1 + 2
can also be invoked as '+'(1, 2)
or do.call('+', list(1, 2))
. Here’s our generalized helper:
compound.inequality <- function(lhs, rhs, comparison) {
if (is.null(attr(lhs, 'compound-inequality-partial'))) {
out <- rhs
attr(out, 'compound-inequality-partial') <- do.call(comparison, list(lhs, rhs))
} else {
out <- do.call(comparison, list(lhs, rhs)) & attr(lhs, 'compound-inequality-partial')
}
return(out)
}
And then the definitions for all four operators become:
'%<<%' <- function(lhs, rhs) {
return(compound.inequality(lhs, rhs, '<'))
}
'%<<=%' <- function(lhs, rhs) {
return(compound.inequality(lhs, rhs, '<='))
}
'%>>%' <- function(lhs, rhs) {
return(compound.inequality(lhs, rhs, '>'))
}
'%>>=%' <- function(lhs, rhs) {
return(compound.inequality(lhs, rhs, '>='))
}
The comparison 2 %<<% 4 %<<% 6
is true. If you only wrote the first comparison, you’d get back 4
, with TRUE
attached as an attribute. Now it’s easy, and very readable, to specify compound “between” comparisons that are inclusive or exclusive on either side.
I’m not sure that I’d actually use this in “real” scripts, since these types of comparisons are relatively rare for me. But it was a fun problem to think about, and useful enough to share. This surprisingly simple solution demonstrates the flexibility of the R language. Code for this project is also available on Github.
posted September 7 2015
the pleasures and pitfalls of data visualization with maps
I’ve been thinking more about data visualization, if the previous post wasn’t enough of an indication. I recently stumbled upon an interesting post by Sean Lorenz at the Domino Data Lab blog on how to use R’s ggmap package to put stylish map tiles into your visualizations.
In the midst of his engaging and accessible tour of the package, Lorenz offers the following visualization of fatal motor vehicle collisons by state (or at any rate, a version very similar to the one I have here). Click to embiggen:
Lorenz says that this is a nice visualization of “the biggest offenders: California, Florida, Texas and New York.” But something about that list of states in that particular order seemed somehow familiar, and caught my attention.
Disclaimer
Before I continue, I’d like to state that the following is an informal analysis. It is intended for entertainment and educational purposes only, and is not scientifically rigorous, exhaustive, or in any way conclusive.
Data and R code for this project are available in my Github repository.
What Were We Talking About?
So, back to that map. The one where California, Florida, Texas, and New York were the “biggest offenders”. This struck me as suspicious, as California, Florida, Texas, and New York are coincidentally the most populous states in the country (in that order). A close reading of Lorenz’s post shows that he is, indeed, working with the raw number of collisions per state, and is not normalizing by the states’ populations. That’s important because, as you can see, there’s a very strong relationship between the number of people in a state and the number of accidents:
With a relationship that strong (even removing the Big Four outliers, R2 = 0.80), a plot of raw motor vehicle accidents is essentially just a map of population. Dividing the number of collisions per state by that state’s population yields a very different map:
As you can see, many of the bubbles are now roughly the same size. Florida still stands out from the crowd, and there’s a state somewhere in the Northeast with a very high accident rate, though it’s hard to tell which one, exactly. This map is no longer useful as a visualization, because it’s no longer clarifying the relationships between the data points. This is the main problem with bubble maps: they often visualize data that do not actually have a meaningful spatial relationship, and they do so in a way that would be hard to examine even without the superfluous geography.
So how else might we visualize this information? Well, how about a chloropleth map, with each state simply colored by accident rate:
The brighter the state, the higher the rate.1 Here, it’s more obvious that Delaware, New Mexico, Louisiana, and Florida have high accident rates, but which has the highest? And what about that vast undifferentiated middle? Maybe the problem is that all those subtle shades of blue are hard to tell apart. Maybe we should simplify the coloring and bin the data into quantiles?
Colorful, and simpler, but not necessarily more useful. From this map, you’d get the impression that Florida, Delaware, Texas, California, and many other states all have comparable accident rates, which isn’t necessarily true. How about we ditch our commitment to maps and just plot the data points?
Ah, clarity at last. Here I’m still coloring the points by quantile so that you can see how much of the data’s variability was hidden in the previous map. Now it’s immediately clear that Delaware, New Mexico, Florida, Louisiana, and South Carolina all have unusually high accident rates (arguably, so do North Carolina and Arizona). Beyond that, almost every other state clusters within one standard deviation of the national average, with South Dakota having a notably low accident rate.
Of course, Lorenz’s point wasn’t really about the accident data, it was about how nifty maps can be. Having just spent the last few paragraphs demonstrating the ways in which maps can fool us, you might get the impression that I’m down on maps. I’m really not. Maps can be beautiful and informative, but by definition, they need to show us data with some spatial relevance. Take, for instance, this map of the Boston marathon.
The lovely tiles are pulled from Stamen Maps, courtesy of the ggmap package, and I really can’t emphasize enough how much work ggmap is doing for you here.2 From there, it was pretty easy to overlay some custom data. The red line traces the path of the Boston Marathon, and the blue line shows Boston’s city limits. Curiously, very little of the Boston Marathon actually happens in Boston. Starting way out west in Hopkinton, the marathon doesn’t touch Boston proper until it briefly treads through parts of Brighton. Then the route passes through Brookline (a separate township from Boston, and quite ardent about it), before re-entering the city limits near Boston’s Fenway neighborhood, just a couple of miles from the finish line.
-
Here I’m converting the accident rate (essentially, accidents per person) to a z-score. Z-scores have a variety of useful properties, one of which is that for any normally distributed set of observations (as the accident rates are, more or less), 99% of the data should fall between values of -3 and 3. This is more useful than the teensy percentage values, which are hard to interpret as a meaningful unit. Put another way, the simple percentage values tell me about the accident rate’s relationship to its state (not very useful), while the z-score tells me about the accident rates relative to each other (much more useful). ↩
-
Really. Adding in that clean outline of Boston’s city limits was a nightmare. You have no idea how difficult it was to a) find good shape data, b) read that data into R correctly, c) project the data into latitude/longitude coordinates, and d) plot it neatly. That ggmap can put so much beautifully formatted data into a plot so quickly is a real marvel. ↩
posted August 17 2015
understanding ggplot: an example
If you use R, odds are good that you’ve also come across ggplot2, by Hadley Wickham. This wildly popular plotting package is one of the most useful tools in the entire R ecosystem. If it didn’t exist, I would have had to invent it (or a significantly inferior version of it). ggplot’s elegant, powerful syntax can turn you into a data vis wizard, if you know what you’re doing. It’s getting to “I know what I’m doing” that can be tough.
Earlier today, I came across a Bloomberg article on how more older drivers are staying on the road and becoming the auto industry’s fastest-growing demographic. It’s an interesting read, but what caught my attention was this figure:
From a data visualization standpoint, this plot is actually several flavors of weird and bad (more on that later), but suppose you find yourself enamored of its fun colors and jaunty angles, and maybe you want to see how much of this you can replicate purely with ggplot.
First thing’s first, download the data, which has just three columns: age.group, year, and value. Next we’ll calculate a marker for whether the number of drivers in an age group is increasing, as well as the actual percentage change:1
data.bloom <- ddply(data.bloom, .(age.group), mutate,
is.increasing = value[year == 2013] >= value[year == 2003],
percent = (value[year == 2013] - value[year == 2003]) / value[year == 2003] * 100
);
And now we’re ready to plot! Let’s get the basics going:
plot.bloom <- ggplot(data=data.bloom, aes(x = year)) +
geom_ribbon(aes(ymin = 0, ymax = value)) +
geom_point(aes(y = value)) +
facet_wrap(facets = ~age.group, nrow = 2);
print(plot.bloom);
ggplot’s syntax often mystifies new users, so let’s go through it.
- Line 1 is the main call to ggplot. It’s what tells R that we’re building a new ggplot object. The
ggplot()
function really only needs two arguments: a dataset (self-explanatory) and a set of aesthetics, defined in the mysteriousaes()
function called inside ofggplot()
. ggplot will attempt to go through each individual row of the dataset and put it on our plot. The aesthetics defined inaes()
tell ggplot which columns define which properties of the plot. In this main call toaes()
, I’ve only specified that “year” should go on the x-axis. Notice that I can refer to the “year” column by name alone; no quotes, no reference to its parent data frame. Theaes()
function implicitly knows about its associated dataset. This aesthetic mapping will then trickle down to every subsequent piece of the plot (unless I change the aesthetic mapping later on down the line). - Line 2 is our first geom, which defines a graphical object. In this case, I’m using
geom_ribbon()
, which is useful for building polygonal shapes such as shaded regions, error bars, highlights, and so on. Thegeom_ribbon()
function is pretty smart; it wants you to define “ymin”, “ymax”, “xmin”, and “xmax”, but if you give it just “x”, it’ll use that value for the min and max values. Since we already defined “x” up inggplot()
, we just have to give it “ymin” and “ymax”. Here I’m saying that I want “ymin” to always be 0, and “ymax” to take on the data values. - Line 3 calls
geom_point()
, and it does what it says on the tin. It already has “x” from the call toggplot()
, and it needs a “y”. In this case, that’s just “value” again. - Line 4 calls
facet_wrap()
. Faceting, an implementation of the small multiples technique, is perhaps the most powerful feature of ggplot. Faceting makes it possible to take a plot and split it into smaller sub-plots defined by some other factor or combination of factors. Our call tofacet_wrap()
simply says to split the plot up by “age.group”, and arrange the resulting set of sub-plots into exactly 2 rows.2
Four lines of code3, and we’re 90% of the way there. This is the point at which you’d call over your boss to show him your new, exciting, preliminary finding. But you wouldn’t call over your boss’s boss. The plot needs more work first. The labeling on the x-axis is screwy and there’s a distinct lack of eye-catching color here. Let’s fix that.
plot.bloom <- ggplot(data=data.bloom, aes(x = year)) +
geom_ribbon(aes(ymin = 0, ymax = value, fill = is.increasing)) +
geom_point(aes(y = value)) +
facet_wrap(facets = ~age.group, nrow = 2, scales = 'free') +
scale_x_continuous(breaks = unique(data.bloom$year), expand=c(0.16, 0)) +
scale_fill_manual(values=c('TRUE'='#99e5e5', 'FALSE'='#f27595'));
print(plot.bloom);
Here are the changes I’ve made:
- Line 2 now adds
fill = is.increasing
to the aesthetics. This will color the polygons according to whether “is.increasing” is TRUE or FALSE. - Line 6’s call to
scale_fill_manual()
tells ggplot that the “fill” aesthetic is going to be controlled manually. While there a number of built-in color scales,such as scale_fill_brewer()
orscale_fill_grey()
,scale_fill_manual()
simply allows us to define our own colors. In this case, I’ve picked the specific colors from the Bloomberg figure, and mapped them to the two possible values in the “is.increasing” column. - Line 5 calls
scale_x_continuous()
. Since our “year” column is numeric data, ggplot assumes that 2003 and 2013 are just two points on a line (you can see lots of intermediate values in the plot above, as ggplot attempted to create neatly spaced tick marks). One way to clear up the labels would be to convert the year column to a factor, and then ggplot would label only the values that exist in the dataset. Another way would be to convert the year column to Date objects, and usescale_x_date()
, but that’s overkill here. Instead, I’ve simply used the “breaks” argument to restrict labeling to the two values found in the dataset. The two-element vector being sent to “expand” tells ggplot how much padding to add to the ends of the x-axis. The first value defines a percentage value (16% here), while the second would add some raw value (here, some number of years). The default isc(0.04, 0)
, so I’ve essentially just increased the padding a little. - Lastly, Line 4 adds
scales = 'free'
tofacet_wrap()
. In the previous plot, notice that each facet shared the same axes, and axis values were only written on the far left and bottom of the plot. This is efficient and readable, but it’s not how the Bloomberg plot does things. For the sake of replication, we need x-axis labels on each facet.
We’re getting closer, but these changes have produced some undesirable effects. The use of a fill aesthetic has caused ggplot to helpfully add a legend, which we don’t want, while the use of scales = 'free'
has caused each facet to plot the data against its own set of y-axis limits, losing the nice comparable scaling we had before. We can keep the individual axes and the unified scaling with one addition:
plot.bloom <- ggplot(data=data.bloom, aes(x = year)) +
geom_ribbon(aes(ymin = 0, ymax = value, fill = is.increasing)) +
geom_point(aes(y = value)) +
facet_wrap(facets = ~age.group, nrow = 2, scales = 'free') +
scale_x_continuous(breaks = unique(data.bloom$year), expand=c(0.16, 0)) +
scale_y_continuous(limits=c(0, 50), expand=c(0, 0)) +
scale_fill_manual(values=c('TRUE'='#99e5e5', 'FALSE'='#f27595'));
print(plot.bloom);
Line 6 establishes explicit limits for the y-axes (ranging from 0 to 50), and knocks out the padding entirely, so that the data touches the y-axis, as in the Bloomberg plot. But wait! The Bloomberg figure also prints the value of each data point, and includes the written percentage as well. Two calls to geom_text()
can get that done:
plot.bloom <- ggplot(data=data.bloom, aes(x = year)) +
geom_ribbon(aes(ymin = 0, ymax = value, fill = is.increasing)) +
geom_point(aes(y = value)) +
geom_text(aes(label = sprintf('%0.1f', value), y = value), vjust = -1, size=3) +
geom_text(subset=.(year == 2013), aes(label = sprintf('%+0.1f%%', percent)), x = 2008, y = 0, vjust = -1, fontface = 'bold', size=3) +
facet_wrap(facets = ~age.group, nrow = 2, scales = 'free') +
scale_x_continuous(breaks = unique(data.bloom$year), expand=c(0.16, 0)) +
scale_y_continuous(limits=c(0, 50), expand=c(0, 0)) +
scale_fill_manual(values=c('TRUE'='#99e5e5', 'FALSE'='#f27595'));
print(plot.bloom);
Lines 4 and 5 accomplish similar things, so let’s just focus on line 5:
- Notice that the “label” argument inside
aes()
invokessprintf()
, which is a very handy way of ensuring that the percentage values are all neatly rounded to one decimal place and attached to a percent symbol. It’s possible to do these types of transformations within ggplot, thus saving you the trouble of having to create transforms within the dataset itself. - “x” is set to 2008, which is halfway between 2003 and 2013, thus ensuring that the label appears at the horizontal center of the plot. Note that “x” is set outside of
aes()
. You can do that if you’re setting an aesthetic to a single value. - Likewise, “y” is always 0, the baseline of the plot, and a “vjust” of -1 will set the text slightly above that baseline.
- The “fontface” and “size” attributes should be fairly self-explanatory.
- Lastly, remember when I said that ggplot wants to plot every row in your dataset? This means that ggplot would print the percentage values twice, right on top of each other, since they appear twice (one per age group). The “subset” parameter ensures that I’m only pulling the labels once.
Finally, the rest is cosmetic:
plot.bloom <- ggplot(data=data.bloom, aes(x = year)) +
geom_ribbon(aes(ymin = 0, ymax = value, fill = is.increasing)) +
geom_point(color='white', size=3, aes(y = value)) +
geom_point(color='black', size=2, aes(y = value)) +
geom_text(aes(label = sprintf('%0.1f', value), y = value), vjust = -1, size=3) +
geom_text(subset=.(year == 2013), aes(label = sprintf('%+0.1f%%', percent)), x = 2008, y = 0, vjust = -1, fontface = 'bold', size=3) +
facet_wrap(facets = ~age.group, nrow = 2, scales = 'free') +
scale_x_continuous(breaks = unique(data.bloom$year), expand=c(0.16, 0)) +
scale_y_continuous(limits=c(0, 50), expand=c(0, 0)) +
scale_fill_manual(values=c('TRUE'='#99e5e5', 'FALSE'='#f27595')) +
labs(x=NULL, y=NULL) +
theme_classic() + theme(legend.position='none',
axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.text.x=element_text(color='#aaaaaa'),
strip.text=element_text(face='bold'),
strip.background=element_rect(fill='#eeeeee', color=NA),
panel.margin.x = unit(0.25, 'in'),
panel.margin.y = unit(0.25, 'in')
);
print(plot.bloom);
- Lines 3 and 4 create two duplicate sets of points, in black and white, thus creating a white outline around the points.
- Line 11 turns off the axis labeling.
- Line 12 tells ggplot to use its built-in “classic” theme, which features a white background and no gridlines, thus saving us some typing in
theme()
. - Lines 12-20 are just one big adventure in the aforementioned
theme()
, which allows us to control the style of the plot. ggplot builds everything out of some basic theme elements, includingelement_rect()
,element_text()
, andelement_line()
. Setting any parameter of the theme toelement_blank()
removes that element from the plot itself, automatically re-arranging the remaining elements to use the available space.4
As far as I’m aware, this is as close as you can get to reproducing the Bloomberg plot without resorting to image editing software. ggplot can’t color facet labels individually (notice that in the Bloomberg version, the labels for the two oldest age groups are a different color than the rest). While I could muck around with arrows and label positioning for the percentage values, it would involve a lot of finicky trial and error, and generally isn’t worth the trouble.
So, we’ve done it. We’ve replicated this plot, which is horrible for the following reasons:
- The primary data are redundantly represented by three different elements: the colored polygons, the data points, and the data labels.
- Percentage change is represented by two separate elements: the color of the polygon and the printed label.
- The repetition of the axis labels is unnecessary.
- Splitting the pairs of points into a series of eight sub-plots obscures the relationship between age group and the number of active drivers. There’s a lot of empty space between the meaningful data, and the arrangement of sub-plots into two rows makes the age relationship harder to examine.
Using the same dataset, one could write this code:
plot.line <- ggplot(data=data.bloom, aes(x = age.group, y = value, color = as.factor(year))) +
geom_line(aes(group = year)) +
geom_point(size = 5.5, color = 'white') +
geom_point(size = 4) +
geom_text(aes(label=sprintf('%+0.1f%%', percent), color = is.increasing), y = -Inf, vjust = -1, size=3, show_guides=F) +
geom_text(data=data.frame(year=c('2003', '\n2013')), aes(label=year), x=Inf, y = Inf, hjust=1, vjust=1) +
scale_y_continuous(limits = c(0, 45)) +
labs(x = 'Age Group', y = 'Number of Licensed Drivers (millions)') +
scale_color_manual(values = c('FALSE'='#f00e48', 'TRUE'='#24a57c', '2003'='lightblue', '2013'='orange', '\n2013'='orange')) +
theme_classic() + theme(legend.position = 'none',
axis.line = element_line(color='#999999'),
axis.ticks = element_line(color='#999999'),
axis.text.x = element_text(angle=45, hjust=1, vjust=1)
);
print(plot.line)
And produce this plot:
You could argue with some of the choices I’ve made here. For example, I’m still using points and lines to encode the same data. But I tend to subscribe to Robert Kosara’s definition of chart junk, which states that, “chart junk is any element of a chart that does not contribute to clarifying the intended message.” Here, the lines serve to remind the audience that the data are on a rough age continuum, while the points clarify which parts are real data and which are interpolation. I’ve also added color to the percentage change values, as I feel it’s important to highlight the one age group that is experiencing a decrease in licensure. Eagle-eyed readers will probably notice that I’m abusing ggplot a little, in that I’ve supressed the automated legend and am using geom_text()
to create a custom legend of colored text (and the tricks I had to pull in scale_color_manual()
are just weird).5
So what have we learned about ggplot? Look at how much explaining I had to do after each piece of code. This tells you that ggplot packs an awful lot of power into a few commands. A lot of that power comes from the things ggplot does implicitly, like establishing uniform axis limits across facets and building sensible axis labels based on the type of data used for the “x” and “y” aesthetics. Notice that ggplot even arranged the facets in the correct order.6 At the same time, ggplot allows for a lot of explicit control through the various scale()
commands or theme()
. The key to mastering ggplot is understanding how the system “thinks” about your data, and what it will try do for you. Lastly, notice that I was able to cook up a basic visualization of the data with just a few succinct lines of code, and gradually build the visualization’s complexity, iterating its design until I arrived at the presentation I wanted. This is a powerful feature, and one ideal for data exploration.
Feel free to grab the data and R code for this project.
-
We will, of course, be using plyr, another indispensible tool from the mind of Dr. Wickham. In this case, we use
mutate()
in the call toddply()
to add two new columns to the dataset (which I’m calling “data.bloom”). For each age group, calculate whether the value increases from 2003 to 2013 (“is.increasing”), and the percentage change (“percent”). ↩ -
Note that
facet_wrap()
andfacet_grid()
use an elegant formula interface for defining facets, of the formrows ~ columns
. In this case, I only need to define either the rows or the columns. But in another dataset I could facet, say,age.group + income ~ gender
. ↩ -
Technically everything before
print()
is a single line of code, and I’m inserting line breaks for readability (notice the+
signs at the end of each line, which connect all of the ggplot commands). ↩ -
ggplot’s sensible way of handling the automatic layout and scaling of plot elements is half the reason I prefer it over the base plotting system. ↩
-
And notice my use of
y = -Inf
in the call togeom_text()
. This tells ggplot to place the text at the minimum of the y-axis, whatever that might turn out to be, a very useful feature that is documented nowhere. ↩ -
Really, we just got lucky there. When you are faceting based on character data (as we are here), ggplot attempts to arrange facets alphabetically, which works to our favor in this case. If you needed to place the facets in an arbitrary order, your best bet would be to convert your faceting columns into ordered factors, in which case ggplot will place the facets using the underlying order. ↩
posted December 9 2014
the first 90%
I work with lots of data. Not what you’d call “big data”, at least not technically, but maybe “biggish”. More than enough that Excel would crash just trying to open the dataset, assuming you were foolish enough to try. The amount of data is voluminous enough, and the relationship between the raw data and what you’re trying to analyze complex enough, that you need pretty decent data management chops to even access it correctly. But let’s say you have accessed it correctly. Now you can proceed to perform your analysis, make data visaulizations, and be a sorcerer of the digital age. Right?
Wrong. You left out the most important step: getting your data into the right format, making sure each data point has all the right labels in the right places to allow you to proceed to the real science. Cleaning data—that is, pulling in the unprocessed data, transforming it, rearranging it, relabeling it, discarding garbage, and otherwise getting it into a format that will play nicely with your analysis tools—is easily 90% of the job.
Let me give you an example.
This is a plot of the hours of daylight (sunset time subtracted from sunrise time) that Boston, Massachusetts received throughout 2014. I got the data from the US Naval Observatory after reading this post about the merits of Daylight Savings Time. Request a data file for any location in the US, and you’ll find it looks like this (scroll the box rightward to see the whole thing):
o , o , BOSTON, MASSACHUSETTS Astronomical Applications Dept.
Location: W071 05, N42 19 Rise and Set for the Sun for 2014 U. S. Naval Observatory
Washington, DC 20392-5420
Eastern Standard Time
Jan. Feb. Mar. Apr. May June July Aug. Sept. Oct. Nov. Dec.
Day Rise Set Rise Set Rise Set Rise Set Rise Set Rise Set Rise Set Rise Set Rise Set Rise Set Rise Set Rise Set
h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m
01 0713 1623 0658 1659 0620 1734 0527 1810 0440 1844 0410 1914 0411 1925 0437 1904 0510 1818 0541 1726 0618 1638 0654 1613
02 0714 1624 0657 1700 0618 1735 0525 1811 0438 1845 0410 1915 0412 1925 0438 1903 0511 1817 0543 1724 0619 1637 0655 1613
03 0714 1624 0656 1701 0616 1737 0524 1812 0437 1846 0409 1916 0413 1924 0439 1901 0512 1815 0544 1722 0620 1635 0656 1612
04 0714 1625 0655 1702 0615 1738 0522 1814 0436 1847 0409 1917 0413 1924 0440 1900 0513 1813 0545 1721 0621 1634 0657 1612
05 0713 1626 0653 1704 0613 1739 0520 1815 0435 1848 0409 1917 0414 1924 0441 1859 0514 1811 0546 1719 0623 1633 0658 1612
06 0713 1627 0652 1705 0612 1740 0518 1816 0433 1849 0408 1918 0414 1924 0442 1858 0515 1810 0547 1717 0624 1632 0659 1612
07 0713 1628 0651 1706 0610 1741 0517 1817 0432 1850 0408 1919 0415 1923 0443 1856 0516 1808 0548 1716 0625 1631 0700 1612
08 0713 1629 0650 1708 0608 1743 0515 1818 0431 1852 0408 1919 0416 1923 0444 1855 0517 1806 0549 1714 0626 1629 0701 1612
09 0713 1630 0649 1709 0607 1744 0513 1819 0430 1853 0408 1920 0416 1922 0445 1854 0518 1805 0550 1712 0628 1628 0702 1612
10 0713 1632 0647 1710 0605 1745 0512 1820 0428 1854 0407 1920 0417 1922 0446 1852 0519 1803 0551 1711 0629 1627 0702 1612
11 0712 1633 0646 1712 0603 1746 0510 1821 0427 1855 0407 1921 0418 1921 0447 1851 0520 1801 0553 1709 0630 1626 0703 1612
12 0712 1634 0645 1713 0601 1747 0508 1823 0426 1856 0407 1921 0419 1921 0448 1850 0521 1759 0554 1707 0631 1625 0704 1612
13 0712 1635 0644 1714 0600 1749 0507 1824 0425 1857 0407 1922 0419 1920 0450 1848 0522 1758 0555 1706 0633 1624 0705 1612
14 0711 1636 0642 1715 0558 1750 0505 1825 0424 1858 0407 1922 0420 1920 0451 1847 0523 1756 0556 1704 0634 1623 0706 1612
15 0711 1637 0641 1717 0556 1751 0504 1826 0423 1859 0407 1923 0421 1919 0452 1845 0524 1754 0557 1702 0635 1622 0706 1613
16 0710 1638 0639 1718 0555 1752 0502 1827 0422 1900 0407 1923 0422 1919 0453 1844 0525 1752 0558 1701 0636 1622 0707 1613
17 0710 1640 0638 1719 0553 1753 0500 1828 0421 1901 0407 1923 0423 1918 0454 1842 0526 1751 0559 1659 0637 1621 0708 1613
18 0709 1641 0637 1721 0551 1754 0459 1829 0420 1902 0407 1924 0424 1917 0455 1841 0527 1749 0601 1658 0639 1620 0708 1614
19 0709 1642 0635 1722 0549 1755 0457 1830 0419 1903 0407 1924 0424 1916 0456 1839 0529 1747 0602 1656 0640 1619 0709 1614
20 0708 1643 0634 1723 0548 1757 0456 1832 0418 1904 0408 1924 0425 1916 0457 1838 0530 1745 0603 1655 0641 1618 0709 1614
21 0707 1644 0632 1724 0546 1758 0454 1833 0418 1905 0408 1925 0426 1915 0458 1836 0531 1743 0604 1653 0642 1618 0710 1615
22 0707 1646 0631 1726 0544 1759 0453 1834 0417 1906 0408 1925 0427 1914 0459 1835 0532 1742 0605 1652 0644 1617 0711 1615
23 0706 1647 0629 1727 0543 1800 0451 1835 0416 1907 0408 1925 0428 1913 0500 1833 0533 1740 0607 1650 0645 1617 0711 1616
24 0705 1648 0628 1728 0541 1801 0450 1836 0415 1908 0409 1925 0429 1912 0501 1832 0534 1738 0608 1649 0646 1616 0711 1617
25 0704 1650 0626 1729 0539 1802 0448 1837 0414 1909 0409 1925 0430 1911 0502 1830 0535 1736 0609 1647 0647 1615 0712 1617
26 0704 1651 0624 1731 0537 1803 0447 1838 0414 1910 0409 1925 0431 1910 0503 1828 0536 1735 0610 1646 0648 1615 0712 1618
27 0703 1652 0623 1732 0536 1805 0445 1839 0413 1910 0410 1925 0432 1909 0504 1827 0537 1733 0611 1644 0649 1614 0712 1619
28 0702 1653 0621 1733 0534 1806 0444 1841 0412 1911 0410 1925 0433 1908 0505 1825 0538 1731 0613 1643 0650 1614 0713 1619
29 0701 1655 0532 1807 0443 1842 0412 1912 0411 1925 0434 1907 0506 1823 0539 1729 0614 1642 0652 1614 0713 1620
30 0700 1656 0530 1808 0441 1843 0411 1913 0411 1925 0435 1906 0507 1822 0540 1728 0615 1640 0653 1613 0713 1621
31 0659 1657 0529 1809 0411 1914 0436 1905 0509 1820 0616 1639 0713 1622
Add one hour for daylight time, if and when in use.
Getting that plot out of this data turns out to be a little tricky, and most of the trick is in the import and cleanup phases. Right now, the data are arranged such that the day of the month is on the rows, while the month, hour, minute, and sunrise/sunset label are on the columns. This is often called “wide” data, which is easy to look at, but usually hard to work with. Our goal is to create a “long” dataset in which each row holds a single timestamp corresponding to one day’s sunrise or sunset (essentially, two rows per day). I’m going to show you how to do it using R. You’ll also need the following R packages: reshape2
, plyr
, and lubridate
.
First things first, we need to import the data, ideally so that each meaningful number (the hours and minutes for each day of the year) ends up in a neat column. While the double-nested headers are unfortunate (hour and minute are nested within sunrise/sunset, which are nested within month), at least the data follow a nice fixed-width format, with each column ending after a predictable number of characters. R happens to have a handy read.fwf
function, which is specialized for reading in these types of files.
data.raw <- read.fwf(
file='Boston Daylight Data 2014.txt',
skip=9,
nrows=31,
colClasses='numeric',
strip.white=T,
widths=c(2, rep(c(4, 2, 3, 2), 12))
);
The read.fwf
command accomplishes a lot, so I’ve spread its arguments out over several lines. I’m telling the function to read in the file, skip its first nine rows (none of which contain data), read exactly the next 31, make sure to import all the columns as numbers (not text strings), strip out any extra whitespace, and lastly, how many characters wide each column should be. This produces a dataset that looks like this (I’m cutting out a lot of the data, but there are a total of 49 columns and 31 rows):
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1 7 13 16 23 6 58 16 59 6 20 17 34 5 27 18 10 4 40 18
2 7 14 16 24 6 57 17 0 6 18 17 35 5 25 18 11 4 38 18
3 7 14 16 24 6 56 17 1 6 16 17 37 5 24 18 12 4 37 18
4 7 14 16 25 6 55 17 2 6 15 17 38 5 22 18 14 4 36 18
5 7 13 16 26 6 53 17 4 6 13 17 39 5 20 18 15 4 35 18
6 7 13 16 27 6 52 17 5 6 12 17 40 5 18 18 16 4 33 18
Now we just need to name the columns:
colnames(data.raw) <- c('day', paste(rep(month.abb, each=4), rep(c('rise', 'set'), each=2), c('hour', 'minute')));
This is a somewhat confusing use of the paste
function, but basically I’m creating a vector of names: the first one is “day”, followed by names that follow the convention “Month rise/set hour/minute” (for example, “Jan rise hour”). Creating the labels at this stage saves us the trouble of having to extract them later.1 Our next step is to melt
the dataset.
data.daylight <- melt(data.raw, id.vars='day');
By default, melt
wants to reduce a dataset to just two columns: “variable” and “value” (“variable” becomes a column containing the dataset’s former column names, and “value” stores their corresponding values). The columns specified in id.vars
are preserved in the new data frame, and are not melted into the “variable” column. So now our dataset looks like this:
day variable value
1 Jan rise hour 7
2 Jan rise hour 7
3 Jan rise hour 7
4 Jan rise hour 7
5 Jan rise hour 7
6 Jan rise hour 7
Now I want to take my “variable” column and split it into three new columns: month, event (sunrise/sunset), and time (hour/minute). This is easily done with colsplit
. Note that I’m combining it with cbind
, so that I can attach the new columns to my dataset without creating a temporary variable.
data.daylight <- cbind(data.daylight, colsplit(data.daylight$variable, ' ', c('month', 'event', 'time')));
Which makes the data look like this:
day variable value month event time
1 Jan rise hour 7 Jan rise hour
2 Jan rise hour 7 Jan rise hour
3 Jan rise hour 7 Jan rise hour
4 Jan rise hour 7 Jan rise hour
5 Jan rise hour 7 Jan rise hour
6 Jan rise hour 7 Jan rise hour
We’re nearly there. All that’s left is to get each event’s hour and minute into the same row. As near as I can tell, there’s no better way to do it than with the handy dcast
function. With this function, I’m saying that “month”, “day”, and “event” should define the rows, while the different values stored in “time” should form new columns.
data.daylight <- dcast(data.daylight, month + day + event ~ time);
month day event hour minute
Apr 1 rise 5 27
Apr 1 set 18 10
Apr 2 rise 5 25
Apr 2 set 18 11
Apr 3 rise 5 24
Apr 3 set 18 12
From importing the data to this use of dcast
, I’ve only written five lines of code. Now would be a great time to scroll back up and remember how the data looked originally. I’ll wait.
And that’s what I call “the first 90%”. The data are now in a highly flexible “long” format, and can be used with ease. For example, say we wanted to a) convert the “month” and “day” columns into proper Date data, which will make plotting much easier, and b) calculate the minute of the day at which the sunrise/sunset event occurred. Enter mutate
, the easiest way to do this kind of transformation (with a call to lubridate
’s ymd
function to turn strings of numbers into Dates):
data.daylight <- mutate(data.daylight,
date=ymd(paste('2014', month, day)),
event.minute=hour * 60 + minute);
month day event hour minute date event.minute
Apr 1 rise 5 27 2014-04-01 327
Apr 1 set 18 10 2014-04-01 1090
Apr 2 rise 5 25 2014-04-02 325
Apr 2 set 18 11 2014-04-02 1091
Apr 3 rise 5 24 2014-04-03 324
Apr 3 set 18 12 2014-04-03 1092
Think about how tedious and error-prone it would have been to create the equivalent of the “date” and “event.minute” columns with the data as originally formatted. But now we’re getting into what I call “the other 90%”, which is another story for another time.
-
There a lot of different ways to skin a cat in R, and therefore lots of different ways you might have generated and assigned these labels. In fact, there are lots of ways to do almost anything in R. Before I knew about
read.fwf
, I usedreadLines
and some clever regular expression magic to separate out the time values. Trust me,read.fwf
is much easier. ↩