the first 90%

posted December 9 2014

I work with lots of data. Not what you’d call “big data”, at least not technically, but maybe “biggish”. More than enough that Excel would crash just trying to open the dataset, assuming you were foolish enough to try. The amount of data is voluminous enough, and the relationship between the raw data and what you’re trying to analyze complex enough, that you need pretty decent data management chops to even access it correctly. But let’s say you have accessed it correctly. Now you can proceed to perform your analysis, make data visaulizations, and be a sorcerer of the digital age. Right?

Wrong. You left out the most important step: getting your data into the right format, making sure each data point has all the right labels in the right places to allow you to proceed to the real science. Cleaning data—that is, pulling in the unprocessed data, transforming it, rearranging it, relabeling it, discarding garbage, and otherwise getting it into a format that will play nicely with your analysis tools—is easily 90% of the job.

Let me give you an example.

This is a plot of the hours of daylight (sunset time subtracted from sunrise time) that Boston, Massachusetts received throughout 2014. I got the data from the US Naval Observatory after reading this post about the merits of Daylight Savings Time. Request a data file for any location in the US, and you’ll find it looks like this (scroll the box rightward to see the whole thing):

             o  ,    o  ,                                BOSTON, MASSACHUSETTS                         Astronomical Applications Dept.
Location: W071 05, N42 19                          Rise and Set for the Sun for 2014                   U. S. Naval Observatory        
                                                                                                       Washington, DC  20392-5420     
                                                         Eastern Standard Time                                                        
                                                                                                                                      
                                                                                                                                      
       Jan.       Feb.       Mar.       Apr.       May        June       July       Aug.       Sept.      Oct.       Nov.       Dec.  
Day Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set  Rise  Set
     h m  h m   h m  h m   h m  h m   h m  h m   h m  h m   h m  h m   h m  h m   h m  h m   h m  h m   h m  h m   h m  h m   h m  h m
0713 1623  0658 1659  0620 1734  0527 1810  0440 1844  0410 1914  0411 1925  0437 1904  0510 1818  0541 1726  0618 1638  0654 1613
0714 1624  0657 1700  0618 1735  0525 1811  0438 1845  0410 1915  0412 1925  0438 1903  0511 1817  0543 1724  0619 1637  0655 1613
0714 1624  0656 1701  0616 1737  0524 1812  0437 1846  0409 1916  0413 1924  0439 1901  0512 1815  0544 1722  0620 1635  0656 1612
0714 1625  0655 1702  0615 1738  0522 1814  0436 1847  0409 1917  0413 1924  0440 1900  0513 1813  0545 1721  0621 1634  0657 1612
0713 1626  0653 1704  0613 1739  0520 1815  0435 1848  0409 1917  0414 1924  0441 1859  0514 1811  0546 1719  0623 1633  0658 1612
0713 1627  0652 1705  0612 1740  0518 1816  0433 1849  0408 1918  0414 1924  0442 1858  0515 1810  0547 1717  0624 1632  0659 1612
0713 1628  0651 1706  0610 1741  0517 1817  0432 1850  0408 1919  0415 1923  0443 1856  0516 1808  0548 1716  0625 1631  0700 1612
0713 1629  0650 1708  0608 1743  0515 1818  0431 1852  0408 1919  0416 1923  0444 1855  0517 1806  0549 1714  0626 1629  0701 1612
0713 1630  0649 1709  0607 1744  0513 1819  0430 1853  0408 1920  0416 1922  0445 1854  0518 1805  0550 1712  0628 1628  0702 1612
0713 1632  0647 1710  0605 1745  0512 1820  0428 1854  0407 1920  0417 1922  0446 1852  0519 1803  0551 1711  0629 1627  0702 1612
0712 1633  0646 1712  0603 1746  0510 1821  0427 1855  0407 1921  0418 1921  0447 1851  0520 1801  0553 1709  0630 1626  0703 1612
0712 1634  0645 1713  0601 1747  0508 1823  0426 1856  0407 1921  0419 1921  0448 1850  0521 1759  0554 1707  0631 1625  0704 1612
0712 1635  0644 1714  0600 1749  0507 1824  0425 1857  0407 1922  0419 1920  0450 1848  0522 1758  0555 1706  0633 1624  0705 1612
0711 1636  0642 1715  0558 1750  0505 1825  0424 1858  0407 1922  0420 1920  0451 1847  0523 1756  0556 1704  0634 1623  0706 1612
0711 1637  0641 1717  0556 1751  0504 1826  0423 1859  0407 1923  0421 1919  0452 1845  0524 1754  0557 1702  0635 1622  0706 1613
0710 1638  0639 1718  0555 1752  0502 1827  0422 1900  0407 1923  0422 1919  0453 1844  0525 1752  0558 1701  0636 1622  0707 1613
0710 1640  0638 1719  0553 1753  0500 1828  0421 1901  0407 1923  0423 1918  0454 1842  0526 1751  0559 1659  0637 1621  0708 1613
0709 1641  0637 1721  0551 1754  0459 1829  0420 1902  0407 1924  0424 1917  0455 1841  0527 1749  0601 1658  0639 1620  0708 1614
0709 1642  0635 1722  0549 1755  0457 1830  0419 1903  0407 1924  0424 1916  0456 1839  0529 1747  0602 1656  0640 1619  0709 1614
0708 1643  0634 1723  0548 1757  0456 1832  0418 1904  0408 1924  0425 1916  0457 1838  0530 1745  0603 1655  0641 1618  0709 1614
0707 1644  0632 1724  0546 1758  0454 1833  0418 1905  0408 1925  0426 1915  0458 1836  0531 1743  0604 1653  0642 1618  0710 1615
0707 1646  0631 1726  0544 1759  0453 1834  0417 1906  0408 1925  0427 1914  0459 1835  0532 1742  0605 1652  0644 1617  0711 1615
0706 1647  0629 1727  0543 1800  0451 1835  0416 1907  0408 1925  0428 1913  0500 1833  0533 1740  0607 1650  0645 1617  0711 1616
0705 1648  0628 1728  0541 1801  0450 1836  0415 1908  0409 1925  0429 1912  0501 1832  0534 1738  0608 1649  0646 1616  0711 1617
0704 1650  0626 1729  0539 1802  0448 1837  0414 1909  0409 1925  0430 1911  0502 1830  0535 1736  0609 1647  0647 1615  0712 1617
0704 1651  0624 1731  0537 1803  0447 1838  0414 1910  0409 1925  0431 1910  0503 1828  0536 1735  0610 1646  0648 1615  0712 1618
0703 1652  0623 1732  0536 1805  0445 1839  0413 1910  0410 1925  0432 1909  0504 1827  0537 1733  0611 1644  0649 1614  0712 1619
0702 1653  0621 1733  0534 1806  0444 1841  0412 1911  0410 1925  0433 1908  0505 1825  0538 1731  0613 1643  0650 1614  0713 1619
0701 1655             0532 1807  0443 1842  0412 1912  0411 1925  0434 1907  0506 1823  0539 1729  0614 1642  0652 1614  0713 1620
0700 1656             0530 1808  0441 1843  0411 1913  0411 1925  0435 1906  0507 1822  0540 1728  0615 1640  0653 1613  0713 1621
0659 1657             0529 1809             0411 1914             0436 1905  0509 1820             0616 1639             0713 1622

                                             Add one hour for daylight time, if and when in use.

Getting that plot out of this data turns out to be a little tricky, and most of the trick is in the import and cleanup phases. Right now, the data are arranged such that the day of the month is on the rows, while the month, hour, minute, and sunrise/sunset label are on the columns. This is often called “wide” data, which is easy to look at, but usually hard to work with. Our goal is to create a “long” dataset in which each row holds a single timestamp corresponding to one day’s sunrise or sunset (essentially, two rows per day). I’m going to show you how to do it using R. You’ll also need the following R packages: reshape2, plyr, and lubridate.

First things first, we need to import the data, ideally so that each meaningful number (the hours and minutes for each day of the year) ends up in a neat column. While the double-nested headers are unfortunate (hour and minute are nested within sunrise/sunset, which are nested within month), at least the data follow a nice fixed-width format, with each column ending after a predictable number of characters. R happens to have a handy read.fwf function, which is specialized for reading in these types of files.

data.raw <- read.fwf(
  file='Boston Daylight Data 2014.txt', 
  skip=9, 
  nrows=31,
  colClasses='numeric', 
  strip.white=T,
  widths=c(2, rep(c(4, 2, 3, 2), 12))
);

The read.fwf command accomplishes a lot, so I’ve spread its arguments out over several lines. I’m telling the function to read in the file, skip its first nine rows (none of which contain data), read exactly the next 31, make sure to import all the columns as numbers (not text strings), strip out any extra whitespace, and lastly, how many characters wide each column should be. This produces a dataset that looks like this (I’m cutting out a lot of the data, but there are a total of 49 columns and 31 rows):

 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
7 13 16 23  6 58 16 59   6  20  17  34   5  27  18  10   4  40  18
7 14 16 24  6 57 17  0   6  18  17  35   5  25  18  11   4  38  18
7 14 16 24  6 56 17  1   6  16  17  37   5  24  18  12   4  37  18
7 14 16 25  6 55 17  2   6  15  17  38   5  22  18  14   4  36  18
7 13 16 26  6 53 17  4   6  13  17  39   5  20  18  15   4  35  18
7 13 16 27  6 52 17  5   6  12  17  40   5  18  18  16   4  33  18

Now we just need to name the columns:

colnames(data.raw) <- c('day', paste(rep(month.abb, each=4), rep(c('rise', 'set'), each=2), c('hour', 'minute')));

This is a somewhat confusing use of the paste function, but basically I’m creating a vector of names: the first one is “day”, followed by names that follow the convention “Month rise/set hour/minute” (for example, “Jan rise hour”). Creating the labels at this stage saves us the trouble of having to extract them later.¹ Our next step is to melt the dataset.

data.daylight <- melt(data.raw, id.vars='day');

By default, melt wants to reduce a dataset to just two columns: “variable” and “value” (“variable” becomes a column containing the dataset’s former column names, and “value” stores their corresponding values). The columns specified in id.vars are preserved in the new data frame, and are not melted into the “variable” column. So now our dataset looks like this:

 day      variable value
Jan rise hour     7
Jan rise hour     7
Jan rise hour     7
Jan rise hour     7
Jan rise hour     7
Jan rise hour     7

Now I want to take my “variable” column and split it into three new columns: month, event (sunrise/sunset), and time (hour/minute). This is easily done with colsplit. Note that I’m combining it with cbind, so that I can attach the new columns to my dataset without creating a temporary variable.

data.daylight <- cbind(data.daylight, colsplit(data.daylight$variable, ' ', c('month', 'event', 'time')));

Which makes the data look like this:

 day      variable value month event time
Jan rise hour     7   Jan  rise hour
Jan rise hour     7   Jan  rise hour
Jan rise hour     7   Jan  rise hour
Jan rise hour     7   Jan  rise hour
Jan rise hour     7   Jan  rise hour
Jan rise hour     7   Jan  rise hour

We’re nearly there. All that’s left is to get each event’s hour and minute into the same row. As near as I can tell, there’s no better way to do it than with the handy dcast function. With this function, I’m saying that “month”, “day”, and “event” should define the rows, while the different values stored in “time” should form new columns.

data.daylight <- dcast(data.daylight, month + day + event ~ time);

 month day event hour minute
   Apr   1  rise    5     27
   Apr   1   set   18     10
   Apr   2  rise    5     25
   Apr   2   set   18     11
   Apr   3  rise    5     24
   Apr   3   set   18     12

From importing the data to this use of dcast, I’ve only written five lines of code. Now would be a great time to scroll back up and remember how the data looked originally. I’ll wait.

And that’s what I call “the first 90%”. The data are now in a highly flexible “long” format, and can be used with ease. For example, say we wanted to a) convert the “month” and “day” columns into proper Date data, which will make plotting much easier, and b) calculate the minute of the day at which the sunrise/sunset event occurred. Enter mutate, the easiest way to do this kind of transformation (with a call to lubridate’s ymd function to turn strings of numbers into Dates):

data.daylight <- mutate(data.daylight,
                      date=ymd(paste('2014', month, day)),
                      event.minute=hour * 60 + minute);

 month day event hour minute       date event.minute
   Apr   1  rise    5     27 2014-04-01          327
   Apr   1   set   18     10 2014-04-01         1090
   Apr   2  rise    5     25 2014-04-02          325
   Apr   2   set   18     11 2014-04-02         1091
   Apr   3  rise    5     24 2014-04-03          324
   Apr   3   set   18     12 2014-04-03         1092

Think about how tedious and error-prone it would have been to create the equivalent of the “date” and “event.minute” columns with the data as originally formatted. But now we’re getting into what I call “the other 90%”, which is another story for another time.

There a lot of different ways to skin a cat in R, and therefore lots of different ways you might have generated and assigned these labels. In fact, there are lots of ways to do almost anything in R. Before I knew about read.fwf, I used readLines and some clever regular expression magic to separate out the time values. Trust me, read.fwf is much easier. ↩

the tardis top ten: human nature / the family of blood understanding ggplot: an example