posted December 9 2014
the first 90%
I work with lots of data. Not what you’d call “big data”, at least not technically, but maybe “biggish”. More than enough that Excel would crash just trying to open the dataset, assuming you were foolish enough to try. The amount of data is voluminous enough, and the relationship between the raw data and what you’re trying to analyze complex enough, that you need pretty decent data management chops to even access it correctly. But let’s say you have accessed it correctly. Now you can proceed to perform your analysis, make data visaulizations, and be a sorcerer of the digital age. Right?
Wrong. You left out the most important step: getting your data into the right format, making sure each data point has all the right labels in the right places to allow you to proceed to the real science. Cleaning data—that is, pulling in the unprocessed data, transforming it, rearranging it, relabeling it, discarding garbage, and otherwise getting it into a format that will play nicely with your analysis tools—is easily 90% of the job.
Let me give you an example.
This is a plot of the hours of daylight (sunset time subtracted from sunrise time) that Boston, Massachusetts received throughout 2014. I got the data from the US Naval Observatory after reading this post about the merits of Daylight Savings Time. Request a data file for any location in the US, and you’ll find it looks like this (scroll the box rightward to see the whole thing):
o , o , BOSTON, MASSACHUSETTS Astronomical Applications Dept.
Location: W071 05, N42 19 Rise and Set for the Sun for 2014 U. S. Naval Observatory
Washington, DC 20392-5420
Eastern Standard Time
Jan. Feb. Mar. Apr. May June July Aug. Sept. Oct. Nov. Dec.
Day Rise Set Rise Set Rise Set Rise Set Rise Set Rise Set Rise Set Rise Set Rise Set Rise Set Rise Set Rise Set
h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m h m
01 0713 1623 0658 1659 0620 1734 0527 1810 0440 1844 0410 1914 0411 1925 0437 1904 0510 1818 0541 1726 0618 1638 0654 1613
02 0714 1624 0657 1700 0618 1735 0525 1811 0438 1845 0410 1915 0412 1925 0438 1903 0511 1817 0543 1724 0619 1637 0655 1613
03 0714 1624 0656 1701 0616 1737 0524 1812 0437 1846 0409 1916 0413 1924 0439 1901 0512 1815 0544 1722 0620 1635 0656 1612
04 0714 1625 0655 1702 0615 1738 0522 1814 0436 1847 0409 1917 0413 1924 0440 1900 0513 1813 0545 1721 0621 1634 0657 1612
05 0713 1626 0653 1704 0613 1739 0520 1815 0435 1848 0409 1917 0414 1924 0441 1859 0514 1811 0546 1719 0623 1633 0658 1612
06 0713 1627 0652 1705 0612 1740 0518 1816 0433 1849 0408 1918 0414 1924 0442 1858 0515 1810 0547 1717 0624 1632 0659 1612
07 0713 1628 0651 1706 0610 1741 0517 1817 0432 1850 0408 1919 0415 1923 0443 1856 0516 1808 0548 1716 0625 1631 0700 1612
08 0713 1629 0650 1708 0608 1743 0515 1818 0431 1852 0408 1919 0416 1923 0444 1855 0517 1806 0549 1714 0626 1629 0701 1612
09 0713 1630 0649 1709 0607 1744 0513 1819 0430 1853 0408 1920 0416 1922 0445 1854 0518 1805 0550 1712 0628 1628 0702 1612
10 0713 1632 0647 1710 0605 1745 0512 1820 0428 1854 0407 1920 0417 1922 0446 1852 0519 1803 0551 1711 0629 1627 0702 1612
11 0712 1633 0646 1712 0603 1746 0510 1821 0427 1855 0407 1921 0418 1921 0447 1851 0520 1801 0553 1709 0630 1626 0703 1612
12 0712 1634 0645 1713 0601 1747 0508 1823 0426 1856 0407 1921 0419 1921 0448 1850 0521 1759 0554 1707 0631 1625 0704 1612
13 0712 1635 0644 1714 0600 1749 0507 1824 0425 1857 0407 1922 0419 1920 0450 1848 0522 1758 0555 1706 0633 1624 0705 1612
14 0711 1636 0642 1715 0558 1750 0505 1825 0424 1858 0407 1922 0420 1920 0451 1847 0523 1756 0556 1704 0634 1623 0706 1612
15 0711 1637 0641 1717 0556 1751 0504 1826 0423 1859 0407 1923 0421 1919 0452 1845 0524 1754 0557 1702 0635 1622 0706 1613
16 0710 1638 0639 1718 0555 1752 0502 1827 0422 1900 0407 1923 0422 1919 0453 1844 0525 1752 0558 1701 0636 1622 0707 1613
17 0710 1640 0638 1719 0553 1753 0500 1828 0421 1901 0407 1923 0423 1918 0454 1842 0526 1751 0559 1659 0637 1621 0708 1613
18 0709 1641 0637 1721 0551 1754 0459 1829 0420 1902 0407 1924 0424 1917 0455 1841 0527 1749 0601 1658 0639 1620 0708 1614
19 0709 1642 0635 1722 0549 1755 0457 1830 0419 1903 0407 1924 0424 1916 0456 1839 0529 1747 0602 1656 0640 1619 0709 1614
20 0708 1643 0634 1723 0548 1757 0456 1832 0418 1904 0408 1924 0425 1916 0457 1838 0530 1745 0603 1655 0641 1618 0709 1614
21 0707 1644 0632 1724 0546 1758 0454 1833 0418 1905 0408 1925 0426 1915 0458 1836 0531 1743 0604 1653 0642 1618 0710 1615
22 0707 1646 0631 1726 0544 1759 0453 1834 0417 1906 0408 1925 0427 1914 0459 1835 0532 1742 0605 1652 0644 1617 0711 1615
23 0706 1647 0629 1727 0543 1800 0451 1835 0416 1907 0408 1925 0428 1913 0500 1833 0533 1740 0607 1650 0645 1617 0711 1616
24 0705 1648 0628 1728 0541 1801 0450 1836 0415 1908 0409 1925 0429 1912 0501 1832 0534 1738 0608 1649 0646 1616 0711 1617
25 0704 1650 0626 1729 0539 1802 0448 1837 0414 1909 0409 1925 0430 1911 0502 1830 0535 1736 0609 1647 0647 1615 0712 1617
26 0704 1651 0624 1731 0537 1803 0447 1838 0414 1910 0409 1925 0431 1910 0503 1828 0536 1735 0610 1646 0648 1615 0712 1618
27 0703 1652 0623 1732 0536 1805 0445 1839 0413 1910 0410 1925 0432 1909 0504 1827 0537 1733 0611 1644 0649 1614 0712 1619
28 0702 1653 0621 1733 0534 1806 0444 1841 0412 1911 0410 1925 0433 1908 0505 1825 0538 1731 0613 1643 0650 1614 0713 1619
29 0701 1655 0532 1807 0443 1842 0412 1912 0411 1925 0434 1907 0506 1823 0539 1729 0614 1642 0652 1614 0713 1620
30 0700 1656 0530 1808 0441 1843 0411 1913 0411 1925 0435 1906 0507 1822 0540 1728 0615 1640 0653 1613 0713 1621
31 0659 1657 0529 1809 0411 1914 0436 1905 0509 1820 0616 1639 0713 1622
Add one hour for daylight time, if and when in use.
Getting that plot out of this data turns out to be a little tricky, and most of the trick is in the import and cleanup phases. Right now, the data are arranged such that the day of the month is on the rows, while the month, hour, minute, and sunrise/sunset label are on the columns. This is often called “wide” data, which is easy to look at, but usually hard to work with. Our goal is to create a “long” dataset in which each row holds a single timestamp corresponding to one day’s sunrise or sunset (essentially, two rows per day). I’m going to show you how to do it using R. You’ll also need the following R packages: reshape2
, plyr
, and lubridate
.
First things first, we need to import the data, ideally so that each meaningful number (the hours and minutes for each day of the year) ends up in a neat column. While the double-nested headers are unfortunate (hour and minute are nested within sunrise/sunset, which are nested within month), at least the data follow a nice fixed-width format, with each column ending after a predictable number of characters. R happens to have a handy read.fwf
function, which is specialized for reading in these types of files.
data.raw <- read.fwf(
file='Boston Daylight Data 2014.txt',
skip=9,
nrows=31,
colClasses='numeric',
strip.white=T,
widths=c(2, rep(c(4, 2, 3, 2), 12))
);
The read.fwf
command accomplishes a lot, so I’ve spread its arguments out over several lines. I’m telling the function to read in the file, skip its first nine rows (none of which contain data), read exactly the next 31, make sure to import all the columns as numbers (not text strings), strip out any extra whitespace, and lastly, how many characters wide each column should be. This produces a dataset that looks like this (I’m cutting out a lot of the data, but there are a total of 49 columns and 31 rows):
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1 7 13 16 23 6 58 16 59 6 20 17 34 5 27 18 10 4 40 18
2 7 14 16 24 6 57 17 0 6 18 17 35 5 25 18 11 4 38 18
3 7 14 16 24 6 56 17 1 6 16 17 37 5 24 18 12 4 37 18
4 7 14 16 25 6 55 17 2 6 15 17 38 5 22 18 14 4 36 18
5 7 13 16 26 6 53 17 4 6 13 17 39 5 20 18 15 4 35 18
6 7 13 16 27 6 52 17 5 6 12 17 40 5 18 18 16 4 33 18
Now we just need to name the columns:
colnames(data.raw) <- c('day', paste(rep(month.abb, each=4), rep(c('rise', 'set'), each=2), c('hour', 'minute')));
This is a somewhat confusing use of the paste
function, but basically I’m creating a vector of names: the first one is “day”, followed by names that follow the convention “Month rise/set hour/minute” (for example, “Jan rise hour”). Creating the labels at this stage saves us the trouble of having to extract them later.1 Our next step is to melt
the dataset.
data.daylight <- melt(data.raw, id.vars='day');
By default, melt
wants to reduce a dataset to just two columns: “variable” and “value” (“variable” becomes a column containing the dataset’s former column names, and “value” stores their corresponding values). The columns specified in id.vars
are preserved in the new data frame, and are not melted into the “variable” column. So now our dataset looks like this:
day variable value
1 Jan rise hour 7
2 Jan rise hour 7
3 Jan rise hour 7
4 Jan rise hour 7
5 Jan rise hour 7
6 Jan rise hour 7
Now I want to take my “variable” column and split it into three new columns: month, event (sunrise/sunset), and time (hour/minute). This is easily done with colsplit
. Note that I’m combining it with cbind
, so that I can attach the new columns to my dataset without creating a temporary variable.
data.daylight <- cbind(data.daylight, colsplit(data.daylight$variable, ' ', c('month', 'event', 'time')));
Which makes the data look like this:
day variable value month event time
1 Jan rise hour 7 Jan rise hour
2 Jan rise hour 7 Jan rise hour
3 Jan rise hour 7 Jan rise hour
4 Jan rise hour 7 Jan rise hour
5 Jan rise hour 7 Jan rise hour
6 Jan rise hour 7 Jan rise hour
We’re nearly there. All that’s left is to get each event’s hour and minute into the same row. As near as I can tell, there’s no better way to do it than with the handy dcast
function. With this function, I’m saying that “month”, “day”, and “event” should define the rows, while the different values stored in “time” should form new columns.
data.daylight <- dcast(data.daylight, month + day + event ~ time);
month day event hour minute
Apr 1 rise 5 27
Apr 1 set 18 10
Apr 2 rise 5 25
Apr 2 set 18 11
Apr 3 rise 5 24
Apr 3 set 18 12
From importing the data to this use of dcast
, I’ve only written five lines of code. Now would be a great time to scroll back up and remember how the data looked originally. I’ll wait.
And that’s what I call “the first 90%”. The data are now in a highly flexible “long” format, and can be used with ease. For example, say we wanted to a) convert the “month” and “day” columns into proper Date data, which will make plotting much easier, and b) calculate the minute of the day at which the sunrise/sunset event occurred. Enter mutate
, the easiest way to do this kind of transformation (with a call to lubridate
’s ymd
function to turn strings of numbers into Dates):
data.daylight <- mutate(data.daylight,
date=ymd(paste('2014', month, day)),
event.minute=hour * 60 + minute);
month day event hour minute date event.minute
Apr 1 rise 5 27 2014-04-01 327
Apr 1 set 18 10 2014-04-01 1090
Apr 2 rise 5 25 2014-04-02 325
Apr 2 set 18 11 2014-04-02 1091
Apr 3 rise 5 24 2014-04-03 324
Apr 3 set 18 12 2014-04-03 1092
Think about how tedious and error-prone it would have been to create the equivalent of the “date” and “event.minute” columns with the data as originally formatted. But now we’re getting into what I call “the other 90%”, which is another story for another time.
-
There a lot of different ways to skin a cat in R, and therefore lots of different ways you might have generated and assigned these labels. In fact, there are lots of ways to do almost anything in R. Before I knew about
read.fwf
, I usedreadLines
and some clever regular expression magic to separate out the time values. Trust me,read.fwf
is much easier. ↩