4.1 Tidy data
There is a principle of making data “tidy”, promoted by Dr. Hadley Wickham. This tidying of data makes computer programs happy, since these data can be most easily digested. A dataset can be messy or tidy depending on how the rows, columns and tables you’re using align with observations, variables and types.
The properties of a tidy dataset are:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table.
This forms a standardized way to structure a dataset, and so makes it easy for the analyst to develop standard pipelines.
A dataset can be messy in many many ways. Many of the more common issues are listed below:
- Column names contain values, not just variable names
- Multiple variables are stored in one column
- Variables are stored in both rows and columns
- Multiple types of observational types are stored in the same table
- A single observational unit is stored in multiple tables
Sometimes the messier format is better for data entry, but bad for data analyses.
We’ll show a few examples here, but a more detailed discussion is available online.
The workhorse for this tidying activity is the tidyr
package, part of the tidyverse
meta-package. We’ll tend to start every analysis by loading the tidyverse
package,
so we are covered.
4.1.1 Variable in column names
library(tidyverse)
pew <- import('data/pew.csv')
head(pew)
## religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
## 1 Agnostic 27 34 60 81 76 137
## 2 Atheist 12 27 37 52 35 70
## 3 Buddhist 27 21 30 34 33 58
## 4 Catholic 418 617 732 670 638 1116
## 5 Don’t know/refused 15 14 15 11 10 35
## 6 Evangelical Prot 575 869 1064 982 881 1486
## $75-100k $100-150k >150k Don't know/refused
## 1 122 109 84 96
## 2 73 59 74 76
## 3 62 39 53 54
## 4 949 792 633 1489
## 5 21 17 18 116
## 6 949 723 414 1529
This dataset has actual data in the column headers, rather than variable names. This
information needs to be captured into a column. We should ideally have 3 columns in this
dataset: religion, income and frequency. We can achieve this using a function called gather
which
takes a wide dataset and makes it tall. We will do this by forming a pipeline (think of this as a sentence),
starting with the dataset.
pew %>%
gather(income, frequency, -religion)
## religion income frequency
## 1 Agnostic <$10k 27
## 2 Atheist <$10k 12
## 3 Buddhist <$10k 27
## 4 Catholic <$10k 418
## 5 Don’t know/refused <$10k 15
## 6 Evangelical Prot <$10k 575
## 7 Hindu <$10k 1
## 8 Historically Black Prot <$10k 228
## 9 Jehovah's Witness <$10k 20
## 10 Jewish <$10k 19
## 11 Mainline Prot <$10k 289
## 12 Mormon <$10k 29
## 13 Muslim <$10k 6
## 14 Orthodox <$10k 13
## 15 Other Christian <$10k 9
## 16 Other Faiths <$10k 20
## 17 Other World Religions <$10k 5
## 18 Unaffiliated <$10k 217
## 19 Agnostic $10-20k 34
## 20 Atheist $10-20k 27
## 21 Buddhist $10-20k 21
## 22 Catholic $10-20k 617
## 23 Don’t know/refused $10-20k 14
## 24 Evangelical Prot $10-20k 869
## 25 Hindu $10-20k 9
## 26 Historically Black Prot $10-20k 244
## 27 Jehovah's Witness $10-20k 27
## 28 Jewish $10-20k 19
## 29 Mainline Prot $10-20k 495
## 30 Mormon $10-20k 40
## 31 Muslim $10-20k 7
## 32 Orthodox $10-20k 17
## 33 Other Christian $10-20k 7
## 34 Other Faiths $10-20k 33
## 35 Other World Religions $10-20k 2
## 36 Unaffiliated $10-20k 299
## 37 Agnostic $20-30k 60
## 38 Atheist $20-30k 37
## 39 Buddhist $20-30k 30
## 40 Catholic $20-30k 732
## 41 Don’t know/refused $20-30k 15
## 42 Evangelical Prot $20-30k 1064
## 43 Hindu $20-30k 7
## 44 Historically Black Prot $20-30k 236
## 45 Jehovah's Witness $20-30k 24
## 46 Jewish $20-30k 25
## 47 Mainline Prot $20-30k 619
## 48 Mormon $20-30k 48
## 49 Muslim $20-30k 9
## 50 Orthodox $20-30k 23
## 51 Other Christian $20-30k 11
## 52 Other Faiths $20-30k 40
## 53 Other World Religions $20-30k 3
## 54 Unaffiliated $20-30k 374
## 55 Agnostic $30-40k 81
## 56 Atheist $30-40k 52
## 57 Buddhist $30-40k 34
## 58 Catholic $30-40k 670
## 59 Don’t know/refused $30-40k 11
## 60 Evangelical Prot $30-40k 982
## 61 Hindu $30-40k 9
## 62 Historically Black Prot $30-40k 238
## 63 Jehovah's Witness $30-40k 24
## 64 Jewish $30-40k 25
## 65 Mainline Prot $30-40k 655
## 66 Mormon $30-40k 51
## 67 Muslim $30-40k 10
## 68 Orthodox $30-40k 32
## 69 Other Christian $30-40k 13
## 70 Other Faiths $30-40k 46
## 71 Other World Religions $30-40k 4
## 72 Unaffiliated $30-40k 365
## 73 Agnostic $40-50k 76
## 74 Atheist $40-50k 35
## 75 Buddhist $40-50k 33
## 76 Catholic $40-50k 638
## 77 Don’t know/refused $40-50k 10
## 78 Evangelical Prot $40-50k 881
## 79 Hindu $40-50k 11
## 80 Historically Black Prot $40-50k 197
## 81 Jehovah's Witness $40-50k 21
## 82 Jewish $40-50k 30
## 83 Mainline Prot $40-50k 651
## 84 Mormon $40-50k 56
## 85 Muslim $40-50k 9
## 86 Orthodox $40-50k 32
## 87 Other Christian $40-50k 13
## 88 Other Faiths $40-50k 49
## 89 Other World Religions $40-50k 2
## 90 Unaffiliated $40-50k 341
## 91 Agnostic $50-75k 137
## 92 Atheist $50-75k 70
## 93 Buddhist $50-75k 58
## 94 Catholic $50-75k 1116
## 95 Don’t know/refused $50-75k 35
## 96 Evangelical Prot $50-75k 1486
## 97 Hindu $50-75k 34
## 98 Historically Black Prot $50-75k 223
## 99 Jehovah's Witness $50-75k 30
## 100 Jewish $50-75k 95
## 101 Mainline Prot $50-75k 1107
## 102 Mormon $50-75k 112
## 103 Muslim $50-75k 23
## 104 Orthodox $50-75k 47
## 105 Other Christian $50-75k 14
## 106 Other Faiths $50-75k 63
## 107 Other World Religions $50-75k 7
## 108 Unaffiliated $50-75k 528
## 109 Agnostic $75-100k 122
## 110 Atheist $75-100k 73
## 111 Buddhist $75-100k 62
## 112 Catholic $75-100k 949
## 113 Don’t know/refused $75-100k 21
## 114 Evangelical Prot $75-100k 949
## 115 Hindu $75-100k 47
## 116 Historically Black Prot $75-100k 131
## 117 Jehovah's Witness $75-100k 15
## 118 Jewish $75-100k 69
## 119 Mainline Prot $75-100k 939
## 120 Mormon $75-100k 85
## 121 Muslim $75-100k 16
## 122 Orthodox $75-100k 38
## 123 Other Christian $75-100k 18
## 124 Other Faiths $75-100k 46
## 125 Other World Religions $75-100k 3
## 126 Unaffiliated $75-100k 407
## 127 Agnostic $100-150k 109
## 128 Atheist $100-150k 59
## 129 Buddhist $100-150k 39
## 130 Catholic $100-150k 792
## 131 Don’t know/refused $100-150k 17
## 132 Evangelical Prot $100-150k 723
## 133 Hindu $100-150k 48
## 134 Historically Black Prot $100-150k 81
## 135 Jehovah's Witness $100-150k 11
## 136 Jewish $100-150k 87
## 137 Mainline Prot $100-150k 753
## 138 Mormon $100-150k 49
## 139 Muslim $100-150k 8
## 140 Orthodox $100-150k 42
## 141 Other Christian $100-150k 14
## 142 Other Faiths $100-150k 40
## 143 Other World Religions $100-150k 4
## 144 Unaffiliated $100-150k 321
## 145 Agnostic >150k 84
## 146 Atheist >150k 74
## 147 Buddhist >150k 53
## 148 Catholic >150k 633
## 149 Don’t know/refused >150k 18
## 150 Evangelical Prot >150k 414
## 151 Hindu >150k 54
## 152 Historically Black Prot >150k 78
## 153 Jehovah's Witness >150k 6
## 154 Jewish >150k 151
## 155 Mainline Prot >150k 634
## 156 Mormon >150k 42
## 157 Muslim >150k 6
## 158 Orthodox >150k 46
## 159 Other Christian >150k 12
## 160 Other Faiths >150k 41
## 161 Other World Religions >150k 4
## 162 Unaffiliated >150k 258
## 163 Agnostic Don't know/refused 96
## 164 Atheist Don't know/refused 76
## 165 Buddhist Don't know/refused 54
## 166 Catholic Don't know/refused 1489
## 167 Don’t know/refused Don't know/refused 116
## 168 Evangelical Prot Don't know/refused 1529
## 169 Hindu Don't know/refused 37
## 170 Historically Black Prot Don't know/refused 339
## 171 Jehovah's Witness Don't know/refused 37
## 172 Jewish Don't know/refused 162
## 173 Mainline Prot Don't know/refused 1328
## 174 Mormon Don't know/refused 69
## 175 Muslim Don't know/refused 22
## 176 Orthodox Don't know/refused 73
## 177 Other Christian Don't know/refused 18
## 178 Other Faiths Don't know/refused 71
## 179 Other World Religions Don't know/refused 8
## 180 Unaffiliated Don't know/refused 597
Let’s parse this out. First we see this new operator %>%
, which you can think of as the word “then”.
So we start with the dataset pew
, “then” we gather its columns into two columns, income
and frequency
.
We don’t want the religion
column to be part of this operation, so we “minus” it out, which says, don’t do this
gather
operation on the religion
column, but use everything else. The religion
column gets repeated as needed.
The
%>%
operator can be easily typed in RStudio using the shortcut Ctrl-Shift-M (Cmd-Shift-M on a Mac)
This is now a tidy dataset, since each column is a single variable, each row is a single observation
4.1.2 Multiple variables in column names
tb <- import('data/tb.csv') %>% as_tibble()
head(tb)
## # A tibble: 6 x 22
## iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu
## <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 AD 1989 NA NA NA NA NA NA NA NA NA NA
## 2 AD 1990 NA NA NA NA NA NA NA NA NA NA
## 3 AD 1991 NA NA NA NA NA NA NA NA NA NA
## 4 AD 1992 NA NA NA NA NA NA NA NA NA NA
## 5 AD 1993 NA NA NA NA NA NA NA NA NA NA
## 6 AD 1994 NA NA NA NA NA NA NA NA NA NA
## # … with 10 more variables: f04 <int>, f514 <int>, f014 <int>,
## # f1524 <int>, f2534 <int>, f3544 <int>, f4554 <int>, f5564 <int>,
## # f65 <int>, fu <int>
Notice that the column names contain both sex and age group data. First we’ll gather
the sex/age columns, as before. Note that there are many missing values in this dataset. These
are denoted in R by NA
.
tb %>%
gather(sex_age, n, -iso2, -year, -fu)
## # A tibble: 109,611 x 5
## iso2 year fu sex_age n
## <chr> <int> <int> <chr> <int>
## 1 AD 1989 NA m04 NA
## 2 AD 1990 NA m04 NA
## 3 AD 1991 NA m04 NA
## 4 AD 1992 NA m04 NA
## 5 AD 1993 NA m04 NA
## 6 AD 1994 NA m04 NA
## 7 AD 1996 NA m04 NA
## 8 AD 1997 NA m04 NA
## 9 AD 1998 NA m04 NA
## 10 AD 1999 NA m04 NA
## # … with 109,601 more rows
Since there are a lot of missing values here, we can drop them in the above step by adding an option.
tb %>% gather(sex_age, n, -iso2, -year, -fu, na.rm=T)
## # A tibble: 35,478 x 5
## iso2 year fu sex_age n
## <chr> <int> <int> <chr> <int>
## 1 AD 2005 0 m04 0
## 2 AD 2006 0 m04 0
## 3 AD 2008 0 m04 0
## 4 AE 2006 NA m04 0
## 5 AE 2007 NA m04 0
## 6 AE 2008 0 m04 0
## 7 AG 2007 NA m04 0
## 8 AL 2005 0 m04 0
## 9 AL 2006 0 m04 1
## 10 AL 2007 0 m04 0
## # … with 35,468 more rows
We can now use the function separate
to separate the data in the sex_age
column into sex and age. In this
case we have have the data in a fixed width format (the 1st element is the sex data), so we can use that:
tb %>%
gather(sex_age, n, -iso2, -year, -fu, na.rm=T) %>%
separate(sex_age, c("sex","age"), sep=1)
## # A tibble: 35,478 x 6
## iso2 year fu sex age n
## <chr> <int> <int> <chr> <chr> <int>
## 1 AD 2005 0 m 04 0
## 2 AD 2006 0 m 04 0
## 3 AD 2008 0 m 04 0
## 4 AE 2006 NA m 04 0
## 5 AE 2007 NA m 04 0
## 6 AE 2008 0 m 04 0
## 7 AG 2007 NA m 04 0
## 8 AL 2005 0 m 04 0
## 9 AL 2006 0 m 04 1
## 10 AL 2007 0 m 04 0
## # … with 35,468 more rows
If the data was separated by a symbol, like _
, we would use sep = "_"
instead.
4.1.3 Variables stored in rows and columns
weather <- import('data/weather.csv') %>% as_tibble()
weather
## # A tibble: 22 x 35
## id year month element d1 d2 d3 d4 d5 d6 d7
## <chr> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 MX17… 2010 1 tmax NA NA NA NA NA NA NA
## 2 MX17… 2010 1 tmin NA NA NA NA NA NA NA
## 3 MX17… 2010 2 tmax NA 27.3 24.1 NA NA NA NA
## 4 MX17… 2010 2 tmin NA 14.4 14.4 NA NA NA NA
## 5 MX17… 2010 3 tmax NA NA NA NA 32.1 NA NA
## 6 MX17… 2010 3 tmin NA NA NA NA 14.2 NA NA
## 7 MX17… 2010 4 tmax NA NA NA NA NA NA NA
## 8 MX17… 2010 4 tmin NA NA NA NA NA NA NA
## 9 MX17… 2010 5 tmax NA NA NA NA NA NA NA
## 10 MX17… 2010 5 tmin NA NA NA NA NA NA NA
## # … with 12 more rows, and 24 more variables: d8 <dbl>, d9 <lgl>,
## # d10 <dbl>, d11 <dbl>, d12 <lgl>, d13 <dbl>, d14 <dbl>, d15 <dbl>,
## # d16 <dbl>, d17 <dbl>, d18 <lgl>, d19 <lgl>, d20 <lgl>, d21 <lgl>,
## # d22 <lgl>, d23 <dbl>, d24 <lgl>, d25 <dbl>, d26 <dbl>, d27 <dbl>,
## # d28 <dbl>, d29 <dbl>, d30 <dbl>, d31 <dbl>
Here, for each year and month, the data for each day of the month is stored in columns. For each
day, two values are noted – the max (tmax
) and min (tmin
) temperature that day, stored as rows.
We start by gathering the extra columns as before:
weather %>%
gather(day, temp, d1:d31)
## # A tibble: 682 x 6
## id year month element day temp
## <chr> <int> <int> <chr> <chr> <dbl>
## 1 MX17004 2010 1 tmax d1 NA
## 2 MX17004 2010 1 tmin d1 NA
## 3 MX17004 2010 2 tmax d1 NA
## 4 MX17004 2010 2 tmin d1 NA
## 5 MX17004 2010 3 tmax d1 NA
## 6 MX17004 2010 3 tmin d1 NA
## 7 MX17004 2010 4 tmax d1 NA
## 8 MX17004 2010 4 tmin d1 NA
## 9 MX17004 2010 5 tmax d1 NA
## 10 MX17004 2010 5 tmin d1 NA
## # … with 672 more rows
Here’s a new notation –
d1:d31
. This means all columns starting atd1
and ending atd31
. This notation is originally from creating sequences of numbers. See what happens if you type1:30
in the console.
Now, for each date, we have two rows of data. These need to be two columns of data. So we need to do the
reverse operation from gather
. This is called spread
.
weather %>%
gather(date, temp, d1:d31) %>%
spread(element, temp)
## # A tibble: 341 x 6
## id year month date tmax tmin
## <chr> <int> <int> <chr> <dbl> <dbl>
## 1 MX17004 2010 1 d1 NA NA
## 2 MX17004 2010 1 d10 NA NA
## 3 MX17004 2010 1 d11 NA NA
## 4 MX17004 2010 1 d12 NA NA
## 5 MX17004 2010 1 d13 NA NA
## 6 MX17004 2010 1 d14 NA NA
## 7 MX17004 2010 1 d15 NA NA
## 8 MX17004 2010 1 d16 NA NA
## 9 MX17004 2010 1 d17 NA NA
## 10 MX17004 2010 1 d18 NA NA
## # … with 331 more rows
We tell spread
which column should form column names and which should provide the data for the columns.