4.2 Data cleaning

The weather data set shows that we still need to do a bit more cleaning to this data to make it workable. Mainly, we need to fix the dat column to make it numeric. Note the odd ordering, where d1 is followed by d10. This is an alphabetical ordering rather than a numeric ordering. We’ll now add to our pipeline (sentence) to make this happen:

weather %>% 
  gather(date, temp, d1:d31) %>% 
  spread(element, temp) %>% 
  mutate(date = parse_number(date))
## # A tibble: 341 x 6
##    id       year month  date  tmax  tmin
##    <chr>   <int> <int> <dbl> <dbl> <dbl>
##  1 MX17004  2010     1     1    NA    NA
##  2 MX17004  2010     1    10    NA    NA
##  3 MX17004  2010     1    11    NA    NA
##  4 MX17004  2010     1    12    NA    NA
##  5 MX17004  2010     1    13    NA    NA
##  6 MX17004  2010     1    14    NA    NA
##  7 MX17004  2010     1    15    NA    NA
##  8 MX17004  2010     1    16    NA    NA
##  9 MX17004  2010     1    17    NA    NA
## 10 MX17004  2010     1    18    NA    NA
## # … with 331 more rows

Here we introduce another “verb”, mutate. This function changes a column, either in-place as we did here, or by creating a new variable.

The data is still not quite in the right format, since the date column is in a weird order. We can add another verb to this pipe to fix that: arrange.

weather %>% 
  gather(date, temp, d1:d31) %>% 
  spread(element, temp) %>% 
  mutate(date = parse_number(date)) %>% 
  arrange(date)
## # A tibble: 341 x 6
##    id       year month  date  tmax  tmin
##    <chr>   <int> <int> <dbl> <dbl> <dbl>
##  1 MX17004  2010     1     1    NA    NA
##  2 MX17004  2010     2     1    NA    NA
##  3 MX17004  2010     3     1    NA    NA
##  4 MX17004  2010     4     1    NA    NA
##  5 MX17004  2010     5     1    NA    NA
##  6 MX17004  2010     6     1    NA    NA
##  7 MX17004  2010     7     1    NA    NA
##  8 MX17004  2010     8     1    NA    NA
##  9 MX17004  2010    10     1    NA    NA
## 10 MX17004  2010    11     1    NA    NA
## # … with 331 more rows

Not quite, right? We’re not used to seeing all the 1st of the months together, and so forth. We want all the daes for month 1, then all the dates for month two, and so on. This can be done by modifying the arrange command, by sorting first by month and then by date (essentially within month).

weather %>% 
  gather(date, temp, d1:d31) %>% 
  spread(element, temp) %>% 
  mutate(date = parse_number(date)) %>% 
  arrange(month, date)
## # A tibble: 341 x 6
##    id       year month  date  tmax  tmin
##    <chr>   <int> <int> <dbl> <dbl> <dbl>
##  1 MX17004  2010     1     1    NA    NA
##  2 MX17004  2010     1     2    NA    NA
##  3 MX17004  2010     1     3    NA    NA
##  4 MX17004  2010     1     4    NA    NA
##  5 MX17004  2010     1     5    NA    NA
##  6 MX17004  2010     1     6    NA    NA
##  7 MX17004  2010     1     7    NA    NA
##  8 MX17004  2010     1     8    NA    NA
##  9 MX17004  2010     1     9    NA    NA
## 10 MX17004  2010     1    10    NA    NA
## # … with 331 more rows

Finally, if we want to save this, we need to assign this final product a name.

weather2 <- weather %>% 
  gather(date, temp, d1:d31) %>% 
  spread(element, temp) %>% 
  mutate(date = parse_number(date)) %>% 
  arrange(month, date)

Exercise

The file data/mbta.xlsx contains monthly data on number of commuter trips by different modalities on the MBTA system n Boston. It is in a messy format. It also has an additional quirk in that it has a title on the first line that isn’t even data. You can avoid loading that in by using the option skip=1 (i.e. skip the first line) when you import. Work through this process to clean this dataset into tidy form. I’ll also note that you can “minus” columns by position as well as name, so gather(date, avg_trips, -1, -mode) is valid to not involve the first column and the mode column.