4.2 Data cleaning
The weather data set shows that we still need to do a bit more cleaning to this data to make it workable.
Mainly, we need to fix the dat
column to make it numeric. Note the odd ordering, where d1
is followed by d10
. This
is an alphabetical ordering rather than a numeric ordering. We’ll now add to our pipeline (sentence) to make this
happen:
weather %>%
gather(date, temp, d1:d31) %>%
spread(element, temp) %>%
mutate(date = parse_number(date))
## # A tibble: 341 x 6
## id year month date tmax tmin
## <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 MX17004 2010 1 1 NA NA
## 2 MX17004 2010 1 10 NA NA
## 3 MX17004 2010 1 11 NA NA
## 4 MX17004 2010 1 12 NA NA
## 5 MX17004 2010 1 13 NA NA
## 6 MX17004 2010 1 14 NA NA
## 7 MX17004 2010 1 15 NA NA
## 8 MX17004 2010 1 16 NA NA
## 9 MX17004 2010 1 17 NA NA
## 10 MX17004 2010 1 18 NA NA
## # … with 331 more rows
Here we introduce another “verb”, mutate
. This function changes a column, either in-place as we did here,
or by creating a new variable.
The data is still not quite in the right format, since the date column is in a weird order. We can add
another verb to this pipe to fix that: arrange
.
weather %>%
gather(date, temp, d1:d31) %>%
spread(element, temp) %>%
mutate(date = parse_number(date)) %>%
arrange(date)
## # A tibble: 341 x 6
## id year month date tmax tmin
## <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 MX17004 2010 1 1 NA NA
## 2 MX17004 2010 2 1 NA NA
## 3 MX17004 2010 3 1 NA NA
## 4 MX17004 2010 4 1 NA NA
## 5 MX17004 2010 5 1 NA NA
## 6 MX17004 2010 6 1 NA NA
## 7 MX17004 2010 7 1 NA NA
## 8 MX17004 2010 8 1 NA NA
## 9 MX17004 2010 10 1 NA NA
## 10 MX17004 2010 11 1 NA NA
## # … with 331 more rows
Not quite, right? We’re not used to seeing all the 1st of the months together, and so forth. We want all the daes for month 1, then all the dates for month two, and so on. This can be done by modifying the arrange
command, by sorting first by month and then by date (essentially within month).
weather %>%
gather(date, temp, d1:d31) %>%
spread(element, temp) %>%
mutate(date = parse_number(date)) %>%
arrange(month, date)
## # A tibble: 341 x 6
## id year month date tmax tmin
## <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 MX17004 2010 1 1 NA NA
## 2 MX17004 2010 1 2 NA NA
## 3 MX17004 2010 1 3 NA NA
## 4 MX17004 2010 1 4 NA NA
## 5 MX17004 2010 1 5 NA NA
## 6 MX17004 2010 1 6 NA NA
## 7 MX17004 2010 1 7 NA NA
## 8 MX17004 2010 1 8 NA NA
## 9 MX17004 2010 1 9 NA NA
## 10 MX17004 2010 1 10 NA NA
## # … with 331 more rows
Finally, if we want to save this, we need to assign this final product a name.
weather2 <- weather %>%
gather(date, temp, d1:d31) %>%
spread(element, temp) %>%
mutate(date = parse_number(date)) %>%
arrange(month, date)
Exercise
The file data/mbta.xlsx
contains monthly data on number of commuter trips by different modalities on the MBTA system n Boston. It
is in a messy format. It also has an additional quirk in that it has a title on the first line that
isn’t even data. You can avoid loading that in by using the option skip=1
(i.e. skip the first line) when you
import. Work through this process to clean this dataset into tidy form. I’ll also note that you can “minus” columns by
position as well as name, so gather(date, avg_trips, -1, -mode)
is valid to not involve the first column and the mode
column.