4.3 Cleaning up data types and values
After you have tidied your data, lets call that tidy dataset mbta2
.
mbta2
## # A tibble: 638 x 5
## ..1 mode year month avg_trips
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 All Modes by Qtr 2007 01 NA
## 2 2 Boat 2007 01 4
## 3 3 Bus 2007 01 335.819
## 4 4 Commuter Rail 2007 01 142.2
## 5 5 Heavy Rail 2007 01 435.294
## 6 6 Light Rail 2007 01 227.231
## 7 7 Pct Chg / Yr 2007 01 0.02
## 8 8 Private Bus 2007 01 4.772
## 9 9 RIDE 2007 01 4.9
## 10 10 Trackless Trolley 2007 01 12.757
## # … with 628 more rows
We see that there’s still some issues. If you look at the top of the dataset, you’ll see
that year, month and avg_trips are all character variables and not numeric variables. (You can see
this if you converted to a tibble using as_tibble
. Otherwise, type str(mbta2)
at the console). Also, there
is this odd column with the name ..1
that is just an index of rows. Lastly, the row marked TOTAL
is
not necessary since it’s a derived row, and the All Modes by Qtr
row is missing in many times, and appears inconsistent
with TOTAL
.
First, let’s deal with the type issue.
mbta2 %>%
mutate(
year = parse_number(year),
month = parse_number(month),
avg_trips = parse_number(avg_trips)
)
## # A tibble: 638 x 5
## ..1 mode year month avg_trips
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 All Modes by Qtr 2007 1 NA
## 2 2 Boat 2007 1 4
## 3 3 Bus 2007 1 336.
## 4 4 Commuter Rail 2007 1 142.
## 5 5 Heavy Rail 2007 1 435.
## 6 6 Light Rail 2007 1 227.
## 7 7 Pct Chg / Yr 2007 1 0.02
## 8 8 Private Bus 2007 1 4.77
## 9 9 RIDE 2007 1 4.9
## 10 10 Trackless Trolley 2007 1 12.8
## # … with 628 more rows
A more advanced version of this operation would be
mbta2 %>%
mutate_at(vars(year, month, avg_trips), parse_number)
Next we want to get rid of that first column. The verb we’ll use here is select
.
mbta2 %>%
mutate(
year = parse_number(year),
month = parse_number(month),
avg_trips = parse_number(avg_trips)
) %>%
select(-1) # Get rid of 1st column
## # A tibble: 638 x 4
## mode year month avg_trips
## <chr> <dbl> <dbl> <dbl>
## 1 All Modes by Qtr 2007 1 NA
## 2 Boat 2007 1 4
## 3 Bus 2007 1 336.
## 4 Commuter Rail 2007 1 142.
## 5 Heavy Rail 2007 1 435.
## 6 Light Rail 2007 1 227.
## 7 Pct Chg / Yr 2007 1 0.02
## 8 Private Bus 2007 1 4.77
## 9 RIDE 2007 1 4.9
## 10 Trackless Trolley 2007 1 12.8
## # … with 628 more rows
Lastly, we want to get rid of rows where mode equals TOTAL or “All Modes by Qtr”. Our verb here is filter
.
mbta3 <- mbta2 %>%
mutate(
year = parse_number(year),
month = parse_number(month),
avg_trips = parse_number(avg_trips)
) %>%
select(-1) %>%
filter(mode != 'TOTAL', mode != "All Modes by Qtr")
Note that the strings in quotes have to be exact matches to what you want to look for. The !=
means “not equals”.
We’re assigning this to a new variable, mbta3, which is our clean dataset.
In R, filtering refers to keeping or removing rows that meet some criterion; selecting refers to keeping or removing columns. The corresponding “verbs” to put into your pipeline are
filter
andselect
.