4.3 Cleaning up data types and values

After you have tidied your data, lets call that tidy dataset mbta2.

mbta2

## # A tibble: 638 x 5
##      ..1 mode              year  month avg_trips
##    <dbl> <chr>             <chr> <chr> <chr>    
##  1     1 All Modes by Qtr  2007  01    NA       
##  2     2 Boat              2007  01    4        
##  3     3 Bus               2007  01    335.819  
##  4     4 Commuter Rail     2007  01    142.2    
##  5     5 Heavy Rail        2007  01    435.294  
##  6     6 Light Rail        2007  01    227.231  
##  7     7 Pct Chg / Yr      2007  01    0.02     
##  8     8 Private Bus       2007  01    4.772    
##  9     9 RIDE              2007  01    4.9      
## 10    10 Trackless Trolley 2007  01    12.757   
## # … with 628 more rows

We see that there’s still some issues. If you look at the top of the dataset, you’ll see that year, month and avg_trips are all character variables and not numeric variables. (You can see this if you converted to a tibble using as_tibble. Otherwise, type str(mbta2) at the console). Also, there is this odd column with the name ..1 that is just an index of rows. Lastly, the row marked TOTAL is not necessary since it’s a derived row, and the All Modes by Qtr row is missing in many times, and appears inconsistent with TOTAL.

First, let’s deal with the type issue.

mbta2 %>% 
  mutate(
    year = parse_number(year),
    month = parse_number(month),
    avg_trips = parse_number(avg_trips)
  )

## # A tibble: 638 x 5
##      ..1 mode               year month avg_trips
##    <dbl> <chr>             <dbl> <dbl>     <dbl>
##  1     1 All Modes by Qtr   2007     1     NA   
##  2     2 Boat               2007     1      4   
##  3     3 Bus                2007     1    336.  
##  4     4 Commuter Rail      2007     1    142.  
##  5     5 Heavy Rail         2007     1    435.  
##  6     6 Light Rail         2007     1    227.  
##  7     7 Pct Chg / Yr       2007     1      0.02
##  8     8 Private Bus        2007     1      4.77
##  9     9 RIDE               2007     1      4.9 
## 10    10 Trackless Trolley  2007     1     12.8 
## # … with 628 more rows

A more advanced version of this operation would be

mbta2 %>% 
  mutate_at(vars(year, month, avg_trips), parse_number)

Next we want to get rid of that first column. The verb we’ll use here is select.

mbta2 %>% 
  mutate(
    year = parse_number(year),
    month = parse_number(month),
    avg_trips = parse_number(avg_trips)
  ) %>% 
  select(-1) # Get rid of 1st column

## # A tibble: 638 x 4
##    mode               year month avg_trips
##    <chr>             <dbl> <dbl>     <dbl>
##  1 All Modes by Qtr   2007     1     NA   
##  2 Boat               2007     1      4   
##  3 Bus                2007     1    336.  
##  4 Commuter Rail      2007     1    142.  
##  5 Heavy Rail         2007     1    435.  
##  6 Light Rail         2007     1    227.  
##  7 Pct Chg / Yr       2007     1      0.02
##  8 Private Bus        2007     1      4.77
##  9 RIDE               2007     1      4.9 
## 10 Trackless Trolley  2007     1     12.8 
## # … with 628 more rows

Lastly, we want to get rid of rows where mode equals TOTAL or “All Modes by Qtr”. Our verb here is filter.

mbta3 <- mbta2 %>% 
  mutate(
    year = parse_number(year),
    month = parse_number(month),
    avg_trips = parse_number(avg_trips)
  ) %>% 
  select(-1) %>% 
  filter(mode != 'TOTAL', mode != "All Modes by Qtr")

Note that the strings in quotes have to be exact matches to what you want to look for. The != means “not equals”.

We’re assigning this to a new variable, mbta3, which is our clean dataset.

In R, filtering refers to keeping or removing rows that meet some criterion; selecting refers to keeping or removing columns. The corresponding “verbs” to put into your pipeline are filter and select.