Chapter 5 Describing data

Data cleaning and munging is important, yet tedious work. Now that we’re done with that, we can get to the fun part of exploring the data.

Our workhorse for this will be the dplyr package, part of the tidyverse. This package provides 5 basic verbs:

  • filter: filter a dataset by rows
  • select: select columns of a dataset
  • arrange: arrange rows of a dataset by values of some variables
  • group_by: split a dataset by values of some variables, so that we can apply verbs to each split
  • summarize: compute various summaries from the data

dplyr also has verbs to let stitch datasets together, which are left_join, right_join, inner_join, outer_join, semi_join, anti_join, bind_rows and bind_cols.

I’ll be using the latest dplyr, version 0.8.0.9009. If you have an older version, most but not all things will work.

We’ll start with the in-build mtcars dataset.

library(tidyverse)
mtcars1 <- mtcars %>% rownames_to_column('cars') %>% as_tibble()
mtcars1
## # A tibble: 32 x 12
##    cars     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazda…  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2 Mazda…  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3 Datsu…  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4 Horne…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5 Horne…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6 Valia…  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7 Duste…  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8 Merc …  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9 Merc …  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10 Merc …  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # … with 22 more rows

Let’s get some summary statistics from this dataset. First, let’s compute the average mpg, displacement, and horsepower of these cars.

mtcars1 %>% summarize(mpg = mean(mpg, na.rm=T), disp = mean(disp, na.rm=T), hp = mean(hp, na.rm=T))
## # A tibble: 1 x 3
##     mpg  disp    hp
##   <dbl> <dbl> <dbl>
## 1  20.1  231.  147.

I’m adding the option na.rm=T to remove any missing values; if there is even a single missing value, the mean will be computed as NA

Most dplyr verbs also have scoped versions *_all, *_at, *_if, which can be useful. The *_any versions act upon all the columns, the *_at versions on specified columns (same way columns are specified in select) and the *_if versions on columns with particular properties. So the above code could be written as

mtcars1 %>% summarize_at(vars(mpg, disp, hp), mean, na.rm = T)
## # A tibble: 1 x 3
##     mpg  disp    hp
##   <dbl> <dbl> <dbl>
## 1  20.1  231.  147.

Factors or categorical variables

R has a data type called a factor, which is meant for categorical or discrete variables; things like sex, race, and, in this data, cyl. We need to transform the column to this data type first.

mtcars1 <- mtcars1 %>% 
  mutate(cyl = as.factor(cyl))

There are a few others that might need this treatment.

mtcars1 <- mtcars1 %>% 
  mutate_at(vars(cyl, vs, am, gear, carb), as.factor)

Let’s see what we’ve done.

str(mtcars1)
## Classes 'tbl_df', 'tbl' and 'data.frame':    32 obs. of  12 variables:
##  $ cars: chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

Now, let’s compute averages of all the non-factor, or numeric, variables.

mtcars1 %>% summarize_if(is.numeric, mean)
## # A tibble: 1 x 6
##     mpg  disp    hp  drat    wt  qsec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  20.1  231.  147.  3.60  3.22  17.8

We can get a summary of all the variables using the function summary.

summary(mtcars1)
##      cars                mpg        cyl         disp             hp       
##  Length:32          Min.   :10.40   4:11   Min.   : 71.1   Min.   : 52.0  
##  Class :character   1st Qu.:15.43   6: 7   1st Qu.:120.8   1st Qu.: 96.5  
##  Mode  :character   Median :19.20   8:14   Median :196.3   Median :123.0  
##                     Mean   :20.09          Mean   :230.7   Mean   :146.7  
##                     3rd Qu.:22.80          3rd Qu.:326.0   3rd Qu.:180.0  
##                     Max.   :33.90          Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec       vs     am     gear  
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   0:18   0:19   3:15  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1:14   1:13   4:12  
##  Median :3.695   Median :3.325   Median :17.71                 5: 5  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85                       
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90                       
##  Max.   :4.930   Max.   :5.424   Max.   :22.90                       
##  carb  
##  1: 7  
##  2:10  
##  3: 3  
##  4:10  
##  6: 1  
##  8: 1

However, this does not give us a tidy dataset. Some alternatives are the skimr package and the ezsummary package.

# install.packages('skimr')
library(skimr)
skim(mtcars1)
## Skim summary statistics
##  n obs: 32 
##  n variables: 12 
## 
## ── Variable type:character ───────────────────────────────────────────────────────────────
##  variable missing complete  n min max empty n_unique
##      cars       0       32 32   7  19     0       32
## 
## ── Variable type:factor ──────────────────────────────────────────────────────────────────
##  variable missing complete  n n_unique                top_counts ordered
##        am       0       32 32        2       0: 19, 1: 13, NA: 0   FALSE
##      carb       0       32 32        6  2: 10, 4: 10, 1: 7, 3: 3   FALSE
##       cyl       0       32 32        3 8: 14, 4: 11, 6: 7, NA: 0   FALSE
##      gear       0       32 32        3 3: 15, 4: 12, 5: 5, NA: 0   FALSE
##        vs       0       32 32        2       0: 18, 1: 14, NA: 0   FALSE
## 
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────
##  variable missing complete  n   mean     sd    p0    p25    p50    p75
##      disp       0       32 32 230.72 123.94 71.1  120.83 196.3  326   
##      drat       0       32 32   3.6    0.53  2.76   3.08   3.7    3.92
##        hp       0       32 32 146.69  68.56 52     96.5  123    180   
##       mpg       0       32 32  20.09   6.03 10.4   15.43  19.2   22.8 
##      qsec       0       32 32  17.85   1.79 14.5   16.89  17.71  18.9 
##        wt       0       32 32   3.22   0.98  1.51   2.58   3.33   3.61
##    p100     hist
##  472    ▇▆▁▂▅▃▁▂
##    4.93 ▃▇▁▅▇▂▁▁
##  335    ▃▇▃▅▂▃▁▁
##   33.9  ▃▇▇▇▃▂▂▂
##   22.9  ▃▂▇▆▃▃▁▁
##    5.42 ▃▃▃▇▆▁▁▂