Chapter 5 Describing data
Data cleaning and munging is important, yet tedious work. Now that we’re done with that, we can get to the fun part of exploring the data.
Our workhorse for this will be the dplyr
package, part of the tidyverse
. This
package provides 5 basic verbs:
filter
: filter a dataset by rowsselect
: select columns of a datasetarrange
: arrange rows of a dataset by values of some variablesgroup_by
: split a dataset by values of some variables, so that we can apply verbs to each splitsummarize
: compute various summaries from the data
dplyr
also has verbs to let stitch datasets together, which are left_join
,
right_join
, inner_join
, outer_join
, semi_join
, anti_join
, bind_rows
and
bind_cols
.
I’ll be using the latest dplyr
, version 0.8.0.9009. If you have
an older version, most but not all things will work.
We’ll start with the in-build mtcars
dataset.
library(tidyverse)
mtcars1 <- mtcars %>% rownames_to_column('cars') %>% as_tibble()
mtcars1
## # A tibble: 32 x 12
## cars mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda… 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 Mazda… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 Datsu… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 Horne… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 Horne… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 Valia… 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 Duste… 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 Merc … 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 Merc … 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 Merc … 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # … with 22 more rows
Let’s get some summary statistics from this dataset. First, let’s compute the average mpg, displacement, and horsepower of these cars.
mtcars1 %>% summarize(mpg = mean(mpg, na.rm=T), disp = mean(disp, na.rm=T), hp = mean(hp, na.rm=T))
## # A tibble: 1 x 3
## mpg disp hp
## <dbl> <dbl> <dbl>
## 1 20.1 231. 147.
I’m adding the option
na.rm=T
to remove any missing values; if there is even a single missing value, the mean will be computed asNA
Most dplyr
verbs also have scoped versions *_all
, *_at
, *_if
, which can be useful. The *_any
versions act
upon all the columns, the *_at
versions on specified columns (same way columns are specified in select
) and the
*_if
versions on columns with particular properties. So the above code could be written as
mtcars1 %>% summarize_at(vars(mpg, disp, hp), mean, na.rm = T)
## # A tibble: 1 x 3
## mpg disp hp
## <dbl> <dbl> <dbl>
## 1 20.1 231. 147.
Factors or categorical variables
R has a data type called a factor
, which is meant for categorical or discrete variables; things like sex, race, and,
in this data, cyl
. We need to transform the column to this data type first.
mtcars1 <- mtcars1 %>%
mutate(cyl = as.factor(cyl))
There are a few others that might need this treatment.
mtcars1 <- mtcars1 %>%
mutate_at(vars(cyl, vs, am, gear, carb), as.factor)
Let’s see what we’ve done.
str(mtcars1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 32 obs. of 12 variables:
## $ cars: chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
## $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
## $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...
Now, let’s compute averages of all the non-factor, or numeric, variables.
mtcars1 %>% summarize_if(is.numeric, mean)
## # A tibble: 1 x 6
## mpg disp hp drat wt qsec
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 20.1 231. 147. 3.60 3.22 17.8
We can get a summary of all the variables using the function summary
.
summary(mtcars1)
## cars mpg cyl disp hp
## Length:32 Min. :10.40 4:11 Min. : 71.1 Min. : 52.0
## Class :character 1st Qu.:15.43 6: 7 1st Qu.:120.8 1st Qu.: 96.5
## Mode :character Median :19.20 8:14 Median :196.3 Median :123.0
## Mean :20.09 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :472.0 Max. :335.0
## drat wt qsec vs am gear
## Min. :2.760 Min. :1.513 Min. :14.50 0:18 0:19 3:15
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1:14 1:13 4:12
## Median :3.695 Median :3.325 Median :17.71 5: 5
## Mean :3.597 Mean :3.217 Mean :17.85
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90
## Max. :4.930 Max. :5.424 Max. :22.90
## carb
## 1: 7
## 2:10
## 3: 3
## 4:10
## 6: 1
## 8: 1
However, this does not give us a tidy dataset. Some alternatives are the skimr
package and
the ezsummary
package.
# install.packages('skimr')
library(skimr)
skim(mtcars1)
## Skim summary statistics
## n obs: 32
## n variables: 12
##
## ── Variable type:character ───────────────────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## cars 0 32 32 7 19 0 32
##
## ── Variable type:factor ──────────────────────────────────────────────────────────────────
## variable missing complete n n_unique top_counts ordered
## am 0 32 32 2 0: 19, 1: 13, NA: 0 FALSE
## carb 0 32 32 6 2: 10, 4: 10, 1: 7, 3: 3 FALSE
## cyl 0 32 32 3 8: 14, 4: 11, 6: 7, NA: 0 FALSE
## gear 0 32 32 3 3: 15, 4: 12, 5: 5, NA: 0 FALSE
## vs 0 32 32 2 0: 18, 1: 14, NA: 0 FALSE
##
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75
## disp 0 32 32 230.72 123.94 71.1 120.83 196.3 326
## drat 0 32 32 3.6 0.53 2.76 3.08 3.7 3.92
## hp 0 32 32 146.69 68.56 52 96.5 123 180
## mpg 0 32 32 20.09 6.03 10.4 15.43 19.2 22.8
## qsec 0 32 32 17.85 1.79 14.5 16.89 17.71 18.9
## wt 0 32 32 3.22 0.98 1.51 2.58 3.33 3.61
## p100 hist
## 472 ▇▆▁▂▅▃▁▂
## 4.93 ▃▇▁▅▇▂▁▁
## 335 ▃▇▃▅▂▃▁▁
## 33.9 ▃▇▇▇▃▂▂▂
## 22.9 ▃▂▇▆▃▃▁▁
## 5.42 ▃▃▃▇▆▁▁▂