The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures
9/25/2018
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures
A set of R packages that:
Tidy datasets are all alike,
but every messy data is messy in its own way
Tidy data is a computer-friendly format based on the following characteristics:
All other forms of data can be considered messy data.
There are many ways data can be messy. An incomplete list….
Country | < $10K | $10-20K | $20-50K | $50-100K | > $100K |
---|---|---|---|---|---|
India | 40 | 25 | 25 | 9 | 1 |
USA | 20 | 20 | 20 | 30 | 10 |
Country | Income | Percentage |
---|---|---|
India | < $10K | 40 |
USA | < $10K | 20 |
This is a case of reshaping or melting
Country | Year | M_0-14 | F_0-14 | M_ 15-60 | F_15-60 | M_60+ | F_60+ |
---|---|---|---|---|---|---|---|
UK | 2010 | ||||||
UK | 2011 |
Country | Year | Gender | Age | Count |
---|
Separating columns into different variables
The typical steps are
data(mtcars) knitr::kable(head(mtcars, 3))
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
rownames
. So it’s not in a columnlink <- 'https://dl.dropboxusercontent.com/s/pqavhcckshqxtjm/brca.csv' download.file(link, 'brca.csv') brca_data <- read.csv('brca.csv', stringsAsFactors=FALSE)
tidyverse
packageThe tidyverse
package is a meta-package that installs a set of packages that are useful for data cleaning, data tidying and data munging (manipulating data to get a computationally “attractive” dataset)
tidyverse
package# install.packages('tidyverse') library(tidyverse)
## ── Attaching packages ────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5 ## ✔ tibble 1.4.2 ✔ dplyr 0.7.6 ## ✔ tidyr 0.8.1 ✔ stringr 1.3.1 ## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag()
You can specify a function from a particular package as
dplyr::filter
. Note there are two colons there
tidyverse
packagesPackage | Description |
---|---|
ggplot2 | Data visualization (next week) |
tibble | data.frame on steroids |
tidyr | Data tidying (today) |
readr | Reading text files (CSV) |
purrr | Applying functions to data iteratively(later this sem) |
dplyr | Data cleaning and munging (today) |
stringr | String (character) manipulation |
forcats | Manipulating categorical variables |
tidyverse
packagesPackage | Description |
---|---|
readxl | Read Excel files |
haven | Read SAS, SPSS, Stata files |
lubridate | Deal with dates and times |
magrittr | Provides the pipe operator %>% |
glue | Makes pasting text and data easier |
library(tidyverse) updated_cars <- mtcars %>% rownames_to_column(var = 'Model') %>% mutate(kmpg = mpg * 1.6) %>% select(Model, kmpg, cyl, disp) %>% filter(cyl == 6)
Model | kmpg | cyl | disp |
---|---|---|---|
Mazda RX4 | 33.60 | 6 | 160.0 |
Mazda RX4 Wag | 33.60 | 6 | 160.0 |
Hornet 4 Drive | 34.24 | 6 | 258.0 |
Valiant | 28.96 | 6 | 225.0 |
Merc 280 | 30.72 | 6 | 167.6 |
Merc 280C | 28.48 | 6 | 167.6 |
library(tidyverse) updated_cars <- mtcars %>% # Take the data set rownames_to_column(var = 'Model') %>% # Make rownames a column select(Model, mpg, cyl, disp) %>% # Keep only certain columns mutate(kmpg = mpg * 1.6) %>% # Create a new variable filter(cyl == 6) # Keep only certain rows
The idea is to use verbs to express operations on a dataset, so it is easier to express what you want to do in code.
The pipe operator %>%
(technically from the package magrittr
) takes a data.frame or tibble object on the left, then “pipes” it to a function that takes the data.frame object as its first argument.
mtcars %>% mutate(kmpg = mpg * 1.6)
would be the same as
mutate(mtcars, kmpg = mpg * 1.6)
mtcars %>% rownames_to_column(var = "Model") %>% select(Model:disp) %>% mutate(kmpg = mpg * 1.6) %>% filter(cyl == 6)
tmp <- rownames_to_column(mtcars, var="Model") tmp3 <- select(tmp2, Model:disp) tmp2 <- mutate(tmp, kmpg = mpg * 1.6) tmp4 <- filter(tmp3, cyl == 6)
Both are fine, but I find pipes help translating my thoughts into code better
updated_cars <- mtcars %>% rownames_to_column(var = "Model") %>% select(Model:disp) %>% mutate(kmpg = mpg * 1.6) %>% filter(cyl == 6)
mtcars[,'Model'] <- rownames(mtcars) tmp <- mtcars[,c('Model','mpg','cyl','disp')] # Can't use the : operator for names tmp[,'kmpg'] <- tmp[,'mpg'] * 1.6 # Note we need to quote the names updated_cars <- tmp[tmp[,'cyl'] == 6,]
Idea was to make it easier to write expressive code without getting too hung up with syntax.
mtcars
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
mtcars %>% rownames_to_column(var = "Model")
Model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
mtcars %>% rownames_to_column(var = "Model") %>% select(Model:disp)
Model | mpg | cyl | disp |
---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 |
Mazda RX4 Wag | 21.0 | 6 | 160 |
Datsun 710 | 22.8 | 4 | 108 |
mtcars %>% rownames_to_column(var = "Model") %>% select(Model:disp) %>% mutate(kmpg = mpg * 1.6)
Model | mpg | cyl | disp | kmpg |
---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 33.60 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 33.60 |
Datsun 710 | 22.8 | 4 | 108 | 36.48 |
mtcars %>% rownames_to_column(var = "Model") %>% select(Model:disp) %>% mutate(kmpg = mpg * 1.6) %>% filter(cyl == 6)
Model | mpg | cyl | disp | kmpg |
---|---|---|---|---|
Mazda RX4 | 21 | 6 | 160 | 33.6 |
Mazda RX4 Wag | 21 | 6 | 160 | 33.6 |
link <- 'https://dl.dropboxusercontent.com/s/pqavhcckshqxtjm/brca.csv' download.file(link, 'brca.csv') brca_data <- read.csv('brca.csv', stringsAsFactors=FALSE)
library(tidyverse) names(brca_data)
## [1] "id" "diagnosis" ## [3] "radius_mean" "texture_mean" ## [5] "perimeter_mean" "area_mean" ## [7] "smoothness_mean" "compactness_mean" ## [9] "concavity_mean" "concave.points_mean" ## [11] "symmetry_mean" "fractal_dimension_mean" ## [13] "radius_se" "texture_se" ## [15] "perimeter_se" "area_se" ## [17] "smoothness_se" "compactness_se" ## [19] "concavity_se" "concave.points_se" ## [21] "symmetry_se" "fractal_dimension_se" ## [23] "radius_worst" "texture_worst" ## [25] "perimeter_worst" "area_worst" ## [27] "smoothness_worst" "compactness_worst" ## [29] "concavity_worst" "concave.points_worst" ## [31] "symmetry_worst" "fractal_dimension_worst" ## [33] "X"
library(tidyverse) brca_data
id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave.points_mean | symmetry_mean | fractal_dimension_mean | radius_se | texture_se | perimeter_se | area_se | smoothness_se | compactness_se | concavity_se | concave.points_se | symmetry_se | fractal_dimension_se | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave.points_worst | symmetry_worst | fractal_dimension_worst | X |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
842302 | M | 17.99 | 10.38 | 122.8 | 1001 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | 1.0950 | 0.9053 | 8.589 | 153.40 | 0.006399 | 0.04904 | 0.05373 | 0.01587 | 0.03003 | 0.006193 | 25.38 | 17.33 | 184.6 | 2019 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NA |
842517 | M | 20.57 | 17.77 | 132.9 | 1326 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | 0.5435 | 0.7339 | 3.398 | 74.08 | 0.005225 | 0.01308 | 0.01860 | 0.01340 | 0.01389 | 0.003532 | 24.99 | 23.41 | 158.8 | 1956 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NA |
84300903 | M | 19.69 | 21.25 | 130.0 | 1203 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | 0.7456 | 0.7869 | 4.585 | 94.03 | 0.006150 | 0.04006 | 0.03832 | 0.02058 | 0.02250 | 0.004571 | 23.57 | 25.53 | 152.5 | 1709 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NA |
library(tidyverse) brca_data %>% select(id, diagnosis, ends_with('mean'), ends_with('se'), ends_with('worst'), -starts_with('fractal')) # removes columns starting with "fractal"
id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave.points_mean | symmetry_mean | radius_se | texture_se | perimeter_se | area_se | smoothness_se | compactness_se | concavity_se | concave.points_se | symmetry_se | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave.points_worst | symmetry_worst |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
842302 | M | 17.99 | 10.38 | 122.8 | 1001 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 1.0950 | 0.9053 | 8.589 | 153.40 | 0.006399 | 0.04904 | 0.05373 | 0.01587 | 0.03003 | 25.38 | 17.33 | 184.6 | 2019 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 |
842517 | M | 20.57 | 17.77 | 132.9 | 1326 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.5435 | 0.7339 | 3.398 | 74.08 | 0.005225 | 0.01308 | 0.01860 | 0.01340 | 0.01389 | 24.99 | 23.41 | 158.8 | 1956 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 |
84300903 | M | 19.69 | 21.25 | 130.0 | 1203 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.7456 | 0.7869 | 4.585 | 94.03 | 0.006150 | 0.04006 | 0.03832 | 0.02058 | 0.02250 | 23.57 | 25.53 | 152.5 | 1709 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 |
library(tidyverse) brca_data %>% select(id, diagnosis, ends_with('mean'), ends_with('se'), ends_with('worst'), -starts_with('fractal')) %>% gather(variable, value, -id, -diagnosis) # operate on everything but
Column names become variable, everything stays aligned with id and diagnosis
id | diagnosis | variable | value |
---|---|---|---|
842302 | M | radius_mean | 17.99 |
842517 | M | radius_mean | 20.57 |
84300903 | M | radius_mean | 19.69 |
library(tidyverse) brca_data %>% select(id, diagnosis, ends_with('mean'), ends_with('se'), ends_with('worst'), -starts_with('fractal')) %>% gather(variable, value, -id, -diagnosis) %>% separate(variable, c("Variable","stat"), sep="_", remove = T)
Split
variable
into 2 cols,Variable
andstat
id | diagnosis | Variable | stat | value |
---|---|---|---|---|
842302 | M | radius | mean | 17.99 |
842517 | M | radius | mean | 20.57 |
84300903 | M | radius | mean | 19.69 |
library(tidyverse) brca_data %>% select(id, diagnosis, ends_with('mean'), ends_with('se'), ends_with('worst'), -starts_with('fractal')) %>% gather(variable, value, -id, -diagnosis) %>% separate(variable, c("Variable","stat"), sep="_", remove = T) %>% spread(stat, value)
id | diagnosis | Variable | mean | se | worst |
---|---|---|---|---|---|
8670 | M | area | 748.90000 | 48.31000 | 1156.0000 |
8670 | M | compactness | 0.12230 | 0.01484 | 0.2394 |
8670 | M | concave.points | 0.08087 | 0.01093 | 0.1514 |