Pipelines and workflows

class: center, middle, inverse, title-slide

# Pipelines and workflows
### Abhijit Dasgupta
### BIOF 339

---

layout: true

<div class="my-header">
<span>BIOF 339: Practical R</span>
</div>

---
class: inverse, middle, center

# Pipes in the tidyverse

---
layout: true

<div class="my-header">
<span>BIOF 339: Practical R</span>
</div>

## Pipes

---

We've seen two types of pipes in R.

.pull-left[
The pipe operator `%>%` from the **magrittr** package

```r
library(tidyverse) # includes magrittr
library(palmerpenguins)

penguins %>% 
  group_by(species) %>% 
  mutate(across(bill_length_mm:body_mass_g,
                function(x) replace_na(x, mean(x, na.rm=T)))) %>% 
  ungroup() %>% 
  summarise(across(bill_length_mm:body_mass_g, median))
```

]
.pull-right[
The `+` symbol used as a pipe-like operator  in **ggplot2**

```r
ggplot(penguins, 
       aes(x = bill_length_mm,
           y = body_mass_g))+
  geom_point(aes(color = species, shape = island))
```

]

---

You can combine the two pipes into a workflow to create a visualization

.pull-left[

```r
penguins %>% 
  group_by(species) %>% 
  mutate(across(bill_length_mm:body_mass_g,
                function(x) replace_na(x, mean(x, na.rm=T)))) %>% 
  ungroup() %>% 
  ggplot(aes(x = bill_length_mm,
*            y = body_mass_g)) +
    geom_point(aes(shape = island,
                   color = species))+
    labs(x = 'Bill length(mm)',
         y = 'Body mass (g)') +
    hrbrthemes::theme_ipsum()
```

The **ggplot** pipe has to be at the end of the workflow. Also note, we're not adding the data argument to `ggplot` since it is tidyverse-compatible and slots the end of the previous pipe into the `data` argument
]
.pull-right[
![](04-pipes_files/figure-html/unnamed-chunk-5-1.png)
]

---
layout: true

<div class="my-header">
<span>BIOF 339: Practical R</span>
</div>

## Rowwise operations

---

The **dplyr** package allows you to do rowwise operations much more easily than before within a pipe using
the `rowwise` function. For example

.pull-left[

```r
mpg %>%
  select(manufacturer, year, cty, hwy) %>% 
  rowwise() %>% 
  mutate(avg_mpg = mean(c(hwy, cty)))
```

```
# A tibble: 234 × 5
# Rowwise: 
   manufacturer  year   cty   hwy avg_mpg
   <chr>        <int> <int> <int>   <dbl>
 1 audi          1999    18    29    23.5
 2 audi          1999    21    29    25  
 3 audi          2008    20    31    25.5
 4 audi          2008    21    30    25.5
 5 audi          1999    16    26    21  
 6 audi          1999    18    26    22  
 7 audi          2008    18    27    22.5
 8 audi          1999    18    26    22  
 9 audi          1999    16    25    20.5
10 audi          2008    20    28    24  
# … with 224 more rows
```
]
.pull-right[
The `rowwise` function creates groups, one per row, and allows operations to occur along rows and across columns.

> What would the result be if you omitted the `rowwise` function in the pipe?

]

---

If you want to continue the pipe to incorporate the more traditonal column-wise operations, you need to use `ungroup` before proceeding

.pull-left[

```r
mpg %>% 
  select(manufacturer, year, cty, hwy) %>% 
  rowwise() %>% 
  mutate(avg_mpg = mean(c(hwy, cty))) %>% 
* ungroup() %>%
  ggplot(aes(x = avg_mpg)) +
    geom_histogram(bins = 50)+
    ggthemes::theme_few()
```
]
.pull-right[
![](04-pipes_files/figure-html/unnamed-chunk-8-1.png)
]

---

There are some nice shortcuts, in line with the `select` function, even with rowwise operations

.pull-left[

```r
diamonds %>% 
  select(carat, x:z) %>% 
  rowwise() %>% 
* mutate(vol = prod(c_across(x:z))) %>%
  ungroup() %>% 
  ggplot(aes(x = vol,
             y = carat))+
    geom_point() + 
    ggthemes::theme_fivethirtyeight()
```

.footnote[Much more details about the possibilities of the `rowwise` function are available [here](https://dplyr.tidyverse.org/articles/rowwise.html)]
]
.pull-right[
![](04-pipes_files/figure-html/unnamed-chunk-9-1.png)

]

---
layout: true

<div class="my-header">
<span>BIOF 339: Practical R</span>
</div>

---
class: inverse, center, middle

# Prepping data for modeling

---
layout: true

<div class="my-header">
<span>BIOF 339: Practical R</span>
</div>

## Recipes

---

.acid[
The idea of the **recipes** package is to define a recipe or blueprint that can be used to sequentially define the encodings and preprocessing of the data (i.e. “feature engineering”)
]

This is done in the context of supervised modeling, e.g. regression, decision trees

The idea is to define the dependent and independent variables, and then creating a pipeline to 
modify the independent variables through various statistical procedures.

---

We'll start with the credit data in the **modeldata** package

```r
library(recipes)
library(modeldata)
data("credit_data")

glimpse(credit_data)
```

```
Rows: 4,454
Columns: 14
$ Status    <fct> good, good, bad, good, good, good, good, good, good, bad, go…
$ Seniority <int> 9, 17, 10, 0, 0, 1, 29, 9, 0, 0, 6, 7, 8, 19, 0, 0, 15, 33, …
$ Home      <fct> rent, rent, owner, rent, rent, owner, owner, parents, owner,…
$ Time      <int> 60, 60, 36, 60, 36, 60, 60, 12, 60, 48, 48, 36, 60, 36, 18, …
$ Age       <int> 30, 58, 46, 24, 26, 36, 44, 27, 32, 41, 34, 29, 30, 37, 21, …
$ Marital   <fct> married, widow, married, single, single, married, married, s…
$ Records   <fct> no, no, yes, no, no, no, no, no, no, no, no, no, no, no, yes…
$ Job       <fct> freelance, fixed, freelance, fixed, fixed, fixed, fixed, fix…
$ Expenses  <int> 73, 48, 90, 63, 46, 75, 75, 35, 90, 90, 60, 60, 75, 75, 35, …
$ Income    <int> 129, 131, 200, 182, 107, 214, 125, 80, 107, 80, 125, 121, 19…
$ Assets    <int> 0, 0, 3000, 2500, 0, 3500, 10000, 0, 15000, 0, 4000, 3000, 5…
$ Debt      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0, 0, 0, 2000…
$ Amount    <int> 800, 1000, 2000, 900, 310, 650, 1600, 200, 1200, 1200, 1150,…
$ Price     <int> 846, 1658, 2985, 1325, 910, 1645, 1800, 1093, 1957, 1468, 15…
```

---

Create an initial recipe based on the model that will be fit

```r
rec <- recipe(Status ~ Seniority + Time + Age + Records, data = credit_data)
```

.pull-left[

```r
rec
```

```
Data Recipe

Inputs:

role #variables
   outcome          1
 predictor          4
```

]
.pull-right[

```r
summary(rec, original=TRUE)   
```

```
# A tibble: 5 × 4
  variable  type    role      source  
  <chr>     <chr>   <chr>     <chr>   
1 Seniority numeric predictor original
2 Time      numeric predictor original
3 Age       numeric predictor original
4 Records   nominal predictor original
5 Status    nominal outcome   original
```

]

---

.pull-left[
Add a step to convert nominal variables into dummies

```r
(dummied <- rec %>% step_dummy(Records))
```

```
Data Recipe

Inputs:

role #variables
   outcome          1
 predictor          4

Operations:

Dummy variables from Records
```
]
.pull-right[
Then apply it to your data

```r
dummied <- prep(dummied, training = credit_data)
with_dummy <- bake(dummied, new_data = credit_data)
head(with_dummy)
```

```
# A tibble: 6 × 5
  Seniority  Time   Age Status Records_yes
      <int> <int> <int> <fct>        <dbl>
1         9    60    30 good             0
2        17    60    58 good             0
3        10    36    46 bad              1
4         0    60    24 good             0
5         0    36    26 good             0
6         1    60    36 good             0
```

]

---

The **recipes** package provides a rich variety of data steps that can be used to prepare a data set.

```r
iris_recipe <- iris %>% 
  recipe(Species ~ .) %>% 
  step_corr(all_predictors()) %>% 
  step_center(all_predictors(), -all_outcomes()) %>% 
  step_scale(all_predictors() , -all_outcomes()) %>% 
  prep()
iris_recipe
```

```
Data Recipe

Inputs:

role #variables
   outcome          1
 predictor          4

Training data contained 150 data points and no missing data.

Operations:

Correlation filter removed Petal.Length [trained]
Centering for Sepal.Length, Sepal.Width, Petal.Width [trained]
Scaling for Sepal.Length, Sepal.Width, Petal.Width [trained]
```

---

This recipe can then be applied to the same or a different dataset

```r
iris1 <- bake(iris_recipe, iris)
glimpse(iris1)
```

```
Rows: 150
Columns: 4
$ Sepal.Length <dbl> -0.89767388, -1.13920048, -1.38072709, -1.50149039, -1.01…
$ Sepal.Width  <dbl> 1.01560199, -0.13153881, 0.32731751, 0.09788935, 1.245030…
$ Petal.Width  <dbl> -1.3110521, -1.3110521, -1.3110521, -1.3110521, -1.311052…
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
```

.footnote[You can go into more details at [tidymodels.org](https://www.tidymodels.org/), 
with a nice introduction [here](https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/)]

---
layout: true

<div class="my-header">
<span>BIOF 339: Practical R</span>
</div>

---
class: inverse, middle, center

# Workflows

---
layout: true

<div class="my-header">
<span>BIOF 339: Practical R</span>
</div>

## Workflows

---
background-image: url(../img/0246OS_00_02.png)
background-size: contain

---
background-image: url(../img/data-science-explore.png)
background-size: contain

---
background-image: url(../img/tidypipeline.jpg)
background-size: contain

---

+ Create one script file for each node in your workflow
+ Save intermediate data or objects using `saveRDS` so that
    - they can be imported quickly by the next step
    - Each link in the chain can be checked and verified
+ You can summarize your entire workflow within one script:

```r
source('01-ingest.R')
source('02-munge.R')
source('03-exploreviz.R')
source('04-eda.R')
source('05-models.R')
source('06-results.R')
```

---

### A personal story

I wrote a paper using R Markdown with a reasonable pipeline for data analyses, modeling and visualization

Output to Word for submission to a journal

Three months later, reviews came in asking for using updated data

Changed the data at the beginning of my workflow, ran the workflow, and had revised manuscript in 10 minutes.

.center[.heat[Quickest turnaround ever!!]]

---

### Some ideas ([*Efficient Programming*](https://csgillespie.github.io/efficientR/workflow.html) by Gillespie and Lovelace)

1. Start without writing code but with a clear mind and perhaps a pen and paper. This will ensure you keep your objectives at the forefront of your mind, without getting lost in the technology.
1. Make a plan. The size and nature will depend on the project but timelines, resources and ‘chunking’ the work will make you more effective when you start.
1. Select the packages you will use for implementing the plan early. Minutes spent researching and selecting from the available options could save hours in the future.
1. Document your work at every stage; work can only be effective if it’s communicated clearly and code can only be efficiently understood if it’s commented.
1. Make your entire workflow as reproducible as possible. knitr can help with this in the phase of documentation.