Practical R: Data Ingestion

# Practical R: Data Ingestion
### Abhijit Dasgupta
### BIOF 339

---

## A quick refresh

+ We talked about various data structures in R
+ The primacy of the `data.frame`
  - Extracting individual variables from a data frame
    - `breast_cancer$ER.Status`, `breast_cancer[,'ER.Status']`, `breast_cancer[['ER.Status']]`
  - Extracting rows of a `data.frame`
+ Identifying data classes using the `class` function
+ Recognizing different classes: `numeric`, `character`, `factor`, `Date`, ..
  - testing for a class: `is.numeric`
  - converting to a class: `as.numeric`

---

## RMarkdown tip of the day

You can add options to each R chunk to add or suppress output

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Option </th>
   <th style="text-align:left;"> Property </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> echo=TRUE/FALSE </td>
   <td style="text-align:left;"> Does the document show the R code </td>
  </tr>
  <tr>
   <td style="text-align:left;"> eval=TRUE/FALSE </td>
   <td style="text-align:left;"> Does the chunk get evaluated by R </td>
  </tr>
  <tr>
   <td style="text-align:left;"> message=TRUE/FALSE </td>
   <td style="text-align:left;"> Do messages get printed </td>
  </tr>
  <tr>
   <td style="text-align:left;"> warning=TRUE/FALSE </td>
   <td style="text-align:left;"> Do warnings get printed </td>
  </tr>
</tbody>
</table>

You can also set these globally in a RMD file by putting the following in the first R chunk:

```r
knitr::opts_chunk$set(echo=T, eval=T, message=F, warning=F)
```

See [here](https://yihui.name/knitr/options/#chunk-options) for the full gory details

.footnote[Note that the correct way to write `TRUE` and `FALSE` is .heatinline[all caps]. They can be shortened to `T` and `F` respectively, but it's better to get used to the full word.]
---

## Package tip of the semester

Use

```r
library(tidyverse)
```
or

```r
pacman::p_load('tidyverse')
```

for pretty much every R script and R Markdown file (put this at the top of a script file, but after the header in a R Markdown)

---
class: middle, center

# Data ingestion

---

## Data ingestion

Unlike Excel, you have to pull data into R for R to operate on it

Typically your data is in some sort of file (Excel, csv, sas7bdat, dta, txt)

You need to find a way to pull it into R

The GUI you've used is one way, but not very programmatic

---

## Data ingestion

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Type </th>
   <th style="text-align:left;"> Function </th>
   <th style="text-align:left;"> Package </th>
   <th style="text-align:left;"> Notes </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> csv </td>
   <td style="text-align:left;"> read_csv </td>
   <td style="text-align:left;"> readr </td>
   <td style="text-align:left;"> Takes care of formatting </td>
  </tr>
  <tr>
   <td style="text-align:left;"> csv </td>
   <td style="text-align:left;"> read.csv </td>
   <td style="text-align:left;"> base </td>
   <td style="text-align:left;"> Built in </td>
  </tr>
  <tr>
   <td style="text-align:left;"> csv </td>
   <td style="text-align:left;"> fread </td>
   <td style="text-align:left;"> data.table </td>
   <td style="text-align:left;"> Fastest </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Excel </td>
   <td style="text-align:left;"> read_excel </td>
   <td style="text-align:left;"> readxl </td>
   <td style="text-align:left;">  </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sas7bdat </td>
   <td style="text-align:left;"> read_sas </td>
   <td style="text-align:left;"> haven </td>
   <td style="text-align:left;"> SAS format </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sav </td>
   <td style="text-align:left;"> read_spss </td>
   <td style="text-align:left;"> haven </td>
   <td style="text-align:left;"> SPSS format </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dta </td>
   <td style="text-align:left;"> read_dta </td>
   <td style="text-align:left;"> haven </td>
   <td style="text-align:left;"> Stata format </td>
  </tr>
</tbody>
</table>

---

# Data ingestion

We will use [this](../data/BreastCancer_Clinical.csv) csv data and [this](../data/BreastCancer.xlsx) Excel data for the following:

```r
brca_clinical <- readr::read_csv('../data/BreastCancer_Clinical.csv')
brca_clinical2 <- data.table::fread('../data/BreastCancer_Clinical.csv')
```

```r
str(brca_clinical)
```

```
spec_tbl_df [77 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Complete TCGA ID                   : chr [1:77] "TCGA-A2-A0CM" "TCGA-BH-A18Q" "TCGA-A7-A0CE" "TCGA-D8-A142" ...
 $ Gender                             : chr [1:77] "FEMALE" "FEMALE" "FEMALE" "FEMALE" ...
 $ Age at Initial Pathologic Diagnosis: num [1:77] 40 56 57 74 61 67 45 48 59 36 ...
 $ ER Status                          : chr [1:77] "Negative" "Negative" "Negative" "Negative" ...
 $ PR Status                          : chr [1:77] "Negative" "Negative" "Negative" "Negative" ...
 $ HER2 Final Status                  : chr [1:77] "Negative" "Negative" "Negative" "Negative" ...
 $ Tumor                              : chr [1:77] "T2" "T2" "T2" "T3" ...
 $ Tumor--T1 Coded                    : chr [1:77] "T_Other" "T_Other" "T_Other" "T_Other" ...
 $ Node                               : chr [1:77] "N0" "N1" "N0" "N0" ...
 $ Node-Coded                         : chr [1:77] "Negative" "Positive" "Negative" "Negative" ...
 $ Metastasis                         : chr [1:77] "M0" "M0" "M0" "M0" ...
 $ Metastasis-Coded                   : chr [1:77] "Negative" "Negative" "Negative" "Negative" ...
 $ AJCC Stage                         : chr [1:77] "Stage IIA" "Stage IIB" "Stage IIA" "Stage IIB" ...
 $ Converted Stage                    : chr [1:77] "Stage IIA" "No_Conversion" "Stage IIA" "Stage IIB" ...
 $ Survival Data Form                 : chr [1:77] "followup" "enrollment" "followup" "followup" ...
 $ Vital Status                       : chr [1:77] "DECEASED" "DECEASED" "LIVING" "LIVING" ...
 $ Days to Date of Last Contact       : num [1:77] 754 1692 309 425 775 ...
 $ Days to date of Death              : num [1:77] 754 1692 NA NA NA ...
 $ OS event                           : num [1:77] 1 1 0 0 0 0 0 0 0 0 ...
 $ OS Time                            : num [1:77] 754 1692 309 425 775 ...
 $ PAM50 mRNA                         : chr [1:77] "Basal-like" "Basal-like" "Basal-like" "Basal-like" ...
 $ SigClust Unsupervised mRNA         : num [1:77] -12 -12 0 0 -12 -12 -12 -12 0 0 ...
 $ SigClust Intrinsic mRNA            : num [1:77] -13 -13 -13 -13 -13 -13 -13 -13 -13 -13 ...
 $ miRNA Clusters                     : num [1:77] 4 5 5 3 2 5 4 4 5 5 ...
 $ methylation Clusters               : num [1:77] 4 5 5 5 5 5 5 5 5 5 ...
 $ RPPA Clusters                      : chr [1:77] "Basal" "Basal" "Basal" "X" ...
 $ CN Clusters                        : num [1:77] 4 1 1 1 1 3 3 1 3 1 ...
 $ Integrated Clusters (with PAM50)   : num [1:77] 2 2 2 2 2 2 2 2 2 2 ...
 $ Integrated Clusters (no exp)       : num [1:77] 1 2 2 2 2 2 2 2 2 2 ...
 $ Integrated Clusters (unsup exp)    : num [1:77] 1 2 2 2 2 2 2 2 2 2 ...
 - attr(*, "spec")=
  .. cols(
  ..   `Complete TCGA ID` = col_character(),
  ..   Gender = col_character(),
  ..   `Age at Initial Pathologic Diagnosis` = col_double(),
  ..   `ER Status` = col_character(),
  ..   `PR Status` = col_character(),
  ..   `HER2 Final Status` = col_character(),
  ..   Tumor = col_character(),
  ..   `Tumor--T1 Coded` = col_character(),
  ..   Node = col_character(),
  ..   `Node-Coded` = col_character(),
  ..   Metastasis = col_character(),
  ..   `Metastasis-Coded` = col_character(),
  ..   `AJCC Stage` = col_character(),
  ..   `Converted Stage` = col_character(),
  ..   `Survival Data Form` = col_character(),
  ..   `Vital Status` = col_character(),
  ..   `Days to Date of Last Contact` = col_double(),
  ..   `Days to date of Death` = col_double(),
  ..   `OS event` = col_double(),
  ..   `OS Time` = col_double(),
  ..   `PAM50 mRNA` = col_character(),
  ..   `SigClust Unsupervised mRNA` = col_double(),
  ..   `SigClust Intrinsic mRNA` = col_double(),
  ..   `miRNA Clusters` = col_double(),
  ..   `methylation Clusters` = col_double(),
  ..   `RPPA Clusters` = col_character(),
  ..   `CN Clusters` = col_double(),
  ..   `Integrated Clusters (with PAM50)` = col_double(),
  ..   `Integrated Clusters (no exp)` = col_double(),
  ..   `Integrated Clusters (unsup exp)` = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
```
]
.pull-right[

```r
str(brca_clinical2)
```

```
Classes 'data.table' and 'data.frame':	77 obs. of  30 variables:
 $ Complete TCGA ID                   : chr  "TCGA-A2-A0CM" "TCGA-BH-A18Q" "TCGA-A7-A0CE" "TCGA-D8-A142" ...
 $ Gender                             : chr  "FEMALE" "FEMALE" "FEMALE" "FEMALE" ...
 $ Age at Initial Pathologic Diagnosis: int  40 56 57 74 61 67 45 48 59 36 ...
 $ ER Status                          : chr  "Negative" "Negative" "Negative" "Negative" ...
 $ PR Status                          : chr  "Negative" "Negative" "Negative" "Negative" ...
 $ HER2 Final Status                  : chr  "Negative" "Negative" "Negative" "Negative" ...
 $ Tumor                              : chr  "T2" "T2" "T2" "T3" ...
 $ Tumor--T1 Coded                    : chr  "T_Other" "T_Other" "T_Other" "T_Other" ...
 $ Node                               : chr  "N0" "N1" "N0" "N0" ...
 $ Node-Coded                         : chr  "Negative" "Positive" "Negative" "Negative" ...
 $ Metastasis                         : chr  "M0" "M0" "M0" "M0" ...
 $ Metastasis-Coded                   : chr  "Negative" "Negative" "Negative" "Negative" ...
 $ AJCC Stage                         : chr  "Stage IIA" "Stage IIB" "Stage IIA" "Stage IIB" ...
 $ Converted Stage                    : chr  "Stage IIA" "No_Conversion" "Stage IIA" "Stage IIB" ...
 $ Survival Data Form                 : chr  "followup" "enrollment" "followup" "followup" ...
 $ Vital Status                       : chr  "DECEASED" "DECEASED" "LIVING" "LIVING" ...
 $ Days to Date of Last Contact       : int  754 1692 309 425 775 964 1027 1288 1319 1471 ...
 $ Days to date of Death              : num  754 1692 NA NA NA ...
 $ OS event                           : int  1 1 0 0 0 0 0 0 0 0 ...
 $ OS Time                            : int  754 1692 309 425 775 964 1027 1288 1319 1471 ...
 $ PAM50 mRNA                         : chr  "Basal-like" "Basal-like" "Basal-like" "Basal-like" ...
 $ SigClust Unsupervised mRNA         : int  -12 -12 0 0 -12 -12 -12 -12 0 0 ...
 $ SigClust Intrinsic mRNA            : int  -13 -13 -13 -13 -13 -13 -13 -13 -13 -13 ...
 $ miRNA Clusters                     : int  4 5 5 3 2 5 4 4 5 5 ...
 $ methylation Clusters               : int  4 5 5 5 5 5 5 5 5 5 ...
 $ RPPA Clusters                      : chr  "Basal" "Basal" "Basal" "X" ...
 $ CN Clusters                        : int  4 1 1 1 1 3 3 1 3 1 ...
 $ Integrated Clusters (with PAM50)   : int  2 2 2 2 2 2 2 2 2 2 ...
 $ Integrated Clusters (no exp)       : int  1 2 2 2 2 2 2 2 2 2 ...
 $ Integrated Clusters (unsup exp)    : int  1 2 2 2 2 2 2 2 2 2 ...
 - attr(*, ".internal.selfref")=<externalptr> 
```
]

---

## A note on two "super"-data.frame objects

```
# A tibble: 6 × 30
  `Complete TCGA ID` Gender `Age at Initial Pathologic … `ER Status` `PR Status`
  <chr>              <chr>                         <dbl> <chr>       <chr>      
1 TCGA-A2-A0CM       FEMALE                           40 Negative    Negative   
2 TCGA-BH-A18Q       FEMALE                           56 Negative    Negative   
3 TCGA-A7-A0CE       FEMALE                           57 Negative    Negative   
4 TCGA-D8-A142       FEMALE                           74 Negative    Negative   
5 TCGA-AO-A0J6       FEMALE                           61 Negative    Negative   
6 TCGA-A2-A0YM       FEMALE                           67 Negative    Negative   
# … with 25 more variables: HER2 Final Status <chr>, Tumor <chr>,
#   Tumor--T1 Coded <chr>, Node <chr>, Node-Coded <chr>, Metastasis <chr>,
#   Metastasis-Coded <chr>, AJCC Stage <chr>, Converted Stage <chr>,
#   Survival Data Form <chr>, Vital Status <chr>,
#   Days to Date of Last Contact <dbl>, Days to date of Death <dbl>,
#   OS event <dbl>, OS Time <dbl>, PAM50 mRNA <chr>,
#   SigClust Unsupervised mRNA <dbl>, SigClust Intrinsic mRNA <dbl>, …
```
]
.pull-right[
A `data.table`

```
   Complete TCGA ID Gender Age at Initial Pathologic Diagnosis ER Status
1:     TCGA-A2-A0CM FEMALE                                  40  Negative
2:     TCGA-BH-A18Q FEMALE                                  56  Negative
3:     TCGA-A7-A0CE FEMALE                                  57  Negative
4:     TCGA-D8-A142 FEMALE                                  74  Negative
5:     TCGA-AO-A0J6 FEMALE                                  61  Negative
6:     TCGA-A2-A0YM FEMALE                                  67  Negative
   PR Status HER2 Final Status Tumor Tumor--T1 Coded Node Node-Coded Metastasis
1:  Negative          Negative    T2         T_Other   N0   Negative         M0
2:  Negative          Negative    T2         T_Other   N1   Positive         M0
3:  Negative          Negative    T2         T_Other   N0   Negative         M0
4:  Negative          Negative    T3         T_Other   N0   Negative         M0
5:  Negative          Negative    T2         T_Other   N0   Negative         M0
6:  Negative          Negative    T2         T_Other   N0   Negative         M0
   Metastasis-Coded AJCC Stage Converted Stage Survival Data Form Vital Status
1:         Negative  Stage IIA       Stage IIA           followup     DECEASED
2:         Negative  Stage IIB   No_Conversion         enrollment     DECEASED
3:         Negative  Stage IIA       Stage IIA           followup       LIVING
4:         Negative  Stage IIB       Stage IIB           followup       LIVING
5:         Negative  Stage IIA       Stage IIA           followup       LIVING
6:         Negative  Stage IIA       Stage IIA           followup       LIVING
   Days to Date of Last Contact Days to date of Death OS event OS Time
1:                          754                   754        1     754
2:                         1692                  1692        1    1692
3:                          309                    NA        0     309
4:                          425                    NA        0     425
5:                          775                    NA        0     775
6:                          964                    NA        0     964
   PAM50 mRNA SigClust Unsupervised mRNA SigClust Intrinsic mRNA miRNA Clusters
1: Basal-like                        -12                     -13              4
2: Basal-like                        -12                     -13              5
3: Basal-like                          0                     -13              5
4: Basal-like                          0                     -13              3
5: Basal-like                        -12                     -13              2
6: Basal-like                        -12                     -13              5
   methylation Clusters RPPA Clusters CN Clusters
1:                    4         Basal           4
2:                    5         Basal           1
3:                    5         Basal           1
4:                    5             X           1
5:                    5         Basal           1
6:                    5         Basal           3
   Integrated Clusters (with PAM50) Integrated Clusters (no exp)
1:                                2                            1
2:                                2                            2
3:                                2                            2
4:                                2                            2
5:                                2                            2
6:                                2                            2
   Integrated Clusters (unsup exp)
1:                               1
2:                               2
3:                               2
4:                               2
5:                               2
6:                               2
```
]

---

## A note on two "super"-data.frame objects

+ A `tibble` works pretty much like any `data.frame`, but the printing is a little saner
+ A `data.table` is faster, has more inherent functionality, but has a very different syntax

We'll work almost entirely with `tibble`'s and not `data.table`

Suggested modifications:

+ If using `fread`, convert the resulting object to a `data.frame` or `tibble` using `as_data_frame()` or `as_tibble()`
+ Convert the column names to not have spaces using, for example,

```r
brca_clinical <- janitor::clean_names(brca_clinical)
```

---
background-image: url(../img/janitor_clean_names.png)
background-size: contain

---

## Data ingestion

Note that you **have** to give a name to what you're importing using `read_*` or whatever you're using, otherwise it won't stay in R

```r
brca_clinical <- readr::read_csv('../data/BreastCancer_Clinical.csv')
```

![](../img/env.png)

> See what happens if you don't give a name to a dataset you ingest.

---

## Reading Excel

You can find the names of the sheets in an Excel file:

```r
readxl::excel_sheets('../data/BreastCancer.xlsx')
```

```
[1] "Cllinical"  "Expression"
```

So you can ingest a particular sheet from an Excel file using

```r
brca_expression <- readxl::read_excel('../data/BreastCancer.xlsx', sheet='Expression')
```

---
class: middle, center

# Data export

---

## Data export

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Type </th>
   <th style="text-align:left;"> Function </th>
   <th style="text-align:left;"> Package </th>
   <th style="text-align:left;"> Notes </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> csv </td>
   <td style="text-align:left;"> write_csv </td>
   <td style="text-align:left;"> readr </td>
   <td style="text-align:left;"> Takes care of formatting </td>
  </tr>
  <tr>
   <td style="text-align:left;"> csv </td>
   <td style="text-align:left;"> write.csv </td>
   <td style="text-align:left;"> base </td>
   <td style="text-align:left;"> Built in </td>
  </tr>
  <tr>
   <td style="text-align:left;"> csv </td>
   <td style="text-align:left;"> fwrite </td>
   <td style="text-align:left;"> data.table </td>
   <td style="text-align:left;"> Fastest </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Excel </td>
   <td style="text-align:left;"> write.xlsx </td>
   <td style="text-align:left;"> openxlsx </td>
   <td style="text-align:left;">  </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sas7bdat </td>
   <td style="text-align:left;"> write_sas </td>
   <td style="text-align:left;"> haven </td>
   <td style="text-align:left;"> SAS format </td>
  </tr>
  <tr>
   <td style="text-align:left;"> sav </td>
   <td style="text-align:left;"> write_spss </td>
   <td style="text-align:left;"> haven </td>
   <td style="text-align:left;"> SPSS format </td>
  </tr>
  <tr>
   <td style="text-align:left;"> dta </td>
   <td style="text-align:left;"> write_dta </td>
   <td style="text-align:left;"> haven </td>
   <td style="text-align:left;"> Stata format </td>
  </tr>
</tbody>
</table>

We'll often save tabular results using these functions

.footnote[These can also be useful for exporting results, but the R Markdown related packages are better for that]

---

<div class="my-header">
<span>BIOF 339: Practical R</span></div>

## Simplifying import/export

---

We'll be using a package that makes this easier.

It's called **rio** and it has two basic functions: `import` and `export`.

The `rio` package uses the different packages mentioned earlier but unifies it into a single syntax

For example:

```r
rio::import('data/clinical_data_breast_cancer_modified.csv')
```

**rio** reads the end of the file being imported or exported and decides which functions from which package should be used for the job.

**rio** accesses different packages that are right for each job, so you don't have to.

---

You can also import multiple sheets from Excel, or multiple objects from .RData files, into a list of data frames

```r
dat <- rio::import_list('data/BreastCancer.xlsx')
```

```r
class(dat)
```

```
[1] "list"
```

```r
names(dat)
```

```
[1] "Cllinical"  "Expression"
```

```r
map_chr(dat, class)
```

```
   Cllinical   Expression 
"data.frame" "data.frame" 
```
---

<div class="my-header">
<span>BIOF 339: Practical R</span></div>

---

## Saving your work

You would often like to store intermediate datasets, and final datasets, so that you can access them quickly.

There are several ways of saving even large datasets so that they can be quickly accessed.

| Function  | Package | Example                                  | Retrieving the stored data          |
|-----------|---------|------------------------------------------|-------------------------------------|
| saveRDS   | base    | `saveRDS(weather, file = 'weather.rds')` | `weather <- readRDS('weather.rds')` |
| write_fst | fst     | `write_fst(weather, file='weather.fst')` | `weather <- read_fst('weather.fst')` |

These methods are meant for storing .fatinline[single objects]

---

## Saving your work

If you want to store all of your objects into a single file, you can store them in a .RData file.

```r
save.image(file="<filename>.RData")
```

To keep multiple specified objects in a .RData file,

```r
save(<obj1>, <obj2>, <obj3>, file = "<filename>.RData")
```

------

## Retrieving your work

You can retrieve the objects in a .RData file using the function `load`.

```r
load(file = "<filename>.RData")
```

This will store each object in its original name in your R environment.