class: center, middle, inverse, title-slide # Practical R: Data Ingestion ### Abhijit Dasgupta ### BIOF 339 --- ## A quick refresh + We talked about various data structures in R + The primacy of the `data.frame` - Extracting individual variables from a data frame - `breast_cancer$ER.Status`, `breast_cancer[,'ER.Status']`, `breast_cancer[['ER.Status']]` - Extracting rows of a `data.frame` + Identifying data classes using the `class` function + Recognizing different classes: `numeric`, `character`, `factor`, `Date`, .. - testing for a class: `is.numeric` - converting to a class: `as.numeric` --- ## RMarkdown tip of the day You can add options to each R chunk to add or suppress output <table> <thead> <tr> <th style="text-align:left;"> Option </th> <th style="text-align:left;"> Property </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> echo=TRUE/FALSE </td> <td style="text-align:left;"> Does the document show the R code </td> </tr> <tr> <td style="text-align:left;"> eval=TRUE/FALSE </td> <td style="text-align:left;"> Does the chunk get evaluated by R </td> </tr> <tr> <td style="text-align:left;"> message=TRUE/FALSE </td> <td style="text-align:left;"> Do messages get printed </td> </tr> <tr> <td style="text-align:left;"> warning=TRUE/FALSE </td> <td style="text-align:left;"> Do warnings get printed </td> </tr> </tbody> </table> You can also set these globally in a RMD file by putting the following in the first R chunk: ```r knitr::opts_chunk$set(echo=T, eval=T, message=F, warning=F) ``` See [here](https://yihui.name/knitr/options/#chunk-options) for the full gory details .footnote[Note that the correct way to write `TRUE` and `FALSE` is .heatinline[all caps]. They can be shortened to `T` and `F` respectively, but it's better to get used to the full word.] --- ## Package tip of the semester Use ```r library(tidyverse) ``` or ```r pacman::p_load('tidyverse') ``` for pretty much every R script and R Markdown file (put this at the top of a script file, but after the header in a R Markdown) --- class: middle, center # Data ingestion --- ## Data ingestion Unlike Excel, you have to pull data into R for R to operate on it Typically your data is in some sort of file (Excel, csv, sas7bdat, dta, txt) You need to find a way to pull it into R The GUI you've used is one way, but not very programmatic --- ## Data ingestion <table> <thead> <tr> <th style="text-align:left;"> Type </th> <th style="text-align:left;"> Function </th> <th style="text-align:left;"> Package </th> <th style="text-align:left;"> Notes </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> csv </td> <td style="text-align:left;"> read_csv </td> <td style="text-align:left;"> readr </td> <td style="text-align:left;"> Takes care of formatting </td> </tr> <tr> <td style="text-align:left;"> csv </td> <td style="text-align:left;"> read.csv </td> <td style="text-align:left;"> base </td> <td style="text-align:left;"> Built in </td> </tr> <tr> <td style="text-align:left;"> csv </td> <td style="text-align:left;"> fread </td> <td style="text-align:left;"> data.table </td> <td style="text-align:left;"> Fastest </td> </tr> <tr> <td style="text-align:left;"> Excel </td> <td style="text-align:left;"> read_excel </td> <td style="text-align:left;"> readxl </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> sas7bdat </td> <td style="text-align:left;"> read_sas </td> <td style="text-align:left;"> haven </td> <td style="text-align:left;"> SAS format </td> </tr> <tr> <td style="text-align:left;"> sav </td> <td style="text-align:left;"> read_spss </td> <td style="text-align:left;"> haven </td> <td style="text-align:left;"> SPSS format </td> </tr> <tr> <td style="text-align:left;"> dta </td> <td style="text-align:left;"> read_dta </td> <td style="text-align:left;"> haven </td> <td style="text-align:left;"> Stata format </td> </tr> </tbody> </table> --- # Data ingestion We will use [this](../data/BreastCancer_Clinical.csv) csv data and [this](../data/BreastCancer.xlsx) Excel data for the following: ```r brca_clinical <- readr::read_csv('../data/BreastCancer_Clinical.csv') brca_clinical2 <- data.table::fread('../data/BreastCancer_Clinical.csv') ``` .pull-left[ ```r str(brca_clinical) ``` ``` spec_tbl_df [77 × 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ Complete TCGA ID : chr [1:77] "TCGA-A2-A0CM" "TCGA-BH-A18Q" "TCGA-A7-A0CE" "TCGA-D8-A142" ... $ Gender : chr [1:77] "FEMALE" "FEMALE" "FEMALE" "FEMALE" ... $ Age at Initial Pathologic Diagnosis: num [1:77] 40 56 57 74 61 67 45 48 59 36 ... $ ER Status : chr [1:77] "Negative" "Negative" "Negative" "Negative" ... $ PR Status : chr [1:77] "Negative" "Negative" "Negative" "Negative" ... $ HER2 Final Status : chr [1:77] "Negative" "Negative" "Negative" "Negative" ... $ Tumor : chr [1:77] "T2" "T2" "T2" "T3" ... $ Tumor--T1 Coded : chr [1:77] "T_Other" "T_Other" "T_Other" "T_Other" ... $ Node : chr [1:77] "N0" "N1" "N0" "N0" ... $ Node-Coded : chr [1:77] "Negative" "Positive" "Negative" "Negative" ... $ Metastasis : chr [1:77] "M0" "M0" "M0" "M0" ... $ Metastasis-Coded : chr [1:77] "Negative" "Negative" "Negative" "Negative" ... $ AJCC Stage : chr [1:77] "Stage IIA" "Stage IIB" "Stage IIA" "Stage IIB" ... $ Converted Stage : chr [1:77] "Stage IIA" "No_Conversion" "Stage IIA" "Stage IIB" ... $ Survival Data Form : chr [1:77] "followup" "enrollment" "followup" "followup" ... $ Vital Status : chr [1:77] "DECEASED" "DECEASED" "LIVING" "LIVING" ... $ Days to Date of Last Contact : num [1:77] 754 1692 309 425 775 ... $ Days to date of Death : num [1:77] 754 1692 NA NA NA ... $ OS event : num [1:77] 1 1 0 0 0 0 0 0 0 0 ... $ OS Time : num [1:77] 754 1692 309 425 775 ... $ PAM50 mRNA : chr [1:77] "Basal-like" "Basal-like" "Basal-like" "Basal-like" ... $ SigClust Unsupervised mRNA : num [1:77] -12 -12 0 0 -12 -12 -12 -12 0 0 ... $ SigClust Intrinsic mRNA : num [1:77] -13 -13 -13 -13 -13 -13 -13 -13 -13 -13 ... $ miRNA Clusters : num [1:77] 4 5 5 3 2 5 4 4 5 5 ... $ methylation Clusters : num [1:77] 4 5 5 5 5 5 5 5 5 5 ... $ RPPA Clusters : chr [1:77] "Basal" "Basal" "Basal" "X" ... $ CN Clusters : num [1:77] 4 1 1 1 1 3 3 1 3 1 ... $ Integrated Clusters (with PAM50) : num [1:77] 2 2 2 2 2 2 2 2 2 2 ... $ Integrated Clusters (no exp) : num [1:77] 1 2 2 2 2 2 2 2 2 2 ... $ Integrated Clusters (unsup exp) : num [1:77] 1 2 2 2 2 2 2 2 2 2 ... - attr(*, "spec")= .. cols( .. `Complete TCGA ID` = col_character(), .. Gender = col_character(), .. `Age at Initial Pathologic Diagnosis` = col_double(), .. `ER Status` = col_character(), .. `PR Status` = col_character(), .. `HER2 Final Status` = col_character(), .. Tumor = col_character(), .. `Tumor--T1 Coded` = col_character(), .. Node = col_character(), .. `Node-Coded` = col_character(), .. Metastasis = col_character(), .. `Metastasis-Coded` = col_character(), .. `AJCC Stage` = col_character(), .. `Converted Stage` = col_character(), .. `Survival Data Form` = col_character(), .. `Vital Status` = col_character(), .. `Days to Date of Last Contact` = col_double(), .. `Days to date of Death` = col_double(), .. `OS event` = col_double(), .. `OS Time` = col_double(), .. `PAM50 mRNA` = col_character(), .. `SigClust Unsupervised mRNA` = col_double(), .. `SigClust Intrinsic mRNA` = col_double(), .. `miRNA Clusters` = col_double(), .. `methylation Clusters` = col_double(), .. `RPPA Clusters` = col_character(), .. `CN Clusters` = col_double(), .. `Integrated Clusters (with PAM50)` = col_double(), .. `Integrated Clusters (no exp)` = col_double(), .. `Integrated Clusters (unsup exp)` = col_double() .. ) - attr(*, "problems")=<externalptr> ``` ] .pull-right[ ```r str(brca_clinical2) ``` ``` Classes 'data.table' and 'data.frame': 77 obs. of 30 variables: $ Complete TCGA ID : chr "TCGA-A2-A0CM" "TCGA-BH-A18Q" "TCGA-A7-A0CE" "TCGA-D8-A142" ... $ Gender : chr "FEMALE" "FEMALE" "FEMALE" "FEMALE" ... $ Age at Initial Pathologic Diagnosis: int 40 56 57 74 61 67 45 48 59 36 ... $ ER Status : chr "Negative" "Negative" "Negative" "Negative" ... $ PR Status : chr "Negative" "Negative" "Negative" "Negative" ... $ HER2 Final Status : chr "Negative" "Negative" "Negative" "Negative" ... $ Tumor : chr "T2" "T2" "T2" "T3" ... $ Tumor--T1 Coded : chr "T_Other" "T_Other" "T_Other" "T_Other" ... $ Node : chr "N0" "N1" "N0" "N0" ... $ Node-Coded : chr "Negative" "Positive" "Negative" "Negative" ... $ Metastasis : chr "M0" "M0" "M0" "M0" ... $ Metastasis-Coded : chr "Negative" "Negative" "Negative" "Negative" ... $ AJCC Stage : chr "Stage IIA" "Stage IIB" "Stage IIA" "Stage IIB" ... $ Converted Stage : chr "Stage IIA" "No_Conversion" "Stage IIA" "Stage IIB" ... $ Survival Data Form : chr "followup" "enrollment" "followup" "followup" ... $ Vital Status : chr "DECEASED" "DECEASED" "LIVING" "LIVING" ... $ Days to Date of Last Contact : int 754 1692 309 425 775 964 1027 1288 1319 1471 ... $ Days to date of Death : num 754 1692 NA NA NA ... $ OS event : int 1 1 0 0 0 0 0 0 0 0 ... $ OS Time : int 754 1692 309 425 775 964 1027 1288 1319 1471 ... $ PAM50 mRNA : chr "Basal-like" "Basal-like" "Basal-like" "Basal-like" ... $ SigClust Unsupervised mRNA : int -12 -12 0 0 -12 -12 -12 -12 0 0 ... $ SigClust Intrinsic mRNA : int -13 -13 -13 -13 -13 -13 -13 -13 -13 -13 ... $ miRNA Clusters : int 4 5 5 3 2 5 4 4 5 5 ... $ methylation Clusters : int 4 5 5 5 5 5 5 5 5 5 ... $ RPPA Clusters : chr "Basal" "Basal" "Basal" "X" ... $ CN Clusters : int 4 1 1 1 1 3 3 1 3 1 ... $ Integrated Clusters (with PAM50) : int 2 2 2 2 2 2 2 2 2 2 ... $ Integrated Clusters (no exp) : int 1 2 2 2 2 2 2 2 2 2 ... $ Integrated Clusters (unsup exp) : int 1 2 2 2 2 2 2 2 2 2 ... - attr(*, ".internal.selfref")=<externalptr> ``` ] --- ## A note on two "super"-data.frame objects .pull-left[ A `tibble` ``` # A tibble: 6 × 30 `Complete TCGA ID` Gender `Age at Initial Pathologic … `ER Status` `PR Status` <chr> <chr> <dbl> <chr> <chr> 1 TCGA-A2-A0CM FEMALE 40 Negative Negative 2 TCGA-BH-A18Q FEMALE 56 Negative Negative 3 TCGA-A7-A0CE FEMALE 57 Negative Negative 4 TCGA-D8-A142 FEMALE 74 Negative Negative 5 TCGA-AO-A0J6 FEMALE 61 Negative Negative 6 TCGA-A2-A0YM FEMALE 67 Negative Negative # … with 25 more variables: HER2 Final Status <chr>, Tumor <chr>, # Tumor--T1 Coded <chr>, Node <chr>, Node-Coded <chr>, Metastasis <chr>, # Metastasis-Coded <chr>, AJCC Stage <chr>, Converted Stage <chr>, # Survival Data Form <chr>, Vital Status <chr>, # Days to Date of Last Contact <dbl>, Days to date of Death <dbl>, # OS event <dbl>, OS Time <dbl>, PAM50 mRNA <chr>, # SigClust Unsupervised mRNA <dbl>, SigClust Intrinsic mRNA <dbl>, … ``` ] .pull-right[ A `data.table` ``` Complete TCGA ID Gender Age at Initial Pathologic Diagnosis ER Status 1: TCGA-A2-A0CM FEMALE 40 Negative 2: TCGA-BH-A18Q FEMALE 56 Negative 3: TCGA-A7-A0CE FEMALE 57 Negative 4: TCGA-D8-A142 FEMALE 74 Negative 5: TCGA-AO-A0J6 FEMALE 61 Negative 6: TCGA-A2-A0YM FEMALE 67 Negative PR Status HER2 Final Status Tumor Tumor--T1 Coded Node Node-Coded Metastasis 1: Negative Negative T2 T_Other N0 Negative M0 2: Negative Negative T2 T_Other N1 Positive M0 3: Negative Negative T2 T_Other N0 Negative M0 4: Negative Negative T3 T_Other N0 Negative M0 5: Negative Negative T2 T_Other N0 Negative M0 6: Negative Negative T2 T_Other N0 Negative M0 Metastasis-Coded AJCC Stage Converted Stage Survival Data Form Vital Status 1: Negative Stage IIA Stage IIA followup DECEASED 2: Negative Stage IIB No_Conversion enrollment DECEASED 3: Negative Stage IIA Stage IIA followup LIVING 4: Negative Stage IIB Stage IIB followup LIVING 5: Negative Stage IIA Stage IIA followup LIVING 6: Negative Stage IIA Stage IIA followup LIVING Days to Date of Last Contact Days to date of Death OS event OS Time 1: 754 754 1 754 2: 1692 1692 1 1692 3: 309 NA 0 309 4: 425 NA 0 425 5: 775 NA 0 775 6: 964 NA 0 964 PAM50 mRNA SigClust Unsupervised mRNA SigClust Intrinsic mRNA miRNA Clusters 1: Basal-like -12 -13 4 2: Basal-like -12 -13 5 3: Basal-like 0 -13 5 4: Basal-like 0 -13 3 5: Basal-like -12 -13 2 6: Basal-like -12 -13 5 methylation Clusters RPPA Clusters CN Clusters 1: 4 Basal 4 2: 5 Basal 1 3: 5 Basal 1 4: 5 X 1 5: 5 Basal 1 6: 5 Basal 3 Integrated Clusters (with PAM50) Integrated Clusters (no exp) 1: 2 1 2: 2 2 3: 2 2 4: 2 2 5: 2 2 6: 2 2 Integrated Clusters (unsup exp) 1: 1 2: 2 3: 2 4: 2 5: 2 6: 2 ``` ] --- ## A note on two "super"-data.frame objects + A `tibble` works pretty much like any `data.frame`, but the printing is a little saner + A `data.table` is faster, has more inherent functionality, but has a very different syntax We'll work almost entirely with `tibble`'s and not `data.table` -- Suggested modifications: + If using `fread`, convert the resulting object to a `data.frame` or `tibble` using `as_data_frame()` or `as_tibble()` + Convert the column names to not have spaces using, for example, ```r brca_clinical <- janitor::clean_names(brca_clinical) ``` --- background-image: url(../img/janitor_clean_names.png) background-size: contain --- ## Data ingestion Note that you **have** to give a name to what you're importing using `read_*` or whatever you're using, otherwise it won't stay in R ```r brca_clinical <- readr::read_csv('../data/BreastCancer_Clinical.csv') ``` ![](../img/env.png) > See what happens if you don't give a name to a dataset you ingest. --- ## Reading Excel You can find the names of the sheets in an Excel file: ```r readxl::excel_sheets('../data/BreastCancer.xlsx') ``` ``` [1] "Cllinical" "Expression" ``` So you can ingest a particular sheet from an Excel file using ```r brca_expression <- readxl::read_excel('../data/BreastCancer.xlsx', sheet='Expression') ``` --- class: middle, center # Data export --- ## Data export <table> <thead> <tr> <th style="text-align:left;"> Type </th> <th style="text-align:left;"> Function </th> <th style="text-align:left;"> Package </th> <th style="text-align:left;"> Notes </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> csv </td> <td style="text-align:left;"> write_csv </td> <td style="text-align:left;"> readr </td> <td style="text-align:left;"> Takes care of formatting </td> </tr> <tr> <td style="text-align:left;"> csv </td> <td style="text-align:left;"> write.csv </td> <td style="text-align:left;"> base </td> <td style="text-align:left;"> Built in </td> </tr> <tr> <td style="text-align:left;"> csv </td> <td style="text-align:left;"> fwrite </td> <td style="text-align:left;"> data.table </td> <td style="text-align:left;"> Fastest </td> </tr> <tr> <td style="text-align:left;"> Excel </td> <td style="text-align:left;"> write.xlsx </td> <td style="text-align:left;"> openxlsx </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> sas7bdat </td> <td style="text-align:left;"> write_sas </td> <td style="text-align:left;"> haven </td> <td style="text-align:left;"> SAS format </td> </tr> <tr> <td style="text-align:left;"> sav </td> <td style="text-align:left;"> write_spss </td> <td style="text-align:left;"> haven </td> <td style="text-align:left;"> SPSS format </td> </tr> <tr> <td style="text-align:left;"> dta </td> <td style="text-align:left;"> write_dta </td> <td style="text-align:left;"> haven </td> <td style="text-align:left;"> Stata format </td> </tr> </tbody> </table> We'll often save tabular results using these functions .footnote[These can also be useful for exporting results, but the R Markdown related packages are better for that] --- layout: true <div class="my-header"> <span>BIOF 339: Practical R</span></div> ## Simplifying import/export --- We'll be using a package that makes this easier. It's called **rio** and it has two basic functions: `import` and `export`. The `rio` package uses the different packages mentioned earlier but unifies it into a single syntax For example: ```r rio::import('data/clinical_data_breast_cancer_modified.csv') ``` -- **rio** reads the end of the file being imported or exported and decides which functions from which package should be used for the job. **rio** accesses different packages that are right for each job, so you don't have to. --- You can also import multiple sheets from Excel, or multiple objects from .RData files, into a list of data frames ```r dat <- rio::import_list('data/BreastCancer.xlsx') ``` ```r class(dat) ``` ``` [1] "list" ``` ```r names(dat) ``` ``` [1] "Cllinical" "Expression" ``` ```r map_chr(dat, class) ``` ``` Cllinical Expression "data.frame" "data.frame" ``` --- layout: true <div class="my-header"> <span>BIOF 339: Practical R</span></div> --- ## Saving your work You would often like to store intermediate datasets, and final datasets, so that you can access them quickly. There are several ways of saving even large datasets so that they can be quickly accessed. | Function | Package | Example | Retrieving the stored data | |-----------|---------|------------------------------------------|-------------------------------------| | saveRDS | base | `saveRDS(weather, file = 'weather.rds')` | `weather <- readRDS('weather.rds')` | | write_fst | fst | `write_fst(weather, file='weather.fst')` | `weather <- read_fst('weather.fst')` | These methods are meant for storing .fatinline[single objects] --- ## Saving your work If you want to store all of your objects into a single file, you can store them in a .RData file. ```r save.image(file="<filename>.RData") ``` To keep multiple specified objects in a .RData file, ```r save(<obj1>, <obj2>, <obj3>, file = "<filename>.RData") ``` ------ ## Retrieving your work You can retrieve the objects in a .RData file using the function `load`. ```r load(file = "<filename>.RData") ``` This will store each object in its original name in your R environment.