class: center, middle, inverse, title-slide # Joins, Split-Apply-Combine & MCPs ### Abhijit Dasgupta ### October 24, 2018 --- layout: true <div class="my-header"> <span>BIOF 339, Fall 2018</span></div> --- # Goals today + Learn how to join data sets (merging) + Split-apply-combine + Split a dataset into a list of several datasets + Do something to each dataset + Put the results back together + Use it for + Running tests for many variables + Understand why we need multiple comparison procedures (MCP) + Things to think about ??? We need to + put datasets capturing different attributes together to find a complete picture + evaluate different attributes to see if they contribute to our understanding + hedge our bets to ensure we find --- # Data This data set is taken from a breast cancer proteome database available [here](https://www.kaggle.com/piotrgrabo/breastcancerproteomes) and modified for this exercise. + Clinical data: [CSV](lecture_joins_sac_data/BreastCancer_Clinical.csv)|[XLSX](lecture_joins_sac_data/BreastCancer_Clinical.xlsx) + Proteome data: [CSV](lecture_joins_sac_data/BreastCancer_Expression.csv)|[XLSX](lecture_joins_sac_data/BreastCancer_Expression.xlsx) --- class: inverse, middle, center # Joins --- # Putting data sets together + Quite often, data on individuals lie in different tables - Clinical, demographic and bioinformatic data -- - Drug, procedure, and payment data (think Medicare) -- - Personal health data across different healthcare entities --- # Joining data sets We already talked about `cbind` and `rbind`: .pull-left[ <span style="text-align:center;">`cbind`</span> <img src="lecture_joins_sac_data/addcol.png" width="640" /> ] .pull-right[ <span style="text-align:center;">`rbind`</span> <img src="lecture_joins_sac_data/addrow.png" width="640" /> ] --- # Joining data sets .pull-left[ We will talk about more general ways of joining two datasets We will assume: 1. We have two rectangular data sets (so `data.frame` or `tibble`) 1. There is at least one variable (column) in common, even if they have different names - ID number - SSN (Social Security number) - Identifiable information ] .pull-right[ <img src="lecture_joins_sac_data/merge.png" height="10%"/> ] --- # Joining data sets <img width="100%" src="lecture_joins_sac_data/joins.png"/> -- <table width="100%"> <tr> <td style="text-align:center;">inner_join</td> <td style="text-align:center;">left_join</td> <td style="text-align:center;">right_join</td> <td style="text-align:center;">outer_join</td> </tr></table> -- The "join condition" are the common variables in the two datasets, i.e. rows are selected if the values of the common variables in the left dataset matches the values of the common variables in the right dataset --- ## Data example ```r library(readxl) clinical <- read_excel('lecture_joins_sac_data/BreastCancer_Clinical.xlsx') %>% set_names(str_replace_all(names(.), '[ -]+', '_')) proteome <- read_excel('lecture_joins_sac_data/BreastCancer_Expression.xlsx') %>% set_names(str_replace_all(names(.), '[ -]+', '_')) ``` .pull-left[ ``` # A tibble: 105 x 30 Complete_TCGA_ID Gender Age_at_Initial_… ER_Status PR_Status <chr> <chr> <dbl> <chr> <chr> 1 TCGA-A2-A0T2 FEMALE 66 Negative Negative 2 TCGA-A2-A0CM FEMALE 40 Negative Negative 3 TCGA-BH-A18V FEMALE 48 Negative Negative 4 TCGA-BH-A18Q FEMALE 56 Negative Negative 5 TCGA-BH-A0E0 FEMALE 38 Negative Negative 6 TCGA-A7-A0CE FEMALE 57 Negative Negative 7 TCGA-D8-A142 FEMALE 74 Negative Negative 8 TCGA-A2-A0D0 FEMALE 60 Negative Negative 9 TCGA-AO-A0J6 FEMALE 61 Negative Negative 10 TCGA-A2-A0YM FEMALE 67 Negative Negative # ... with 95 more rows, and 25 more variables: HER2_Final_Status <chr>, # Tumor <chr>, Tumor_T1_Coded <chr>, Node <chr>, Node_Coded <chr>, # Metastasis <chr>, Metastasis_Coded <chr>, AJCC_Stage <chr>, # Converted_Stage <chr>, Survival_Data_Form <chr>, Vital_Status <chr>, # Days_to_Date_of_Last_Contact <dbl>, Days_to_date_of_Death <dbl>, # OS_event <dbl>, OS_Time <dbl>, PAM50_mRNA <chr>, # SigClust_Unsupervised_mRNA <dbl>, SigClust_Intrinsic_mRNA <dbl>, # miRNA_Clusters <dbl>, methylation_Clusters <dbl>, RPPA_Clusters <chr>, # CN_Clusters <dbl>, `Integrated_Clusters_(with_PAM50)` <dbl>, # `Integrated_Clusters_(no_exp)` <dbl>, # `Integrated_Clusters_(unsup_exp)` <dbl> ``` ] .pull-right[ ``` # A tibble: 83 x 11 TCGA_ID NP_958782 NP_958785 NP_958786 NP_000436 NP_958781 NP_958780 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 TCGA-A… 1.10 1.11 1.11 1.11 1.12 1.11 2 TCGA-C… 2.61 2.65 2.65 2.65 2.65 2.65 3 TCGA-A… -0.660 -0.649 -0.654 -0.632 -0.640 -0.654 4 TCGA-B… 0.195 0.215 0.215 0.205 0.215 0.215 5 TCGA-C… -0.494 -0.504 -0.501 -0.510 -0.504 -0.504 6 TCGA-C… 2.77 2.78 2.78 2.80 2.79 2.78 7 TCGA-E… 0.863 0.870 0.870 0.866 0.870 0.870 8 TCGA-C… 1.41 1.41 1.41 1.41 1.41 1.41 9 TCGA-A… 1.19 1.19 1.19 1.19 1.20 1.19 10 TCGA-A… 1.10 1.10 1.10 1.10 1.09 1.10 # ... with 73 more rows, and 4 more variables: NP_958783 <dbl>, # NP_958784 <dbl>, NP_112598 <dbl>, NP_001611 <dbl> ``` ] --- ## Data example ```r library(readxl) clinical <- read_excel('lecture_joins_sac_data/BreastCancer_Clinical.xlsx') %>% set_names(str_replace_all(names(.), '[ -]+', '_')) proteome <- read_excel('lecture_joins_sac_data/BreastCancer_Expression.xlsx') %>% set_names(str_replace_all(names(.), '[ -]+', '_')) ``` .pull-left[ ``` # A tibble: 105 x 2 Complete_TCGA_ID Gender <chr> <chr> 1 TCGA-A2-A0T2 FEMALE 2 TCGA-A2-A0CM FEMALE 3 TCGA-BH-A18V FEMALE 4 TCGA-BH-A18Q FEMALE 5 TCGA-BH-A0E0 FEMALE 6 TCGA-A7-A0CE FEMALE 7 TCGA-D8-A142 FEMALE 8 TCGA-A2-A0D0 FEMALE 9 TCGA-AO-A0J6 FEMALE 10 TCGA-A2-A0YM FEMALE # ... with 95 more rows ``` ] .pull-right[ ``` # A tibble: 83 x 2 TCGA_ID NP_958782 <chr> <dbl> 1 TCGA-AO-A12D 1.10 2 TCGA-C8-A131 2.61 3 TCGA-AO-A12B -0.660 4 TCGA-BH-A18Q 0.195 5 TCGA-C8-A130 -0.494 6 TCGA-C8-A138 2.77 7 TCGA-E2-A154 0.863 8 TCGA-C8-A12L 1.41 9 TCGA-A2-A0EX 1.19 10 TCGA-AO-A12D 1.10 # ... with 73 more rows ``` ] -- We see that both have the same ID variable, but with different names and different orders ??? Let's keep only the first two columns so we can see the ID variable --- ## Data example Let's make sure that the ID's are truly IDs, i.e. each row has a unique value ```r length(unique(clinical$Complete_TCGA_ID)) == nrow(clinical) ``` ``` [1] TRUE ``` -- ```r length(unique(proteome$TCGA_ID)) == nrow(proteome) ``` ``` [1] FALSE ``` -- <div style="height:25%;margins:auto;"> <img style="display:block; margin:0 auto; height: 70%;" src="https://twitchy.com/wp-content/uploads/2015/04/screen-shot-2015-04-13-at-2-06-38-pm-300x300.png"/> </div> ??? We need the ID variables to be unique for each row. If we use multiple columns to define the "ID" then each row needs to have a unique set of values for those columns. Otherwise the joins get confused about which rows go with which rows. --- ## Data example For convenience we'll keep the first instance for each ID in the `proteome` data ```r proteome <- proteome %>% filter(!duplicated(TCGA_ID)) ``` > `duplicated` = TRUE if a previous row contains the same value -- ```r length(unique(proteome$TCGA_ID)) == nrow(proteome) ``` ``` [1] TRUE ``` --- ## Inner join ```r common_rows <- inner_join(clinical[,1:6], proteome, by=c('Complete_TCGA_ID'='TCGA_ID')) ``` ``` # A tibble: 77 x 16 Complete_TCGA_ID Gender Age_at_Initial_… ER_Status PR_Status <chr> <chr> <dbl> <chr> <chr> 1 TCGA-A2-A0CM FEMALE 40 Negative Negative 2 TCGA-BH-A18Q FEMALE 56 Negative Negative 3 TCGA-A7-A0CE FEMALE 57 Negative Negative 4 TCGA-D8-A142 FEMALE 74 Negative Negative 5 TCGA-AO-A0J6 FEMALE 61 Negative Negative 6 TCGA-A2-A0YM FEMALE 67 Negative Negative 7 TCGA-A2-A0D2 FEMALE 45 Negative Negative 8 TCGA-A2-A0SX FEMALE 48 Negative Negative 9 TCGA-AO-A0JL FEMALE 59 Negative Negative 10 TCGA-AO-A12F FEMALE 36 Negative Negative # ... with 67 more rows, and 11 more variables: HER2_Final_Status <chr>, # NP_958782 <dbl>, NP_958785 <dbl>, NP_958786 <dbl>, NP_000436 <dbl>, # NP_958781 <dbl>, NP_958780 <dbl>, NP_958783 <dbl>, NP_958784 <dbl>, # NP_112598 <dbl>, NP_001611 <dbl> ``` -- Note that we have all the columns from both datasets, but only 77 rows, which is the common set of IDs from the two datasets -- > If you don't include the `by` option, R will attempt to match values of any columns with the same names --- ## Left join ```r left_rows <- left_join(clinical[,1:6], proteome, by=c('Complete_TCGA_ID'='TCGA_ID')) ``` ``` # A tibble: 105 x 16 Complete_TCGA_ID Gender Age_at_Initial_Pathologic_Diagnosis ER_Status <chr> <chr> <dbl> <chr> 1 TCGA-A2-A0T2 FEMALE 66 Negative 2 TCGA-A2-A0CM FEMALE 40 Negative 3 TCGA-BH-A18V FEMALE 48 Negative PR_Status HER2_Final_Status NP_958782 NP_958785 NP_958786 NP_000436 <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Negative Negative NA NA NA NA 2 Negative Negative 0.683 0.694 0.698 0.687 3 Negative Negative NA NA NA NA NP_958781 NP_958780 NP_958783 NP_958784 NP_112598 NP_001611 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 NA NA NA NA NA NA 2 0.687 0.698 0.698 0.698 -2.65 -0.984 3 NA NA NA NA NA NA # ... with 102 more rows ``` We get 105 rows, which is all the rows of `clinical`, combined with the rows of `proteome` with common IDs. The rest of the rows get `NA` for the proteome columns. --- ## Right join ```r right_rows <- right_join(clinical[,1:6], proteome, by=c('Complete_TCGA_ID'='TCGA_ID')) ``` ``` # A tibble: 80 x 16 Complete_TCGA_ID Gender Age_at_Initial_Pathologic_Diagnosis ER_Status <chr> <chr> <dbl> <chr> 1 TCGA-AO-A12D FEMALE 43 Negative 2 TCGA-C8-A131 FEMALE 82 Negative 3 TCGA-AO-A12B FEMALE 63 Positive PR_Status HER2_Final_Status NP_958782 NP_958785 NP_958786 NP_000436 <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Negative Positive 1.10 1.11 1.11 1.11 2 Negative Negative 2.61 2.65 2.65 2.65 3 Positive Negative -0.660 -0.649 -0.654 -0.632 NP_958781 NP_958780 NP_958783 NP_958784 NP_112598 NP_001611 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1.12 1.11 1.11 1.11 -1.52 0.483 2 2.65 2.65 2.65 2.65 3.91 -1.05 3 -0.640 -0.654 -0.649 -0.649 -0.618 1.22 # ... with 77 more rows ``` -- Here we get 80 rows, which is all the rows of `proteome`, along with the rows of `clinical` with common IDs, but with the columns of `clinical` appearing first. --- ## Outer/Full Join ```r full_rows <- full_join(clinical[,1:6], proteome, by=c('Complete_TCGA_ID'='TCGA_ID')) ``` ``` # A tibble: 108 x 16 Complete_TCGA_ID Gender Age_at_Initial_Pathologic_Diagnosis ER_Status <chr> <chr> <dbl> <chr> 1 TCGA-A2-A0T2 FEMALE 66 Negative 2 TCGA-A2-A0CM FEMALE 40 Negative 3 TCGA-BH-A18V FEMALE 48 Negative PR_Status HER2_Final_Status NP_958782 NP_958785 NP_958786 NP_000436 <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Negative Negative NA NA NA NA 2 Negative Negative 0.683 0.694 0.698 0.687 3 Negative Negative NA NA NA NA NP_958781 NP_958780 NP_958783 NP_958784 NP_112598 NP_001611 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 NA NA NA NA NA NA 2 0.687 0.698 0.698 0.698 -2.65 -0.984 3 NA NA NA NA NA NA # ... with 105 more rows ``` -- Here we obtain 108 rows and 16 columns. So we've expanded the data in both rows and columns, putting missing values in where needed. --- # Joins In each of `inner_join`, `left_join`, `right_join` and `full_join`, the number of columns always increases There are also two joins where the number of columns don't increase. They aren't really "joins" in that sense, but really fancy filters on a dataset <table> <thead> <tr> <th style="text-align:left;"> Join </th> <th style="text-align:left;"> Use </th> <th style="text-align:left;"> Description </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> semi_join </td> <td style="text-align:left;"> semi_join(A,B) </td> <td style="text-align:left;"> Keep rows in A where ID matches some ID value in B </td> </tr> <tr> <td style="text-align:left;"> anti_join </td> <td style="text-align:left;"> anti_join(A,B) </td> <td style="text-align:left;"> Keep rows in A where ID does NOT match any ID value in B </td> </tr> </tbody> </table> These just filter the rows of `A` without adding any columns of `B`. --- class: inverse, middle, center ## Are there protein expression differences between ER +ve and ER -ve breast cancers --- # Create analytic dataset ```r final_data <- clinical %>% inner_join(proteome, by=c("Complete_TCGA_ID"="TCGA_ID")) %>% filter(Gender =='FEMALE') %>% select(Complete_TCGA_ID, Age_at_Initial_Pathologic_Diagnosis, ER_Status, starts_with("NP")) # grabs all the protein data ``` ``` # A tibble: 75 x 13 Complete_TCGA_ID Age_at_Initial_Pathologic_Diagnosis ER_Status NP_958782 <chr> <dbl> <chr> <dbl> 1 TCGA-A2-A0CM 40 Negative 0.683 2 TCGA-BH-A18Q 56 Negative 0.195 3 TCGA-A7-A0CE 57 Negative -1.12 NP_958785 NP_958786 NP_000436 NP_958781 NP_958780 NP_958783 NP_958784 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0.694 0.698 0.687 0.687 0.698 0.698 0.698 2 0.215 0.215 0.205 0.215 0.215 0.215 0.215 3 -1.12 -1.12 -1.13 -1.13 -1.12 -1.12 -1.12 NP_112598 NP_001611 <dbl> <dbl> 1 -2.65 -0.984 2 -1.04 -0.517 3 2.24 -2.58 # ... with 72 more rows ``` --- ## Protein-specific analyses We want to analyze each protein separately, while maintaining alignment with ER status and age. -- The R trick is to make this wide table long, so you can split on the rows ```r final_data2 <- final_data %>% gather(protein, expression, starts_with('NP')) %>% arrange(Complete_TCGA_ID) ``` ``` # A tibble: 750 x 5 Complete_TCGA_ID Age_at_Initial_Pathologic_Diagnosis ER_Status protein <chr> <dbl> <chr> <chr> 1 TCGA-A2-A0CM 40 Negative NP_958782 2 TCGA-A2-A0CM 40 Negative NP_958785 3 TCGA-A2-A0CM 40 Negative NP_958786 expression <dbl> 1 0.683 2 0.694 3 0.698 # ... with 747 more rows ``` --- # Split-apply-combine <img src="lecture_joins_sac_data/split-apply-combine.png" width="2956" /> --- ## Splitting data by protein There are two ways of doing this: ```r final_data2_grp <- final_data2 %>% group_by(protein) ``` or ```r final_data2_nest <- final_data2 %>% group_by(protein) %>% nest() ``` .right[.tiny[Updated this thanks to https://www.brodrigues.co/blog/2017-03-29-make-ggplot2-purrr/]] -- ``` # A tibble: 10 x 2 protein data <chr> <list> 1 NP_958782 <tibble [75 × 4]> 2 NP_958785 <tibble [75 × 4]> 3 NP_958786 <tibble [75 × 4]> 4 NP_000436 <tibble [75 × 4]> 5 NP_958781 <tibble [75 × 4]> 6 NP_958780 <tibble [75 × 4]> 7 NP_958783 <tibble [75 × 4]> 8 NP_958784 <tibble [75 × 4]> 9 NP_112598 <tibble [75 × 4]> 10 NP_001611 <tibble [75 × 4]> ``` -- This is an example of a `list-column`. We will actually use this form, since it's a bit clearer to understand ??? We've stored a list within a column, aligned to each protein. Advantages: 1. Visual ease and alignment 1. Can use `dplyr` verbs along with `purrr` functions We know how to apply a function over elements of a list. The base R way is using `lapply`, while the tidyverse way is to use `map`. These two functions are essentially the same, but the `map_*` functions make it much clearer than `lapply` what the type of output is. --- <img src="lecture_joins_sac_data/split-apply-combine.png" width="2956" /> ??? Now we have to apply some function to each split dataset that will get us the desired result --- class: inverse, middle, center # Side note: Functions --- # Functions Functions are **rules** written in R code that take some _input_ and give some _output_ ```r #' @param d A data.frame object #' #' @return The p-value for the 2-sided t-test comparing expression between ER_Status my_test <- function(d){ ttest <- t.test(expression ~ ER_Status, data = d) return(ttest$p.value) } ``` This function takes in a `data.frame`, does some operations on it (runs a t-test, and extracts the p-value) and returns a value (the p-value). > In general, a function can output any kind of R object. We'll learn by example, but for more details, see [this chapter](http://r4ds.had.co.nz/functions.html) in _R for Data Science_ by Wickham & Grolemund. We will **apply** this function to each split dataset in `final_data2_nest` --- ``` # A tibble: 10 x 2 protein data <chr> <list> 1 NP_958782 <tibble [75 × 4]> 2 NP_958785 <tibble [75 × 4]> 3 NP_958786 <tibble [75 × 4]> 4 NP_000436 <tibble [75 × 4]> 5 NP_958781 <tibble [75 × 4]> 6 NP_958780 <tibble [75 × 4]> 7 NP_958783 <tibble [75 × 4]> 8 NP_958784 <tibble [75 × 4]> 9 NP_112598 <tibble [75 × 4]> 10 NP_001611 <tibble [75 × 4]> ``` Let's take a look at an element in the `data` column ```r final_data2_nest$data[[1]] ``` ``` # A tibble: 75 x 4 Complete_TCGA_ID Age_at_Initial_Pathologic_Diagnosis ER_Status expression <chr> <dbl> <chr> <dbl> 1 TCGA-A2-A0CM 40 Negative 0.683 2 TCGA-A2-A0D2 45 Negative 0.107 3 TCGA-A2-A0EQ 64 Negative -0.913 # ... with 72 more rows ``` ??? Each element in the data column is just a tibble --- ## Applying a function to each split dataset We will use the function `purrr::map` to do this: ```r final_data2_nest %>% mutate(pval = map(data, my_test)) ``` ``` # A tibble: 10 x 3 protein data pval <chr> <list> <list> 1 NP_958782 <tibble [75 × 4]> <dbl [1]> 2 NP_958785 <tibble [75 × 4]> <dbl [1]> 3 NP_958786 <tibble [75 × 4]> <dbl [1]> 4 NP_000436 <tibble [75 × 4]> <dbl [1]> 5 NP_958781 <tibble [75 × 4]> <dbl [1]> 6 NP_958780 <tibble [75 × 4]> <dbl [1]> 7 NP_958783 <tibble [75 × 4]> <dbl [1]> 8 NP_958784 <tibble [75 × 4]> <dbl [1]> 9 NP_112598 <tibble [75 × 4]> <dbl [1]> 10 NP_001611 <tibble [75 × 4]> <dbl [1]> ``` --- <img src="lecture_joins_sac_data/split-apply-combine.png" width="2956" /> ??? Now we have to combine the results so that each protein gets a corresponding p-value --- ## Combining the split results ```r final_data2_nest %>% mutate(pval = map(data, my_test)) %>% mutate(pval = unlist(pval)) ``` ``` # A tibble: 10 x 3 protein data pval <chr> <list> <dbl> 1 NP_958782 <tibble [75 × 4]> 0.924 2 NP_958785 <tibble [75 × 4]> 0.934 3 NP_958786 <tibble [75 × 4]> 0.940 4 NP_000436 <tibble [75 × 4]> 0.922 5 NP_958781 <tibble [75 × 4]> 0.926 6 NP_958780 <tibble [75 × 4]> 0.930 7 NP_958783 <tibble [75 × 4]> 0.931 8 NP_958784 <tibble [75 × 4]> 0.931 9 NP_112598 <tibble [75 × 4]> 0.908 10 NP_001611 <tibble [75 × 4]> 0.000432 ``` -- This could be done in one operation, as well ```r final_data2_nest %>% mutate(pval = map_dbl(data, my_test)) ``` --- # What's `map` doing? ```r final_data2_nest %>% mutate(pval = map_dbl(data, my_test)) %>% head(3) ``` ``` # A tibble: 3 x 3 protein data pval <chr> <list> <dbl> 1 NP_958782 <tibble [75 × 4]> 0.924 2 NP_958785 <tibble [75 × 4]> 0.934 3 NP_958786 <tibble [75 × 4]> 0.940 ``` -- ```r my_test(final_data2_nest$data[[1]]) ``` ``` [1] 0.9238144 ``` -- ```r my_test(final_data2_nest$data[[2]]) ``` ``` [1] 0.9340165 ``` --- # The `map` function 1. **`map(data, my_test)`**: - `data` is a list of data.frames, and `my_test` is a function that takes a data.frame as input and produces some output, that is stored in a list 1. **`map(data, ~t.test(expression ~ ER_Status, data = .))`**: - Apply an anonymous function to each element of `data`, where the `.` serves as a place holder for an element of `data`. The anonymous function must start with a `~`. Note that the result of the anonymous function is the output of a `t.test`, which is a kind of object in R 1. **`map(data, "ER_Status")`**: - Extract the element `ER_Status` from each element of `data` --- ## Pipelining this process ```r final_data2 %>% nest(-protein) %>% mutate(test = map(data, ~t.test(expression ~ ER_Status, data = .))) %>% mutate(pval = map_dbl(test, 'p.value')) %>% head(3) ``` ``` # A tibble: 3 x 4 protein data test pval <chr> <list> <list> <dbl> 1 NP_958782 <tibble [75 × 4]> <S3: htest> 0.924 2 NP_958785 <tibble [75 × 4]> <S3: htest> 0.934 3 NP_958786 <tibble [75 × 4]> <S3: htest> 0.940 ``` -- Cleaning it up ```r final_data2 %>% nest(-protein) %>% mutate(test = map(data, ~t.test(expression ~ ER_Status, data = .))) %>% mutate(pval = map_dbl(test, 'p.value')) %>% * select(protein, pval) %>% head(3) ``` ``` # A tibble: 3 x 2 protein pval <chr> <dbl> 1 NP_958782 0.924 2 NP_958785 0.934 3 NP_958786 0.940 ``` --- class: inverse, middle, center # Multiple comparison procedures (MCP) --- ## Why do we need it? + Recall, in hypothesis tests, the Type I error (or false positive rate) is `$$\Pr(Reject\ H_0 | H_0\ is\ true)$$` This is typically limited by the testing procedure to 5%. - For the more technically interested, this is from the _Neyman-Pearson lemma_ + There is always a chance we are wrong!! --- ## Why do we need it? Imagine your test is like a biased coin, with heads being "Reject `\(H_0\)`" and tails being "Do not reject `\(H_0\)`" Now assume `\(H_0\)` is true, and you're doing multiple tests using the same data <table> <thead> <tr> <th style="text-align:left;"> Number of tests </th> <th style="text-align:left;"> Coin tosses </th> <th style="text-align:right;"> Pr(at least one head) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 1 toss </td> <td style="text-align:right;"> 0.05 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> 2 tosses </td> <td style="text-align:right;"> 0.10 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:left;"> 5 tosses </td> <td style="text-align:right;"> 0.23 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:left;"> 10 tosses </td> <td style="text-align:right;"> 0.40 </td> </tr> <tr> <td style="text-align:left;"> 100 </td> <td style="text-align:left;"> 100 tosses </td> <td style="text-align:right;"> 0.99 </td> </tr> <tr> <td style="text-align:left;"> 1,000,000 </td> <td style="text-align:left;"> 1 million tosses </td> <td style="text-align:right;"> 1.00 </td> </tr> </tbody> </table> -- This means, if you're doing 1 million tests (like, e.g., a GWAS), the chance that you get **at least one false positive** is practically 1, i.e. a sure shot. This requires some kind of multiple comparisons adjustment so we don't make excessive errors and get fooled into thinking we have significant results --- ## Bonferroni correction If you have _n_ tests using the same data, then make sure that the Type I error is `\(0.05/n\)`. This means that for 100 tests, we'd reject the null hypothesis if the p-value was less than 0.0005 rather than 0.05. How does this help? <table> <thead> <tr> <th style="text-align:left;"> Number of tests </th> <th style="text-align:right;"> Nominal FP rate </th> <th style="text-align:right;"> Corrected FP rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 0.050 </td> <td style="text-align:right;"> 0.050 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 0.098 </td> <td style="text-align:right;"> 0.049 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 0.226 </td> <td style="text-align:right;"> 0.049 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:right;"> 0.401 </td> <td style="text-align:right;"> 0.049 </td> </tr> <tr> <td style="text-align:left;"> 100 </td> <td style="text-align:right;"> 0.994 </td> <td style="text-align:right;"> 0.049 </td> </tr> <tr> <td style="text-align:left;"> 1000000 </td> <td style="text-align:right;"> 1.000 </td> <td style="text-align:right;"> 0.049 </td> </tr> </tbody> </table> The FP rate is the probability of getting at least one false positive. -- Realize that this is quite stringent, since we're not allowing any false positives. The Bonferroni correction is thus quite comservative. --- # Bonferroni correction There is a price to pay for this stringency, in terms of > Statistical Power = Pr(Reject `\(H_0\)` | `\(H_0\)` is FALSE) > > This is the chance that the test will reject the null when the null hypothesis is wrong. We would like this to be high, so that we can detect true dfferences. This is usually set at 80% <table> <thead> <tr> <th style="text-align:left;"> Number of tests </th> <th style="text-align:right;"> Nominal FP rate </th> <th style="text-align:right;"> Bonferroni-corrected FP rate </th> <th style="text-align:right;"> Statistical power </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 0.050 </td> <td style="text-align:right;"> 0.050 </td> <td style="text-align:right;"> 0.800 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 0.098 </td> <td style="text-align:right;"> 0.049 </td> <td style="text-align:right;"> 0.706 </td> </tr> <tr> <td style="text-align:left;"> 5 </td> <td style="text-align:right;"> 0.226 </td> <td style="text-align:right;"> 0.049 </td> <td style="text-align:right;"> 0.573 </td> </tr> <tr> <td style="text-align:left;"> 10 </td> <td style="text-align:right;"> 0.401 </td> <td style="text-align:right;"> 0.049 </td> <td style="text-align:right;"> 0.474 </td> </tr> <tr> <td style="text-align:left;"> 100 </td> <td style="text-align:right;"> 0.994 </td> <td style="text-align:right;"> 0.049 </td> <td style="text-align:right;"> 0.212 </td> </tr> <tr> <td style="text-align:left;"> 1000 </td> <td style="text-align:right;"> 1.000 </td> <td style="text-align:right;"> 0.049 </td> <td style="text-align:right;"> 0.076 </td> </tr> <tr> <td style="text-align:left;"> 10000 </td> <td style="text-align:right;"> 1.000 </td> <td style="text-align:right;"> 0.049 </td> <td style="text-align:right;"> 0.023 </td> </tr> <tr> <td style="text-align:left;"> 100000 </td> <td style="text-align:right;"> 1.000 </td> <td style="text-align:right;"> 0.049 </td> <td style="text-align:right;"> 0.006 </td> </tr> <tr> <td style="text-align:left;"> 1000000 </td> <td style="text-align:right;"> 1.000 </td> <td style="text-align:right;"> 0.049 </td> <td style="text-align:right;"> 0.001 </td> </tr> </tbody> </table> --- # False discovery rates (FDR) The FDR is > the expected proportion of false positives (incorrectly rejected null hypotheses) So if we set our FDR threshold to 0.05 and we identify 100 positives (reject 100 tests), then on average 5 of those 100 will be false positives ### Benjamini-Hochberg (BH) procedure This controls for FDR by ordering the p-values from highest to lowest and then judges their significance on a sliding scale. The adjusted values here are often referred to as _q-values_. -- FDR control procedures retain more statistical power than Bonferroni corrections. There are other variants of this like the Benjamini-Yeukitili (BY) procedure --- # An example If we see 10 heads in a row, we can statistically test whether the coin is biased or not. ```r prop.test(x = 10, n = 10, p = 0.5) # x = heads, n = tosses ``` ``` 1-sample proportions test with continuity correction data: 10 out of 10, null probability 0.5 X-squared = 8.1, df = 1, p-value = 0.004427 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.6554628 1.0000000 sample estimates: p 1 ``` --- ### P-value adjustment .small[Suppose we want to test if pennies in circulation are biased (i.e. chance of a head is not 0.5). We collect 20,000 pennies and flip each of then 100 times, recording the number of heads. We can simulate this experiment in R using the `rbinom` function, use the `prop.test` function to test our hypothesis, and then get adjusted p-values using the `p.adjust` function.] ```r set.seed(1243) # Initialize the random number generator coinTable <- tibble(heads = rbinom(n = 20000, size = 100, prob = 0.5)) coinTable <- coinTable %>% mutate(pvals = map_dbl(heads, ~prop.test(., 100, 0.5)$p.value)) *coinTable <- coinTable %>% mutate(bonf = p.adjust(pvals, method = 'bonferroni')) %>% * mutate(q_value = p.adjust(pvals, method = 'fdr')) ``` ``` # A tibble: 20,000 x 4 heads pvals bonf q_value <int> <dbl> <dbl> <dbl> 1 56 0.271 1 0.996 2 47 0.617 1 0.996 3 48 0.764 1 0.996 # ... with 2e+04 more rows ``` .small[The proportion of p-values < 0.05 is:] ``` # A tibble: 1 x 3 pvals bonf q_value <dbl> <dbl> <dbl> 1 0.0356 0 0 ``` ??? + Note that `map` can take a vector as input, since a vector is converted internally into a list of numbers --- ### P-value adjustment If the coins are truly biased, we can see that the Bonferroni misses most of them whereas the q-value doesn't. ```r set.seed(1243) # Initialize the random number generator *coinTable <- tibble(heads = rbinom(n = 20000, size = 100, prob = 0.7)) coinTable <- coinTable %>% mutate(pvals = map_dbl(heads, ~prop.test(., 100, 0.5)$p.value)) coinTable <- coinTable %>% mutate(bonf = p.adjust(pvals, method = 'bonferroni')) %>% mutate(q_value = p.adjust(pvals, method = 'fdr')) ``` ``` # A tibble: 20,000 x 4 heads pvals bonf q_value <int> <dbl> <dbl> <dbl> 1 64 0.00693 1 0.00753 2 72 0.0000171 0.342 0.0000451 3 71 0.0000413 0.826 0.0000892 # ... with 2e+04 more rows ``` ```r coinTable %>% summarize_at(vars(-heads), funs(mean(. < 0.05))) ``` ``` # A tibble: 1 x 3 pvals bonf q_value <dbl> <dbl> <dbl> 1 0.979 0.168 0.979 ``` --- class: middle, center # Next week: Statistical models in R