Running many t-tests

Suppose you have an expression dataset, and want to run t-tests for several genes/probes comparing two conditions. We can use tidyverse concepts to do this. First, let’s get some data.

# BiocManager::install('affydata')
data(Dilution, package = 'affydata')
eset <- mas5(Dilution) # pre-proccessing
dat <- exprs(eset) # extract expressions
Dat <- as.data.frame(t(dat)) %>% rownames_to_column('sample') %>% as_tibble() %>% 
  mutate(scanner = as_factor(pData(Dilution)$scanner)) %>% # Add "scanner" from phenoData
  select(sample, scanner, everything()) # Re-order columns

Now, this dataset has 12625 probesets, and we want to run a t-test on each one, comparing the scanner types (which are of two types). There are a couple of ways to do this, which take about the same amount of time, and create a tibble of results, with one row for each probeset. The first is a direct application of purrr::map, acknowledging that a tibble or data.frame is basically a list of columns, and map can apply a function to each column and get the results.

results1 <- Dat %>% select(-sample, -scanner) %>% # Keep only probeset data
  map(~ tidy(t.test(. ~ Dat$scanner))) %>% # T-test of each column against scanner, and tidy
  bind_rows(.id = 'probe') # Puts it all together into a data frame
head(results1)
## # A tibble: 6 x 11
##   probe estimate estimate1 estimate2 statistic p.value parameter conf.low
##   <chr>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
## 1 100_…     73.4     487.      414.      1.13  0.436        1.20   -490. 
## 2 1000…    108.      809.      700.    103.    0.00231      1.22     99.6
## 3 1001…    -23.9      75.9      99.8    -2.11  0.195        1.66    -83.6
## 4 1002…    -18.9     150.      169.     -1.57  0.341        1.12   -137. 
## 5 1003…    -23.8      42.7      66.5    -0.657 0.618        1.15   -362. 
## 6 1004…     45.4     186.      141.      1.75  0.299        1.19   -183. 
## # ... with 3 more variables: conf.high <dbl>, method <chr>,
## #   alternative <chr>

The second approach is what is recommended in the broom package; you create a list-column of tibbles using nest, one per probeset, run t-tests on each tibble using map, and then unnest to put it all together. This is done after using gather to convert the dataset from wide to tall form.

results2 <- Dat %>% 
  gather(probe, value, -sample, -scanner) %>% # wide-to-tall
  nest(-probe) %>% # Create nested data frames using list-columns
  mutate(tests = map(data, ~tidy(t.test(value ~ scanner, data = .)))) %>% # run t-tests
  select(-data) %>% 
  unnest() # Create final data.frame
head(results2)
## # A tibble: 6 x 11
##   probe estimate estimate1 estimate2 statistic p.value parameter conf.low
##   <chr>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
## 1 100_…     73.4     487.      414.      1.13  0.436        1.20   -490. 
## 2 1000…    108.      809.      700.    103.    0.00231      1.22     99.6
## 3 1001…    -23.9      75.9      99.8    -2.11  0.195        1.66    -83.6
## 4 1002…    -18.9     150.      169.     -1.57  0.341        1.12   -137. 
## 5 1003…    -23.8      42.7      66.5    -0.657 0.618        1.15   -362. 
## 6 1004…     45.4     186.      141.      1.75  0.299        1.19   -183. 
## # ... with 3 more variables: conf.high <dbl>, method <chr>,
## #   alternative <chr>