In this homework you will have to submit 3 files. (1) The R script from the Data preparation section, (2) the Rmd file from the Data analysis section and (3) the knitted HTML file from the Data analysis section. The results of the data analysis section should all be present in the Rmd and HTML files. The R script should be well-commented so that the reader can understand what each section of code is supposed to do.
In this homework we will play with some publicly available RNA-Seq data. Information about the experiment is available here. Briefly, this is a mouse experiment where the authors were interested in PIK3CA mutation variants. This interest is derived from PIK3CA’s implication in glioblastoma tumorogenesis. There are 26 mice with different engineered genetics.
The data are available as a zip file here. The experimental design is available here.
Open a R script file in the RStudio Project you created. This will be Part 1 of your submission. Your code should be annotated with sufficient comments so that the reader/grader can understand what you are intending to do with each part of your code.
Read all 26 files into a list in R, where the list consists of 26 data frames
library(here)
library(rio)
dat <- import_list(dir(here('data/GSE123519_RAW/'), full.names = TRUE))
For each of the data frames perform the following data processing
tracking_id
, gene_id
, and locus
through FPKM
.gene_id
to capital letters (use stringr::str_to_upper
or other equivalent functions)locus
variable into a chromosome
and a location
variable.library(tidyverse)
dat1 <- dat %>%
map(select, tracking_id, gene_id, locus:FPKM) %>%
map(mutate, gene_id = stringr::str_to_upper(gene_id)) %>%
map(separate, locus, c('chromosome','location'), sep = ':') %>%
map(filter, FPKM >= 1)
Now make a single data set stacking the processed datasets, adding a column that specifies the sample identifier of each observation. You might like to call this column sampleID
to make the next step simpler
dat2 <- bind_rows(dat1, .id = 'sampleID') %>%
mutate(sampleID = str_extract(sampleID, 'GSM\\d+'))
Import the design information into a data frame. Add the design information to the genetic dataset you created above, ensuring that ids and genotype specification (stored in the column Class
) are aligned.
design <- import(here('data/GSE123519design.xlsx'))
dat2 <- dat2 %>% left_join(design)
Note that there is an add-on to the genotype information, in the form -1
, -2
to specify which particular sample it is. This is unnecessary and would create problems for any comparative analyses using t-tests or ANOVA later. Please remove it so that you just have genotype information.
dat2 <- dat2 %>%
mutate(Class = str_remove(Class, '-[0-9]'))
There is one idiosyncratic genotype specification which is a re-rerun. Fix this to match the other genotypes.
dat2 <- dat2 %>%
mutate(Class = str_remove(Class,'-re-run-59'))
Clean the chromosome
values so that you only have data of the form “chr##” where ## refers to the chromosome number.
dat2 <- dat2 %>%
mutate(chromosome = str_extract(chromosome, 'chr[[:alnum:]]+'))
Save the dataset using saveRDS
to a .rds file
saveRDS(dat2, 'gene_data.rds', compress = T)
You can choose whether you fix all the genotype information before or after you join the design dataset to the genetic dataset. Whatever makes sense to you.
Open a R Markdown file and perform the following tasks. Knit this R Markdown file to HTML. Both the Rmd and html files will be Part 2 of your submission
dat2 <- readRDS('gene_data.rds')
dat2 %>% group_by(Class) %>% summarise(FPKM = median(FPKM, na.rm=T))
## # A tibble: 8 × 2
## Class FPKM
## <chr> <dbl>
## 1 2xcr 7.78
## 2 2xcr + c420r 7.87
## 3 2xcr+e545k 7.95
## 4 2xcr+h1047r 8.10
## 5 2xcr+m1043i 8.16
## 6 2xcr+pik3ca 8.20
## 7 2xcr+r88q 7.85
## 8 wt ctx 7.11
dat2 %>% group_by(chromosome) %>% summarize(N = length(unique(gene_id)))
## # A tibble: 22 × 2
## chromosome N
## <chr> <int>
## 1 chr1 953
## 2 chr10 756
## 3 chr11 1304
## 4 chr12 544
## 5 chr13 597
## 6 chr14 570
## 7 chr15 625
## 8 chr16 493
## 9 chr17 790
## 10 chr18 399
## # … with 12 more rows
dat2 %>% filter(str_detect(gene_id, 'PIK3C')) %>%
summarize(length(unique(gene_id)))
## length(unique(gene_id))
## 1 7
dat2 %>% filter(str_detect(gene_id, 'PIK3C')) %>%
select(sampleID, gene_id) %>%
distinct() %>% count(gene_id) %>% mutate(rel_freq = n/26)
## gene_id n rel_freq
## 1 PIK3C2A 23 0.8846154
## 2 PIK3C2B 21 0.8076923
## 3 PIK3C3 26 1.0000000
## 4 PIK3CA 26 1.0000000
## 5 PIK3CB 26 1.0000000
## 6 PIK3CD 26 1.0000000
## 7 PIK3CG 15 0.5769231
dat2 %>% mutate(pik3c_ind = ifelse(str_detect(gene_id, 'PIK3C'), 1, 0)) %>%
t.test(FPKM ~ pik3c_ind, data = .) %>%
broom::tidy()
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28.7 33.4 4.75 70.2 0 1241. 27.9 29.5
## # … with 2 more variables: method <chr>, alternative <chr>
dat2 %>% group_by(Class) %>% summarize(FPKM = median(FPKM, na.rm=T)) %>%
ggplot(aes(Class, FPKM))+geom_bar(stat='identity') + coord_flip()
# or
dat2 %>% ggplot(aes(Class, FPKM)) + stat_summary(geom='bar', fun = median) +
coord_flip()