In this homework you will have to submit 3 files. (1) The R script from the Data preparation section, (2) the Rmd file from the Data analysis section and (3) the knitted HTML file from the Data analysis section. The results of the data analysis section should all be present in the Rmd and HTML files. The R script should be well-commented so that the reader can understand what each section of code is supposed to do.

In this homework we will play with some publicly available RNA-Seq data. Information about the experiment is available here. Briefly, this is a mouse experiment where the authors were interested in PIK3CA mutation variants. This interest is derived from PIK3CA’s implication in glioblastoma tumorogenesis. There are 26 mice with different engineered genetics.

The data are available as a zip file here. The experimental design is available here.

Prelim work

  1. Download and unzip the data file in the data directory you just created. You should get 26 text files, perhaps in a sub-directory. Note this, since you’ll need it in the next step. Also download the experimental design file here.

Homework

Data preparation

Open a R script file in the RStudio Project you created. This will be Part 1 of your submission. Your code should be annotated with sufficient comments so that the reader/grader can understand what you are intending to do with each part of your code.

  1. Read all 26 files into a list in R, where the list consists of 26 data frames

    library(here)
    library(rio)
    dat <- import_list(dir(here('data/GSE123519_RAW/'), full.names = TRUE))
  2. For each of the data frames perform the following data processing

    1. Keep only the variables tracking_id, gene_id, and locus through FPKM.
    2. Change the gene names in gene_id to capital letters (use stringr::str_to_upper or other equivalent functions)
    3. Split the locus variable into a chromosome and a location variable.
    4. Filter out all observations with FPKM less than 1 (no real scientific reason, just because)
    library(tidyverse)
    dat1 <- dat %>% 
        map(select, tracking_id, gene_id, locus:FPKM) %>% 
        map(mutate, gene_id = stringr::str_to_upper(gene_id)) %>% 
        map(separate, locus, c('chromosome','location'), sep = ':') %>% 
        map(filter, FPKM >= 1)
  3. Now make a single data set stacking the processed datasets, adding a column that specifies the sample identifier of each observation. You might like to call this column sampleID to make the next step simpler

    dat2 <- bind_rows(dat1, .id = 'sampleID') %>% 
        mutate(sampleID = str_extract(sampleID, 'GSM\\d+'))
  4. Import the design information into a data frame. Add the design information to the genetic dataset you created above, ensuring that ids and genotype specification (stored in the column Class) are aligned.

    design <- import(here('data/GSE123519design.xlsx'))
    dat2 <- dat2 %>% left_join(design)
  5. Note that there is an add-on to the genotype information, in the form -1, -2 to specify which particular sample it is. This is unnecessary and would create problems for any comparative analyses using t-tests or ANOVA later. Please remove it so that you just have genotype information.

    dat2 <- dat2 %>% 
        mutate(Class = str_remove(Class, '-[0-9]'))
  6. There is one idiosyncratic genotype specification which is a re-rerun. Fix this to match the other genotypes.

    dat2 <- dat2 %>% 
        mutate(Class = str_remove(Class,'-re-run-59'))
  7. Clean the chromosome values so that you only have data of the form “chr##” where ## refers to the chromosome number.

    dat2 <- dat2 %>% 
        mutate(chromosome = str_extract(chromosome, 'chr[[:alnum:]]+'))
  8. Save the dataset using saveRDS to a .rds file

    saveRDS(dat2, 'gene_data.rds', compress = T)

You can choose whether you fix all the genotype information before or after you join the design dataset to the genetic dataset. Whatever makes sense to you.

Data analysis

Open a R Markdown file and perform the following tasks. Knit this R Markdown file to HTML. Both the Rmd and html files will be Part 2 of your submission

  1. Provide the following tables:
    • median FPKM levels by Class
    • How many unique genes are interrogated per chromosome
    dat2 <- readRDS('gene_data.rds')
    dat2 %>% group_by(Class) %>% summarise(FPKM = median(FPKM, na.rm=T))
    ## # A tibble: 8 × 2
    ##   Class         FPKM
    ##   <chr>        <dbl>
    ## 1 2xcr          7.78
    ## 2 2xcr + c420r  7.87
    ## 3 2xcr+e545k    7.95
    ## 4 2xcr+h1047r   8.10
    ## 5 2xcr+m1043i   8.16
    ## 6 2xcr+pik3ca   8.20
    ## 7 2xcr+r88q     7.85
    ## 8 wt ctx        7.11
    dat2 %>% group_by(chromosome) %>% summarize(N = length(unique(gene_id)))
    ## # A tibble: 22 × 2
    ##    chromosome     N
    ##    <chr>      <int>
    ##  1 chr1         953
    ##  2 chr10        756
    ##  3 chr11       1304
    ##  4 chr12        544
    ##  5 chr13        597
    ##  6 chr14        570
    ##  7 chr15        625
    ##  8 chr16        493
    ##  9 chr17        790
    ## 10 chr18        399
    ## # … with 12 more rows
  2. How many unique genes of the form PIK3C* are present in the dataset
dat2 %>% filter(str_detect(gene_id, 'PIK3C')) %>%
    summarize(length(unique(gene_id)))
##   length(unique(gene_id))
## 1                       7
  1. What are the relative frequencies of each PIK3CA variant among the 26 mice. Assume a missing value means the variant is not present.
dat2 %>% filter(str_detect(gene_id, 'PIK3C')) %>%
    select(sampleID, gene_id) %>% 
    distinct() %>% count(gene_id) %>% mutate(rel_freq = n/26)
##   gene_id  n  rel_freq
## 1 PIK3C2A 23 0.8846154
## 2 PIK3C2B 21 0.8076923
## 3  PIK3C3 26 1.0000000
## 4  PIK3CA 26 1.0000000
## 5  PIK3CB 26 1.0000000
## 6  PIK3CD 26 1.0000000
## 7  PIK3CG 15 0.5769231
  1. Perform a statistical test to see if FPKM expression levels are different at PIK3C* genes compared to all other genes. (the correct test is actually complicated, but doing t-test, wilcoxon test or permutation tests are all allowed here)
dat2 %>% mutate(pik3c_ind = ifelse(str_detect(gene_id, 'PIK3C'), 1, 0)) %>% 
    t.test(FPKM ~ pik3c_ind, data = .) %>% 
    broom::tidy()
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1     28.7      33.4      4.75      70.2       0     1241.     27.9      29.5
## # … with 2 more variables: method <chr>, alternative <chr>
  1. Draw a barplot showing the median FPKM levels by Class
dat2 %>% group_by(Class) %>% summarize(FPKM = median(FPKM, na.rm=T)) %>% 
    ggplot(aes(Class, FPKM))+geom_bar(stat='identity') + coord_flip()

# or 

dat2 %>% ggplot(aes(Class, FPKM)) + stat_summary(geom='bar', fun = median) +
    coord_flip()