In this homework you will have to submit 3 files. (1) The R script from the Data preparation section, (2) the Rmd file from the Data analysis section and (3) the knitted HTML file from the Data analysis section. The results of the data analysis section should all be present in the Rmd and HTML files. The R script should be well-commented so that the reader can understand what each section of code is supposed to do.
In this homework we will play with some publicly available RNA-Seq data. Information about the experiment is available here. Briefly, this is a mouse experiment where the authors were interested in PIK3CA mutation variants. This interest is derived from PIK3CA’s implication in glioblastoma tumorogenesis. There are 26 mice with different engineered genetics.
The data are available as a zip file here. The experimental design is available here.
Please let me know if these links don’t work.
Open a R script file in your class RStudio Project. This will be Part 1 of your submission. Your code should be annotated with sufficient comments so that the reader/grader can understand what you are intending to do with each part of your code.
Read all 26 files into a list in R, where the list consists of 26 data frames
For each of the data frames perform the following data processing
tracking_id
, gene_id
, and locus
through FPKM
.gene_id
to capital letters (use stringr::str_to_upper
or other equivalent functions)locus
variable into a chromosome
and a location
variable.Now make a single data set stacking the processed datasets, adding a column that specifies the sample identifier of each observation. You might like to call this column sampleID
to make the next step simpler
Import the design information into a data frame. Add the design information to the genetic dataset you created above, ensuring that ids and genotype specification (stored in the column Class
) are aligned.
Note that there is an add-on to the genotype information, in the form -1
, -2
to specify which particular sample it is. This is unnecessary and would create problems for any comparative analyses using t-tests or ANOVA later. Please remove it so that you just have genotype information.
There is one idiosyncratic genotype specification which is a re-rerun. Fix this to match the other genotypes.
Clean the chromosome
values so that you only have data of the form “chr##” where ## refers to the chromosome number.
Save the dataset using saveRDS
to a .rds file
You can choose whether you fix all the genotype information before or after you join the design dataset to the genetic dataset. Whatever makes sense to you.
Open a R Markdown file and perform the following tasks. Knit this R Markdown file to HTML. Both the Rmd and html files will be Part 2 of your submission
Provide the following tables:
How many unique genes of the form PIK3C* are present in the dataset
What are the relative frequencies of each PIK3CA variant among the 26 mice. Assume a missing value means the variant is not present.
Perform a statistical test to see if FPKM expression levels are different at PIK3C* genes (i.e. any gene name starting with “PIK3C”) compared to all other genes. (the correct test is actually complicated, but doing t-test, wilcoxon test or permutation tests are all allowed here)
Draw a barplot showing the median FPKM levels by Class