We will use data from a gene expression experiment on chronic lymphocytic leukemia (CLL) patients, available from Bioconductor. You can download the data here

  1. Read the data into R using the read_csv function from the readr package, part of the tidyverse metapackage.

  2. This data set contains some Affymetrix control probes that we want to exclude. These are in columns with names starting with “AFFX”. Let’s exclude these from further analysis. Save the result to a new data frame (i.e. give it a new name)

  3. The first two columns are the sample IDs and the disease phenotypes. The rest are expressions from the probesets. Let’s make this dataset tidy using gather, but making sure that we keep the SampleID and Disease columns intact and repeated over the rows of each probeset’s data.

  4. Filter out the rows with missing disease status


Note

I created the dataset used here using the following code:

source('http://bioconductor.org/biocLite.R')
biocLite('CLL')
library(CLL)
library(affy)
data(CLLBatch)
data(disease)
d <-  rma(CLLbatch)
dat <- t(as.data.frame(exprs(d)))
dat <- cbind(disease, dat)
write.csv(dat, file='geneexpressions.csv', row.names = F)