We will use data from a gene expression experiment on chronic lymphocytic leukemia (CLL) patients, available from Bioconductor. You can download the data here
Read the data into R using the read_csv
function from the readr
package, part of the tidyverse
metapackage.
This data set contains some Affymetrix control probes that we want to exclude. These are in columns with names starting with “AFFX”. Let’s exclude these from further analysis. Save the result to a new data frame (i.e. give it a new name)
The first two columns are the sample IDs and the disease phenotypes. The rest are expressions from the probesets. Let’s make this dataset tidy using gather
, but making sure that we keep the SampleID
and Disease
columns intact and repeated over the rows of each probeset’s data.
Filter out the rows with missing disease status
I created the dataset used here using the following code:
source('http://bioconductor.org/biocLite.R')
biocLite('CLL')
library(CLL)
library(affy)
data(CLLBatch)
data(disease)
d <- rma(CLLbatch)
dat <- t(as.data.frame(exprs(d)))
dat <- cbind(disease, dat)
write.csv(dat, file='geneexpressions.csv', row.names = F)