R is a functional programming language. This means that functions are “objects”, just like data frames, vectors, and other things that are assigned to variables and passed to other functions.
November 14th, 2018
R is a functional programming language. This means that functions are “objects”, just like data frames, vectors, and other things that are assigned to variables and passed to other functions.
The name of a functions is actually the name of a variable that contains the function, in the same way that the
log
## function (x, base = exp(1)) .Primitive("log")
This means that we can create a copy of a function by assigning its value to a new variable.
myLogFunction <- log myLogFunction
## function (x, base = exp(1)) .Primitive("log")
myNumber <- 7 class(myNumber)
## [1] "numeric"
class(log)
## [1] "function"
We can create a new function using the word “function” followed by the functions arguments and one or more R statements.
myDumbFunction <- function() 42 myDumbFunction()
## [1] 42
This is a function with no arguments. Usually functions have arguments, which we will see next. Here,
myDumbFunction
gives the same answer whenever it’s called
If there is more than one statement in a function, they should be enclosed in curly brackets:
doubleIt <- function(x) { myResult <- x * 2 myResult # or, explicitly, return(myResult) } doubleIt(5)
## [1] 10
The last statement within the curly brackets will be the value returned by the function.
x
is the function argument, in that it is a placeholder we can replace with an actual value when calling the function
Inside a function, variables that existed in your environment can be used and even changed. However, any changes made, including changing data stored in variables and creating new variables, happens solely within the function. Your environment stays the same.
exists("myResult")
## [1] FALSE
myResult <- 1000 doubleItOutput <- doubleIt(2) myResult
## [1] 1000
The data set used in today’s lecture comes from an siRNA screen that we published a few years ago. The screen looked for genes that influence parkin translocation.
High-content genome-wide RNAi screens identify regulators of parkin upstream of mitophagy. Hasson SA, Kane LA, Yamano K, Huang CH, Sliter DA, Buehler E, Wang C, Heman-Ackah SM, Hessa T, Guha R, Martin SE, Youle RJ. Nature. 2013.
The data set will be available for download from the lectures portion of the class web page, also here.
library(readxl) ambion <- read_excel("lecture_functions_data/nature.parkin.gw.xlsx", skip = 3) str(ambion)
## Classes 'tbl_df', 'tbl' and 'data.frame': 65196 obs. of 25 variables: ## $ Vendor Supplied Gene Symbol : chr "LMAN1" "MED15" "PFN2" "KIAA1191" ... ## $ Sample : num 8.91 7.92 7.97 11.2 8.88 ... ## $ Median Negative Control on Plate : num 38.6 33.3 31.4 43.8 34.3 ... ## $ Median Positive Control on Plate : num 96.7 96.7 97.1 96.5 96.1 ... ## $ PPT Sample as Percentage of Negative Control : num 23.1 23.7 25.4 25.6 25.9 ... ## $ PPT MAD Z-Score : num -2.42 -2.4 -2.36 -2.35 -2.35 ... ## $ PPT MAD Log MAD Z-Score : num -4.82 -4.74 -4.54 -4.52 -4.49 ... ## $ Median PPT Log Mad Z-Score for all siRNAs having the same seed sequence: num -2.056 -1.421 -1.421 -1.388 -0.485 ... ## $ Number of siRNAs having the same seed sequence : num 23 152 152 28 18 28 152 28 75 18 ... ## $ Cell Count, Sample : num 1286 1203 1177 1060 1272 ... ## $ Median Negative Control Cell Count on Plate : num 1278 1310 1260 1344 1256 ... ## $ Median Positive Control Cell Count on Plate : num 1030 1050 978 1106 985 ... ## $ Sample Cell Count, MAD Z-Score Normalized to Negative Contol : num 1.109 0.241 0.399 -1.027 1.168 ... ## $ Sample__1 : num 8.94 16.79 11.3 11.13 12.89 ... ## $ Median Negative Control Mitophagy on Plate : num 10.7 28.1 29 15.5 12.1 ... ## $ Median Positive Control Mitophagy on Plate : num 22.4 49 50.1 27.7 22 ... ## $ Sample Mitophagy, MAD Z-Score Normalized to Negative Control : num 0.0466 -0.8887 -1.7 -0.4281 0.9416 ... ## $ HUGO Gene Symbol : chr "LMAN1" "PCQAP" "PFN2" "KIAA1191" ... ## $ Entrez GeneID : num 3998 51586 5217 57179 284221 ... ## $ REFSEQ : chr "NM_005570" "NM_001003891" "NM_002628" "NM_001079684" ... ## $ DESCRIPTION : chr "lectin, mannose-binding, 1" "mediator complex subunit 15" "profilin 2" "KIAA1191" ... ## $ PLATE ID : chr "QDA101771" "QDA101996" "QDA103288" "QDA103435" ... ## $ Row Number : num 1 15 14 16 1 2 1 16 10 1 ... ## $ Column Number : num 4 21 19 11 12 1 4 16 1 14 ... ## $ Ambion siRNA ID : chr "s8218" "s28366" "s10379" "s226909" ...
options(dplyr.width = Inf) # show all cols ambion %>% summarize_all(function(x) sum(is.na(x)))
## # A tibble: 1 x 25 ## `Vendor Supplied Gene Symbol` Sample `Median Negative Control on Plate` ## <int> <int> <int> ## 1 441 441 441 ## `Median Positive Control on Plate` ## <int> ## 1 441 ## `PPT Sample as Percentage of Negative Control` `PPT MAD Z-Score` ## <int> <int> ## 1 441 441 ## `PPT MAD Log MAD Z-Score` ## <int> ## 1 441 ## `Median PPT Log Mad Z-Score for all siRNAs having the same seed sequence` ## <int> ## 1 441 ## `Number of siRNAs having the same seed sequence` `Cell Count, Sample` ## <int> <int> ## 1 441 441 ## `Median Negative Control Cell Count on Plate` ## <int> ## 1 441 ## `Median Positive Control Cell Count on Plate` ## <int> ## 1 441 ## `Sample Cell Count, MAD Z-Score Normalized to Negative Contol` Sample__1 ## <int> <int> ## 1 441 441 ## `Median Negative Control Mitophagy on Plate` ## <int> ## 1 441 ## `Median Positive Control Mitophagy on Plate` ## <int> ## 1 441 ## `Sample Mitophagy, MAD Z-Score Normalized to Negative Control` ## <int> ## 1 441 ## `HUGO Gene Symbol` `Entrez GeneID` REFSEQ DESCRIPTION `PLATE ID` ## <int> <int> <int> <int> <int> ## 1 0 441 441 441 441 ## `Row Number` `Column Number` `Ambion siRNA ID` ## <int> <int> <int> ## 1 441 441 441
# apply(ambion, 2, function(x) sum(is.na(x)))
#ambion[is.na(ambion[,1]),][1,] ambion %>% filter(is.na(.[,1])) %>% slice(1)
## # A tibble: 1 x 25 ## `Vendor Supplied Gene Symbol` Sample `Median Negative Control on Plate` ## <chr> <dbl> <dbl> ## 1 <NA> NA NA ## `Median Positive Control on Plate` ## <dbl> ## 1 NA ## `PPT Sample as Percentage of Negative Control` `PPT MAD Z-Score` ## <dbl> <dbl> ## 1 NA NA ## `PPT MAD Log MAD Z-Score` ## <dbl> ## 1 NA ## `Median PPT Log Mad Z-Score for all siRNAs having the same seed sequence` ## <dbl> ## 1 NA ## `Number of siRNAs having the same seed sequence` `Cell Count, Sample` ## <dbl> <dbl> ## 1 NA NA ## `Median Negative Control Cell Count on Plate` ## <dbl> ## 1 NA ## `Median Positive Control Cell Count on Plate` ## <dbl> ## 1 NA ## `Sample Cell Count, MAD Z-Score Normalized to Negative Contol` Sample__1 ## <dbl> <dbl> ## 1 NA NA ## `Median Negative Control Mitophagy on Plate` ## <dbl> ## 1 NA ## `Median Positive Control Mitophagy on Plate` ## <dbl> ## 1 NA ## `Sample Mitophagy, MAD Z-Score Normalized to Negative Control` ## <dbl> ## 1 NA ## `HUGO Gene Symbol` `Entrez GeneID` REFSEQ DESCRIPTION `PLATE ID` ## <chr> <dbl> <chr> <chr> <chr> ## 1 PRKCE NA <NA> <NA> <NA> ## `Row Number` `Column Number` `Ambion siRNA ID` ## <dbl> <dbl> <chr> ## 1 NA NA <NA>
# ambion <- ambion[! is.na(ambion[,2]), ] ambion <- ambion %>% filter(!is.na(Sample)) #apply(ambion, 2, function(x) sum(is.na(x))) ambion %>% summarize_all(function(x) sum(is.na(x)))
## # A tibble: 1 x 25 ## `Vendor Supplied Gene Symbol` Sample `Median Negative Control on Plate` ## <int> <int> <int> ## 1 0 0 0 ## `Median Positive Control on Plate` ## <int> ## 1 0 ## `PPT Sample as Percentage of Negative Control` `PPT MAD Z-Score` ## <int> <int> ## 1 0 0 ## `PPT MAD Log MAD Z-Score` ## <int> ## 1 0 ## `Median PPT Log Mad Z-Score for all siRNAs having the same seed sequence` ## <int> ## 1 0 ## `Number of siRNAs having the same seed sequence` `Cell Count, Sample` ## <int> <int> ## 1 0 0 ## `Median Negative Control Cell Count on Plate` ## <int> ## 1 0 ## `Median Positive Control Cell Count on Plate` ## <int> ## 1 0 ## `Sample Cell Count, MAD Z-Score Normalized to Negative Contol` Sample__1 ## <int> <int> ## 1 0 0 ## `Median Negative Control Mitophagy on Plate` ## <int> ## 1 0 ## `Median Positive Control Mitophagy on Plate` ## <int> ## 1 0 ## `Sample Mitophagy, MAD Z-Score Normalized to Negative Control` ## <int> ## 1 0 ## `HUGO Gene Symbol` `Entrez GeneID` REFSEQ DESCRIPTION `PLATE ID` ## <int> <int> <int> <int> <int> ## 1 0 0 0 0 0 ## `Row Number` `Column Number` `Ambion siRNA ID` ## <int> <int> <int> ## 1 0 0 0
Often it will be helpful to create a new data frame with only the data we wish to analyze.
#ambion.simple <- ambion[,c(19,25,1,21,7,13,17)] ambion.simple <- select(ambion, 19,25,1,21,7,13,17) ambion.simple[1,]
## # A tibble: 1 x 7 ## `Entrez GeneID` `Ambion siRNA ID` `Vendor Supplied Gene Symbol` ## <dbl> <chr> <chr> ## 1 3998 s8218 LMAN1 ## DESCRIPTION `PPT MAD Log MAD Z-Score` ## <chr> <dbl> ## 1 lectin, mannose-binding, 1 -4.82 ## `Sample Cell Count, MAD Z-Score Normalized to Negative Contol` ## <dbl> ## 1 1.11 ## `Sample Mitophagy, MAD Z-Score Normalized to Negative Control` ## <dbl> ## 1 0.0466
library(knitr, quietly = TRUE) colnames(ambion.simple) <- c("GeneID","siRNA","Symbol","Description", "PPT","Cells","Mitophagy") kable(head(ambion.simple, n=4), format = "markdown")
GeneID | siRNA | Symbol | Description | PPT | Cells | Mitophagy |
---|---|---|---|---|---|---|
3998 | s8218 | LMAN1 | lectin, mannose-binding, 1 | -4.822230 | 1.1085178 | 0.0466291 |
51586 | s28366 | MED15 | mediator complex subunit 15 | -4.739415 | 0.2405872 | -0.8887362 |
5217 | s10379 | PFN2 | profilin 2 | -4.539309 | 0.3994161 | -1.7000123 |
57179 | s226909 | KIAA1191 | KIAA1191 | -4.516443 | -1.0274124 | -0.4280631 |
library(knitr, quietly = TRUE) ambion.simple <- ambion.simple %>% set_names(c("GeneID","siRNA","Symbol","Description", "PPT","Cells","Mitophagy")) head(ambion.simple, n = 4) %>% kable(format='markdown')
GeneID | siRNA | Symbol | Description | PPT | Cells | Mitophagy |
---|---|---|---|---|---|---|
3998 | s8218 | LMAN1 | lectin, mannose-binding, 1 | -4.822230 | 1.1085178 | 0.0466291 |
51586 | s28366 | MED15 | mediator complex subunit 15 | -4.739415 | 0.2405872 | -0.8887362 |
5217 | s10379 | PFN2 | profilin 2 | -4.539309 | 0.3994161 | -1.7000123 |
57179 | s226909 | KIAA1191 | KIAA1191 | -4.516443 | -1.0274124 | -0.4280631 |
library(ggplot2, quietly = TRUE) ggplot(ambion.simple, aes(x=PPT, y=Cells)) + geom_point(alpha=0.5)
ggplot(ambion.simple, aes(x=PPT, y=Cells)) + geom_point(alpha=0.5) + geom_point(data = ambion.simple %>% filter(Symbol=='PARK2'), #ambion.simple[ambion.simple$Symbol == "PARK2",], color="blue")
ggplot(ambion.simple, aes(x=PPT, y=Cells)) + geom_density2d() + geom_point(data = ambion.simple %>% filter(Symbol=='PARK2'), color="blue")
ggplot(ambion.simple, aes(x=PPT, y=Cells)) + geom_density2d() + geom_point(data = ambion.simple %>% filter(Symbol=="PARK2"), color="red", shape=17, size =5) + ggtitle("PARK2")
description <- ambion.simple$Description[ambion.simple$Symbol == "PARK2"][1] description
## [1] "Parkinson disease (autosomal recessive, juvenile) 2, parkin"
myTitle <- paste("PARK2",description,sep=": ") myTitle
## [1] "PARK2: Parkinson disease (autosomal recessive, juvenile) 2, parkin"
ggplot(ambion.simple, aes(x=PPT, y=Cells)) + geom_density2d() + geom_point(data = ambion.simple %>% filter(Symbol == "PARK2"), color="red", shape=17, size =5) + ggtitle(myTitle)
Now that we have our custom plot looking right, we would like to be able to do the same for other genes but without so much typing. First, make a new R Script in RStudio:
Frequently, making a function will simply be a function of selecting the right parts of your history and hitting the “to source” button.
graphGene <- function(gene) { description <- ambion.simple$Description[ambion.simple$Symbol == "PARK2"][1] myTitle <- paste("PARK2",description,sep=": ") ggplot(ambion.simple, aes(x=PPT, y=Cells)) + geom_density2d() + geom_point(data = ambion.simple %>% filter(Symbol=="PARK2"), color="red", shape=17, size =5) + ggtitle(myTitle) }
graphGene <- function(gene) { description <- ambion.simple$Description[ambion.simple$Symbol == gene][1] myTitle <- paste(gene,description,sep=": ") ggplot(ambion.simple, aes(x=PPT, y=Cells)) + geom_density2d() + geom_point(data = ambion.simple %>% filter(Symbol == gene), color="red", shape=17, size =5) + ggtitle(myTitle) }
graphGene <- function(gene) { if(!is.character(gene)) stop("Need to provide a string") if(!(gene %in% ambion.simple$Symbol)) stop("Need to provide a valid gene") description <- ambion.simple$Description[ambion.simple$Symbol == gene][1] myTitle <- paste(gene,description,sep=": ") ggplot(ambion.simple, aes(x=PPT, y=Cells)) + geom_density2d() + geom_point(data = ambion.simple %>% filter(Symbol == gene), color="red", shape=17, size =5) + ggtitle(myTitle) }
graphGene("PINK1")
pdfGene <- function(gene, file=paste(gene,".pdf", sep="")) { pdf(file, width=5, height=5) graphGene(gene) dev.off() }
We can use the ellipse notation (…) to indicate that extra arguments to our function should be passed on to a function that is inside our function (in this case pdf).
pdfGene <- function(gene, file=paste(gene,".pdf", sep=""), ...) { pdf(file, ...) print(graphGene(gene)) dev.off() } pdfGene("PINK1", width=10, height=10)
We can decide whether something happens in our function using “if” and “if/else”.
sillyFunction <- function(x) { if (x < 5) { returnValue <- x } else { returnValue <- x / 2 } return(returnValue) } sillyFunction(12)
## [1] 6
pdfGenes <- function(genes, ...) { for (gene in genes) { # This will work through gene by gene pdfGene(gene, file = paste0(gene, '.pdf'), ...) } } pdfGenes(c("PLK1","PINK1","BRCA1"))