standardize_mean
.standardize_mean <- function(x) {
return((x - mean(x))/sd(x))
}
mad
), respectively. Create a standardize_median
function.standardize_median <- function(x){
return((x - median(x))/mad(x))
}
standardize_mean <- function(x){
return((x - mean(x, na.rm=T))/sd(x, na.rm=T))
}
standardize
function that takes as inputs a vector of numbers, an element named method
that will be either “mean” or “median”, any other inputs you like, and output the corresponding standardized variablestandardize <- function(x, method = 'mean'){
if(method=='mean'){
standardize_mean(x)
}
if(method == 'median'){
standardize_median(x)
}
}
A lot of metrics for binary outcomes are derived from a contingency table, which is a 2 x 2 table of frequency counts comparing the true outcome with the predicted outcome. This can be obtained using table(truth, predicted)
, where truth
and predicted
are provided by the user, and are vectors of equal length comprising 1’s and 0’s.
The package mlbench
has many machine learning datasets. Install and load the mlbench
package and then access the Pima Indians Diabetes dataset
if (!('mlbench' %in% installed.packages()[,1])) install.packages('mlbench', repos = 'http://cran.rstudio.com')
library(mlbench)
data(PimaIndiansDiabetes)
str(PimaIndiansDiabetes)
Fit a logistic model to this data using all the predictors.
logistic_model <- glm(diabetes ~ ., data = PimaIndiansDiabetes, family ='binomial')
Grab the predictions from this model, and dichotomize it to 0/1
predicted <- ifelse(predict(logistic_model, type='response') > 0.5,1,0)
You can then create a contingency table comparing observed to predicted data, after converting the outcome (diabetes) to 0/1 variable
PimaIndiansDiabetes$diabetes <- ifelse(PimaIndiansDiabetes$diabetes == 'pos',1,0)
t1 <- table(PimaIndiansDiabetes$diabetes, predicted)
t1
## predicted
## 0 1
## 0 445 55
## 1 112 156
In the following exercises, the inputs to the functions should be the observed diabetes data and the predictions from the logistic model, either dichotomized or as probabilities, as appropriate. You can verify your numbers with the table above.
accuracy_score <- function(truth, predicted){
mean(truth==predicted)
}
sens_score <- function(truth, predicted){
sum(truth==1 & predicted==1)/sum(truth==1)
}
spec_score <- function(truth, predicted){
sum(truth == 0 & predicted == 0)/sum(truth == 0)
}
ppv <- function(truth, predicted){
sum(predicted==1 & truth==1)/sum(predicted==1)
}
roc
. You can use the functions you created aboveroc <- function(truth, predicted){
thresholds <- seq(0, 1, by=0.05)
out <- data.frame('threshold' = thresholds, 'sensitivity' = 0, 'specificity'=0)
for (i in 1:length(thresholds)){
pred <- ifelse(predicted > thresholds[i], 1, 0)
out[i, 'sensitivity'] <- sens_score(truth, pred)
out[i, 'specificity'] <- spec_score(truth, pred)
}
return(out)
}
rocplot
.rocplot <- function(truth, predicted){
require(ggplot2)
ggplot(roc(truth, predicted), aes(1-sensitivity, specificity)) +
geom_line()+
labs (x = '1 - Sensitivity', y = 'Specificity')
}
rocplot(PimaIndiansDiabetes$diabetes, predict(logistic_model, type='response'))
## Loading required package: ggplot2
Bonus question: Add a feature to the rocplot
function to compute the “area under the curve”, that is, the area between the curve, the x-axis, and the right edge of the plot. Hint: The area of a trapezium with parallel sides of length a and b, and height h is h x (a+b)/2.
rocplot2 <- function(truth, predicted, auc=F){
require(ggplot2)
d <- roc(truth, predicted)
plt <- ggplot(d, aes(1-sensitivity, specificity)) +
geom_line()+
labs (x = '1 - Sensitivity', y = 'Specificity')
print(plt)
if(auc){
approx_auc <- 0.05*sum(d$specificity[-1] + d$specificity[-nrow(d)])/2
print(approx_auc)
}
}