library(datasets)
aggModel <- lm(Fertility ~ Agriculture, data=swiss)
summary(aggModel)
##
## Call:
## lm(formula = Fertility ~ Agriculture, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.5374 -7.8685 -0.6362 9.0464 24.4858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.30438 4.25126 14.185 <2e-16 ***
## Agriculture 0.19420 0.07671 2.532 0.0149 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.82 on 45 degrees of freedom
## Multiple R-squared: 0.1247, Adjusted R-squared: 0.1052
## F-statistic: 6.409 on 1 and 45 DF, p-value: 0.01492
Now build a model predicting Fertility based on Education
educationModel <- lm(Fertility ~ Education, data=swiss)
summary(educationModel)
##
## Call:
## lm(formula = Fertility ~ Education, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.036 -6.711 -1.011 9.526 19.689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.6101 2.1041 37.836 < 2e-16 ***
## Education -0.8624 0.1448 -5.954 3.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.446 on 45 degrees of freedom
## Multiple R-squared: 0.4406, Adjusted R-squared: 0.4282
## F-statistic: 35.45 on 1 and 45 DF, p-value: 3.659e-07
and finally a model using both variables.
jointModel <- lm(Fertility ~ Education + Agriculture, data=swiss)
summary(jointModel)
##
## Call:
## lm(formula = Fertility ~ Education + Agriculture, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3072 -6.6157 -0.9443 8.7028 20.5291
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.08005 5.78180 14.542 < 2e-16 ***
## Education -0.96276 0.18906 -5.092 7.1e-06 ***
## Agriculture -0.06648 0.08005 -0.830 0.411
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.479 on 44 degrees of freedom
## Multiple R-squared: 0.4492, Adjusted R-squared: 0.4242
## F-statistic: 17.95 on 2 and 44 DF, p-value: 2e-06
Use anova to ask if the combination of Agriculture and Education is statistically significantly better than Education alone.
anova(educationModel, jointModel, test="Chisq")
## Analysis of Variance Table
##
## Model 1: Fertility ~ Education
## Model 2: Fertility ~ Education + Agriculture
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 45 4015.2
## 2 44 3953.3 1 61.966 0.4063
Since the p-value is not very small, we can’t conclude that adding the Agriculture variable significantly improves the model.
Finally, use the step function to quickly find an optimal model for predicting Fertility, selecting features from all the predictive variables. The easiest way to do this will be to set “object” equal to a model containing all the other variables (lm(Fertility ~ .,data=swiss)) with no other function options specified.
stepModel <- step(lm(Fertility ~ .,data=swiss))
## Start: AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
##
## Df Sum of Sq RSS AIC
## - Examination 1 53.03 2158.1 189.86
## <none> 2105.0 190.69
## - Agriculture 1 307.72 2412.8 195.10
## - Infant.Mortality 1 408.75 2513.8 197.03
## - Catholic 1 447.71 2552.8 197.75
## - Education 1 1162.56 3267.6 209.36
##
## Step: AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
##
## Df Sum of Sq RSS AIC
## <none> 2158.1 189.86
## - Agriculture 1 264.18 2422.2 193.29
## - Infant.Mortality 1 409.81 2567.9 196.03
## - Catholic 1 956.57 3114.6 205.10
## - Education 1 2249.97 4408.0 221.43
summary(stepModel)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
Note that only one variable gets dropped from the model.
gradData <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
summary(glm(formula = admit ~ gre + gpa, family = "binomial", data = gradData))
##
## Call:
## glm(formula = admit ~ gre + gpa, family = "binomial", data = gradData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2730 -0.8988 -0.7206 1.3013 2.0620
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.949378 1.075093 -4.604 4.15e-06 ***
## gre 0.002691 0.001057 2.544 0.0109 *
## gpa 0.754687 0.319586 2.361 0.0182 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 480.34 on 397 degrees of freedom
## AIC: 486.34
##
## Number of Fisher Scoring iterations: 4
Although the size of the parameter estimates for these two variables are very different, recall that they have very different scales. To compare, we would need to normalize (subtract the mean and divide by the standard deviation).
gradData$greZ <- (gradData$gre - mean(gradData$gre)) / sd(gradData$gre)
gradData$gpaZ <- (gradData$gpa - mean(gradData$gpa)) / sd(gradData$gpa)
summary(glm(formula = admit ~ greZ + gpaZ, family = "binomial", data = gradData))
##
## Call:
## glm(formula = admit ~ greZ + gpaZ, family = "binomial", data = gradData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2730 -0.8988 -0.7206 1.3013 2.0620
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8098 0.1120 -7.233 4.74e-13 ***
## greZ 0.3108 0.1222 2.544 0.0109 *
## gpaZ 0.2872 0.1216 2.361 0.0182 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 480.34 on 397 degrees of freedom
## AIC: 486.34
##
## Number of Fisher Scoring iterations: 4
After normalization, the parameter estimates for the two predictors are very close (overlaping when considering the standard error of the parameters) , and so we can conclude that they have roughly equal weight.