1. Create a linear model from the “swiss” data set (in the datasets package), predicting Fertility based on Agriculture.
library(datasets)
aggModel <- lm(Fertility ~ Agriculture, data=swiss)
summary(aggModel)
## 
## Call:
## lm(formula = Fertility ~ Agriculture, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5374  -7.8685  -0.6362   9.0464  24.4858 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 60.30438    4.25126  14.185   <2e-16 ***
## Agriculture  0.19420    0.07671   2.532   0.0149 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.82 on 45 degrees of freedom
## Multiple R-squared:  0.1247, Adjusted R-squared:  0.1052 
## F-statistic: 6.409 on 1 and 45 DF,  p-value: 0.01492

Now build a model predicting Fertility based on Education

educationModel <- lm(Fertility ~ Education, data=swiss)
summary(educationModel)
## 
## Call:
## lm(formula = Fertility ~ Education, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.036  -6.711  -1.011   9.526  19.689 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  79.6101     2.1041  37.836  < 2e-16 ***
## Education    -0.8624     0.1448  -5.954 3.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.446 on 45 degrees of freedom
## Multiple R-squared:  0.4406, Adjusted R-squared:  0.4282 
## F-statistic: 35.45 on 1 and 45 DF,  p-value: 3.659e-07

and finally a model using both variables.

jointModel <- lm(Fertility ~ Education + Agriculture, data=swiss)
summary(jointModel)
## 
## Call:
## lm(formula = Fertility ~ Education + Agriculture, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3072  -6.6157  -0.9443   8.7028  20.5291 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 84.08005    5.78180  14.542  < 2e-16 ***
## Education   -0.96276    0.18906  -5.092  7.1e-06 ***
## Agriculture -0.06648    0.08005  -0.830    0.411    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.479 on 44 degrees of freedom
## Multiple R-squared:  0.4492, Adjusted R-squared:  0.4242 
## F-statistic: 17.95 on 2 and 44 DF,  p-value: 2e-06

Use anova to ask if the combination of Agriculture and Education is statistically significantly better than Education alone.

anova(educationModel, jointModel, test="Chisq")
## Analysis of Variance Table
## 
## Model 1: Fertility ~ Education
## Model 2: Fertility ~ Education + Agriculture
##   Res.Df    RSS Df Sum of Sq Pr(>Chi)
## 1     45 4015.2                      
## 2     44 3953.3  1    61.966   0.4063

Since the p-value is not very small, we can’t conclude that adding the Agriculture variable significantly improves the model.

Finally, use the step function to quickly find an optimal model for predicting Fertility, selecting features from all the predictive variables. The easiest way to do this will be to set “object” equal to a model containing all the other variables (lm(Fertility ~ .,data=swiss)) with no other function options specified.

stepModel <- step(lm(Fertility ~ .,data=swiss))
## Start:  AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## - Examination       1     53.03 2158.1 189.86
## <none>                          2105.0 190.69
## - Agriculture       1    307.72 2412.8 195.10
## - Infant.Mortality  1    408.75 2513.8 197.03
## - Catholic          1    447.71 2552.8 197.75
## - Education         1   1162.56 3267.6 209.36
## 
## Step:  AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          2158.1 189.86
## - Agriculture       1    264.18 2422.2 193.29
## - Infant.Mortality  1    409.81 2567.9 196.03
## - Catholic          1    956.57 3114.6 205.10
## - Education         1   2249.97 4408.0 221.43
summary(stepModel)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

Note that only one variable gets dropped from the model.

  1. Using the “read.csv” function, read in a data frame of data about graduate students acceptance (https://stats.idre.ucla.edu/stat/data/binary.csv). Build a logistic regression model that predicts acceptance based on the predictive variable GRE and GPA. Do these two predictive variable seem to have equal importance in the prediction? How do you know? (Adapted from the web tutorial at https://stats.idre.ucla.edu/r/dae/logit-regression/)
gradData <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
summary(glm(formula = admit ~ gre + gpa, family = "binomial", data = gradData))
## 
## Call:
## glm(formula = admit ~ gre + gpa, family = "binomial", data = gradData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2730  -0.8988  -0.7206   1.3013   2.0620  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.949378   1.075093  -4.604 4.15e-06 ***
## gre          0.002691   0.001057   2.544   0.0109 *  
## gpa          0.754687   0.319586   2.361   0.0182 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 480.34  on 397  degrees of freedom
## AIC: 486.34
## 
## Number of Fisher Scoring iterations: 4

Although the size of the parameter estimates for these two variables are very different, recall that they have very different scales. To compare, we would need to normalize (subtract the mean and divide by the standard deviation).

gradData$greZ <- (gradData$gre - mean(gradData$gre)) / sd(gradData$gre)
gradData$gpaZ <- (gradData$gpa - mean(gradData$gpa)) / sd(gradData$gpa)
summary(glm(formula = admit ~ greZ + gpaZ, family = "binomial", data = gradData))
## 
## Call:
## glm(formula = admit ~ greZ + gpaZ, family = "binomial", data = gradData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2730  -0.8988  -0.7206   1.3013   2.0620  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.8098     0.1120  -7.233 4.74e-13 ***
## greZ          0.3108     0.1222   2.544   0.0109 *  
## gpaZ          0.2872     0.1216   2.361   0.0182 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 480.34  on 397  degrees of freedom
## AIC: 486.34
## 
## Number of Fisher Scoring iterations: 4

After normalization, the parameter estimates for the two predictors are very close (overlaping when considering the standard error of the parameters) , and so we can conclude that they have roughly equal weight.