In this homework you will have to submit 2 files: the R Markdown file that you will create to answer this homework, and the knitted HTML file derived from it.

  1. Start with the PimaIndiansDiabetes2 dataset from the mlbench package. Your objective is to analyze the factors that are associated with getting diabetes in this data.
    1. Explore the dataset to see if there are any variables that you might need to omit. Justify your decisions

    2. Fit a logistic regression with diabetes as the dependent variable and the rest of the variables you chose to keep in as the independent variables.

    3. Create a proper formatted table (not raw output) of the results (slopes, confidence intervals, and p-values)

    4. Present these results graphically

  2. We have presidential approval ratings going back to Franklin D. Roosevelt available from the American Presidency Project. Download this data (stored in an Excel file) from the class data link.
    1. Load this data into a list of data frames, one per president

    2. Combine this data into a single data frame, with a column designating the president. Also, clean the column names.

    3. If you look at the structure of the data, you’ll see that start_date and end_date are datetime objects, so they are treated as dates in R. Plot the approval percent at each end date against the date as a line graph, colored by which president is in office at the time. Do not make a legend.

    4. Plot the percentage of uncertain respondents by end date, similar to c. Put dotted vertical lines corresponding to each inauguration date (inaugurations in the US occur on January 20 of the year following a presidential election, and the election years are years divisible by 4; you can count back from 2020 if you like).

    5. Use gtsummary::tbl_summary to create a table summarizing approving, disapproving and unsure percentages by president. Transform the original dataset so that this table has columns in chronological order.

    6. Create a new column for days in office using the following code (I named my data frame approve, and my variable names might be slightly different from yours.)

      library(lubridate)
      approve <- approve %>% 
          group_by(president) %>% 
          mutate(days = end_date - min(start_date)) %>% 
          mutate(days = as.duration(days)/ddays(1))

      Then create a facetted plot showing approval percent against days in office by president, and add a regression line and the correlation coefficient to each plot. You should probably throw out President Roosevelt since his term was before presidential term limits were implemented. The facets should follow chronological order.


For question 2d, you can create a compatible vector of inauguration dates, by formatting strings in the form “YYYY-MM-DD”, and then passing them through lubridate::as_datetime. The lubridate package is the tidyverse package that addresses dates. You can then pass that vector of inauguration dates into geom_vline as the xintercept argument.