In this homework you will have to submit 2 files: the R Markdown file that you will create to answer this homework, and the knitted HTML file derived from it.
PimaIndiansDiabetes2
dataset from the mlbench package. Your objective is to analyze the factors that are associated with getting diabetes in this data.
Explore the dataset to see if there are any variables that you might need to omit. Justify your decisions
Fit a logistic regression with diabetes
as the dependent variable and the rest of the variables you chose to keep in as the independent variables.
Create a proper formatted table (not raw output) of the results (slopes, confidence intervals, and p-values)
Present these results graphically
Load this data into a list of data frames, one per president
Combine this data into a single data frame, with a column designating the president. Also, clean the column names.
If you look at the structure of the data, you’ll see that start_date
and end_date
are datetime objects, so they are treated as dates in R. Plot the approval percent at each end date against the date as a line graph, colored by which president is in office at the time. Do not make a legend.
Plot the percentage of uncertain respondents by end date, similar to c. Put dotted vertical lines corresponding to each inauguration date (inaugurations in the US occur on January 20 of the year following a presidential election, and the election years are years divisible by 4; you can count back from 2020 if you like).
Use gtsummary::tbl_summary
to create a table summarizing approving, disapproving and unsure percentages by president. Transform the original dataset so that this table has columns in chronological order.
Create a new column for days in office using the following code (I named my data frame approve
, and my variable names might be slightly different from yours.)
library(lubridate)
approve <- approve %>%
group_by(president) %>%
mutate(days = end_date - min(start_date)) %>%
mutate(days = as.duration(days)/ddays(1))
Then create a facetted plot showing approval percent against days in office by president, and add a regression line and the correlation coefficient to each plot. You should probably throw out President Roosevelt since his term was before presidential term limits were implemented. The facets should follow chronological order.
lubridate::as_datetime
. The lubridate package is the tidyverse package that addresses dates. You can then pass that vector of inauguration dates into geom_vline
as the xintercept
argument.