In this homework you will have to submit 2 files: the R Markdown file that you will create to answer this homework, and the knitted HTML file derived from it.

Readings

Our discussion this week will be based on this paper

  1. *Good enough practices in scientific computing" by Wilson et al (pdf)

Homework

  1. The mlbench packages has several data sets available as exemplar data for machine learning. Please load it in your R session using pacman::p_load('mlbench'). Then enable access to the Pima Indians Diabetes data by running data("PimaIndiansDiabetes2"). You’ll notice that the PimaIndiansDiabetes2 object appears in your Environment pane.

  2. Visualize and assess this data set for missingness patterns using functions from the visdat and naniar packages.

    1. Visualize the data using the vis_dat function

    2. Visualize missing data patterns using the heatmap and the upset plot. Are there any missing data patterns that stick out

    3. Provide a plot that shows the percent of data missing by variable.

    4. Provide the same information as a table using miss_var_summary. Please make sure the output is prettified for HTML and not raw output

  3. Generate a summary of the PimaIndiansDiabetes2 data using skimr::skim. Make sure that this R chunk has the option results='asis' for this question.

  4. Generate another summary table for this data using the gtsummary package

    1. First use labelled::var_label to set labels for each of the variables. You can continue the following code to label all the variables.
    var_label(PimaIndiansDiabetes2) <- 
      list(pregnant = 'Times pregnant',
           glucose = 'Plasma glucose',
           triceps = 'Triceps skin fold (mm)',
           pressure = 'Diastolic BP (mmHg)'
      )
    1. Then use gtsummary::tbl_summary to create a summary table stratified by diabetes status. See the documentation for tbl_summary tutorial to see what kinds of options you might add. This is available if you go to the Packages pane, find gtsummary, click on it, and then click on “User guides, package vignettes and other documentation.”.
  5. The msleep data is included in ggplot2. Create a single pipeline that

    1. filters the data to include just carnivores and omnivores,
    2. creates a new variable that identifies which quintile of the body weight distribution each animal falls in (see quantile and cut; quintiles divides data into fifths). Call this variable bodywt_cat. [Be careful of creating missing values when you discretize. Identify and solve tha problem too]
    3. imputes the missing values in sleep_rem with the median of the bodywt_cat and vore group each animal falls in,
    4. plots boxplots of sleep_rem by bodywt_cat, facetted by vore
  6. Install the limma package from Bioconductor using the following code:

    1. Load the BreastCancer_Clinical.csv file and call it brca. Use janitor::clean_names to update the column names

    2. Create a new data.frame brca1 that contains the columns complete_tcga_id, er_status, pr_status and her2_final_status from brca

    3. Use rowwise() to create a new variable triple_neg which is ‘Yes’ if all the ER, PR and HER2 statuses are negative, and ‘No’ otherwise. I suggest first converting the ER, PR, and HER2 statuses to 0/1 variables (0 = “Negative”), and then creating triple_neg based on whether the sum of the statuses is 0 or not.

    4. Report the number of triple negative cancers.

    5. limma has two functions vennCounts and vennDiagram. Pass the ER, PR and HER2 columns from the transformed brca1 dataset from c. together into vennCounts and pipe the results into vennDiagram. This will create a venn diagram for the ER, PR and HER2 status. Verify that the count of triple negatives from the venn diagram (the number outside the circles but inside the rectangle) matches what you found in d.