Instructions:

I suggest you start working on this week’s assignment using just a R script file, and work out the different things you have to do using just R code.

The submission, of course, has to be a R Markdown file and the corresponding knitted HTML file. You will create this R Markdown file on your own, adding R chunks in it with code copied from your scirpt file. You also have to write in the question numbers and a minimal narrative in markdown as part of the R Markdown file, so that it reads more like a report.

The submission will consist of your R markdown file and knitted HTML file. You have to make sure that the HTML file is properly rendered. Both files are required, and omitting one is an incomplete assignment

Also, please remove any and all template-based text from the R markdown file before you submit it. We don’t want to see the template materials in any submissions.

Reading

  1. Please read Chapter 3 of R4DS. This is a really nice reference. If you’re interested, look at section 3.7 for a deeper understanding of how different statistical transformations can be accommodated within different geometries in ggplot2.

Exercises

  1. Let’s look at the breast cancer data set we’be been using.

    1. Create a bar plot of the frequencies of ER status (Hint: geom_bar. How would you modify the dataset so that there is no separate bar for NA’s?
    2. Color each bar by the frequencies of HER2 status within each group. (Hint: you can add aes(color='HER2.Status') or aes(fill='HER2.Status') to the geom_bar statement. What is the difference in these two choices?)
    3. I would like to see a graph showing the relative frequencies of HER2 status within each ER group. Read the documentation ?geom_bar as well as section 3.8 of R4DS to find this solution.
  2. The diamonds dataset is included in the ggplot2 package.
    1. Add a column to the diamonds dataset that is the volume of each diamond (multiplying the x, y and z variables)
    2. Plot a scatter plot of diamond price against diamond volume
    3. Modify the plot to add colors based on the cut of each diamond
    4. Modify the previous plot to construct a facetted plot where each facet is based on a color.
  3. The following links provide data for U.S. incidence rates per 100,000, standardized to the 2000 standard U.S. population, for brain, colon, esophageal, lung and oral cancers for the period 1975-2016. These data/HW6 are provided from the SEER program.

    Our goal is to create a single graphic showing the patterns of incidence rates over this time period for the 5 cancers, something like the one below:

    1. Each csv file is in the same format. Open one in a text editor like Notepad (Windows) or TextEdit (Mac) to see lines you may need to skip. (I don’t suggest Excel since it has a habit of changing formating and adding columns when you save back into csv). Then read each of the five files into R. For convenience I’ll refer to these datasets as brain, colon, esophagus, lung and oral.
      • Confirm that each dataset has 43 rows and 10 columns
    2. Create three separate datasets for each site, one for both sexes, one for males and one for females. Each dataset should still have year of diagnosis. So each dataset should have 43 rows and 4 columns.
  4. For the next few points, I will refer to the both sexes datasets. You’ll do the same for the male and female datasets.

    1. Each dataset has the same column headers. We’ll be joining the data sets into a single data set using the year of diagnosis as the key, i.e. the common variable that will determine the joins. This means we have to make the column names for the different data sets (except the year column) distinctive. Change the column names replacing both_sexes with the names of the cancer sites
      • Hints: names(brain) gives the column names for the brain data. You can also change column names using names(brain) <- ... since this is just a vector. In particular, you can try something like names(brain) <- stringr::str_replace(names(brain), 'both_sexes', 'brain')
  5. Create a new composite dataset by repeatedly using left_join or inner_join to add each site-specific dataset to the composite data set. What I mean is, create a dataset joining A and B, then join C to the result, then join D to the result and so on.

    1. For each of these datasets:
      1. Remove the row with the overall 1975-2016 rates
      2. Use pivot_longer to make a dataset with 3 columns: year, type of cancer and cancer incidence rate
  6. Create 3 plots like the one above, one for all races, one for whites and one for blacks. Assign the ggplot code for each to a name, i.e. something like plt1 <- ggplot(...) + .... Display the graph for all races, and create and display a panelled plot where the white and black plots are presented side-by-side. You can use functions from cowplot, ggpubr or patchwork as you like.

    1. Make sure you have labelled everything properly, including titles, so that the graphical information is self-evident.