Homework 3

Exercises

Let’s look at the breast cancer data set we’be been using.

library(rio)
brca <- import(here('data/brca_data.rds')) %>% 
  janitor::clean_names()

Create a bar plot of the frequencies of ER status (Hint: geom_bar. How would you modify the dataset so that there is no separate bar for NA’s?
```
ggplot(brca, aes(x = er_status))+
  geom_bar()
```
```
ggplot(brca %>% filter(!is.na(er_status)),
       aes(x = er_status))+
  geom_bar()
```
Color each bar by the frequencies of HER2 status within each group. (Hint: you can add aes(color='HER2.Status') or aes(fill='HER2.Status') to the geom_bar statement. What is the difference in these two choices?)
```
ggplot(brca %>% filter(!is.na(er_status), !is.na(her2_final_status)),
       aes(x = er_status, fill=her2_final_status))+
  geom_bar(position = 'stack') # This is the default
```

I would like to see a graph showing the relative frequencies of HER2 status within each ER group. Read the documentation ?geom_bar as well as section 3.8 of R4DS to find this solution.

ggplot(brca %>% filter(!is.na(er_status), !is.na(her2_final_status)),
       aes(x = er_status, fill=her2_final_status))+
  geom_bar(position = 'fill')+ # This is the default
  scale_y_continuous('Percent', labels = scales::label_percent())+
  labs(x = 'ER status', fill = 'HER2 status')

The diamonds dataset is included in the ggplot2 package.
1. Add a column to the diamonds dataset that is the volume of each diamond (multiplying the x, y and z variables)
```
diamonds1 <- diamonds %>% mutate(volume = x * y * z)
```
2. Plot a scatter plot of diamond price against diamond volume
```
ggplot(diamonds1, aes(x=volume, y=price))+
  geom_point()
```
3. Modify the plot to add colors based on the cut of each diamond
```
ggplot(diamonds1, aes(x=volume, y = price))+
  geom_point(aes(color=cut))
```
4. Modify the previous plot to construct a facetted plot where each facet is based on a color.
```
ggplot(diamonds1, aes(x = volume, y = price))+
  geom_point()+
  facet_wrap(~cut)
```

The following links provide data for U.S. incidence rates per 100,000, standardized to the 2000 standard U.S. population, for brain, colon, esophageal, lung and oral cancers for the period 1975-2016. These data/HW6 are provided from the SEER program.

Our goal is to create a single graphic showing the patterns of incidence rates over this time period for the 5 cancers, something like the one below:

Each csv file is in the same format. Open one in a text editor like Notepad (Windows) or TextEdit (Mac) to see lines you may need to skip. (I don’t suggest Excel since it has a habit of changing formating and adding columns when you save back into csv). Then read each of the five files into R. For convenience I’ll refer to these datasets as brain, colon, esophagus, lung and oral.
- Confirm that each dataset has 43 rows and 10 columns

library(tidyverse)
library(rio)
library(here)

brain <- import(here('data/Brain.csv'), skip=4)
colon <- import(here('data/Colon.csv'), skip=4)
esophagus <- import(here('data/Esophagus.csv'), skip=4)
lung <- import(here('data/Lung.csv'), skip=4)
oral <- import(here('data/Oral.csv'), skip=4)

str(brain)

## 'data.frame':    43 obs. of  10 variables:
##  $ Year of Diagnosis   : chr  "1975-2016" "1975" "1976" "1977" ...
##  $ All Races,Both Sexes: num  6.59 5.85 5.82 6.17 5.76 6.12 6.3 6.51 6.42 6.31 ...
##  $ All Races,Males     : num  7.88 6.84 7.14 7.76 6.79 7.42 7.58 8.07 7.93 7.6 ...
##  $ All Races,Females   : num  5.51 5.01 4.68 4.89 4.91 5.01 5.24 5.2 5.24 5.19 ...
##  $ Whites,Both Sexes   : num  7.22 6.21 6.18 6.6 6.1 6.6 6.81 6.9 6.92 6.88 ...
##  $ Whites,Males        : num  8.61 7.31 7.51 8.26 7.19 8.03 8.2 8.44 8.57 8.2 ...
##  $ Whites,Females      : num  6.04 5.28 5.03 5.27 5.19 5.37 5.65 5.63 5.64 5.74 ...
##  $ Blacks,Both Sexes   : num  4.08 4.14 3.32 3.55 3.86 3.69 3.14 5.02 3.71 2.75 ...
##  $ Blacks,Males        : num  4.79 4.31 5.37 5.17 4.34 4.19 3.35 7.24 4.4 3.79 ...
##  $ Blacks,Females      : chr  "3.51" "3.88" "-" "2.47" ...

str(colon)

## 'data.frame':    43 obs. of  10 variables:
##  $ Year of Diagnosis   : chr  "1975-2016" "1975" "1976" "1977" ...
##  $ All Races,Both Sexes: num  52.1 59.5 61.3 62.4 62 ...
##  $ All Races,Males     : num  60.9 68.5 71.7 73.5 72.3 ...
##  $ All Races,Females   : num  45.5 53.7 54.1 55 55.2 ...
##  $ Whites,Both Sexes   : num  52.2 60.2 62.2 63.2 62.8 ...
##  $ Whites,Males        : num  60.9 69.8 72.7 74.3 73.9 ...
##  $ Whites,Females      : num  45.5 54.1 55 55.8 55.6 ...
##  $ Blacks,Both Sexes   : num  58 56.9 55 60.8 62.2 ...
##  $ Blacks,Males        : num  67.1 59.3 58.5 75.8 65.8 ...
##  $ Blacks,Females      : num  52 54.7 51.3 51.5 60.6 ...

Create three separate datasets for each site, one for both sexes, one for males and one for females. Each dataset should still have year of diagnosis. So each dataset should have 43 rows and 4 columns.

brain_both_sexes <- select(brain, `Year of Diagnosis`, ends_with('Sexes'))
brain_male <- select(brain, `Year of Diagnosis`, ends_with('Males'))
brain_female <- select(brain, `Year of Diagnosis`, ends_with('Females'))

str(brain_both_sexes)

## 'data.frame':    43 obs. of  4 variables:
##  $ Year of Diagnosis   : chr  "1975-2016" "1975" "1976" "1977" ...
##  $ All Races,Both Sexes: num  6.59 5.85 5.82 6.17 5.76 6.12 6.3 6.51 6.42 6.31 ...
##  $ Whites,Both Sexes   : num  7.22 6.21 6.18 6.6 6.1 6.6 6.81 6.9 6.92 6.88 ...
##  $ Blacks,Both Sexes   : num  4.08 4.14 3.32 3.55 3.86 3.69 3.14 5.02 3.71 2.75 ...

str(brain_male)

## 'data.frame':    43 obs. of  7 variables:
##  $ Year of Diagnosis: chr  "1975-2016" "1975" "1976" "1977" ...
##  $ All Races,Males  : num  7.88 6.84 7.14 7.76 6.79 7.42 7.58 8.07 7.93 7.6 ...
##  $ All Races,Females: num  5.51 5.01 4.68 4.89 4.91 5.01 5.24 5.2 5.24 5.19 ...
##  $ Whites,Males     : num  8.61 7.31 7.51 8.26 7.19 8.03 8.2 8.44 8.57 8.2 ...
##  $ Whites,Females   : num  6.04 5.28 5.03 5.27 5.19 5.37 5.65 5.63 5.64 5.74 ...
##  $ Blacks,Males     : num  4.79 4.31 5.37 5.17 4.34 4.19 3.35 7.24 4.4 3.79 ...
##  $ Blacks,Females   : chr  "3.51" "3.88" "-" "2.47" ...

str(brain_female)

## 'data.frame':    43 obs. of  4 variables:
##  $ Year of Diagnosis: chr  "1975-2016" "1975" "1976" "1977" ...
##  $ All Races,Females: num  5.51 5.01 4.68 4.89 4.91 5.01 5.24 5.2 5.24 5.19 ...
##  $ Whites,Females   : num  6.04 5.28 5.03 5.27 5.19 5.37 5.65 5.63 5.64 5.74 ...
##  $ Blacks,Females   : chr  "3.51" "3.88" "-" "2.47" ...

colon_both_sexes <- select(colon, `Year of Diagnosis`, ends_with('Sexes'))
esophagus_both_sexes <- select(esophagus, `Year of Diagnosis`, ends_with('Sexes'))
lung_both_sexes <- select(lung, `Year of Diagnosis`, ends_with('Sexes'))
oral_both_sexes <- select(oral, `Year of Diagnosis`, ends_with('Sexes'))

For the next few points, I will refer to the both sexes datasets. You’ll do the same for the male and female datasets.
1. Each dataset has the same column headers. We’ll be joining the data sets into a single data set using the year of diagnosis as the key, i.e. the common variable that will determine the joins. This means we have to make the column names for the different data sets (except the year column) distinctive. Change the column names replacing both_sexes with the names of the cancer sites
  - Hints: names(brain) gives the column names for the brain data. You can also change column names using names(brain) <- ... since this is just a vector. In particular, you can try something like names(brain) <- stringr::str_replace(names(brain), 'both_sexes', 'brain')
```
names(brain_both_sexes) <- str_replace(names(brain_both_sexes), 'Both Sexes','brain')
names(colon_both_sexes) <- str_replace(names(colon_both_sexes), 'Both Sexes','colon')
names(esophagus_both_sexes) <- str_replace(names(esophagus_both_sexes), 'Both Sexes','esophagus')
names(lung_both_sexes) <- str_replace(names(lung_both_sexes), 'Both Sexes','lung')
names(oral_both_sexes) <- str_replace(names(oral_both_sexes), 'Both Sexes','oral')
```

Create a new composite dataset by repeatedly using left_join or inner_join to add each site-specific dataset to the composite data set. What I mean is, create a dataset joining A and B, then join C to the result, then join D to the result and so on.

both_sexes <- brain_both_sexes %>% left_join(colon_both_sexes) %>% 
  left_join(esophagus_both_sexes) %>% 
  left_join(lung_both_sexes) %>% 
  left_join(oral_both_sexes)

For each of these datasets:
1. Remove the row with the overall 1975-2016 rates
2. Use pivot_longer to make a dataset with 3 columns: year, type of cancer and cancer incidence rate

both_sexes <- both_sexes %>% 
    slice(-1) %>% 
    mutate(across(everything(), as.numeric)) %>% 
    pivot_longer(names_to = 'race_site', 
                 values_to = 'rate',
                 cols = c(-`Year of Diagnosis`)) %>% 
    separate(race_site, c('race','site'), sep=',')

Create 3 plots like the one above, one for all races, one for whites and one for blacks. Assign the ggplot code for each to a name, i.e. something like plt1 <- ggplot(...) + .... Display the graph for all races, and create and display a panelled plot where the white and black plots are presented side-by-side. You can use functions from cowplot, ggpubr or patchwork as you like.

Make sure you have labelled everything properly, including titles, so that the graphical information is self-evident.

plt1 <- both_sexes %>% filter(race=='All Races') %>% 
    ggplot(aes(x = `Year of Diagnosis`, y  = rate, color = site))+
    geom_line()
plt2 <- both_sexes %>% filter(race=='Whites') %>% 
    ggplot(aes(x = `Year of Diagnosis`, y  = rate, color = site))+
    geom_line()
plt3 <- both_sexes %>% filter(race=='Blacks') %>% 
    ggplot(aes(x = `Year of Diagnosis`, y  = rate, color = site))+
    geom_line()

library(cowplot)
plot_grid(plt1, 
          plot_grid(plt2, plt3, nrow=1,
                    labels = c('Whites','Blacks')),
          ncol=1,
          labels = c('Both races', ''))

The code I developed in the screencast is available here

Homework 3

BIOF 339

Reading

Exercises