Instructions:
I suggest you start working on this week’s assignment using just a R script file, and work out the different things you have to do using just R code.
The submission, of course, has to be a R Markdown file and the corresponding knitted HTML file. You will create this R Markdown file on your own, adding R chunks in it with code copied from your scirpt file. You also have to write in the question numbers and a minimal narrative in markdown as part of the R Markdown file, so that it reads more like a report.
The submission will consist of your R markdown file and knitted HTML file. You have to make sure that the HTML file is properly rendered. Both files are required, and omitting one is an incomplete assignment
Also, please remove any and all template-based text from the R markdown file before you submit it. We don’t want to see the template materials in any submissions.
Let’s look at the breast cancer data set we’be been using.
library(rio)
brca <- import(here('data/brca_data.rds')) %>%
janitor::clean_names()
Create a bar plot of the frequencies of ER status (Hint: geom_bar
. How would you modify the dataset so that there is no separate bar for NA’s?
ggplot(brca, aes(x = er_status))+
geom_bar()
ggplot(brca %>% filter(!is.na(er_status)),
aes(x = er_status))+
geom_bar()
Color each bar by the frequencies of HER2 status within each group. (Hint: you can add aes(color='HER2.Status')
or aes(fill='HER2.Status')
to the geom_bar
statement. What is the difference in these two choices?)
ggplot(brca %>% filter(!is.na(er_status), !is.na(her2_final_status)),
aes(x = er_status, fill=her2_final_status))+
geom_bar(position = 'stack') # This is the default
I would like to see a graph showing the relative frequencies of HER2 status within each ER group. Read the documentation ?geom_bar
as well as section 3.8 of R4DS to find this solution.
ggplot(brca %>% filter(!is.na(er_status), !is.na(her2_final_status)),
aes(x = er_status, fill=her2_final_status))+
geom_bar(position = 'fill')+ # This is the default
scale_y_continuous('Percent', labels = scales::label_percent())+
labs(x = 'ER status', fill = 'HER2 status')
diamonds
dataset is included in the ggplot2
package.
Add a column to the diamonds dataset that is the volume of each diamond (multiplying the x
, y
and z
variables)
diamonds1 <- diamonds %>% mutate(volume = x * y * z)
Plot a scatter plot of diamond price against diamond volume
ggplot(diamonds1, aes(x=volume, y=price))+
geom_point()
Modify the plot to add colors based on the cut of each diamond
ggplot(diamonds1, aes(x=volume, y = price))+
geom_point(aes(color=cut))
Modify the previous plot to construct a facetted plot where each facet is based on a color.
ggplot(diamonds1, aes(x = volume, y = price))+
geom_point()+
facet_wrap(~cut)
The following links provide data for U.S. incidence rates per 100,000, standardized to the 2000 standard U.S. population, for brain, colon, esophageal, lung and oral cancers for the period 1975-2016. These data/HW6 are provided from the SEER program.
Our goal is to create a single graphic showing the patterns of incidence rates over this time period for the 5 cancers, something like the one below:
brain
, colon
, esophagus
, lung
and oral
.
library(tidyverse)
library(rio)
library(here)
brain <- import(here('data/Brain.csv'), skip=4)
colon <- import(here('data/Colon.csv'), skip=4)
esophagus <- import(here('data/Esophagus.csv'), skip=4)
lung <- import(here('data/Lung.csv'), skip=4)
oral <- import(here('data/Oral.csv'), skip=4)
str(brain)
## 'data.frame': 43 obs. of 10 variables:
## $ Year of Diagnosis : chr "1975-2016" "1975" "1976" "1977" ...
## $ All Races,Both Sexes: num 6.59 5.85 5.82 6.17 5.76 6.12 6.3 6.51 6.42 6.31 ...
## $ All Races,Males : num 7.88 6.84 7.14 7.76 6.79 7.42 7.58 8.07 7.93 7.6 ...
## $ All Races,Females : num 5.51 5.01 4.68 4.89 4.91 5.01 5.24 5.2 5.24 5.19 ...
## $ Whites,Both Sexes : num 7.22 6.21 6.18 6.6 6.1 6.6 6.81 6.9 6.92 6.88 ...
## $ Whites,Males : num 8.61 7.31 7.51 8.26 7.19 8.03 8.2 8.44 8.57 8.2 ...
## $ Whites,Females : num 6.04 5.28 5.03 5.27 5.19 5.37 5.65 5.63 5.64 5.74 ...
## $ Blacks,Both Sexes : num 4.08 4.14 3.32 3.55 3.86 3.69 3.14 5.02 3.71 2.75 ...
## $ Blacks,Males : num 4.79 4.31 5.37 5.17 4.34 4.19 3.35 7.24 4.4 3.79 ...
## $ Blacks,Females : chr "3.51" "3.88" "-" "2.47" ...
str(colon)
## 'data.frame': 43 obs. of 10 variables:
## $ Year of Diagnosis : chr "1975-2016" "1975" "1976" "1977" ...
## $ All Races,Both Sexes: num 52.1 59.5 61.3 62.4 62 ...
## $ All Races,Males : num 60.9 68.5 71.7 73.5 72.3 ...
## $ All Races,Females : num 45.5 53.7 54.1 55 55.2 ...
## $ Whites,Both Sexes : num 52.2 60.2 62.2 63.2 62.8 ...
## $ Whites,Males : num 60.9 69.8 72.7 74.3 73.9 ...
## $ Whites,Females : num 45.5 54.1 55 55.8 55.6 ...
## $ Blacks,Both Sexes : num 58 56.9 55 60.8 62.2 ...
## $ Blacks,Males : num 67.1 59.3 58.5 75.8 65.8 ...
## $ Blacks,Females : num 52 54.7 51.3 51.5 60.6 ...
brain_both_sexes <- select(brain, `Year of Diagnosis`, ends_with('Sexes'))
brain_male <- select(brain, `Year of Diagnosis`, ends_with('Males'))
brain_female <- select(brain, `Year of Diagnosis`, ends_with('Females'))
str(brain_both_sexes)
## 'data.frame': 43 obs. of 4 variables:
## $ Year of Diagnosis : chr "1975-2016" "1975" "1976" "1977" ...
## $ All Races,Both Sexes: num 6.59 5.85 5.82 6.17 5.76 6.12 6.3 6.51 6.42 6.31 ...
## $ Whites,Both Sexes : num 7.22 6.21 6.18 6.6 6.1 6.6 6.81 6.9 6.92 6.88 ...
## $ Blacks,Both Sexes : num 4.08 4.14 3.32 3.55 3.86 3.69 3.14 5.02 3.71 2.75 ...
str(brain_male)
## 'data.frame': 43 obs. of 7 variables:
## $ Year of Diagnosis: chr "1975-2016" "1975" "1976" "1977" ...
## $ All Races,Males : num 7.88 6.84 7.14 7.76 6.79 7.42 7.58 8.07 7.93 7.6 ...
## $ All Races,Females: num 5.51 5.01 4.68 4.89 4.91 5.01 5.24 5.2 5.24 5.19 ...
## $ Whites,Males : num 8.61 7.31 7.51 8.26 7.19 8.03 8.2 8.44 8.57 8.2 ...
## $ Whites,Females : num 6.04 5.28 5.03 5.27 5.19 5.37 5.65 5.63 5.64 5.74 ...
## $ Blacks,Males : num 4.79 4.31 5.37 5.17 4.34 4.19 3.35 7.24 4.4 3.79 ...
## $ Blacks,Females : chr "3.51" "3.88" "-" "2.47" ...
str(brain_female)
## 'data.frame': 43 obs. of 4 variables:
## $ Year of Diagnosis: chr "1975-2016" "1975" "1976" "1977" ...
## $ All Races,Females: num 5.51 5.01 4.68 4.89 4.91 5.01 5.24 5.2 5.24 5.19 ...
## $ Whites,Females : num 6.04 5.28 5.03 5.27 5.19 5.37 5.65 5.63 5.64 5.74 ...
## $ Blacks,Females : chr "3.51" "3.88" "-" "2.47" ...
colon_both_sexes <- select(colon, `Year of Diagnosis`, ends_with('Sexes'))
esophagus_both_sexes <- select(esophagus, `Year of Diagnosis`, ends_with('Sexes'))
lung_both_sexes <- select(lung, `Year of Diagnosis`, ends_with('Sexes'))
oral_both_sexes <- select(oral, `Year of Diagnosis`, ends_with('Sexes'))
For the next few points, I will refer to the both sexes datasets. You’ll do the same for the male and female datasets.
both_sexes
with the names of the cancer sites
names(brain)
gives the column names for the brain data. You can also change column names using names(brain) <- ...
since this is just a vector. In particular, you can try something like names(brain) <- stringr::str_replace(names(brain), 'both_sexes', 'brain')
names(brain_both_sexes) <- str_replace(names(brain_both_sexes), 'Both Sexes','brain')
names(colon_both_sexes) <- str_replace(names(colon_both_sexes), 'Both Sexes','colon')
names(esophagus_both_sexes) <- str_replace(names(esophagus_both_sexes), 'Both Sexes','esophagus')
names(lung_both_sexes) <- str_replace(names(lung_both_sexes), 'Both Sexes','lung')
names(oral_both_sexes) <- str_replace(names(oral_both_sexes), 'Both Sexes','oral')
Create a new composite dataset by repeatedly using left_join
or inner_join
to add each site-specific dataset to the composite data set. What I mean is, create a dataset joining A and B, then join C to the result, then join D to the result and so on.
both_sexes <- brain_both_sexes %>% left_join(colon_both_sexes) %>%
left_join(esophagus_both_sexes) %>%
left_join(lung_both_sexes) %>%
left_join(oral_both_sexes)
pivot_longer
to make a dataset with 3 columns: year, type of cancer and cancer incidence rateboth_sexes <- both_sexes %>%
slice(-1) %>%
mutate(across(everything(), as.numeric)) %>%
pivot_longer(names_to = 'race_site',
values_to = 'rate',
cols = c(-`Year of Diagnosis`)) %>%
separate(race_site, c('race','site'), sep=',')
Create 3 plots like the one above, one for all races, one for whites and one for blacks. Assign the ggplot code for each to a name, i.e. something like plt1 <- ggplot(...) + ...
. Display the graph for all races, and create and display a panelled plot where the white and black plots are presented side-by-side. You can use functions from cowplot, ggpubr or patchwork as you like.
plt1 <- both_sexes %>% filter(race=='All Races') %>%
ggplot(aes(x = `Year of Diagnosis`, y = rate, color = site))+
geom_line()
plt2 <- both_sexes %>% filter(race=='Whites') %>%
ggplot(aes(x = `Year of Diagnosis`, y = rate, color = site))+
geom_line()
plt3 <- both_sexes %>% filter(race=='Blacks') %>%
ggplot(aes(x = `Year of Diagnosis`, y = rate, color = site))+
geom_line()
library(cowplot)
plot_grid(plt1,
plot_grid(plt2, plt3, nrow=1,
labels = c('Whites','Blacks')),
ncol=1,
labels = c('Both races', ''))
The code I developed in the screencast is available here