Instructions

You are expected to edit this R Markdown document, especially the R code chunks. Once you’re done, please change eval=F to eval=T in each chunk in order to make the code run. Most of the R chunks are fill-in-the-blanks, with a few left empty for you to completely fill in.

Your submission for this week will comprise 2 files:

  1. This file, after editing to complete the assignments
  2. The HTML file obtained by knitting this file

Exercises

  1. Fix the code in the following code to ensure they work without errors.
my_variable <-  1:5
my_var1able
mean(airquality$wind)
table(iris$Sepal_Length)
  1. Download the data “clinical_data_breast_cancer_modified.csv” from the data link on Canvas and save it in the data folder of your RStudio Project for this class.
  1. Import the data into R, calling it breast_cancer.
# Code to import data file here

__________ <- __::______("data/clinical_data_breast_cancer_modified.csv")
  1. Check the data types for each of the variables
# Check for data types

____(breast_cancer)
  1. Convert the Gender, ER Status, PR Status and HER2 status variables to factors
# Convert here

breast_cancer$Gender <- as.factor(breast_cancer$Gender)

## You can repeat this for the other variables, making sure you spell them properly. Or you can comment 
## proceed as follows:

breast_cancer <- breast_cancer %>%
    mutate(Gender = as.factor(Gender), 
           ______= ________(_______),
           ..... # fill this in with the other variables
          )

## Comment out one of the two strategies for your submission
  1. Identify any other variables that might be the wrong type and convert them
# Convert any new variables here
  1. Verify that all the variables in breast_cancer match the modifications you made
  1. The dataset “clinical_data_breast_cancer_hw.csv” provides clinical and pathological data for 105 breast cancer patients.
  1. Read this data into R, calling it brca_data.
brca_data <- ____::________('data/clinical_data_breast_cancer_hw.csv')
  1. This data is in tidy format, but needs cleaning.
    1. Evaluate whether all the variables are of the correct type. If not, evaluate why, and correct the problem.
___(brca_data)

## Add code here to correct any problems in the data set
  1. For the molecular results for ER, PR and HER2, any entry that is not Positive or Negative should be changed to a missing value.
## There are two approaches here. The first looks at the data and finds out the actual values that aren't Positive or Negative
brca_data1 <- brca_data %>%
    mutate(ER.Status = recode_factor(ER.Status, Indeterminate = NA_character_), 
    mutate(________ = recode_factor(_______,  '_________' = NA_character_))

## There is an error in the above code, and a couple of solutions.
  
## The other is to run the same function on the columns for ER, PR and HER2 status

clean_markers <- function(x){
  x <- recode_factor(x, Positive = 'Positive', _______ = '________', .default = NA_character_)
  return(x)
 }

 brca_data2 <- brca_data %>%
    mutate(across(c(____, ____, ____), ________))

## Verify that both data sets are identical

all(brca_data1 == brca_data2, na.rm=TRUE)
  1. Create a new dataset that contains patient ID, ER, PR and HER2 status.
brca_data3 <- _______(brca_data, ______, ______, _______, _______)
  1. Creating new variables (what dplyr function will you use for all of these?): ___________

  2. Create a variable giving the TNM status of each patient. The T, N and M statuses are given separately. I want a single variable encoded as, for example, “T2N0M0”. [Hint: The function paste is your friend]

brca_data <- brca_data %>%
    ______(tnm_status = ___________(Tumor, Node, Metastasis, sep = ""))
brca_data3$tnm_status <- brca_data$tnm_status
  1. Create a variable that divides the patients into 3 categories based on molecular profiles [Hint: The function case_when might help]: - Luminal (ER positive and/or PR positive) - HER2 (HER2 positive) - Basal-like (ER, PR and HER2 negative)
brca_data3 <- brca_data3 %>%
    mutate(mol_cat = case_when( # fill in the next 4 lines

      
      
      )
  1. Create a variable that equals the time of death if they are dead (see the variable Vital.Status) or the time of last contact if they are alive. This is a common computation for survival analysis studies, and is called the overall survival time. [Hint: the function ifelse might be useful]
## Fill this in yourself. Time to start leaving the nest
brca_data <- brca_data %>% 
  mutate(event_time = ___________(Vital.Status == 1, _____________, _______________))
brca_data3$event_time <- brca_data$event_time

Save the cleaned breast cancer dataset as brca_cleaned. You can save this to your computer using saveRDS(brca_cleaned, file="<a filename of your choice>.rds"). We’ll be using this dataset again when we do plots and modeling.