class: center, middle, inverse, title-slide # Data validation and exploration ### Abhijit Dasgupta ### BIOF 339 --- layout: true <div class="my-header"> <span>BIOF 339: Practical R</span> </div> --- ## Plan today - Dynamic exploration of data - Data validation - Missing data evaluation --- class: middle, center # Why go back to this? --- ## This is important!! + Most of the time in an analysis is spent understanding and cleaning data + Recognize that unless you've ended up with good-quality data, the rest of the analyses are moot + This is tedious, careful, non-sexy work - Hard to tell your boss you're still fixing the data - No real results yet - But essential to understanding the appropriate analyses and the tweaks you may need. --- ## What does a dataset look like? .pull-left[ ```r library(tidyverse) library(visdat) beaches <- rio::import('../data/sydneybeaches3.csv') vis_dat(beaches) ``` ] .pull-right[ ![](03-eda_files/figure-html/12-CleaningExploring-1-1.png)<!-- --> ] --- ## What does a dataset look like? .pull-left[ ```r brca <- rio::import('../data/clinical_data_breast_cancer_modified.csv') vis_dat(brca) ``` ] .pull-right[ ![](03-eda_files/figure-html/12-CleaningExploring-2-1.png)<!-- --> ] --- ## What does a dataset look like? .pull-left[ ```r vis_dat(airquality) ``` These plots give a nice insight into 1. data types 1. Missing data patterns (more on this later) ] .pull-right[ ![](03-eda_files/figure-html/12-CleaningExploring-3-1.png)<!-- --> ] --- class: middle, center # Let's get a bit more quantitative --- ## `summary` and `str`/`glimpse` are a first pass .pull-left[ ```r summary(airquality) ``` ``` Ozone Solar.R Wind Temp Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 NA's :37 NA's :7 Month Day Min. :5.000 Min. : 1.0 1st Qu.:6.000 1st Qu.: 8.0 Median :7.000 Median :16.0 Mean :6.993 Mean :15.8 3rd Qu.:8.000 3rd Qu.:23.0 Max. :9.000 Max. :31.0 ``` ] .pull-right[ ```r glimpse(airquality) ``` ``` Rows: 153 Columns: 6 $ Ozone <int> 41, 36, 12, 18, NA, 28, 23, 19, 8, NA, 7, 16, 11, 14, 18, 14, … $ Solar.R <int> 190, 118, 149, 313, NA, NA, 299, 99, 19, 194, NA, 256, 290, 27… $ Wind <dbl> 7.4, 8.0, 12.6, 11.5, 14.3, 14.9, 8.6, 13.8, 20.1, 8.6, 6.9, 9… $ Temp <int> 67, 72, 74, 62, 56, 66, 65, 59, 61, 69, 74, 69, 66, 68, 58, 64… $ Month <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,… $ Day <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,… ``` ] --- ## Validating data values + We can certainly be reactive by just describing the data and looking for anomalies. + For larger data or multiple data files it makes sense to be proactive and catch errors that you want to avoid, before exploring for new errors. + The `assertthat` package provides nice tools to do this -- > **Note to self:** I don't do this enough. This is a good defensive programming technique that can catch crucial problems that aren't always automatically flagged as errors --- ## Being assertive ```r library(assertthat) assert_that(all(between(airquality$Day, 1, 31) )) ``` ``` [1] TRUE ``` ```r assert_that(is.factor(mpg$manufacturer)) ``` ``` Error: mpg$manufacturer is not a factor ``` ```r assert_that(all(beaches$season_name %in% c('Summer','Winter','Spring', 'Fall'))) ``` ``` Error: Elements 11, 12, 13, 14, 15, ... of beaches$season_name %in% c("Summer", "Winter", "Spring", "Fall") are not true ``` --- ## Being assertive + `assert_that` generates an error, which will stop things + `see_if` does the same validation, but just generates a `TRUE/FALSE`, which you can capture ```r see_if(is.factor(mpg$manufacturer)) ``` ``` [1] FALSE attr(,"msg") [1] "mpg$manufacturer is not a factor" ``` + `validate_that` generates `TRUE` if the assertion is true, otherwise generates a string explaining the error ```r validate_that(is.factor(mpg$manufacturer)) ``` ``` [1] "mpg$manufacturer is not a factor" ``` ```r validate_that(is.character(mpg$manufacturer)) ``` ``` [1] TRUE ``` --- ## Being assertive You can even write your own validation functions and custom messages ```r is_odd <- function(x){ assert_that(is.numeric(x), length(x)==1) x %% 2 == 1 } assert_that(is_odd(2)) ``` ``` Error: is_odd(x = 2) is not TRUE ``` ```r on_failure(is_odd) <- function(call, env) { paste0(deparse(call$x), " is even") # This is a R trick } assert_that(is_odd(2)) ``` ``` Error: 2 is even ``` ```r is_odd(1:2) ``` ``` Error: length(x) not equal to 1 ``` --- class: middle,center # Missing data --- ## Missing data R denotes missing data as `NA`, and supplies several functions to deal with missing data. The most fundamental is `is.na`, which gives a TRUE/FALSE answer ```r is.na(NA) ``` ``` [1] TRUE ``` ```r is.na(25) ``` ``` [1] FALSE ``` --- ## Missing data When we get a new dataset, it's useful to get a sense of the missingness ```r mpg %>% summarize(across(everything(), function(x) sum(is.na(x)))) ``` ``` # A tibble: 1 × 11 manufacturer model displ year cyl trans drv cty hwy fl class <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> 1 0 0 0 0 0 0 0 0 0 0 0 ``` --- ## Missing data The `naniar` package allows a tidyverse-compatible way to deal with missing data ```r library(naniar) weather <- rio::import('../data/weather.csv') all_complete(mpg) ``` ``` [1] TRUE ``` ```r all_complete(weather) ``` ``` [1] FALSE ``` ```r weather %>% summarize_all(pct_complete) ``` ``` id year month element d1 d2 d3 d4 d5 d6 1 100 100 100 100 9.090909 18.18182 18.18182 9.090909 36.36364 9.090909 d7 d8 d9 d10 d11 d12 d13 d14 d15 1 9.090909 9.090909 0 9.090909 9.090909 0 9.090909 18.18182 9.090909 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 1 9.090909 9.090909 0 0 0 0 0 18.18182 0 9.090909 9.090909 27.27273 d28 d29 d30 d31 1 9.090909 18.18182 9.090909 9.090909 ``` --- ## Missing data ```r gg_miss_case(weather, show_pct = T) ``` ![](03-eda_files/figure-html/12-CleaningExploring-13-1.png)<!-- --> --- ## Missing data ```r gg_miss_var(weather, show_pct=T) ``` ``` Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please use `guide = "none"` instead. ``` ![](03-eda_files/figure-html/12-CleaningExploring-14-1.png)<!-- --> --- ## Are there patterns to the missing data + Most analyses assume that data is either - Missing completely at random (MCAR) - Missing at random (MAR) + MCAR means - The missing data is just a random subset of the data + MAR means - Whether data is missing is related to values of some other variable(s) - If we control for those variable(s), the missing data would form a random subset of each of those data subsets defined by unique values of these variables --- ## Are there patterns to the missing data #### MAR or MCAR allows us to ignore the missing data, since it doesn't bias our analyses #### If data are **not** MCAR or MAR, we really need to understand the issing data mechanism and how that might affect our results. --- ## Co-patterns of missingness .pull-left[ ```r gg_miss_upset(airquality) ``` ![](03-eda_files/figure-html/v4-1.png)<!-- --> ] .pull-right[ ```r gg_miss_upset(riskfactors) ``` ![](03-eda_files/figure-html/v5-1.png)<!-- --> ] --- ## Co-patterns of missingness .pull-left[ ```r ggplot(airquality, aes(x = Ozone, y = Solar.R)) + geom_point() ``` ``` Warning: Removed 42 rows containing missing values (geom_point). ``` ![](03-eda_files/figure-html/d1-1.png)<!-- --> ] .pull-right[ ```r ggplot(airquality, aes(x = Ozone, y = Solar.R)) + geom_miss_point() ``` ![](03-eda_files/figure-html/12-CleaningExploring-15-1.png)<!-- --> ] --- ## Co-patterns of missingness ```r gg_miss_fct(x = riskfactors, fct = marital) ``` ![](03-eda_files/figure-html/12-CleaningExploring-16-1.png)<!-- --> --- ## Replacing missing data `tidyr` has a function `replace_na` which will replace all missing values with some particular value. In the weather dataset, values are missing generally because there wasn't recorded rainfall on a day. So these values should really be 0 ```r weather1 <- weather %>% mutate(d1 = replace_na(d1, 0)) pct_miss(weather1$d1) ``` ``` [1] 0 ``` --- ### Question: How would you replace all the missing values with 0? -- ```r weather %>% mutate(across(everything(),function(x) replace_na(x, 0))) ``` -- ### How would you replace the missing values with the mean of the variable? -- ```r weather %>% mutate(across(where(is.numeric), function(x) replace_na(x, mean(x, na.rm=T)))) ``` ---