class: center, middle, inverse, title-slide # Data structures in R ### Abhijit Dasgupta ### BIOF 339 --- layout: true <div class="my-header"> <span>BIOF339</span></div> --- ## A quick refresh + R is a scripting language for data analysis and statistics + R Markdown is a way of combining textual information and R code to produce reproducible documents + RStudio is an integrated environment that makes it easier to work with R .pull-left[ You type commands (_code_) for R to run. - objects like data (_nouns_) - functions that do something to R objects (_verbs_) ] .pull-right[ Examples ```r airquality diamonds summary(airquality) ``` ] --- # Objects in R .pull-left[ Let's start with the `airquality` data. - It is an object - of class `class(airquality)` = data.frame How about each column? Let's look at the Ozone and Wind columns - We can access them using `airquality$Ozone` and `airquality$Wind` - `class(airquality$Ozone)` = integer - `class(airquality$Wind)` = numeric ] .pull-right[ ``` Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 7 23 299 8.6 65 5 7 8 19 99 13.8 59 5 8 9 8 19 20.1 61 5 9 10 NA 194 8.6 69 5 10 11 7 NA 6.9 74 5 11 12 16 256 9.7 69 5 12 13 11 290 9.2 66 5 13 14 14 274 10.9 68 5 14 15 18 65 13.2 58 5 15 16 14 334 11.5 64 5 16 17 34 307 12.0 66 5 17 18 6 78 18.4 57 5 18 19 30 322 11.5 68 5 19 20 11 44 9.7 62 5 20 21 1 8 9.7 59 5 21 22 11 320 16.6 73 5 22 23 4 25 9.7 61 5 23 24 32 92 12.0 61 5 24 25 NA 66 16.6 57 5 25 ``` ] --- # Objects in R ```r head(iris) ``` ``` Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa ``` ```r str(iris) ``` ``` 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... ``` Now we see another type of object, a `factor` --- # Objects in R ```r library(ggplot2) str(midwest) ``` ``` tibble [437 × 28] (S3: tbl_df/tbl/data.frame) $ PID : int [1:437] 561 562 563 564 565 566 567 568 569 570 ... $ county : chr [1:437] "ADAMS" "ALEXANDER" "BOND" "BOONE" ... $ state : chr [1:437] "IL" "IL" "IL" "IL" ... $ area : num [1:437] 0.052 0.014 0.022 0.017 0.018 0.05 0.017 0.027 0.024 0.058 ... $ poptotal : int [1:437] 66090 10626 14991 30806 5836 35688 5322 16805 13437 173025 ... $ popdensity : num [1:437] 1271 759 681 1812 324 ... $ popwhite : int [1:437] 63917 7054 14477 29344 5264 35157 5298 16519 13384 146506 ... $ popblack : int [1:437] 1702 3496 429 127 547 50 1 111 16 16559 ... $ popamerindian : int [1:437] 98 19 35 46 14 65 8 30 8 331 ... $ popasian : int [1:437] 249 48 16 150 5 195 15 61 23 8033 ... $ popother : int [1:437] 124 9 34 1139 6 221 0 84 6 1596 ... $ percwhite : num [1:437] 96.7 66.4 96.6 95.3 90.2 ... $ percblack : num [1:437] 2.575 32.9 2.862 0.412 9.373 ... $ percamerindan : num [1:437] 0.148 0.179 0.233 0.149 0.24 ... $ percasian : num [1:437] 0.3768 0.4517 0.1067 0.4869 0.0857 ... $ percother : num [1:437] 0.1876 0.0847 0.2268 3.6973 0.1028 ... $ popadults : int [1:437] 43298 6724 9669 19272 3979 23444 3583 11323 8825 95971 ... $ perchsd : num [1:437] 75.1 59.7 69.3 75.5 68.9 ... $ percollege : num [1:437] 19.6 11.2 17 17.3 14.5 ... $ percprof : num [1:437] 4.36 2.87 4.49 4.2 3.37 ... $ poppovertyknown : int [1:437] 63628 10529 14235 30337 4815 35107 5241 16455 13081 154934 ... $ percpovertyknown : num [1:437] 96.3 99.1 95 98.5 82.5 ... $ percbelowpoverty : num [1:437] 13.15 32.24 12.07 7.21 13.52 ... $ percchildbelowpovert: num [1:437] 18 45.8 14 11.2 13 ... $ percadultpoverty : num [1:437] 11.01 27.39 10.85 5.54 11.14 ... $ percelderlypoverty : num [1:437] 12.44 25.23 12.7 6.22 19.2 ... $ inmetro : int [1:437] 0 0 0 1 0 0 0 0 0 1 ... $ category : chr [1:437] "AAR" "LHR" "AAR" "ALU" ... ``` Here we have a `character`. --- # Objects in R The most common types of data we see are `numeric`, `character`, `factor`. You can also see `Date` and `logical` You can test to see if data is of a particular type, or convert from one data type to another <table> <thead> <tr> <th style="text-align:left;"> Data type </th> <th style="text-align:left;"> Test </th> <th style="text-align:left;"> Convert </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> numeric </td> <td style="text-align:left;"> is.numeric </td> <td style="text-align:left;"> as.numeric </td> </tr> <tr> <td style="text-align:left;"> character </td> <td style="text-align:left;"> is.character </td> <td style="text-align:left;"> as.character </td> </tr> <tr> <td style="text-align:left;"> factor </td> <td style="text-align:left;"> is.factor </td> <td style="text-align:left;"> as.factor </td> </tr> </tbody> </table> -- This conversion is important. Why?
05
:
00
--- # Factors Factors are uniquely an R thing. They are meant to represent categorical data (gender, race, state, phenotype) They look like character vectors, but internally act like integers, so you have to be a bit careful with them -- Whenever you're in doubt, convert them to characters using `as.character`. We'll see the utility of factors when we do data munging, summaries and modeling --- ## Every object in R has a name You give an object a name using the syntax `name <- object` Naming conventions: 1. Snake_case or pothole_case 1. CamelCase 1. Some.people.use.periods I'm partial to `snake_case`. The point here is to make expressive names using English so you know what is stored in the name. --- ## A silly exercise From the iris dataset, save each column into a new object, giving it a name. Then see what kind of data that object contains.
05
:
00
--- ## Bigger objects ### Scalar -> vector (array) -> matrix (2-d array) -- + A scalar is a single number or word + A vector is a bunch of scalars arranged in a row or column + A matrix is a bunch of scalars arranged in rows and columns #### Each of these **must** be of the same data type --- ## Examples ```r 2 ``` ``` [1] 2 ``` -- ```r c(2,3,4,5,6) ``` ``` [1] 2 3 4 5 6 ``` > `c()` is the concatenate function -- ```r matrix(c(1,2,3,4), byrow = T, nrow = 2) ``` ``` [,1] [,2] [1,] 1 2 [2,] 3 4 ``` --- class:middle,center # Data comes in many flavors --- ## Heterogeneous data From Excel, we are familiar with keeping different kinds of data together in a spreadsheet + Expression levels (numeric) + Gene names (character) + Date of experiment (Date) In R, the objects that can hold heterogeneous data are `data.frame` and `list` --- class:middle, center # Data sets --- ## Typical data structure + Data is typically in a rectangular format + spreadsheet, database table + CSV (comma-separated values) or TSV (tab-separated values) files + Characteristic + Rows are observations + Columns are variables + Each column has the same number of observations > [__Tidy data__](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) is a particularly amenable format for data analysis. --- # The `data.frame` Dataframes are the primary mode of storing datasets in R They were revolutionary in that they kept heterogeneous data together They share properties of both a __matrix__ and a __list__ ```r class(airquality) ``` ``` [1] "data.frame" ``` > Technically, a data.frame is a list of vectors (or objects, generally) of the same length --- class: middle, center # Load some data --- We'll load the `spine` dataset into R. To do this, download the data from the web, and store it in the main folder in your project. Then, in the Environment pane, import it using the **Import Dataset** button. You will use the `From text (readr)` option --- ![](img/readr1.png) --- ![](img/readr2.png) --- ![](img/readr3.png) --- class:center, middle # A digression: Lists and Matrices --- # Matrices A __matrix__ is a rectangular array of data _of the same type_ ```r matrix(0, nrow=2, ncol=4) ``` ``` [,1] [,2] [,3] [,4] [1,] 0 0 0 0 [2,] 0 0 0 0 ``` ```r matrix(letters, nrow=2) ``` ``` [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [1,] "a" "c" "e" "g" "i" "k" "m" "o" "q" "s" "u" "w" "y" [2,] "b" "d" "f" "h" "j" "l" "n" "p" "r" "t" "v" "x" "z" ``` ```r matrix(letters, nrow=2, byrow=T) ``` ``` [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [1,] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" [2,] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" ``` --- # Matrices You can create a matrix from a set of _vectors_ of the same length ```r x <- c(1,2,3,4) y <- c(10,20,30,40) ``` Put columns together ```r cbind(c(1,2,3,4), c(10,20,30,40)) ## Column bind ``` ``` [,1] [,2] [1,] 1 10 [2,] 2 20 [3,] 3 30 [4,] 4 40 ``` --- # Matrices You can create a matrix from a set of _vectors_ of the same length ```r x <- c(1,2,3,4) y <- c(10,20,30,40) ``` Put rows together ```r example_matrix <- rbind(c(1,2,3,4), c(10,20,30,40)) ## Row bind example_matrix ``` ``` [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 10 20 30 40 ``` --- # Extracting elements ```r example_matrix ``` ``` [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 10 20 30 40 ``` ```r example_matrix[1,] ## Extracts 1st row ``` ``` [1] 1 2 3 4 ``` ```r example_matrix[,2:3] ## extracts 2nd & 3rd columns ``` ``` [,1] [,2] [1,] 2 3 [2,] 20 30 ``` ```r example_matrix[1,4] ``` ``` [1] 4 ``` --- # Matrix properties ```r example_matrix ``` ``` [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 10 20 30 40 ``` ```r nrow(example_matrix) ## Number of rows ``` ``` [1] 2 ``` ```r ncol(example_matrix) ## Number of columns ``` ``` [1] 4 ``` ```r dim(example_matrix) ## shortcut for above ``` ``` [1] 2 4 ``` --- # Matrix arithmetic ```r example_matrix ``` ``` [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 10 20 30 40 ``` ```r example_matrix + 5 ## Add 5 to each element ``` ``` [,1] [,2] [,3] [,4] [1,] 6 7 8 9 [2,] 15 25 35 45 ``` ```r example_matrix * 2 ## Multiply each element by 2 ``` ``` [,1] [,2] [,3] [,4] [1,] 2 4 6 8 [2,] 20 40 60 80 ``` --- # Two matrices ```r example_matrix ``` ``` [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 10 20 30 40 ``` ```r example_matrix2 <- rbind(3:6, 9:12) example_matrix2 ``` ``` [,1] [,2] [,3] [,4] [1,] 3 4 5 6 [2,] 9 10 11 12 ``` ```r example_matrix + example_matrix2 ``` ``` [,1] [,2] [,3] [,4] [1,] 4 6 8 10 [2,] 19 30 41 52 ``` --- # Two matrices ```r example_matrix ``` ``` [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 10 20 30 40 ``` ```r example_matrix2 ``` ``` [,1] [,2] [,3] [,4] [1,] 3 4 5 6 [2,] 9 10 11 12 ``` ```r example_matrix * example_matrix2 ## Not matrix multiplication, but element-wise multiplication ``` ``` [,1] [,2] [,3] [,4] [1,] 3 8 15 24 [2,] 90 200 330 480 ``` --- # Two matrices ```r rbind(example_matrix, example_matrix2) ``` ``` [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 10 20 30 40 [3,] 3 4 5 6 [4,] 9 10 11 12 ``` ```r cbind(example_matrix, example_matrix2) ``` ``` [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 1 2 3 4 3 4 5 6 [2,] 10 20 30 40 9 10 11 12 ``` --- # Two matrices ```r dim(example_matrix2) ``` ``` [1] 2 4 ``` ```r t(example_matrix2) ## Transpose of a matrix ``` ``` [,1] [,2] [1,] 3 9 [2,] 4 10 [3,] 5 11 [4,] 6 12 ``` ```r example_matrix %*% t(example_matrix2) ## Matrix multiplication ``` ``` [,1] [,2] [1,] 50 110 [2,] 500 1100 ``` --- # Lists Lists are collections of arbitrary objects in R ```r example_list <- list(c('Andy','Brian','Harry'), c(12, 16, 16), c(TRUE, TRUE, FALSE), matrix(1, nrow=2, ncol=3)) example_list ``` ``` [[1]] [1] "Andy" "Brian" "Harry" [[2]] [1] 12 16 16 [[3]] [1] TRUE TRUE FALSE [[4]] [,1] [,2] [,3] [1,] 1 1 1 [2,] 1 1 1 ``` --- # Extracting elements from lists ```r example_list[[3]] ``` ``` [1] TRUE TRUE FALSE ``` ```r example_list[1:2] ``` ``` [[1]] [1] "Andy" "Brian" "Harry" [[2]] [1] 12 16 16 ``` --- # Extracting elements from lists ```r example_list[[4]] ``` ``` [,1] [,2] [,3] [1,] 1 1 1 [2,] 1 1 1 ``` ```r class(example_list[[4]]) ``` ``` [1] "matrix" "array" ``` ```r example_list[[4]][1,] ``` ``` [1] 1 1 1 ``` --- # Named lists ```r example_named_list <- list('Names' = c('Andy','Brian','Harry'), "YearsOfEducation" = c(12, 16, 16), "Married" = c(TRUE, TRUE, FALSE), 'something' = matrix(1, nrow=2, ncol=3)) example_named_list[['Names']] ``` ``` [1] "Andy" "Brian" "Harry" ``` ```r example_named_list$Names ``` ``` [1] "Andy" "Brian" "Harry" ``` ```r example_named_list$Names[3] ``` ``` [1] "Harry" ``` --- class: middle, center # Back to a Data Frame --- # Data frames A data.frame object is a __named list__ where each element is of the same length You can use both _matrix_ and _list_ functions to operate on data.frame objects!! --- # Data Frames ```r head(data_spine) ``` ``` Pelvic.incidence Pelvic.tilt Lumbar.lordosis.angle Sacral.slope Pelvic.radius 1 63.02782 22.552586 39.60912 40.47523 98.67292 2 39.05695 10.060991 25.01538 28.99596 114.40543 3 68.83202 22.218482 50.09219 46.61354 105.98514 4 69.29701 24.652878 44.31124 44.64413 101.86850 5 49.71286 9.652075 28.31741 40.06078 108.16872 6 40.25020 13.921907 25.12495 26.32829 130.32787 Degree.spondylolisthesis Pelvic.slope Direct.tilt Thoracic.slope 1 -0.254400 0.7445035 12.5661 14.5386 2 4.564259 0.4151857 12.8874 17.5323 3 -3.530317 0.4748892 26.8343 17.4861 4 11.211523 0.3693453 23.5603 12.7074 5 7.918501 0.5433605 35.4940 15.9546 6 2.230652 0.7899929 29.3230 12.0036 Cervical.tilt Sacrum.angle Scoliosis.slope Class.attribute 1 15.30468 -28.658501 43.5123 Abnormal 2 16.78486 -25.530607 16.1102 Abnormal 3 16.65897 -29.031888 19.2221 Abnormal 4 11.42447 -30.470246 18.8329 Abnormal 5 8.87237 -16.378376 24.9171 Abnormal 6 10.40462 -1.512209 9.6548 Abnormal ``` --- # Data Frames ```r dim(data_spine) ``` ``` [1] 310 13 ``` ```r nrow(data_spine) ``` ``` [1] 310 ``` ```r data_spine_small <- data_spine[1:4,] ## Matrix operation ``` --- # Data Frames ```r data_spine_small[,2] ## Matrix extraction by position ``` ``` [1] 22.55259 10.06099 22.21848 24.65288 ``` ```r data_spine_small[[2]] ## List extraction by position ``` ``` [1] 22.55259 10.06099 22.21848 24.65288 ``` --- # Data Frames ```r data_spine_small[['Pelvic.tilt']] ## Named list extraction ``` ``` [1] 22.55259 10.06099 22.21848 24.65288 ``` ```r data_spine_small[,'Pelvic.tilt'] ## Data frame named column extraction ``` ``` [1] 22.55259 10.06099 22.21848 24.65288 ``` ```r data_spine_small$Pelvic.tilt ## Dollar sign extraction ``` ``` [1] 22.55259 10.06099 22.21848 24.65288 ``` --- # Data Frames My preference is for 1. _data frame named column extraction_ `data_spine_small[,'Pelvic.tilt']`, 2. _named list extraction_ `data_spine_small[['Pelvic.tilt']]` 3. _Dollar-based extraction_ `data_spine_small$Pelvic.tilt` --- # Data Frames ```r names(data_spine_small) ``` ``` [1] "Pelvic.incidence" "Pelvic.tilt" [3] "Lumbar.lordosis.angle" "Sacral.slope" [5] "Pelvic.radius" "Degree.spondylolisthesis" [7] "Pelvic.slope" "Direct.tilt" [9] "Thoracic.slope" "Cervical.tilt" [11] "Sacrum.angle" "Scoliosis.slope" [13] "Class.attribute" ``` ```r data_spine_small[,c('Pelvic.tilt', 'Pelvic.slope','Class.attribute')] ``` ``` Pelvic.tilt Pelvic.slope Class.attribute 1 22.55259 0.7445035 Abnormal 2 10.06099 0.4151857 Abnormal 3 22.21848 0.4748892 Abnormal 4 24.65288 0.3693453 Abnormal ```