4.5 Cleaning from Excel files

Excel, being omnipresent, creates its own sets of difficulties. Excel, on top of being a data entry and analysis platform, is also a visual platform, so tables are often created to look good rather than be tidy. Things we commonly find in Excel files include colored rows and columns, multiple lines of headers, multiple rows with variables, typos resulting in numeric data becoming string data, and many others.

Let’s start with a data set that’s structured similarly to many government datasets around the world.

Notice that there are two levels of headers on top, each spanning multiple columns. Data is in paired columns, with the first column being the statistic and the second column being its standard error. The first row is a title, which we don’t need, and the 4th row is also not needed.

We’ll first tidy-fy this dataset, and then we’ll clean it up a bit. Let’s first think about the structure we want to get to for tidy-fication. The actual data lies in the statistics and the standard errors, and each of the column groupings represents different variables, so they should be in columns.

You can try to load this data using import, but you’ll find it’s a big mess. There are two powerful packages (by the same group of developers), tidyxl and unpivotr, that are fantastic tools for “fixing” Excel files for analysis.

Let’s start with tidyxl.

library(tidyxl)
dataset1 <- xlsx_cells('data/attendance.xlsx')
dataset1

## # A tibble: 1,173 x 21
##    sheet address   row   col is_blank data_type error logical numeric
##    <chr> <chr>   <int> <int> <lgl>    <chr>     <chr> <lgl>     <dbl>
##  1 Tabl… A1          1     1 FALSE    character <NA>  NA           NA
##  2 Tabl… B1          1     2 TRUE     blank     <NA>  NA           NA
##  3 Tabl… C1          1     3 TRUE     blank     <NA>  NA           NA
##  4 Tabl… D1          1     4 TRUE     blank     <NA>  NA           NA
##  5 Tabl… E1          1     5 TRUE     blank     <NA>  NA           NA
##  6 Tabl… F1          1     6 TRUE     blank     <NA>  NA           NA
##  7 Tabl… G1          1     7 TRUE     blank     <NA>  NA           NA
##  8 Tabl… H1          1     8 TRUE     blank     <NA>  NA           NA
##  9 Tabl… I1          1     9 TRUE     blank     <NA>  NA           NA
## 10 Tabl… J1          1    10 TRUE     blank     <NA>  NA           NA
## # … with 1,163 more rows, and 12 more variables: date <dttm>,
## #   character <chr>, character_formatted <list>, formula <chr>,
## #   is_array <lgl>, formula_ref <chr>, formula_group <int>, comment <chr>,
## #   height <dbl>, width <dbl>, style_format <chr>, local_format_id <int>

Notice that this pulls in a lot of meta-data in a tidy form, including information about cell formatting. This will be really useful in many situations.

First, lets get rid of the rows we don’t need.

dataset1 <- dataset1 %>% filter(row != 1, row != 4, row < 65)

Now we could manipulate this dataset using tidyverse tools, but unpivotr is much mor poweful. First, we are going to pull off the two headers. unpivotr does this using the function behead (suggestive?), with the first argument being the direction (‘N’, “S”, ‘E’,‘W’,etc) of the table that the header is present. We will also consider the first column, consisting of state names, as a header on the left.

library(unpivotr)
dataset1 %>% 
  behead('N', tophead) %>% 
  behead('N', head2) %>% 
  behead('W', State) %>% 
  select(row, col, data_type, numeric, tophead, head2, State)

## # A tibble: 960 x 7
##      row   col data_type  numeric tophead               head2     State    
##    <int> <int> <chr>        <dbl> <chr>                 <chr>     <chr>    
##  1     5     2 numeric    9.31e+1 Total elementary, se… ADA as p… "   Unit…
##  2     5     3 numeric    2.19e-1 <NA>                  <NA>      "   Unit…
##  3     5     4 numeric    6.64e+0 <NA>                  Average … "   Unit…
##  4     5     5 numeric    1.76e-2 <NA>                  <NA>      "   Unit…
##  5     5     6 numeric    1.80e+2 <NA>                  Average … "   Unit…
##  6     5     7 numeric    1.43e-1 <NA>                  <NA>      "   Unit…
##  7     5     8 numeric    1.19e+3 <NA>                  Average … "   Unit…
##  8     5     9 numeric    3.09e+0 <NA>                  <NA>      "   Unit…
##  9     5    10 numeric    9.40e+1 Elementary schools    ADA as p… "   Unit…
## 10     5    11 numeric    2.69e-1 <NA>                  <NA>      "   Unit…
## # … with 950 more rows

We need to separate the statistics and the standard errors from consecutive columns, and also make them headers.

dataset1 %>% 
  behead('N', tophead) %>% 
  behead('N', head2) %>% 
  behead('W', State) %>% 
  select(row, col, data_type, numeric, tophead, head2, State) %>% 
  mutate(header = ifelse(col %% 2 == 0, 'stats','se'))

## # A tibble: 960 x 8
##      row   col data_type  numeric tophead          head2    State    header
##    <int> <int> <chr>        <dbl> <chr>            <chr>    <chr>    <chr> 
##  1     5     2 numeric    9.31e+1 Total elementar… ADA as … "   Uni… stats 
##  2     5     3 numeric    2.19e-1 <NA>             <NA>     "   Uni… se    
##  3     5     4 numeric    6.64e+0 <NA>             Average… "   Uni… stats 
##  4     5     5 numeric    1.76e-2 <NA>             <NA>     "   Uni… se    
##  5     5     6 numeric    1.80e+2 <NA>             Average… "   Uni… stats 
##  6     5     7 numeric    1.43e-1 <NA>             <NA>     "   Uni… se    
##  7     5     8 numeric    1.19e+3 <NA>             Average… "   Uni… stats 
##  8     5     9 numeric    3.09e+0 <NA>             <NA>     "   Uni… se    
##  9     5    10 numeric    9.40e+1 Elementary scho… ADA as … "   Uni… stats 
## 10     5    11 numeric    2.69e-1 <NA>             <NA>     "   Uni… se    
## # … with 950 more rows

The %% operator computes the remainder if the left side is divided by the right side. So the criterion is asking which columns are even. The ifelse statement says, if the criterion is met, write “stats”, otherwise write “se”. This new variable is assigned to the dataset with the variable name “header”.

Notice that we have actually tidy-fied this dataset, but there’s missing data here, since the column headers span several rows visually but are only credited to the first column it covers. So we need to fill in the entries for the remaining rows with the corresponding entry from the earliest column. There is a function fill in tidyr that does this general trick, using a method called last value carried forward.

dataset1 %>% 
  behead('N', tophead) %>% 
  behead('N', head2) %>% 
  behead('W', State) %>% 
  select(row, col, data_type, numeric, tophead, head2, State) %>% 
  fill(tophead) %>% 
  fill(head2)

## # A tibble: 960 x 7
##      row   col data_type  numeric tophead               head2     State    
##    <int> <int> <chr>        <dbl> <chr>                 <chr>     <chr>    
##  1     5     2 numeric    9.31e+1 Total elementary, se… ADA as p… "   Unit…
##  2     5     3 numeric    2.19e-1 Total elementary, se… ADA as p… "   Unit…
##  3     5     4 numeric    6.64e+0 Total elementary, se… Average … "   Unit…
##  4     5     5 numeric    1.76e-2 Total elementary, se… Average … "   Unit…
##  5     5     6 numeric    1.80e+2 Total elementary, se… Average … "   Unit…
##  6     5     7 numeric    1.43e-1 Total elementary, se… Average … "   Unit…
##  7     5     8 numeric    1.19e+3 Total elementary, se… Average … "   Unit…
##  8     5     9 numeric    3.09e+0 Total elementary, se… Average … "   Unit…
##  9     5    10 numeric    9.40e+1 Elementary schools    ADA as p… "   Unit…
## 10     5    11 numeric    2.69e-1 Elementary schools    ADA as p… "   Unit…
## # … with 950 more rows

To make this really tidy, we need to make two columns titled stats and se from this. We’ve seen this using spread, but there is a slightly more robust method from unpivotr called spatter which is meant for this unique structure.

tidy_dataset <- dataset1 %>% 
  behead('N', tophead) %>% 
  behead('N', head2) %>% 
  behead('W', State) %>% 
  select(row, col, data_type, numeric, tophead, head2, State) %>% 
  mutate(header = ifelse(col %% 2 == 0, 'stats','se')) %>% 
  fill(tophead) %>% 
  fill(head2) %>% 
  select(row, numeric, tophead, head2, State, header) %>% 
  spatter(header, numeric) %>% 
  select(-row)
tidy_dataset

## # A tibble: 480 x 5
##    tophead                        head2          State            se  stats
##    <chr>                          <chr>          <chr>         <dbl>  <dbl>
##  1 Elementary schools             ADA as percen… "   United … 0.269  9.40e1
##  2 Elementary schools             Average hours… "   United … 0.0160 6.66e0
##  3 Secondary schools              ADA as percen… "   United … 0.432  9.11e1
##  4 Secondary schools              Average hours… "   United … 0.0403 6.59e0
##  5 Total elementary, secondary, … ADA as percen… "   United … 0.219  9.31e1
##  6 Total elementary, secondary, … Average days … "   United … 0.143  1.80e2
##  7 Total elementary, secondary, … Average hours… "   United … 0.0176 6.64e0
##  8 Total elementary, secondary, … Average hours… "   United … 3.09   1.19e3
##  9 Elementary schools             ADA as percen… Alabama ...… 1.84   9.38e1
## 10 Elementary schools             Average hours… Alabama ...… 0.0759 7.04e0
## # … with 470 more rows

We have to clean the State variable. We’ll use the methods in the stringr package, which is already loaded with the tidyverse.

tidy_dataset <- tidy_dataset %>% 
  mutate(State = str_remove(State, '\\.+')) %>% 
  mutate(State = str_trim(State))

The first verb removes all the . in the variable, using something called a regular expression. This particular expression means that we want to look for sequences of dots, and remove them. The \\ before the . tells R that we really mean ., since the . has a different meaning in regular expressions.

The second verb trims away blank spaces before and after each entry.

We’ll hold on to this dataset for the visualization section. Just to be safe, let’s save it.

saveRDS(tidy_dataset, file = 'data/attendance.rds')

This saves the data in an R-specific format that will allow us to load it quickly.

The RDS format is an open standard and so it can be called from other programs if the appropriate programs are written.

Dealing with visual formating (colors)

The dataset we’ll use for this has identifiable information, so I will not expose it publicly. It is available in your files as data/classlist.xlsx.

Since we’re interested in background and font colors here, which are informative, we also need to load the format information into R.

library(tidyxl)
library(unpivotr)

dataset2 <- xlsx_cells('data/classlist.xlsx')
formats <- xlsx_formats('data/classlist.xlsx')

format_id <- dataset2$local_format_id
dataset2$font_color <- formats$local$font$color$rgb[format_id]
dataset2$bg_color <- formats$local$fill$patternFill$fgColor$rgb[format_id]

unique(dataset2$font_color)

## [1] "FF000000" "FF0563C1" "FFFF0000"

unique(dataset2$bg_color)

## [1] "FFFFC000" NA         "FFE7E6E6"

So we can filter rows based on these two colors if we want.

To tidy-fy this dataset, we realize that there are really two interweaved datasets. The odd rows are one dataset and the even rows are another dataset.

dat1 <- dataset2 %>% 
  filter( row %% 2 == 1) %>% # odd rows
  behead('N', header) %>% 
  mutate(row = (row+1)/2) # make the row numbers sequential

dat2 <- dataset2 %>% 
  filter(row %% 2 == 0) %>%  # even rows
  behead('N', header) %>% 
  mutate(row = row/2) %>% # make row numbers sequential
  mutate(col = col+4) # These will be the last 4 cols of new data

tidy_dataset2 <- 
  rbind(dat1, dat2) %>% # Put datsets on top of each other
  select(row, data_type, numeric, character, header) %>% 
  spatter(header) %>% 
  select(-row, -numeric)

We’ll do a couple of finesse things to finish. First, we’ll make the names with no spaces (they’re a pain to write otherwise) and put the student name on the first column.

tidy_dataset2 <- tidy_dataset2 %>% 
  set_names(make.names(names(.))) %>% 
  select(Student.Name, everything())

make.names changes a vector of names into “approved” space-free format, replacing the space with .. One shortcut I’ve used is using . as an argument to names, which means that the . is replaced by the “noun” that is being acted on by the “verbs”. You will notice that I can also do multiple verbs together to work sequentially.