5.1 Split-apply-combine, a.k.a. MapReduce

The split-apply-combine is a powerful paradigm for understanding subgroups within a dataset. The basic idea is that you split the data into pieces based on values of some variables, do something (the same thing) to each piece, and then stitch the results back together.

For example, in the mtcars data, we might want to know what the average mpg is by the number of cylinders. The way to do this is:

mtcars1 %>% 
  group_by(cyl) %>% 
  summarize(mpg_mean = mean(mpg))

## # A tibble: 3 x 2
##   cyl   mpg_mean
##   <fct>    <dbl>
## 1 4         26.7
## 2 6         19.7
## 3 8         15.1

Once again, the scoped versions of summarize will also work in this pipe

mtcars1 %>% 
  group_by(cyl) %>% 
  summarize_if(is.numeric, mean)

## # A tibble: 3 x 7
##   cyl     mpg  disp    hp  drat    wt  qsec
##   <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4      26.7  105.  82.6  4.07  2.29  19.1
## 2 6      19.7  183. 122.   3.59  3.12  18.0
## 3 8      15.1  353. 209.   3.23  4.00  16.8

Let’s go a bit further and compute the medians as well.

mtcars1 %>% 
  group_by(cyl) %>% 
  summarize_if(is.numeric, list('mean'= mean, 'median' = median))

## # A tibble: 3 x 13
##   cyl   mpg_mean disp_mean hp_mean drat_mean wt_mean qsec_mean mpg_median
##   <fct>    <dbl>     <dbl>   <dbl>     <dbl>   <dbl>     <dbl>      <dbl>
## 1 4         26.7      105.    82.6      4.07    2.29      19.1       26  
## 2 6         19.7      183.   122.       3.59    3.12      18.0       19.7
## 3 8         15.1      353.   209.       3.23    4.00      16.8       15.2
## # … with 5 more variables: disp_median <dbl>, hp_median <dbl>,
## #   drat_median <dbl>, wt_median <dbl>, qsec_median <dbl>

We can look at a second dataset showing individual violent incidents in Western Afrika between 2000 and 2017. We can get the number of incidents per country and year very easily using this paradigm.

west_africa <- import('data/2000-01-01-2019-01-01-Western_Africa.csv')
west_africa %>% group_by(country, year) %>% tally()

## # A tibble: 290 x 3
## # Groups:   country [15]
##    country  year     n
##    <chr>   <int> <int>
##  1 Benin    2000     1
##  2 Benin    2001     3
##  3 Benin    2002     1
##  4 Benin    2003     2
##  5 Benin    2004     2
##  6 Benin    2005     2
##  7 Benin    2006     1
##  8 Benin    2007     3
##  9 Benin    2008     1
## 10 Benin    2009     2
## # … with 280 more rows

For display, we can make this a wide dataset

west_africa %>% group_by(country, year) %>% tally() %>% 
  spread(year, n)

## # A tibble: 15 x 21
## # Groups:   country [15]
##    country `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008`
##    <chr>    <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
##  1 Benin        1      3      1      2      2      2      1      3      1
##  2 Burkin…     22      6      6      1      4      6      8      1     12
##  3 Gambia       8     14     13     11      5      4      6      2      4
##  4 Ghana       10      8      7     17      7      3      3      5     11
##  5 Guinea     180     70     14     10     11     15      7     46     15
##  6 Guinea…      9      2      3      5      4     13     21      2      2
##  7 Ivory …    133     34    135    177    101     45     30      6     24
##  8 Liberia     87    171    148    242     22     26     22      9     17
##  9 Mali         4      5      2      3      3      2     10     11     21
## 10 Maurit…      4      1      3     13      2      9      3      5     16
## 11 Niger       11      9     42      6     17      9      8     31     28
## 12 Nigeria    168    118    160    207    277    198    120    200    208
## 13 Senegal     86     61     40     18     11     11     29     24     20
## 14 Sierra…    495    224      5     18     14      5      1      6     15
## 15 Togo         4      4      3      4      3     25      1      1      1
## # … with 11 more variables: `2009` <int>, `2010` <int>, `2011` <int>,
## #   `2012` <int>, `2013` <int>, `2014` <int>, `2015` <int>, `2016` <int>,
## #   `2017` <int>, `2018` <int>, `2019` <int>

We’ll save this dataset for visualization later.

west_africa %>% group_by(country, year) %>% tally() %>% 
  spread(year, n) %>% 
  saveRDS('data/west_africa.rds')