Overview

Once a dataset has been tidied, often one of the first things we want to do is generate summary statistics. We’ll be using the mpg dataset that is built into the tidyverse for this workshop. This dataset contains information about cars (such as engine size, fuel economy) produced by a number of different manufacturers. How would be go about generating (e.g.) the means and standard deviations grouped by car manufacturer for one of our variables? Have a look at the following video where I walk you through this worksheet. Then I want you to work through the content by writing (and running) the script on your own machine.

  

  

Remember to set up a .Rproj file for this workshop before continuing. In your script, you’ll first need to load the tidyverse.

library(tidyverse)

Using group_by() and summarise()

We are going to use the group_by() function to group the dataset, and then the summarise() function to calculate the mean and sd of the hwy variable. The summarise() function can take a lot of different functions to give us summary statistics. To read more about the different options, type ?summarise in the Console window. Commonly used ones are mean(), median(), sd().

mpg %>%
  group_by(manufacturer) %>%
  summarise(mean_hwy = mean(hwy), sd_hwy = sd(hwy), number = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 15 x 4
##    manufacturer mean_hwy sd_hwy number
##    <chr>           <dbl>  <dbl>  <int>
##  1 audi             26.4   2.18     18
##  2 chevrolet        21.9   5.11     19
##  3 dodge            17.9   3.57     37
##  4 ford             19.4   3.33     25
##  5 honda            32.6   2.55      9
##  6 hyundai          26.9   2.18     14
##  7 jeep             17.6   3.25      8
##  8 land rover       16.5   1.73      4
##  9 lincoln          17     1         3
## 10 mercury          18     1.15      4
## 11 nissan           24.6   5.09     13
## 12 pontiac          26.4   1.14      5
## 13 subaru           25.6   1.16     14
## 14 toyota           24.9   6.17     34
## 15 volkswagen       29.2   5.32     27

Note that this output is currently ordered alphabetically by the first column manufacturer. What if we wanted to order this out by mean highway fuel economy highest (best) to lowest (worst)? We can use the arrange function.

Re-ordering the output with arrange()

mpg %>%
  group_by(manufacturer) %>%
  summarise(mean_hwy = mean(hwy), sd_hwy = sd(hwy), number = n()) %>%
  arrange(mean_hwy)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 15 x 4
##    manufacturer mean_hwy sd_hwy number
##    <chr>           <dbl>  <dbl>  <int>
##  1 land rover       16.5   1.73      4
##  2 lincoln          17     1         3
##  3 jeep             17.6   3.25      8
##  4 dodge            17.9   3.57     37
##  5 mercury          18     1.15      4
##  6 ford             19.4   3.33     25
##  7 chevrolet        21.9   5.11     19
##  8 nissan           24.6   5.09     13
##  9 toyota           24.9   6.17     34
## 10 subaru           25.6   1.16     14
## 11 pontiac          26.4   1.14      5
## 12 audi             26.4   2.18     18
## 13 hyundai          26.9   2.18     14
## 14 volkswagen       29.2   5.32     27
## 15 honda            32.6   2.55      9

Hmm, so that isn’t what we want - this is going from lowest to highest which is the default in R. We can change that by putting a - sign in from of the parameter we can to order by.

mpg %>%
  group_by(manufacturer) %>%
  summarise(mean_hwy = mean(hwy), sd_hwy = sd(hwy), number = n()) %>%
  arrange(-mean_hwy)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 15 x 4
##    manufacturer mean_hwy sd_hwy number
##    <chr>           <dbl>  <dbl>  <int>
##  1 honda            32.6   2.55      9
##  2 volkswagen       29.2   5.32     27
##  3 hyundai          26.9   2.18     14
##  4 audi             26.4   2.18     18
##  5 pontiac          26.4   1.14      5
##  6 subaru           25.6   1.16     14
##  7 toyota           24.9   6.17     34
##  8 nissan           24.6   5.09     13
##  9 chevrolet        21.9   5.11     19
## 10 ford             19.4   3.33     25
## 11 mercury          18     1.15      4
## 12 dodge            17.9   3.57     37
## 13 jeep             17.6   3.25      8
## 14 lincoln          17     1         3
## 15 land rover       16.5   1.73      4

This is looking better.

The summarise_at() variant

As well as using summarise(), you can use related functions such as summarise_at(). This is a scoped version of the summarise() function that can be applied across multiple columns. Note, when using summarise_at() you need to put the columns you want to summarise over in quotes. You also need to provide the summary function - in this case mean. Finally, in case our dataset contains any missing values (indicated by NA), we set the parameter na.rm = TRUE. This will ensure that missing data points are removed before the operation is applied. If we had missing data, but didn’t tell R what we wanted to do with it, it would have thrown an error.

mpg %>% 
  group_by(manufacturer) %>%
  summarise_at(c("displ", "cty", "hwy"), mean, na.rm = TRUE)
## # A tibble: 15 x 4
##    manufacturer displ   cty   hwy
##    <chr>        <dbl> <dbl> <dbl>
##  1 audi          2.54  17.6  26.4
##  2 chevrolet     5.06  15    21.9
##  3 dodge         4.38  13.1  17.9
##  4 ford          4.54  14    19.4
##  5 honda         1.71  24.4  32.6
##  6 hyundai       2.43  18.6  26.9
##  7 jeep          4.58  13.5  17.6
##  8 land rover    4.3   11.5  16.5
##  9 lincoln       5.4   11.3  17  
## 10 mercury       4.4   13.2  18  
## 11 nissan        3.27  18.1  24.6
## 12 pontiac       3.96  17    26.4
## 13 subaru        2.46  19.3  25.6
## 14 toyota        2.95  18.5  24.9
## 15 volkswagen    2.26  20.9  29.2

The summarise_if() variant

Imagine we had a really big dataset and wanted to summarise all columns that were of a certain type. We can use the summarise_if() function to work out the mean for each of our car manufactures as follows:

mpg %>% 
  group_by(manufacturer) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE)
## # A tibble: 15 x 6
##    manufacturer displ  year   cyl   cty   hwy
##    <chr>        <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 audi          2.54 2004.  5.22  17.6  26.4
##  2 chevrolet     5.06 2005.  7.26  15    21.9
##  3 dodge         4.38 2004.  7.08  13.1  17.9
##  4 ford          4.54 2003.  7.2   14    19.4
##  5 honda         1.71 2003   4     24.4  32.6
##  6 hyundai       2.43 2004.  4.86  18.6  26.9
##  7 jeep          4.58 2006.  7.25  13.5  17.6
##  8 land rover    4.3  2004.  8     11.5  16.5
##  9 lincoln       5.4  2002   8     11.3  17  
## 10 mercury       4.4  2004.  7     13.2  18  
## 11 nissan        3.27 2004.  5.54  18.1  24.6
## 12 pontiac       3.96 2003.  6.4   17    26.4
## 13 subaru        2.46 2004.  4     19.3  25.6
## 14 toyota        2.95 2003.  5.12  18.5  24.9
## 15 volkswagen    2.26 2003.  4.59  20.9  29.2

The first parameter in summarise_if() is the logical test applied to each column - in this case, if a column is numeric (i.e., a number) - then the test evaluates to TRUE and the second function, mean, is applied. Again, we tell R to ignore missing (NA) data values with the na.rm = TRUE parameter.

R functions differ in terms of what arguments they take. I often forget them - if you start typing a function name, you’ll get a little bubble above where you’re typing to remind you what parameters are needed. And if you can’t remember the details, you can just type help(function_name) or ?function_name in the console for any function that you need help with. A lot of data analysis with R (or Python or wny other language really) involves a fair bit of Googling. This is normal. There are some things I can never remember and am always having to look up!

Adding columns using mutate()

We can add a new column that I’m calling mean_hwy using the mutate() function like this.

mpg %>% 
  group_by(manufacturer) %>%
  mutate(mean_hwy = mean(hwy), sd_hwy = sd(hwy))
## # A tibble: 234 x 13
## # Groups:   manufacturer [15]
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4      3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows, and 2 more variables: mean_hwy <dbl>, sd_hwy <dbl>

We have too many columns to display on this page so we can drop a coouple by using the select() function slightly differently. By putting a - sign in front of a column names in select() we end up dropping it.

mpg %>% 
  group_by(manufacturer) %>%
  mutate(mean_hwy = mean(hwy), sd_hwy = sd(hwy)) %>%
  select(-class, -trans)
## # A tibble: 234 x 11
## # Groups:   manufacturer [15]
##    manufacturer model  displ  year   cyl drv     cty   hwy fl    mean_hwy sd_hwy
##    <chr>        <chr>  <dbl> <int> <int> <chr> <int> <int> <chr>    <dbl>  <dbl>
##  1 audi         a4       1.8  1999     4 f        18    29 p         26.4   2.18
##  2 audi         a4       1.8  1999     4 f        21    29 p         26.4   2.18
##  3 audi         a4       2    2008     4 f        20    31 p         26.4   2.18
##  4 audi         a4       2    2008     4 f        21    30 p         26.4   2.18
##  5 audi         a4       2.8  1999     6 f        16    26 p         26.4   2.18
##  6 audi         a4       2.8  1999     6 f        18    26 p         26.4   2.18
##  7 audi         a4       3.1  2008     6 f        18    27 p         26.4   2.18
##  8 audi         a4 qu…   1.8  1999     4 4        18    26 p         26.4   2.18
##  9 audi         a4 qu…   1.8  1999     4 4        16    25 p         26.4   2.18
## 10 audi         a4 qu…   2    2008     4 4        20    28 p         26.4   2.18
## # … with 224 more rows

Note that this doesn’t change the mpg dataset permanently - the changes won’t be saved unless we map the output of this code onto a new variable. Below I am doing this by using the assignment operator <- to map it onto a new variable I’m calling mpg_with_mean. Note that we remove the grouping at the end as we don’t want our grouping rule to remain in our new data frame.

mpg_with_mean <- mpg %>% 
  group_by(manufacturer) %>%
    mutate(mean_hwy = mean(hwy), sd_hyw = sd(hwy)) %>%
  ungroup() %>%
  select(-class, -trans) 

We can then inspect this new variable using head() and str().

head(mpg_with_mean)
## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl drv     cty   hwy fl    mean_hwy sd_hyw
##   <chr>        <chr> <dbl> <int> <int> <chr> <int> <int> <chr>    <dbl>  <dbl>
## 1 audi         a4      1.8  1999     4 f        18    29 p         26.4   2.18
## 2 audi         a4      1.8  1999     4 f        21    29 p         26.4   2.18
## 3 audi         a4      2    2008     4 f        20    31 p         26.4   2.18
## 4 audi         a4      2    2008     4 f        21    30 p         26.4   2.18
## 5 audi         a4      2.8  1999     6 f        16    26 p         26.4   2.18
## 6 audi         a4      2.8  1999     6 f        18    26 p         26.4   2.18
str(mpg_with_mean)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ mean_hwy    : num [1:234] 26.4 26.4 26.4 26.4 26.4 ...
##  $ sd_hyw      : num [1:234] 2.18 2.18 2.18 2.18 2.18 ...

Your challenge

The Tidyverse has a number of other built-in data sets. Another one is the starwars data set. You can have a look at it by typing starwars or by typing view(starwars). This second option will open the data set in a new window. Have a go playing around with it. Work out the mean height of humans in the Star Wars universe. There might be some missing data (indicated by NA). You can use the na.rm = TRUE parameter in your summarise() function to ignore these values when generating your summary statistics.

Another way to filter out NA values is to use the filter() function in your pipeline. The function is.na() returns a logical value of TRUE of FALSE. The operator ! means NOT, so the expression !is.na(height) will return TRUE when the height value is present, and FALSE if absent. By combining this with filter() we have the line filter(!is.na(height)) which will filter only the cases where we have height data (i.e., !is.na(height) is TRUE). So your code might look like this:

starwars %>%
  filter(!is.na(height)) %>%
  filter(species == "Human") %>%
  summarise(mean_height = mean(height))

Replace the word mean in the summarise() line with median. What other things can you replace it with? Hint: type ?summarise in the console. What other summary information can you extract from this dataset?

Improve this Workshop

If you spot any issues/errors in this workshop, you can raise an issue or create a pull request for this repo.