Once a dataset has been tidied, often one of the first things we want to do is generate summary statistics. We’ll be using the mpg
dataset that is built into the tidyverse
for this workshop. This dataset contains information about cars (such as engine size, fuel economy) produced by a number of different manufacturers. How would be go about generating (e.g.) the means and standard deviations grouped by car manufacturer for one of our variables? Have a look at the following video where I walk you through this worksheet. Then I want you to work through the content by writing (and running) the script on your own machine.
Remember to set up a .Rproj file for this workshop before continuing. In your script, you’ll first need to load the tidyverse.
library(tidyverse)
group_by()
and summarise()
We are going to use the group_by()
function to group the dataset, and then the summarise()
function to calculate the mean and sd of the hwy
variable. The summarise()
function can take a lot of different functions to give us summary statistics. To read more about the different options, type ?summarise
in the Console window. Commonly used ones are mean()
, median()
, sd()
.
mpg %>%
group_by(manufacturer) %>%
summarise(mean_hwy = mean(hwy), sd_hwy = sd(hwy), number = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 15 x 4
## manufacturer mean_hwy sd_hwy number
## <chr> <dbl> <dbl> <int>
## 1 audi 26.4 2.18 18
## 2 chevrolet 21.9 5.11 19
## 3 dodge 17.9 3.57 37
## 4 ford 19.4 3.33 25
## 5 honda 32.6 2.55 9
## 6 hyundai 26.9 2.18 14
## 7 jeep 17.6 3.25 8
## 8 land rover 16.5 1.73 4
## 9 lincoln 17 1 3
## 10 mercury 18 1.15 4
## 11 nissan 24.6 5.09 13
## 12 pontiac 26.4 1.14 5
## 13 subaru 25.6 1.16 14
## 14 toyota 24.9 6.17 34
## 15 volkswagen 29.2 5.32 27
Note that this output is currently ordered alphabetically by the first column manufacturer
. What if we wanted to order this out by mean highway fuel economy highest (best) to lowest (worst)? We can use the arrange function.
arrange()
mpg %>%
group_by(manufacturer) %>%
summarise(mean_hwy = mean(hwy), sd_hwy = sd(hwy), number = n()) %>%
arrange(mean_hwy)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 15 x 4
## manufacturer mean_hwy sd_hwy number
## <chr> <dbl> <dbl> <int>
## 1 land rover 16.5 1.73 4
## 2 lincoln 17 1 3
## 3 jeep 17.6 3.25 8
## 4 dodge 17.9 3.57 37
## 5 mercury 18 1.15 4
## 6 ford 19.4 3.33 25
## 7 chevrolet 21.9 5.11 19
## 8 nissan 24.6 5.09 13
## 9 toyota 24.9 6.17 34
## 10 subaru 25.6 1.16 14
## 11 pontiac 26.4 1.14 5
## 12 audi 26.4 2.18 18
## 13 hyundai 26.9 2.18 14
## 14 volkswagen 29.2 5.32 27
## 15 honda 32.6 2.55 9
Hmm, so that isn’t what we want - this is going from lowest to highest which is the default in R. We can change that by putting a -
sign in from of the parameter we can to order by.
mpg %>%
group_by(manufacturer) %>%
summarise(mean_hwy = mean(hwy), sd_hwy = sd(hwy), number = n()) %>%
arrange(-mean_hwy)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 15 x 4
## manufacturer mean_hwy sd_hwy number
## <chr> <dbl> <dbl> <int>
## 1 honda 32.6 2.55 9
## 2 volkswagen 29.2 5.32 27
## 3 hyundai 26.9 2.18 14
## 4 audi 26.4 2.18 18
## 5 pontiac 26.4 1.14 5
## 6 subaru 25.6 1.16 14
## 7 toyota 24.9 6.17 34
## 8 nissan 24.6 5.09 13
## 9 chevrolet 21.9 5.11 19
## 10 ford 19.4 3.33 25
## 11 mercury 18 1.15 4
## 12 dodge 17.9 3.57 37
## 13 jeep 17.6 3.25 8
## 14 lincoln 17 1 3
## 15 land rover 16.5 1.73 4
This is looking better.
summarise_at()
variantAs well as using summarise()
, you can use related functions such as summarise_at()
. This is a scoped version of the summarise()
function that can be applied across multiple columns. Note, when using summarise_at()
you need to put the columns you want to summarise over in quotes. You also need to provide the summary function - in this case mean
. Finally, in case our dataset contains any missing values (indicated by NA
), we set the parameter na.rm = TRUE
. This will ensure that missing data points are removed before the operation is applied. If we had missing data, but didn’t tell R what we wanted to do with it, it would have thrown an error.
mpg %>%
group_by(manufacturer) %>%
summarise_at(c("displ", "cty", "hwy"), mean, na.rm = TRUE)
## # A tibble: 15 x 4
## manufacturer displ cty hwy
## <chr> <dbl> <dbl> <dbl>
## 1 audi 2.54 17.6 26.4
## 2 chevrolet 5.06 15 21.9
## 3 dodge 4.38 13.1 17.9
## 4 ford 4.54 14 19.4
## 5 honda 1.71 24.4 32.6
## 6 hyundai 2.43 18.6 26.9
## 7 jeep 4.58 13.5 17.6
## 8 land rover 4.3 11.5 16.5
## 9 lincoln 5.4 11.3 17
## 10 mercury 4.4 13.2 18
## 11 nissan 3.27 18.1 24.6
## 12 pontiac 3.96 17 26.4
## 13 subaru 2.46 19.3 25.6
## 14 toyota 2.95 18.5 24.9
## 15 volkswagen 2.26 20.9 29.2
summarise_if()
variantImagine we had a really big dataset and wanted to summarise all columns that were of a certain type. We can use the summarise_if()
function to work out the mean for each of our car manufactures as follows:
mpg %>%
group_by(manufacturer) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
## # A tibble: 15 x 6
## manufacturer displ year cyl cty hwy
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 audi 2.54 2004. 5.22 17.6 26.4
## 2 chevrolet 5.06 2005. 7.26 15 21.9
## 3 dodge 4.38 2004. 7.08 13.1 17.9
## 4 ford 4.54 2003. 7.2 14 19.4
## 5 honda 1.71 2003 4 24.4 32.6
## 6 hyundai 2.43 2004. 4.86 18.6 26.9
## 7 jeep 4.58 2006. 7.25 13.5 17.6
## 8 land rover 4.3 2004. 8 11.5 16.5
## 9 lincoln 5.4 2002 8 11.3 17
## 10 mercury 4.4 2004. 7 13.2 18
## 11 nissan 3.27 2004. 5.54 18.1 24.6
## 12 pontiac 3.96 2003. 6.4 17 26.4
## 13 subaru 2.46 2004. 4 19.3 25.6
## 14 toyota 2.95 2003. 5.12 18.5 24.9
## 15 volkswagen 2.26 2003. 4.59 20.9 29.2
The first parameter in summarise_if()
is the logical test applied to each column - in this case, if a column is numeric (i.e., a number) - then the test evaluates to TRUE and the second function, mean
, is applied. Again, we tell R to ignore missing (NA
) data values with the na.rm = TRUE
parameter.
R functions differ in terms of what arguments they take. I often forget them - if you start typing a function name, you’ll get a little bubble above where you’re typing to remind you what parameters are needed. And if you can’t remember the details, you can just type help(function_name)
or ?function_name
in the console for any function that you need help with. A lot of data analysis with R (or Python or wny other language really) involves a fair bit of Googling. This is normal. There are some things I can never remember and am always having to look up!
mutate()
We can add a new column that I’m calling mean_hwy
using the mutate()
function like this.
mpg %>%
group_by(manufacturer) %>%
mutate(mean_hwy = mean(hwy), sd_hwy = sd(hwy))
## # A tibble: 234 x 13
## # Groups: manufacturer [15]
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 q… 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows, and 2 more variables: mean_hwy <dbl>, sd_hwy <dbl>
We have too many columns to display on this page so we can drop a coouple by using the select()
function slightly differently. By putting a -
sign in front of a column names in select()
we end up dropping it.
mpg %>%
group_by(manufacturer) %>%
mutate(mean_hwy = mean(hwy), sd_hwy = sd(hwy)) %>%
select(-class, -trans)
## # A tibble: 234 x 11
## # Groups: manufacturer [15]
## manufacturer model displ year cyl drv cty hwy fl mean_hwy sd_hwy
## <chr> <chr> <dbl> <int> <int> <chr> <int> <int> <chr> <dbl> <dbl>
## 1 audi a4 1.8 1999 4 f 18 29 p 26.4 2.18
## 2 audi a4 1.8 1999 4 f 21 29 p 26.4 2.18
## 3 audi a4 2 2008 4 f 20 31 p 26.4 2.18
## 4 audi a4 2 2008 4 f 21 30 p 26.4 2.18
## 5 audi a4 2.8 1999 6 f 16 26 p 26.4 2.18
## 6 audi a4 2.8 1999 6 f 18 26 p 26.4 2.18
## 7 audi a4 3.1 2008 6 f 18 27 p 26.4 2.18
## 8 audi a4 qu… 1.8 1999 4 4 18 26 p 26.4 2.18
## 9 audi a4 qu… 1.8 1999 4 4 16 25 p 26.4 2.18
## 10 audi a4 qu… 2 2008 4 4 20 28 p 26.4 2.18
## # … with 224 more rows
Note that this doesn’t change the mpg
dataset permanently - the changes won’t be saved unless we map the output of this code onto a new variable. Below I am doing this by using the assignment operator <-
to map it onto a new variable I’m calling mpg_with_mean
. Note that we remove the grouping at the end as we don’t want our grouping rule to remain in our new data frame.
mpg_with_mean <- mpg %>%
group_by(manufacturer) %>%
mutate(mean_hwy = mean(hwy), sd_hyw = sd(hwy)) %>%
ungroup() %>%
select(-class, -trans)
We can then inspect this new variable using head()
and str()
.
head(mpg_with_mean)
## # A tibble: 6 x 11
## manufacturer model displ year cyl drv cty hwy fl mean_hwy sd_hyw
## <chr> <chr> <dbl> <int> <int> <chr> <int> <int> <chr> <dbl> <dbl>
## 1 audi a4 1.8 1999 4 f 18 29 p 26.4 2.18
## 2 audi a4 1.8 1999 4 f 21 29 p 26.4 2.18
## 3 audi a4 2 2008 4 f 20 31 p 26.4 2.18
## 4 audi a4 2 2008 4 f 21 30 p 26.4 2.18
## 5 audi a4 2.8 1999 6 f 16 26 p 26.4 2.18
## 6 audi a4 2.8 1999 6 f 18 26 p 26.4 2.18
str(mpg_with_mean)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ mean_hwy : num [1:234] 26.4 26.4 26.4 26.4 26.4 ...
## $ sd_hyw : num [1:234] 2.18 2.18 2.18 2.18 2.18 ...
The Tidyverse
has a number of other built-in data sets. Another one is the starwars
data set. You can have a look at it by typing starwars
or by typing view(starwars)
. This second option will open the data set in a new window. Have a go playing around with it. Work out the mean height of humans in the Star Wars universe. There might be some missing data (indicated by NA
). You can use the na.rm = TRUE
parameter in your summarise()
function to ignore these values when generating your summary statistics.
Another way to filter out NA
values is to use the filter()
function in your pipeline. The function is.na()
returns a logical value of TRUE of FALSE. The operator !
means NOT, so the expression !is.na(height)
will return TRUE when the height value is present, and FALSE if absent. By combining this with filter()
we have the line filter(!is.na(height))
which will filter only the cases where we have height data (i.e., !is.na(height)
is TRUE). So your code might look like this:
starwars %>%
filter(!is.na(height)) %>%
filter(species == "Human") %>%
summarise(mean_height = mean(height))
Replace the word mean
in the summarise()
line with median
. What other things can you replace it with? Hint: type ?summarise
in the console. What other summary information can you extract from this dataset?
If you spot any issues/errors in this workshop, you can raise an issue or create a pull request for this repo.