Overview

In this workshop we shall take our first look at some key tools in the Tidyverse that will allow us to wrangle and tidy our data so that it’s in the format that we need in order to visualize and model it. The Tidyverse is a collection of packages that all ‘play nicely’ with each other. They are based on a common philosophy where data are represented in rectangular format (i.e., with rows and columns). These rectangular structures are known in the Tidyverse as tibbles. If you’re interested, you can read more about tibbles in the R4DS book here.

Have a look at the following video where I walk you through this worksheet. Then I want you to work through the content by writing (and running) the script on your own machine.

Loading the Tidyverse

Let’s take our first look at data wrangling. We are going to start with a dataset that comes with the Tidyverse. The dataset is called mpg and comprises fuel economy data from 1999 to 2008 for 38 popular models of cars in the US.

First, we need to load the tidyverse library with the following:

library(tidyverse)

If you run this line without having first installed the Tidyverse on your computer, you will encounter an error. R packages only need to be installed once, so if you want to load one into your library for the first time, you need to install it with install.packages(*packagename*).

For the tidyverse we need to install it with:

install.packages(tidyverse)

Once you have installed the tidyverse, you can then load it into your llbrary with the library() function. You only ever need to install a package once on your machine (unless you have updated R or you want to install the most up-to-date version of a particular package). When you are writing your R scripts, you never want to have the install.packages() function in the body of the script as if someone else were to run your script, this would update packages on their computer (which they might not want).

The `mpg` dataset

The mpg dataset is loaded as part of the Tidyverse In the help file, which you can access by typing help(mpg) or ?mpg we see the following:

Description This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.

A data frame with 234 rows and 11 variables.

manufacturer - manufacturer model - model name displ - engine displacement, in litres year - year of manufacture cyl - number of cylinders trans -type of transmission drv -f = front-wheel drive, r = rear wheel drive, 4 = 4wd cty - city miles per gallon hwy - highway miles per gallon fl - fuel type class - “type” of car

Using `head()` and `str()`

We can explore the mpg dataset that is loaded with the Tidyverse in a number of ways. If we want to look at the first 6 lines of the dataset, we can use the head() function.

head(mpg)

## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

We see that it is a tibble - or a rectangular data frame - made up of rows and columns. This is tidy format where each observation corresponds to a row. Most of the analyses we will run in R involve tidy data. Within the Tidyverse, the tibble is the standard way to represent data. You’ll spend a lot of time tidying and wrangling your data to get it into this format! By doing this in R using a script that you write, you are making this key stage reproducible. You can run the script again on an updated or different dataset - thus likely saving you lots of time!

We can also ask for information about the structure of our dataset with str(). This will tell us about the columns, what type of variable each is, the number of rows etc.

str(mpg)

## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

Use `select()` to select columns

If we want to, we could just select one of the columns using the select() function. Below we are just selecing the column entitled manufacturer.

mpg %>%
  select(manufacturer)

## # A tibble: 234 x 1
##    manufacturer
##    <chr>       
##  1 audi        
##  2 audi        
##  3 audi        
##  4 audi        
##  5 audi        
##  6 audi        
##  7 audi        
##  8 audi        
##  9 audi        
## 10 audi        
## # … with 224 more rows

Related to the select() function is rename(). It does exactly what you think it might; it renames a column.

We can also look at the different car manufacturers in the dataset by using the distinct() function. This gives us the unique manufacturer names. This function can be quite handy if you want to check a dataset for duplicates of (e.g.) participant IDs.

mpg %>%
  distinct(manufacturer)

## # A tibble: 15 x 1
##    manufacturer
##    <chr>       
##  1 audi        
##  2 chevrolet   
##  3 dodge       
##  4 ford        
##  5 honda       
##  6 hyundai     
##  7 jeep        
##  8 land rover  
##  9 lincoln     
## 10 mercury     
## 11 nissan      
## 12 pontiac     
## 13 subaru      
## 14 toyota      
## 15 volkswagen

Use `filter()` to select rows

Sometimes we might want to select only a subset of rows in our dataset. We can do that using the filter() function. For example, here we filter our dataset to include only cars made by ‘honda’.

mpg %>%
  filter(manufacturer == "honda")

## # A tibble: 9 x 11
##   manufacturer model displ  year   cyl trans    drv     cty   hwy fl    class   
##   <chr>        <chr> <dbl> <int> <int> <chr>    <chr> <int> <int> <chr> <chr>   
## 1 honda        civic   1.6  1999     4 manual(… f        28    33 r     subcomp…
## 2 honda        civic   1.6  1999     4 auto(l4) f        24    32 r     subcomp…
## 3 honda        civic   1.6  1999     4 manual(… f        25    32 r     subcomp…
## 4 honda        civic   1.6  1999     4 manual(… f        23    29 p     subcomp…
## 5 honda        civic   1.6  1999     4 auto(l4) f        24    32 r     subcomp…
## 6 honda        civic   1.8  2008     4 manual(… f        26    34 r     subcomp…
## 7 honda        civic   1.8  2008     4 auto(l5) f        25    36 r     subcomp…
## 8 honda        civic   1.8  2008     4 auto(l5) f        24    36 c     subcomp…
## 9 honda        civic   2    2008     4 manual(… f        21    29 p     subcomp…

Note, we use the operator == which means ‘is equal to’. This is a logical operator - other logical operators include less than <, greater than >, less than or equal to <=, greater then or equal to >=, and is not equal to !=.

We can also filter using a combination of possibilities via logical OR | or logical AND &. The first code chunk below filters the dataset for cases where the manufacturer is ‘honda’ OR ‘toyota’.

mpg %>%
  filter(manufacturer == "honda" | manufacturer == "toyota")

## # A tibble: 43 x 11
##    manufacturer model   displ  year   cyl trans   drv     cty   hwy fl    class 
##    <chr>        <chr>   <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr> 
##  1 honda        civic     1.6  1999     4 manual… f        28    33 r     subco…
##  2 honda        civic     1.6  1999     4 auto(l… f        24    32 r     subco…
##  3 honda        civic     1.6  1999     4 manual… f        25    32 r     subco…
##  4 honda        civic     1.6  1999     4 manual… f        23    29 p     subco…
##  5 honda        civic     1.6  1999     4 auto(l… f        24    32 r     subco…
##  6 honda        civic     1.8  2008     4 manual… f        26    34 r     subco…
##  7 honda        civic     1.8  2008     4 auto(l… f        25    36 r     subco…
##  8 honda        civic     1.8  2008     4 auto(l… f        24    36 c     subco…
##  9 honda        civic     2    2008     4 manual… f        21    29 p     subco…
## 10 toyota       4runne…   2.7  1999     4 manual… 4        15    20 r     suv   
## # … with 33 more rows

While below we filter for cases where the manufacturer is ‘honda’ and the year of manufacture is ‘1999’.

mpg %>% 
  filter(manufacturer == "honda" & year == "1999")

## # A tibble: 5 x 11
##   manufacturer model displ  year   cyl trans    drv     cty   hwy fl    class   
##   <chr>        <chr> <dbl> <int> <int> <chr>    <chr> <int> <int> <chr> <chr>   
## 1 honda        civic   1.6  1999     4 manual(… f        28    33 r     subcomp…
## 2 honda        civic   1.6  1999     4 auto(l4) f        24    32 r     subcomp…
## 3 honda        civic   1.6  1999     4 manual(… f        25    32 r     subcomp…
## 4 honda        civic   1.6  1999     4 manual(… f        23    29 p     subcomp…
## 5 honda        civic   1.6  1999     4 auto(l4) f        24    32 r     subcomp…

Combining functions

We can combine the use of filter() with select() to filter for case where the manufacturer is ‘honda’, the year of manufacture is ‘1999’ and we only want to display these two columns plus those telling us about fuel economy - cty and hwy.

mpg %>% 
  filter(manufacturer == "honda" & year == "1999") %>%
  select(manufacturer, year, cty, hwy)

## # A tibble: 5 x 4
##   manufacturer  year   cty   hwy
##   <chr>        <int> <int> <int>
## 1 honda         1999    28    33
## 2 honda         1999    24    32
## 3 honda         1999    25    32
## 4 honda         1999    23    29
## 5 honda         1999    24    32

By combining just a few functions, you can imagine that we can build some quite complex data wrangling rules quite straightforwardly.

The pipe `%>%`

Note that in these examples above we used the %>% operator - this is called the pipe and allows us to pass information from one side of the pipe to the other. You can read it out load as ‘and then’. All of the functions (such as select(), filter() etc.) in the Tidyverse are known as verbs, and they describe what they do. The pipe is one of the most commonly used operators in the Tidyverse and allows us to chain together different lines of code - with the output of each line being passed on as input into the next. In this example, the dataset mpg is passed along to the distinct() function where we ask for a list of the distinct (i.e., unique) manufacturers. This output itself is a vector. Vectors are a basic data structure and contain elements of the same type - for example, a bunch of numbers. We can add another line to our piped chain to tell us how many elements are in this vector. We could read this out loud as ‘take the dataset mpg, and then work out the distinct manufacturer names, and then count them’.

mpg %>% 
  distinct(manufacturer) %>%
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1    15

Tidying up a dataset

Tidying variable names

At the moment, the car manufacturer names are all in lower case. It would look a lot nicer if they were in title case (i.e., with capitalisation on the first letter of each word). We can use the mutate() function to create a new column - this time, the name of the new column is also the name of the old column that we’re wanting to modify using the function str_to_title(). What this will do is overwrite the column manufacturer and replace it with the new version with the car manufacturer names in title case.

mpg %>%
  mutate(manufacturer = str_to_title(manufacturer))

## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
##  1 Audi         a4         1.8  1999     4 auto(l… f        18    29 p     comp…
##  2 Audi         a4         1.8  1999     4 manual… f        21    29 p     comp…
##  3 Audi         a4         2    2008     4 manual… f        20    31 p     comp…
##  4 Audi         a4         2    2008     4 auto(a… f        21    30 p     comp…
##  5 Audi         a4         2.8  1999     6 auto(l… f        16    26 p     comp…
##  6 Audi         a4         2.8  1999     6 manual… f        18    26 p     comp…
##  7 Audi         a4         3.1  2008     6 auto(a… f        18    27 p     comp…
##  8 Audi         a4 quat…   1.8  1999     4 manual… 4        18    26 p     comp…
##  9 Audi         a4 quat…   1.8  1999     4 auto(l… 4        16    25 p     comp…
## 10 Audi         a4 quat…   2    2008     4 manual… 4        20    28 p     comp…
## # … with 224 more rows

The column model is also lowercase. Let’s make that title case too. We can use the mutate() function to work over more than one column at the same time like this:

mpg %>%
  mutate(manufacturer = str_to_title(manufacturer), model = str_to_title(model))

## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
##  1 Audi         A4         1.8  1999     4 auto(l… f        18    29 p     comp…
##  2 Audi         A4         1.8  1999     4 manual… f        21    29 p     comp…
##  3 Audi         A4         2    2008     4 manual… f        20    31 p     comp…
##  4 Audi         A4         2    2008     4 auto(a… f        21    30 p     comp…
##  5 Audi         A4         2.8  1999     6 auto(l… f        16    26 p     comp…
##  6 Audi         A4         2.8  1999     6 manual… f        18    26 p     comp…
##  7 Audi         A4         3.1  2008     6 auto(a… f        18    27 p     comp…
##  8 Audi         A4 Quat…   1.8  1999     4 manual… 4        18    26 p     comp…
##  9 Audi         A4 Quat…   1.8  1999     4 auto(l… 4        16    25 p     comp…
## 10 Audi         A4 Quat…   2    2008     4 manual… 4        20    28 p     comp…
## # … with 224 more rows

There are quite a few columns there, so how about we select just the manufacturer, model, year, transmission, and hwy columns:

mpg %>%
  mutate(manufacturer = str_to_title(manufacturer), model = str_to_title(model)) %>%
  select(manufacturer, model, year, trans, hwy)

## # A tibble: 234 x 5
##    manufacturer model       year trans        hwy
##    <chr>        <chr>      <int> <chr>      <int>
##  1 Audi         A4          1999 auto(l5)      29
##  2 Audi         A4          1999 manual(m5)    29
##  3 Audi         A4          2008 manual(m6)    31
##  4 Audi         A4          2008 auto(av)      30
##  5 Audi         A4          1999 auto(l5)      26
##  6 Audi         A4          1999 manual(m5)    26
##  7 Audi         A4          2008 auto(av)      27
##  8 Audi         A4 Quattro  1999 manual(m5)    26
##  9 Audi         A4 Quattro  1999 auto(l5)      25
## 10 Audi         A4 Quattro  2008 manual(m6)    28
## # … with 224 more rows

Recoding variables

In the real world, data frames do not always arrive on our computer in tidy format. Very often you need to engage in some data tidying before you can do anything useful with them. We’re going to look at an example of how we go from messy data to tidy data.

my_messy_data <- read_csv("https://raw.githubusercontent.com/ajstewartlang/03_data_wrangling/master/data/my_data.csv")

We ran a reaction time experiment with 24 participants and 4 conditions - they are numbered 1-4 in our data file.

head(my_messy_data)

## # A tibble: 6 x 3
##   participant condition    rt
##         <dbl>     <dbl> <dbl>
## 1           1         1   879
## 2           1         2  1027
## 3           1         3  1108
## 4           1         4   765
## 5           2         1  1042
## 6           2         2  1050

This is a repeated measures design where we had one factor (Prime Type) with two levels (A vs. B) and a second factor (Target Type) with two levels (A vs. B). We want to recode our data frame so it better matches our experimental design. First we need to recode our 4 conditions like this:

Recode condition columns follows: Condition 1 = Prime A, Target A Condition 2 = Prime A, Target B Condition 3 = Prime B, Target A Condition 4 = Prime B, Target B

my_messy_data %>% 
  mutate(condition = recode(condition,
                            "1" = "PrimeA_TargetA",
                            "2" = "PrimeA_TargetB", 
                            "3" = "PrimeB_TargetA", 
                            "4" = "PrimeB_TargetB")) %>%
  head()

## # A tibble: 6 x 3
##   participant condition         rt
##         <dbl> <chr>          <dbl>
## 1           1 PrimeA_TargetA   879
## 2           1 PrimeA_TargetB  1027
## 3           1 PrimeB_TargetA  1108
## 4           1 PrimeB_TargetB   765
## 5           2 PrimeA_TargetA  1042
## 6           2 PrimeA_TargetB  1050

We now need to separate out our Condition column into two - one for our first factor (Prime), and one for our second factor (Target). The separate() function does just this - when used in conjunction with a piped tibble, it needs to know which column we want to separate, what new columns to create by separating that original column, and on what basis we want to do the separation. In the example below we tell separate() that we want to separate the column labeled condition into two new columns called Prime and Target and we want to do this at any points where a _ is present in the column to be separated.

my_messy_data %>% 
  mutate(condition = recode(condition,
                            "1" = "PrimeA_TargetA",
                            "2" = "PrimeA_TargetB", 
                            "3" = "PrimeB_TargetA", 
                            "4" = "PrimeB_TargetB")) %>%
  separate(col = "condition", into = c("Prime", "Target"), sep = "_")

## # A tibble: 96 x 4
##    participant Prime  Target     rt
##          <dbl> <chr>  <chr>   <dbl>
##  1           1 PrimeA TargetA   879
##  2           1 PrimeA TargetB  1027
##  3           1 PrimeB TargetA  1108
##  4           1 PrimeB TargetB   765
##  5           2 PrimeA TargetA  1042
##  6           2 PrimeA TargetB  1050
##  7           2 PrimeB TargetA   942
##  8           2 PrimeB TargetB   945
##  9           3 PrimeA TargetA   943
## 10           3 PrimeA TargetB   910
## # … with 86 more rows

my_messy_data %>% 
  mutate(condition = recode(condition,
                            "1" = "PrimeA_TargetA",
                            "2" = "PrimeA_TargetB", 
                            "3" = "PrimeB_TargetA", 
                            "4" = "PrimeB_TargetB")) %>%
  separate(col = "condition", into = c("Prime", "Target"), sep = "_") %>%
  mutate(Prime = factor(Prime), Target = factor(Target))

## # A tibble: 96 x 4
##    participant Prime  Target     rt
##          <dbl> <fct>  <fct>   <dbl>
##  1           1 PrimeA TargetA   879
##  2           1 PrimeA TargetB  1027
##  3           1 PrimeB TargetA  1108
##  4           1 PrimeB TargetB   765
##  5           2 PrimeA TargetA  1042
##  6           2 PrimeA TargetB  1050
##  7           2 PrimeB TargetA   942
##  8           2 PrimeB TargetB   945
##  9           3 PrimeA TargetA   943
## 10           3 PrimeA TargetB   910
## # … with 86 more rows

The pivot functions

Most of the analysis we will conduct in R requires our data to be in tidy, or long, format. In such data sets, one row corresponds to one observation. In the real world, data are rarely in the right format for analysis. In R, the pivot_wider() and pivot_longer() functions are designed to reshape our data files. First, let’s load a datafile that is in wide format (i.e., multiple observations per row). It is from an experiment where we had four conditions (labelled Condition1, Condition2, Condition3, and Condition4). In addition to there being a column for each of the 4 conditions, we also have a column corresponding to participant ID. Each cell in the data set corresponds to a reaction time (measured in milliseconds).

my_wide_data <- read_csv("https://raw.githubusercontent.com/ajstewartlang/03_data_wrangling/master/data/my_wide_data.csv")

The `pivot_longer()` function

head(my_wide_data)

## # A tibble: 6 x 5
##      ID Condition1 Condition2 Condition3 Condition4
##   <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
## 1     1        487        492        499        488
## 2     2        502        494        517        508
## 3     3        510        483        488        509
## 4     4        476        488        513        521
## 5     5        504        478        513        504
## 6     6        505        486        503        495

So, we can see the data file is in wide format. We want to reshape it to long format. We can do that using the pivot_longer() function.

Minially, we need to specify the data frame that we want to reshape, the columns that we want to ‘pivot’ into longer format, the name of the new column that we are creating, and the name of the column that will hold the values of our reshaped data frame. We are going to map the output to a variable I’m calling my_longer_data.

my_longer_data <- my_wide_data %>%
  pivot_longer(cols = c(Condition1, Condition2, Condition3, Condition4), 
               names_to = "Condition", 
               values_to = "RT")

Now let’s have a look at what our reshaped data frame looks like.

head(my_longer_data)

## # A tibble: 6 x 3
##      ID Condition     RT
##   <dbl> <chr>      <dbl>
## 1     1 Condition1   487
## 2     1 Condition2   492
## 3     1 Condition3   499
## 4     1 Condition4   488
## 5     2 Condition1   502
## 6     2 Condition2   494

So you can see our data are now in long - or tidy - format with one observation per row. Note that our Condition column isn’t coded as a factor. It’s important that our data set reflects the structure of our experiment so let’s convert that column to a factor - note that in the following code we are now ‘saving’ the change as we are not mapping the output onto a variable name.

my_longer_data %>%
  mutate(Condition = factor(Condition)) %>%
  head()

## # A tibble: 6 x 3
##      ID Condition     RT
##   <dbl> <fct>      <dbl>
## 1     1 Condition1   487
## 2     1 Condition2   492
## 3     1 Condition3   499
## 4     1 Condition4   488
## 5     2 Condition1   502
## 6     2 Condition2   494

The `pivot_wider()` function

We can use the pivot_wider() function to reshape a long data frame so that it goes from long to wide format. It works similarly to pivot_longer(). Let’s take our new, long, data frame and turn it back into wide format. With pivot_wider() we minimally need to specify the data frame that we want to reshape, and a pair or arguments (names_from and values_from) that describe from which column to get the name of the output column, and from which column to get the cell values.

my_wider_data <- my_longer_data %>%
  pivot_wider(names_from = "Condition", 
              values_from = "RT")

We can check that our data set is back in wide format.

head(my_wider_data)

## # A tibble: 6 x 5
##      ID Condition1 Condition2 Condition3 Condition4
##   <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
## 1     1        487        492        499        488
## 2     2        502        494        517        508
## 3     3        510        483        488        509
## 4     4        476        488        513        521
## 5     5        504        478        513        504
## 6     6        505        486        503        495

Joining Two Datasets

Sometimes you might need to combine two datasets. For example, you might have one dataset that contains reading time data (like the one above) and another than contains individual difference measures for the participants in the first dataset. How would we go about combining these two datasets so that we end up with one that includes both the reading time data and the individual difference measures (that perhaps we want to covary out later)? Luckily, the {dplyr} package contains a number of join functions that allows you to join together different tibbles. First, let’s load the data that contains the individual different measures.

individual_diffs <- read_csv("https://raw.githubusercontent.com/ajstewartlang/03_data_wrangling/master/data/individual_diffs.csv")

Let’s look at the first few rows of the individual differences data. This dataset contains the ID numbers of our participants plus measures of IQ (the iq column) and Working Memory (the wm column).

head(individual_diffs)

## # A tibble: 6 x 3
##      ID    iq    wm
##   <dbl> <dbl> <dbl>
## 1     1   100     9
## 2     2   108     8
## 3     3   116     9
## 4     4    95     9
## 5     5    83    11
## 6     6    73    10

We want to combine this dataset with our reading time dataset from above my_longer_data which looks like this:

head(my_longer_data)

## # A tibble: 6 x 3
##      ID Condition     RT
##   <dbl> <chr>      <dbl>
## 1     1 Condition1   487
## 2     1 Condition2   492
## 3     1 Condition3   499
## 4     1 Condition4   488
## 5     2 Condition1   502
## 6     2 Condition2   494

Full Join

We can combine using one of the join functions. There are a variety of options including full_join() which includes all of the rows in tibble one or tibble two that we want to join. Other options include inner_join() which includes all of the rows in tibble one and tibble 2, as well as left_join() and right_join().

combined_data <- full_join(my_longer_data, individual_diffs, by = "ID")

We now see that our dataset are combined as we’d expect.

combined_data

## # A tibble: 128 x 5
##       ID Condition     RT    iq    wm
##    <dbl> <chr>      <dbl> <dbl> <dbl>
##  1     1 Condition1   487   100     9
##  2     1 Condition2   492   100     9
##  3     1 Condition3   499   100     9
##  4     1 Condition4   488   100     9
##  5     2 Condition1   502   108     8
##  6     2 Condition2   494   108     8
##  7     2 Condition3   517   108     8
##  8     2 Condition4   508   108     8
##  9     3 Condition1   510   116     9
## 10     3 Condition2   483   116     9
## # … with 118 more rows

Left Join

Of course, you may be thinking that we could just do a quick bit of Excel cut and paste of the columns we want from one dataset to the other. But what about the case where our individual differences file contains 10,000 participant IDs (in random order) and we’re only interested in combining the two datasets where there is a match?

large_ind_diffs <- read_csv("https://raw.githubusercontent.com/ajstewartlang/03_data_wrangling/master/data/large_ind_diffs.csv")

head(large_ind_diffs)

## # A tibble: 6 x 3
##      ID    iq    wm
##   <dbl> <dbl> <dbl>
## 1  6057    93     7
## 2  2723    86     7
## 3  1088    97     9
## 4  8687    87     8
## 5  4223    77    11
## 6   369    95     9

We can actually use another join function (left_join()) to combine these two datasets, but only where there is a match of ID with the first of the two datasets (my_longer_data) in the function call.

left_join(my_longer_data, large_ind_diffs, by = "ID")

## # A tibble: 128 x 5
##       ID Condition     RT    iq    wm
##    <dbl> <chr>      <dbl> <dbl> <dbl>
##  1     1 Condition1   487   100     9
##  2     1 Condition2   492   100     9
##  3     1 Condition3   499   100     9
##  4     1 Condition4   488   100     9
##  5     2 Condition1   502   108     8
##  6     2 Condition2   494   108     8
##  7     2 Condition3   517   108     8
##  8     2 Condition4   508   108     8
##  9     3 Condition1   510   116     9
## 10     3 Condition2   483   116     9
## # … with 118 more rows

Your challenge

Have a go at recreating what I’ve done above by writing your first script using RStudio Desktop.

Improve this Workshop

If you spot any issues/errors in this workshop, you can raise an issue or create a pull request for this repo.

Data Wrangling