In this workshop we shall take our first look at some key tools in the Tidyverse that will allow us to wrangle and tidy our data so that it’s in the format that we need in order to visualize and model it. The Tidyverse is a collection of packages that all ‘play nicely’ with each other. They are based on a common philosophy where data are represented in rectangular format (i.e., with rows and columns). These rectangular structures are known in the Tidyverse as tibbles
. If you’re interested, you can read more about tibbles
in the R4DS book here.
Have a look at the following video where I walk you through this worksheet. Then I want you to work through the content by writing (and running) the script on your own machine.
Let’s take our first look at data wrangling. We are going to start with a dataset that comes with the Tidyverse. The dataset is called mpg
and comprises fuel economy data from 1999 to 2008 for 38 popular models of cars in the US.
First, we need to load the tidyverse
library with the following:
library(tidyverse)
If you run this line without having first installed the Tidyverse on your computer, you will encounter an error. R packages only need to be installed once, so if you want to load one into your library for the first time, you need to install it with install.packages(*packagename*)
.
For the tidyverse
we need to install it with:
install.packages(tidyverse)
Once you have installed the tidyverse
, you can then load it into your llbrary with the library()
function. You only ever need to install a package once on your machine (unless you have updated R or you want to install the most up-to-date version of a particular package). When you are writing your R scripts, you never want to have the install.packages()
function in the body of the script as if someone else were to run your script, this would update packages on their computer (which they might not want).
mpg
datasetThe mpg
dataset is loaded as part of the Tidyverse In the help file, which you can access by typing help(mpg)
or ?mpg
we see the following:
Description This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
A data frame with 234 rows and 11 variables.
manufacturer - manufacturer model - model name displ - engine displacement, in litres year - year of manufacture cyl - number of cylinders trans -type of transmission drv -f = front-wheel drive, r = rear wheel drive, 4 = 4wd cty - city miles per gallon hwy - highway miles per gallon fl - fuel type class - “type” of car
head()
and str()
We can explore the mpg
dataset that is loaded with the Tidyverse in a number of ways. If we want to look at the first 6 lines of the dataset, we can use the head()
function.
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
We see that it is a tibble - or a rectangular data frame - made up of rows and columns. This is tidy
format where each observation corresponds to a row. Most of the analyses we will run in R involve tidy
data. Within the Tidyverse, the tibble
is the standard way to represent data. You’ll spend a lot of time tidying and wrangling your data to get it into this format! By doing this in R using a script that you write, you are making this key stage reproducible. You can run the script again on an updated or different dataset - thus likely saving you lots of time!
We can also ask for information about the structure of our dataset with str()
. This will tell us about the columns, what type of variable each is, the number of rows etc.
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
select()
to select columnsIf we want to, we could just select one of the columns using the select()
function. Below we are just selecing the column entitled manufacturer
.
mpg %>%
select(manufacturer)
## # A tibble: 234 x 1
## manufacturer
## <chr>
## 1 audi
## 2 audi
## 3 audi
## 4 audi
## 5 audi
## 6 audi
## 7 audi
## 8 audi
## 9 audi
## 10 audi
## # … with 224 more rows
Related to the select()
function is rename()
. It does exactly what you think it might; it renames a column.
We can also look at the different car manufacturers in the dataset by using the distinct()
function. This gives us the unique manufacturer names. This function can be quite handy if you want to check a dataset for duplicates of (e.g.) participant IDs.
mpg %>%
distinct(manufacturer)
## # A tibble: 15 x 1
## manufacturer
## <chr>
## 1 audi
## 2 chevrolet
## 3 dodge
## 4 ford
## 5 honda
## 6 hyundai
## 7 jeep
## 8 land rover
## 9 lincoln
## 10 mercury
## 11 nissan
## 12 pontiac
## 13 subaru
## 14 toyota
## 15 volkswagen
filter()
to select rowsSometimes we might want to select only a subset of rows in our dataset. We can do that using the filter()
function. For example, here we filter our dataset to include only cars made by ‘honda’.
mpg %>%
filter(manufacturer == "honda")
## # A tibble: 9 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 honda civic 1.6 1999 4 manual(… f 28 33 r subcomp…
## 2 honda civic 1.6 1999 4 auto(l4) f 24 32 r subcomp…
## 3 honda civic 1.6 1999 4 manual(… f 25 32 r subcomp…
## 4 honda civic 1.6 1999 4 manual(… f 23 29 p subcomp…
## 5 honda civic 1.6 1999 4 auto(l4) f 24 32 r subcomp…
## 6 honda civic 1.8 2008 4 manual(… f 26 34 r subcomp…
## 7 honda civic 1.8 2008 4 auto(l5) f 25 36 r subcomp…
## 8 honda civic 1.8 2008 4 auto(l5) f 24 36 c subcomp…
## 9 honda civic 2 2008 4 manual(… f 21 29 p subcomp…
Note, we use the operator ==
which means ‘is equal to’. This is a logical operator - other logical operators include less than <
, greater than >
, less than or equal to <=
, greater then or equal to >=
, and is not equal to !=
.
We can also filter using a combination of possibilities via logical OR |
or logical AND &
. The first code chunk below filters the dataset for cases where the manufacturer is ‘honda’ OR ‘toyota’.
mpg %>%
filter(manufacturer == "honda" | manufacturer == "toyota")
## # A tibble: 43 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 honda civic 1.6 1999 4 manual… f 28 33 r subco…
## 2 honda civic 1.6 1999 4 auto(l… f 24 32 r subco…
## 3 honda civic 1.6 1999 4 manual… f 25 32 r subco…
## 4 honda civic 1.6 1999 4 manual… f 23 29 p subco…
## 5 honda civic 1.6 1999 4 auto(l… f 24 32 r subco…
## 6 honda civic 1.8 2008 4 manual… f 26 34 r subco…
## 7 honda civic 1.8 2008 4 auto(l… f 25 36 r subco…
## 8 honda civic 1.8 2008 4 auto(l… f 24 36 c subco…
## 9 honda civic 2 2008 4 manual… f 21 29 p subco…
## 10 toyota 4runne… 2.7 1999 4 manual… 4 15 20 r suv
## # … with 33 more rows
While below we filter for cases where the manufacturer is ‘honda’ and the year of manufacture is ‘1999’.
mpg %>%
filter(manufacturer == "honda" & year == "1999")
## # A tibble: 5 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 honda civic 1.6 1999 4 manual(… f 28 33 r subcomp…
## 2 honda civic 1.6 1999 4 auto(l4) f 24 32 r subcomp…
## 3 honda civic 1.6 1999 4 manual(… f 25 32 r subcomp…
## 4 honda civic 1.6 1999 4 manual(… f 23 29 p subcomp…
## 5 honda civic 1.6 1999 4 auto(l4) f 24 32 r subcomp…
We can combine the use of filter()
with select()
to filter for case where the manufacturer is ‘honda’, the year of manufacture is ‘1999’ and we only want to display these two columns plus those telling us about fuel economy - cty
and hwy
.
mpg %>%
filter(manufacturer == "honda" & year == "1999") %>%
select(manufacturer, year, cty, hwy)
## # A tibble: 5 x 4
## manufacturer year cty hwy
## <chr> <int> <int> <int>
## 1 honda 1999 28 33
## 2 honda 1999 24 32
## 3 honda 1999 25 32
## 4 honda 1999 23 29
## 5 honda 1999 24 32
By combining just a few functions, you can imagine that we can build some quite complex data wrangling rules quite straightforwardly.
%>%
Note that in these examples above we used the %>%
operator - this is called the pipe and allows us to pass information from one side of the pipe to the other. You can read it out load as ‘and then’. All of the functions (such as select()
, filter()
etc.) in the Tidyverse are known as verbs, and they describe what they do. The pipe is one of the most commonly used operators in the Tidyverse and allows us to chain together different lines of code - with the output of each line being passed on as input into the next. In this example, the dataset mpg
is passed along to the distinct()
function where we ask for a list of the distinct (i.e., unique) manufacturers. This output itself is a vector. Vectors are a basic data structure and contain elements of the same type - for example, a bunch of numbers. We can add another line to our piped chain to tell us how many elements are in this vector. We could read this out loud as ‘take the dataset mpg, and then work out the distinct manufacturer names, and then count them’.
mpg %>%
distinct(manufacturer) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 15
At the moment, the car manufacturer names are all in lower case. It would look a lot nicer if they were in title case (i.e., with capitalisation on the first letter of each word). We can use the mutate()
function to create a new column - this time, the name of the new column is also the name of the old column that we’re wanting to modify using the function str_to_title()
. What this will do is overwrite the column manufacturer
and replace it with the new version with the car manufacturer names in title case.
mpg %>%
mutate(manufacturer = str_to_title(manufacturer))
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 Audi a4 1.8 1999 4 auto(l… f 18 29 p comp…
## 2 Audi a4 1.8 1999 4 manual… f 21 29 p comp…
## 3 Audi a4 2 2008 4 manual… f 20 31 p comp…
## 4 Audi a4 2 2008 4 auto(a… f 21 30 p comp…
## 5 Audi a4 2.8 1999 6 auto(l… f 16 26 p comp…
## 6 Audi a4 2.8 1999 6 manual… f 18 26 p comp…
## 7 Audi a4 3.1 2008 6 auto(a… f 18 27 p comp…
## 8 Audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp…
## 9 Audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp…
## 10 Audi a4 quat… 2 2008 4 manual… 4 20 28 p comp…
## # … with 224 more rows
The column model
is also lowercase. Let’s make that title case too. We can use the mutate()
function to work over more than one column at the same time like this:
mpg %>%
mutate(manufacturer = str_to_title(manufacturer), model = str_to_title(model))
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 Audi A4 1.8 1999 4 auto(l… f 18 29 p comp…
## 2 Audi A4 1.8 1999 4 manual… f 21 29 p comp…
## 3 Audi A4 2 2008 4 manual… f 20 31 p comp…
## 4 Audi A4 2 2008 4 auto(a… f 21 30 p comp…
## 5 Audi A4 2.8 1999 6 auto(l… f 16 26 p comp…
## 6 Audi A4 2.8 1999 6 manual… f 18 26 p comp…
## 7 Audi A4 3.1 2008 6 auto(a… f 18 27 p comp…
## 8 Audi A4 Quat… 1.8 1999 4 manual… 4 18 26 p comp…
## 9 Audi A4 Quat… 1.8 1999 4 auto(l… 4 16 25 p comp…
## 10 Audi A4 Quat… 2 2008 4 manual… 4 20 28 p comp…
## # … with 224 more rows
There are quite a few columns there, so how about we select just the manufacturer, model, year, transmission, and hwy columns:
mpg %>%
mutate(manufacturer = str_to_title(manufacturer), model = str_to_title(model)) %>%
select(manufacturer, model, year, trans, hwy)
## # A tibble: 234 x 5
## manufacturer model year trans hwy
## <chr> <chr> <int> <chr> <int>
## 1 Audi A4 1999 auto(l5) 29
## 2 Audi A4 1999 manual(m5) 29
## 3 Audi A4 2008 manual(m6) 31
## 4 Audi A4 2008 auto(av) 30
## 5 Audi A4 1999 auto(l5) 26
## 6 Audi A4 1999 manual(m5) 26
## 7 Audi A4 2008 auto(av) 27
## 8 Audi A4 Quattro 1999 manual(m5) 26
## 9 Audi A4 Quattro 1999 auto(l5) 25
## 10 Audi A4 Quattro 2008 manual(m6) 28
## # … with 224 more rows
In the real world, data frames do not always arrive on our computer in tidy format. Very often you need to engage in some data tidying before you can do anything useful with them. We’re going to look at an example of how we go from messy data to tidy data.
my_messy_data <- read_csv("https://raw.githubusercontent.com/ajstewartlang/03_data_wrangling/master/data/my_data.csv")
We ran a reaction time experiment with 24 participants and 4 conditions - they are numbered 1-4 in our data file.
head(my_messy_data)
## # A tibble: 6 x 3
## participant condition rt
## <dbl> <dbl> <dbl>
## 1 1 1 879
## 2 1 2 1027
## 3 1 3 1108
## 4 1 4 765
## 5 2 1 1042
## 6 2 2 1050
This is a repeated measures design where we had one factor (Prime Type) with two levels (A vs. B) and a second factor (Target Type) with two levels (A vs. B). We want to recode our data frame so it better matches our experimental design. First we need to recode our 4 conditions like this:
Recode condition columns follows: Condition 1 = Prime A, Target A Condition 2 = Prime A, Target B Condition 3 = Prime B, Target A Condition 4 = Prime B, Target B
my_messy_data %>%
mutate(condition = recode(condition,
"1" = "PrimeA_TargetA",
"2" = "PrimeA_TargetB",
"3" = "PrimeB_TargetA",
"4" = "PrimeB_TargetB")) %>%
head()
## # A tibble: 6 x 3
## participant condition rt
## <dbl> <chr> <dbl>
## 1 1 PrimeA_TargetA 879
## 2 1 PrimeA_TargetB 1027
## 3 1 PrimeB_TargetA 1108
## 4 1 PrimeB_TargetB 765
## 5 2 PrimeA_TargetA 1042
## 6 2 PrimeA_TargetB 1050
We now need to separate out our Condition column into two - one for our first factor (Prime), and one for our second factor (Target). The separate()
function does just this - when used in conjunction with a piped tibble, it needs to know which column we want to separate, what new columns to create by separating that original column, and on what basis we want to do the separation. In the example below we tell separate()
that we want to separate the column labeled condition
into two new columns called Prime
and Target
and we want to do this at any points where a _
is present in the column to be separated.
my_messy_data %>%
mutate(condition = recode(condition,
"1" = "PrimeA_TargetA",
"2" = "PrimeA_TargetB",
"3" = "PrimeB_TargetA",
"4" = "PrimeB_TargetB")) %>%
separate(col = "condition", into = c("Prime", "Target"), sep = "_")
## # A tibble: 96 x 4
## participant Prime Target rt
## <dbl> <chr> <chr> <dbl>
## 1 1 PrimeA TargetA 879
## 2 1 PrimeA TargetB 1027
## 3 1 PrimeB TargetA 1108
## 4 1 PrimeB TargetB 765
## 5 2 PrimeA TargetA 1042
## 6 2 PrimeA TargetB 1050
## 7 2 PrimeB TargetA 942
## 8 2 PrimeB TargetB 945
## 9 3 PrimeA TargetA 943
## 10 3 PrimeA TargetB 910
## # … with 86 more rows
my_messy_data %>%
mutate(condition = recode(condition,
"1" = "PrimeA_TargetA",
"2" = "PrimeA_TargetB",
"3" = "PrimeB_TargetA",
"4" = "PrimeB_TargetB")) %>%
separate(col = "condition", into = c("Prime", "Target"), sep = "_") %>%
mutate(Prime = factor(Prime), Target = factor(Target))
## # A tibble: 96 x 4
## participant Prime Target rt
## <dbl> <fct> <fct> <dbl>
## 1 1 PrimeA TargetA 879
## 2 1 PrimeA TargetB 1027
## 3 1 PrimeB TargetA 1108
## 4 1 PrimeB TargetB 765
## 5 2 PrimeA TargetA 1042
## 6 2 PrimeA TargetB 1050
## 7 2 PrimeB TargetA 942
## 8 2 PrimeB TargetB 945
## 9 3 PrimeA TargetA 943
## 10 3 PrimeA TargetB 910
## # … with 86 more rows
Most of the analysis we will conduct in R requires our data to be in tidy, or long, format. In such data sets, one row corresponds to one observation. In the real world, data are rarely in the right format for analysis. In R, the pivot_wider()
and pivot_longer()
functions are designed to reshape our data files. First, let’s load a datafile that is in wide format (i.e., multiple observations per row). It is from an experiment where we had four conditions (labelled Condition1, Condition2, Condition3, and Condition4). In addition to there being a column for each of the 4 conditions, we also have a column corresponding to participant ID. Each cell in the data set corresponds to a reaction time (measured in milliseconds).
my_wide_data <- read_csv("https://raw.githubusercontent.com/ajstewartlang/03_data_wrangling/master/data/my_wide_data.csv")
pivot_longer()
functionhead(my_wide_data)
## # A tibble: 6 x 5
## ID Condition1 Condition2 Condition3 Condition4
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 487 492 499 488
## 2 2 502 494 517 508
## 3 3 510 483 488 509
## 4 4 476 488 513 521
## 5 5 504 478 513 504
## 6 6 505 486 503 495
So, we can see the data file is in wide format. We want to reshape it to long format. We can do that using the pivot_longer()
function.
Minially, we need to specify the data frame that we want to reshape, the columns that we want to ‘pivot’ into longer format, the name of the new column that we are creating, and the name of the column that will hold the values of our reshaped data frame. We are going to map the output to a variable I’m calling my_longer_data
.
my_longer_data <- my_wide_data %>%
pivot_longer(cols = c(Condition1, Condition2, Condition3, Condition4),
names_to = "Condition",
values_to = "RT")
Now let’s have a look at what our reshaped data frame looks like.
head(my_longer_data)
## # A tibble: 6 x 3
## ID Condition RT
## <dbl> <chr> <dbl>
## 1 1 Condition1 487
## 2 1 Condition2 492
## 3 1 Condition3 499
## 4 1 Condition4 488
## 5 2 Condition1 502
## 6 2 Condition2 494
So you can see our data are now in long - or tidy - format with one observation per row. Note that our Condition
column isn’t coded as a factor. It’s important that our data set reflects the structure of our experiment so let’s convert that column to a factor - note that in the following code we are now ‘saving’ the change as we are not mapping the output onto a variable name.
my_longer_data %>%
mutate(Condition = factor(Condition)) %>%
head()
## # A tibble: 6 x 3
## ID Condition RT
## <dbl> <fct> <dbl>
## 1 1 Condition1 487
## 2 1 Condition2 492
## 3 1 Condition3 499
## 4 1 Condition4 488
## 5 2 Condition1 502
## 6 2 Condition2 494
pivot_wider()
functionWe can use the pivot_wider()
function to reshape a long data frame so that it goes from long to wide format. It works similarly to pivot_longer()
. Let’s take our new, long, data frame and turn it back into wide format. With pivot_wider()
we minimally need to specify the data frame that we want to reshape, and a pair or arguments (names_from and values_from) that describe from which column to get the name of the output column, and from which column to get the cell values.
my_wider_data <- my_longer_data %>%
pivot_wider(names_from = "Condition",
values_from = "RT")
We can check that our data set is back in wide format.
head(my_wider_data)
## # A tibble: 6 x 5
## ID Condition1 Condition2 Condition3 Condition4
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 487 492 499 488
## 2 2 502 494 517 508
## 3 3 510 483 488 509
## 4 4 476 488 513 521
## 5 5 504 478 513 504
## 6 6 505 486 503 495
Sometimes you might need to combine two datasets. For example, you might have one dataset that contains reading time data (like the one above) and another than contains individual difference measures for the participants in the first dataset. How would we go about combining these two datasets so that we end up with one that includes both the reading time data and the individual difference measures (that perhaps we want to covary out later)? Luckily, the {dplyr}
package contains a number of join functions that allows you to join together different tibbles. First, let’s load the data that contains the individual different measures.
individual_diffs <- read_csv("https://raw.githubusercontent.com/ajstewartlang/03_data_wrangling/master/data/individual_diffs.csv")
Let’s look at the first few rows of the individual differences data. This dataset contains the ID numbers of our participants plus measures of IQ (the iq column) and Working Memory (the wm column).
head(individual_diffs)
## # A tibble: 6 x 3
## ID iq wm
## <dbl> <dbl> <dbl>
## 1 1 100 9
## 2 2 108 8
## 3 3 116 9
## 4 4 95 9
## 5 5 83 11
## 6 6 73 10
We want to combine this dataset with our reading time dataset from above my_longer_data
which looks like this:
head(my_longer_data)
## # A tibble: 6 x 3
## ID Condition RT
## <dbl> <chr> <dbl>
## 1 1 Condition1 487
## 2 1 Condition2 492
## 3 1 Condition3 499
## 4 1 Condition4 488
## 5 2 Condition1 502
## 6 2 Condition2 494
We can combine using one of the join functions. There are a variety of options including full_join()
which includes all of the rows in tibble one or tibble two that we want to join. Other options include inner_join()
which includes all of the rows in tibble one and tibble 2, as well as left_join()
and right_join()
.
combined_data <- full_join(my_longer_data, individual_diffs, by = "ID")
We now see that our dataset are combined as we’d expect.
combined_data
## # A tibble: 128 x 5
## ID Condition RT iq wm
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Condition1 487 100 9
## 2 1 Condition2 492 100 9
## 3 1 Condition3 499 100 9
## 4 1 Condition4 488 100 9
## 5 2 Condition1 502 108 8
## 6 2 Condition2 494 108 8
## 7 2 Condition3 517 108 8
## 8 2 Condition4 508 108 8
## 9 3 Condition1 510 116 9
## 10 3 Condition2 483 116 9
## # … with 118 more rows
Of course, you may be thinking that we could just do a quick bit of Excel cut and paste of the columns we want from one dataset to the other. But what about the case where our individual differences file contains 10,000 participant IDs (in random order) and we’re only interested in combining the two datasets where there is a match?
large_ind_diffs <- read_csv("https://raw.githubusercontent.com/ajstewartlang/03_data_wrangling/master/data/large_ind_diffs.csv")
head(large_ind_diffs)
## # A tibble: 6 x 3
## ID iq wm
## <dbl> <dbl> <dbl>
## 1 6057 93 7
## 2 2723 86 7
## 3 1088 97 9
## 4 8687 87 8
## 5 4223 77 11
## 6 369 95 9
We can actually use another join function (left_join()
) to combine these two datasets, but only where there is a match of ID with the first of the two datasets (my_longer_data
) in the function call.
left_join(my_longer_data, large_ind_diffs, by = "ID")
## # A tibble: 128 x 5
## ID Condition RT iq wm
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Condition1 487 100 9
## 2 1 Condition2 492 100 9
## 3 1 Condition3 499 100 9
## 4 1 Condition4 488 100 9
## 5 2 Condition1 502 108 8
## 6 2 Condition2 494 108 8
## 7 2 Condition3 517 108 8
## 8 2 Condition4 508 108 8
## 9 3 Condition1 510 116 9
## 10 3 Condition2 483 116 9
## # … with 118 more rows
Have a go at recreating what I’ve done above by writing your first script using RStudio Desktop.
If you spot any issues/errors in this workshop, you can raise an issue or create a pull request for this repo.