In this sesssion we are going to examine data simulation. Specifically, we’ll look at why simulating data sets before you run a study can be incredibly helpful in terms of determining whether you have enough power in your design to detect the effect size of interest. Data simulation can also be helpful in ensuring that you are able to build the statistical model that you want to when it comes to analysing data from your experiment. This can be particularly important in the context of mixed models where you need to ensure the appropriate number of observations at the different levels of your random effects in order to be able to estimate the model parameters. Data simulation can also be helpful in understanding concepts related to the distribution of p-values as a function of sample size, effect sizes etc.
You can download the slides in .odp format by clicking here and in .pdf format by clicking on the image below.
Once you’ve watched the video above, run the code below on your own machines. The following code corresponds to the bulk of the code I use in the slides and talk about in the video.
We first need to load the key packages that we’ll be using. They are the tidyverse
and the broom
package. The broom
package contains functions which allow us to turn the output of statistical tests into tibbles.
library(tidyverse)
library(broom)
Computers tend to generate numbers pseudorandomly rather than in a truly randoem fashion. They do this using an algorithm. This algorithm has a starting point, and if you ensure it always starts at the same point, it will always produce the same sequence of random numbers. In R, we set the starting point using the set.seed()
function. First, let’s generate two sequences of random numbers without first setting the starting point of the algorithm.
Using the rnorm()
function we generate 5 numbers from a distribution with a mean of 0 and a standard deviation of 1. We do this twice.
rnorm(5, 0, 1)
## [1] -1.42020284 1.68105040 -0.02273563 1.27937398 2.13594442
rnorm(5, 0, 1)
## [1] 0.01388909 0.12533095 -0.02938404 0.77014470 1.17642354
As you can see, the sequences differ. Our code and output associated with generating the random numbers isn’t reproducible. Run it again and you’ll get two new sequence of 5 random numbers. We can however make things reproducible by using the set.seed()
function before each rnorm()
call.
set.seed(1234)
rnorm(5, 0, 1)
## [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247
set.seed(1234)
rnorm(5, 0, 1)
## [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247
The two sequences are identical, and will always be identical if preceded by the same seed (or starting point) for the random number generator algorithm.
First let’s simulate data from a one factor between participants experiment. Each of the 24 participants will have one dependent variable measure. First let’s create a sequence of numbers 1:24 correspinding to our participant numbers. In R, this is also known as a vector
- one of the basic data structures. It contains elements of the same type (in this case integers).
participant <- seq(1:24)
participant
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Now we need to create the conditions - Condition 1 we will label “fast” and Condition 2 we will label “slow”. We use the c()
function to combine the arguments that follow it (i.e., “fast” and “slow”) into a vector. The first 12 participants are in the “fast” condition, and the second 12 are in the “slow” condition. We use the rep()
function to replicate an element (defined in the first argument) a certain number of times (defined in the second argument.)
condition <- c(rep("fast", times = 12), rep("slow", times = 12))
condition
## [1] "fast" "fast" "fast" "fast" "fast" "fast" "fast" "fast" "fast" "fast"
## [11] "fast" "fast" "slow" "slow" "slow" "slow" "slow" "slow" "slow" "slow"
## [21] "slow" "slow" "slow" "slow"
We now have “fast” and “slow” repeated 12 times each.
Now we need to simulate our data - we will assume we’re sampling from the normal distribution so will use the rnorm()
function. This selects samples from a normal distribution where we specify the mean and sd. We want to simulate the data for our “fast” condition as coming from a distribution with a mean = 1000 and sd = 50, and the data for our “slow” condition from a distribution with a mean = 1020 and sd = 50. We need to make sure we set up our sampling using the rnorm()
function in the same order as we did for specifying the condition variable.
To make sure we can reproduce these random samples in future, we can use the function set.seed()
to specify the starting point of the random number generation.
set.seed(1234)
dv <- c(rnorm(12, 1000, 50), rnorm(12, 1020, 50))
dv
## [1] 939.6467 1013.8715 1054.2221 882.7151 1021.4562 1025.3028 971.2630
## [8] 972.6684 971.7774 955.4981 976.1404 950.0807 981.1873 1023.2229
## [15] 1067.9747 1014.4857 994.4495 974.4402 978.1414 1140.7918 1026.7044
## [22] 995.4657 997.9726 1042.9795
We now need to combined our 3 columns into a tibble. We use the cbind()
function to first bind the three variables together as columns, and then as_tibble()
to convert these three combined columns to a tibble.
my_data <- as_tibble(cbind(participant, condition, dv))
my_data
## # A tibble: 24 x 3
## participant condition dv
## <chr> <chr> <chr>
## 1 1 fast 939.646712530729
## 2 2 fast 1013.87146210553
## 3 3 fast 1054.22205883415
## 4 4 fast 882.715114868533
## 5 5 fast 1021.45623444055
## 6 6 fast 1025.30279460788
## 7 7 fast 971.263001993268
## 8 8 fast 972.668407210791
## 9 9 fast 971.777400045336
## 10 10 fast 955.498108547795
## # … with 14 more rows
As you can see, the columns in the tibble aren’t yet of the correct type. The column condition
should be a factor, and dv
an integer. Let’s fix that and map the tibble with the renamed columns onto a new variable called my_tidied_data
.
my_tidied_data <- my_data %>%
mutate(condition = factor(condition), dv = as.integer(dv))
my_tidied_data
## # A tibble: 24 x 3
## participant condition dv
## <chr> <fct> <int>
## 1 1 fast 939
## 2 2 fast 1013
## 3 3 fast 1054
## 4 4 fast 882
## 5 5 fast 1021
## 6 6 fast 1025
## 7 7 fast 971
## 8 8 fast 972
## 9 9 fast 971
## 10 10 fast 955
## # … with 14 more rows
Let’s build a plot to ensure our data looks like it should.
ggplot(my_tidied_data, aes(x = condition, y = dv, fill = condition)) +
geom_violin(width = .25) +
stat_summary(fun.data = "mean_cl_boot", colour = "black") +
geom_jitter(alpha = .2, width = .05) +
guides(fill = FALSE) +
labs(x = "Condition", y = "DV (ms.)") +
theme_minimal()
Given a bit of sampling error, this looks about right. Now, let’s run a between participants t-test to see if the difference between the conditions is significant. I am going to use the tidy()
function from the broom
package to store the output of t.test()
as a tibble itself.
t.test(filter(my_tidied_data, condition == "fast")$dv,
filter(my_tidied_data, condition == "slow")$dv, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: filter(my_tidied_data, condition == "fast")$dv and filter(my_tidied_data, condition == "slow")$dv
## t = -2.202, df = 21.987, p-value = 0.03845
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -81.233786 -2.432881
## sample estimates:
## mean of x mean of y
## 977.4167 1019.2500
result <- tidy(t.test(filter(my_tidied_data, condition == "fast")$dv,
filter(my_tidied_data, condition == "slow")$dv, paired = FALSE))
result
## # A tibble: 1 x 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -41.8 977. 1019. -2.20 0.0385 22.0 -81.2 -2.43
## # … with 2 more variables: method <chr>, alternative <chr>
As you’ll see from the above, the p-value for the t-test is = .0385. So we have found a difference. But is N=24 a big enough sample size to consistenly detect a significant difference? Or did we just get lucky in this one case?
Let’s turn now to simulating a number of datasets, each with a sample size of 24. The next code chunk essentially runs through the procedure for creating one dataset, but it does so 10 times (this giving us 10 simulated datasets).
total_samples <- 10
sample_size <- 24
participant <- rep(1:sample_size)
condition <- c(rep("fast", times = sample_size/2),
rep("slow", times = sample_size/2))
all_data <- NULL
for (i in 1:total_samples) {
sample <- i
set.seed(1233 + i)
dv <- c(rnorm(sample_size/2, 1000, 50), rnorm(sample_size/2, 1020, 50))
my_data <- as_tibble(cbind(participant, condition, dv, sample))
all_data <- rbind(my_data, all_data)
}
all_tidied_data <- all_data %>%
mutate(condition = factor(condition), dv = as.integer(dv))
Let’s plot the averages for each of the two conditions from each of these 10 simulated datasets on the same graph.
all_tidied_data %>%
group_by(condition, sample) %>%
summarise(average = mean(dv)) %>%
ggplot(aes(x = condition, y = average, group = condition,
label = sample)) +
geom_jitter(width = .1, alpha = .5) +
stat_summary(fun.data = "mean_cl_boot", colour = "blue") +
geom_text(check_overlap = TRUE, nudge_x = .2, nudge_y = 0, colour =
"black") +
labs(x = "Condition", y = "DV(ms.)") +
theme_minimal()
The blue dots corresponds to the mean of means - and are pretty close to the population means. There’s quite a lot of variation around each of these dots though, again indicative of sampling error. The mean for our “fast” condition is quite close to the population mean (1000), while the mean for our “slow” condition is almost exactly the population mean (1020). Sample 2 has an extreme mean for the “slow” condition - indeed, numerically the “slow” condition is faster than the “fast” condition in Sample 2. This is sampling error in practice and further highlights the problem with small sample sizes.
Now, instead of simulating 10 datasets let’s simulate 100. The code below for simulating 100 datasets is identical to the code above for simulating 10 apart from in one place. Can you spot it? Do you see how that one change results now in 100 simulations?
total_samples <- 100
sample_size <- 24
participant <- rep(1:sample_size)
condition <- c(rep("fast", times = sample_size/2),
rep("slow", times = sample_size/2))
all_data <- NULL
for (i in 1:total_samples) {
sample <- i
set.seed(1233 + i)
dv <- c(rnorm(sample_size/2, 1000, 50), rnorm(sample_size/2, 1020, 50))
my_data <- as_tibble(cbind(participant, condition, dv, sample))
all_data <- rbind(my_data, all_data)
}
all_tidied_data <- all_data %>%
mutate(condition = factor(condition), dv = as.integer(dv))
all_tidied_data %>%
group_by(condition, sample) %>%
summarise(average = mean(dv)) %>%
ggplot(aes(x = condition, y = average, group = condition,
label = sample)) +
geom_jitter(width = .1, alpha = .5) +
stat_summary(fun.data = "mean_cl_boot", colour = "blue") +
geom_text(check_overlap = TRUE, nudge_x = .2, nudge_y = 0, colour =
"black", size = 3) +
labs(x = "Condition", y = "DV(ms.)") +
theme_minimal()
Overall the “Slow” condition RTs are higher than the “Fast” Condition RTs - but we can spot some simulations where the difference is negligible or even goes the other way (e.g., Simulation 100). The blue circles corresponds to the overall means and are pretty much exactly equal to the population means of 1000 and 1020.
Now, let’s run t-tests for each of the 100 datasets.
result <- NULL
for (i in 1:total_samples) {
result <- rbind(tidy(t.test(filter(all_tidied_data,
condition == "fast" & sample == i)$dv,
filter(all_tidied_data,
condition == "slow" & sample == i)$dv,
paired = FALSE)), result)
}
result
## # A tibble: 100 x 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6.5 1013. 1007. 0.364 0.720 20.2 -30.7 43.7
## 2 -15 1000. 1015. -0.695 0.494 22.0 -59.7 29.7
## 3 7.92 1019. 1011. 0.445 0.661 21.0 -29.1 44.9
## 4 -16.4 984. 999. -0.697 0.493 22.0 -65.3 32.5
## 5 -10.8 1002. 1012. -0.517 0.612 16.9 -54.7 33.2
## 6 7.25 1000. 993 0.359 0.723 20.4 -34.8 49.3
## 7 -35 994. 1030. -1.66 0.113 20.6 -79.0 8.99
## 8 -27.5 983 1010. -1.40 0.175 21.8 -68.2 13.2
## 9 4.83 1026. 1021. 0.246 0.808 18.3 -36.3 46.0
## 10 -35.8 996. 1032 -1.58 0.130 18.9 -83.2 11.5
## # … with 90 more rows, and 2 more variables: method <chr>, alternative <chr>
How many of these 100 simulated datasets have produced a difference that is statistically significant? Remember, we know the “Fast” and “Slow” populations do differ from each other. But can we detect this difference consistently?
result %>%
filter(p.value < .05) %>%
count()
## # A tibble: 1 x 1
## n
## <int>
## 1 17
So, only 17 of the 100 simulations detects the difference - even thought the difference is there and waiting to be found. With 24 participants, we have a very under-powered experiment. Indeed, the power for this design to detect the effect we are looking for is just 0.17. We should be aiming to run studies with about 0.80 power. Much lower than 0.80, we are wasting our time.
If you spot any issues/errors in this workshop, you can raise an issue or create a pull request for this repo.