Overview

In this sesssion we are going to examine data simulation. Specifically, we’ll look at why simulating data sets before you run a study can be incredibly helpful in terms of determining whether you have enough power in your design to detect the effect size of interest. Data simulation can also be helpful in ensuring that you are able to build the statistical model that you want to when it comes to analysing data from your experiment. This can be particularly important in the context of mixed models where you need to ensure the appropriate number of observations at the different levels of your random effects in order to be able to estimate the model parameters. Data simulation can also be helpful in understanding concepts related to the distribution of p-values as a function of sample size, effect sizes etc.

Slides

You can download the slides in .odp format by clicking here and in .pdf format by clicking on the image below.

Once you’ve watched the video above, run the code below on your own machines. The following code corresponds to the bulk of the code I use in the slides and talk about in the video.

We first need to load the key packages that we’ll be using. They are the tidyverse and the broom package. The broom package contains functions which allow us to turn the output of statistical tests into tibbles.

library(tidyverse) 
library(broom)

Generating Random Numbers

Computers tend to generate numbers pseudorandomly rather than in a truly randoem fashion. They do this using an algorithm. This algorithm has a starting point, and if you ensure it always starts at the same point, it will always produce the same sequence of random numbers. In R, we set the starting point using the set.seed() function. First, let’s generate two sequences of random numbers without first setting the starting point of the algorithm.

Using the rnorm() function we generate 5 numbers from a distribution with a mean of 0 and a standard deviation of 1. We do this twice.

rnorm(5, 0, 1)

## [1] -1.42020284  1.68105040 -0.02273563  1.27937398  2.13594442

rnorm(5, 0, 1)

## [1]  0.01388909  0.12533095 -0.02938404  0.77014470  1.17642354

As you can see, the sequences differ. Our code and output associated with generating the random numbers isn’t reproducible. Run it again and you’ll get two new sequence of 5 random numbers. We can however make things reproducible by using the set.seed() function before each rnorm() call.

set.seed(1234)
rnorm(5, 0, 1)

## [1] -1.2070657  0.2774292  1.0844412 -2.3456977  0.4291247

set.seed(1234)
rnorm(5, 0, 1)

## [1] -1.2070657  0.2774292  1.0844412 -2.3456977  0.4291247

The two sequences are identical, and will always be identical if preceded by the same seed (or starting point) for the random number generator algorithm.

Simulating 1 Dataset

First let’s simulate data from a one factor between participants experiment. Each of the 24 participants will have one dependent variable measure. First let’s create a sequence of numbers 1:24 correspinding to our participant numbers. In R, this is also known as a vector - one of the basic data structures. It contains elements of the same type (in this case integers).

participant <- seq(1:24)
participant

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Now we need to create the conditions - Condition 1 we will label “fast” and Condition 2 we will label “slow”. We use the c() function to combine the arguments that follow it (i.e., “fast” and “slow”) into a vector. The first 12 participants are in the “fast” condition, and the second 12 are in the “slow” condition. We use the rep() function to replicate an element (defined in the first argument) a certain number of times (defined in the second argument.)

condition <- c(rep("fast", times = 12), rep("slow", times = 12))
condition

##  [1] "fast" "fast" "fast" "fast" "fast" "fast" "fast" "fast" "fast" "fast"
## [11] "fast" "fast" "slow" "slow" "slow" "slow" "slow" "slow" "slow" "slow"
## [21] "slow" "slow" "slow" "slow"

We now have “fast” and “slow” repeated 12 times each.

Now we need to simulate our data - we will assume we’re sampling from the normal distribution so will use the rnorm() function. This selects samples from a normal distribution where we specify the mean and sd. We want to simulate the data for our “fast” condition as coming from a distribution with a mean = 1000 and sd = 50, and the data for our “slow” condition from a distribution with a mean = 1020 and sd = 50. We need to make sure we set up our sampling using the rnorm() function in the same order as we did for specifying the condition variable.

To make sure we can reproduce these random samples in future, we can use the function set.seed() to specify the starting point of the random number generation.

set.seed(1234)
dv <- c(rnorm(12, 1000, 50), rnorm(12, 1020, 50))
dv

##  [1]  939.6467 1013.8715 1054.2221  882.7151 1021.4562 1025.3028  971.2630
##  [8]  972.6684  971.7774  955.4981  976.1404  950.0807  981.1873 1023.2229
## [15] 1067.9747 1014.4857  994.4495  974.4402  978.1414 1140.7918 1026.7044
## [22]  995.4657  997.9726 1042.9795

We now need to combined our 3 columns into a tibble. We use the cbind() function to first bind the three variables together as columns, and then as_tibble() to convert these three combined columns to a tibble.

my_data <- as_tibble(cbind(participant, condition, dv))
my_data

## # A tibble: 24 x 3
##    participant condition dv              
##    <chr>       <chr>     <chr>           
##  1 1           fast      939.646712530729
##  2 2           fast      1013.87146210553
##  3 3           fast      1054.22205883415
##  4 4           fast      882.715114868533
##  5 5           fast      1021.45623444055
##  6 6           fast      1025.30279460788
##  7 7           fast      971.263001993268
##  8 8           fast      972.668407210791
##  9 9           fast      971.777400045336
## 10 10          fast      955.498108547795
## # … with 14 more rows

As you can see, the columns in the tibble aren’t yet of the correct type. The column condition should be a factor, and dv an integer. Let’s fix that and map the tibble with the renamed columns onto a new variable called my_tidied_data.

my_tidied_data <- my_data %>%
  mutate(condition = factor(condition), dv = as.integer(dv))

my_tidied_data

## # A tibble: 24 x 3
##    participant condition    dv
##    <chr>       <fct>     <int>
##  1 1           fast        939
##  2 2           fast       1013
##  3 3           fast       1054
##  4 4           fast        882
##  5 5           fast       1021
##  6 6           fast       1025
##  7 7           fast        971
##  8 8           fast        972
##  9 9           fast        971
## 10 10          fast        955
## # … with 14 more rows

Let’s build a plot to ensure our data looks like it should.

ggplot(my_tidied_data, aes(x = condition, y = dv, fill = condition)) +
  geom_violin(width = .25) +
  stat_summary(fun.data = "mean_cl_boot", colour = "black") +
  geom_jitter(alpha = .2, width = .05) +
  guides(fill = FALSE) +
  labs(x = "Condition", y = "DV (ms.)") +
  theme_minimal()

Given a bit of sampling error, this looks about right. Now, let’s run a between participants t-test to see if the difference between the conditions is significant. I am going to use the tidy() function from the broom package to store the output of t.test() as a tibble itself.

t.test(filter(my_tidied_data, condition == "fast")$dv, 
       filter(my_tidied_data, condition == "slow")$dv, paired = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  filter(my_tidied_data, condition == "fast")$dv and filter(my_tidied_data, condition == "slow")$dv
## t = -2.202, df = 21.987, p-value = 0.03845
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -81.233786  -2.432881
## sample estimates:
## mean of x mean of y 
##  977.4167 1019.2500

result <- tidy(t.test(filter(my_tidied_data, condition == "fast")$dv, 
                      filter(my_tidied_data, condition == "slow")$dv, paired = FALSE))
result

## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1    -41.8      977.     1019.     -2.20  0.0385      22.0    -81.2     -2.43
## # … with 2 more variables: method <chr>, alternative <chr>

As you’ll see from the above, the p-value for the t-test is = .0385. So we have found a difference. But is N=24 a big enough sample size to consistenly detect a significant difference? Or did we just get lucky in this one case?

Simulating 10 Datasets

Let’s turn now to simulating a number of datasets, each with a sample size of 24. The next code chunk essentially runs through the procedure for creating one dataset, but it does so 10 times (this giving us 10 simulated datasets).

total_samples <- 10
sample_size <- 24
participant <- rep(1:sample_size)
condition <- c(rep("fast", times = sample_size/2),
               rep("slow", times = sample_size/2))
all_data <- NULL

for (i in 1:total_samples) {
  sample <- i
  set.seed(1233 + i)
  dv <- c(rnorm(sample_size/2, 1000, 50), rnorm(sample_size/2, 1020, 50))
  my_data <- as_tibble(cbind(participant, condition, dv, sample))
  all_data <- rbind(my_data, all_data)
}

all_tidied_data <- all_data %>%
  mutate(condition = factor(condition), dv = as.integer(dv))

Let’s plot the averages for each of the two conditions from each of these 10 simulated datasets on the same graph.

all_tidied_data %>%
  group_by(condition, sample) %>%
  summarise(average = mean(dv)) %>%
  ggplot(aes(x = condition, y = average, group = condition,
             label = sample)) +
  geom_jitter(width = .1, alpha = .5) +
  stat_summary(fun.data = "mean_cl_boot", colour = "blue") +
  geom_text(check_overlap = TRUE, nudge_x = .2, nudge_y = 0, colour =
              "black") +
  labs(x = "Condition", y = "DV(ms.)") +
  theme_minimal()

The blue dots corresponds to the mean of means - and are pretty close to the population means. There’s quite a lot of variation around each of these dots though, again indicative of sampling error. The mean for our “fast” condition is quite close to the population mean (1000), while the mean for our “slow” condition is almost exactly the population mean (1020). Sample 2 has an extreme mean for the “slow” condition - indeed, numerically the “slow” condition is faster than the “fast” condition in Sample 2. This is sampling error in practice and further highlights the problem with small sample sizes.

Simulating 100 Datasets

Now, instead of simulating 10 datasets let’s simulate 100. The code below for simulating 100 datasets is identical to the code above for simulating 10 apart from in one place. Can you spot it? Do you see how that one change results now in 100 simulations?

total_samples <- 100
sample_size <- 24
participant <- rep(1:sample_size)
condition <- c(rep("fast", times = sample_size/2),
               rep("slow", times = sample_size/2))
all_data <- NULL

for (i in 1:total_samples) {
  sample <- i
  set.seed(1233 + i)
  dv <- c(rnorm(sample_size/2, 1000, 50), rnorm(sample_size/2, 1020, 50))
  my_data <- as_tibble(cbind(participant, condition, dv, sample))
  all_data <- rbind(my_data, all_data)
}

all_tidied_data <- all_data %>%
  mutate(condition = factor(condition), dv = as.integer(dv))

all_tidied_data %>%
  group_by(condition, sample) %>%
  summarise(average = mean(dv)) %>%
  ggplot(aes(x = condition, y = average, group = condition,
             label = sample)) +
  geom_jitter(width = .1, alpha = .5) +
  stat_summary(fun.data = "mean_cl_boot", colour = "blue") +
  geom_text(check_overlap = TRUE, nudge_x = .2, nudge_y = 0, colour =
              "black", size = 3) +
  labs(x = "Condition", y = "DV(ms.)") +
  theme_minimal()

Overall the “Slow” condition RTs are higher than the “Fast” Condition RTs - but we can spot some simulations where the difference is negligible or even goes the other way (e.g., Simulation 100). The blue circles corresponds to the overall means and are pretty much exactly equal to the population means of 1000 and 1020.

Now, let’s run t-tests for each of the 100 datasets.

result <- NULL
for (i in 1:total_samples) {
  result <- rbind(tidy(t.test(filter(all_tidied_data, 
                                     condition == "fast" & sample == i)$dv,
                              filter(all_tidied_data, 
                                     condition == "slow" & sample == i)$dv,
                              paired = FALSE)), result)
}

result

## # A tibble: 100 x 10
##    estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##       <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
##  1     6.5      1013.     1007.     0.364   0.720      20.2    -30.7     43.7 
##  2   -15        1000.     1015.    -0.695   0.494      22.0    -59.7     29.7 
##  3     7.92     1019.     1011.     0.445   0.661      21.0    -29.1     44.9 
##  4   -16.4       984.      999.    -0.697   0.493      22.0    -65.3     32.5 
##  5   -10.8      1002.     1012.    -0.517   0.612      16.9    -54.7     33.2 
##  6     7.25     1000.      993      0.359   0.723      20.4    -34.8     49.3 
##  7   -35         994.     1030.    -1.66    0.113      20.6    -79.0      8.99
##  8   -27.5       983      1010.    -1.40    0.175      21.8    -68.2     13.2 
##  9     4.83     1026.     1021.     0.246   0.808      18.3    -36.3     46.0 
## 10   -35.8       996.     1032     -1.58    0.130      18.9    -83.2     11.5 
## # … with 90 more rows, and 2 more variables: method <chr>, alternative <chr>

How many of these 100 simulated datasets have produced a difference that is statistically significant? Remember, we know the “Fast” and “Slow” populations do differ from each other. But can we detect this difference consistently?

result %>% 
  filter(p.value < .05) %>%
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1    17

So, only 17 of the 100 simulations detects the difference - even thought the difference is there and waiting to be found. With 24 participants, we have a very under-powered experiment. Indeed, the power for this design to detect the effect we are looking for is just 0.17. We should be aiming to run studies with about 0.80 power. Much lower than 0.80, we are wasting our time.

Improve this Workshop

If you spot any issues/errors in this workshop, you can raise an issue or create a pull request for this repo.

Data Simulation