First we need to install the packages we need. We’re going to install the tidyverse
packages plus a few others. The package Hmisc
allows us to use the rcorr()
function for calculating Pearson’s r. Remember, if you haven’t previously installed these packages on your laptop you first need to type install.packages("packagename")
in the console before you can call the library()
function for that package.
library(tidyverse)
library(Hmisc)
Import the dataset called crime_dataset.csv
- this dataset contains population data, housing price index data and crime data for cities in the US.
It is from Kaggle datasets: https://www.kaggle.com/sandeep04201988/housing-price-index-using-crime-rate-data/version/1
We can use the function head()
to display the first few rows of our dataset called “crime”.
crime <- read_csv("https://bit.ly/2Z5zQlY")
head(crime)
## # A tibble: 6 x 9
## Year index_nsa `City, State` Population `Violent Crimes` Homicides Rapes
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1975 41.1 Atlanta, GA 490584 8033 185 443
## 2 1975 30.8 Chicago, IL 3150000 37160 818 1657
## 3 1975 36.4 Cleveland, OH 659931 10403 288 491
## 4 1975 20.9 Oakland, CA 337748 5900 111 316
## 5 1975 20.4 Seattle, WA 503500 3971 52 324
## 6 NA NA <NA> NA NA NA NA
## # … with 2 more variables: Assaults <dbl>, Robberies <dbl>
First let’s do some wrangling. There is one column that combines both City and State information. Let’s separate that information out into two new columns called “City” and “State” using the function separate()
. Then have a look at what you now have. How has the output of head(crime)
changed from above?
crime <- separate(crime, 'City, State', into=c("City", "State"))
head(crime)
## # A tibble: 6 x 10
## Year index_nsa City State Population `Violent Crimes` Homicides Rapes
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1975 41.1 Atla… GA 490584 8033 185 443
## 2 1975 30.8 Chic… IL 3150000 37160 818 1657
## 3 1975 36.4 Clev… OH 659931 10403 288 491
## 4 1975 20.9 Oakl… CA 337748 5900 111 316
## 5 1975 20.4 Seat… WA 503500 3971 52 324
## 6 NA NA <NA> <NA> NA NA NA NA
## # … with 2 more variables: Assaults <dbl>, Robberies <dbl>
Now let’s rename the columns to change the name of the “index_nsa” column (which is column 2) to “House_price” and get rid of the space in the “Violent Crimes” heading (which is column 6). See how the output of head(crime)
has changed again?
colnames(crime)[2] <- "House_price"
colnames(crime)[6] <- "Violent_Crimes"
head(crime)
## # A tibble: 6 x 10
## Year House_price City State Population Violent_Crimes Homicides Rapes
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1975 41.1 Atla… GA 490584 8033 185 443
## 2 1975 30.8 Chic… IL 3150000 37160 818 1657
## 3 1975 36.4 Clev… OH 659931 10403 288 491
## 4 1975 20.9 Oakl… CA 337748 5900 111 316
## 5 1975 20.4 Seat… WA 503500 3971 52 324
## 6 NA NA <NA> <NA> NA NA NA NA
## # … with 2 more variables: Assaults <dbl>, Robberies <dbl>
We might first think that as population size increases, crime rate also increases. Let’s first build a scatter plot.
crime %>%
ggplot(aes(x = Population, y = Violent_Crimes)) +
geom_point() +
geom_smooth(method = "lm")
This plot looks pretty interesting. How about calculating Pearson’s r?
rcorr(crime$Population, crime$Violent_Crimes)
## x y
## x 1.00 0.81
## y 0.81 1.00
##
## n
## x y
## x 1714 1708
## y 1708 1708
##
## P
## x y
## x 0
## y 0
Look at the r and p-values - r is =.81 and p < .001. So ~64% of the variance in our Violent_Crimes variable is explained by our Population size variable. Clearly there is a positive relationship between population size and the rate of violent crime. From the plot, we might conclude that the relationship is being overly influenced by crime in a small number of very large cities (top right of the plot above). Let’s exclude cities with populations greater than 2,000,000
crime_filtered <- filter(crime, Population < 2000000)
Now let’s redo the plot:
crime_filtered %>%
ggplot(aes(x = Population, y = Violent_Crimes)) +
geom_point() +
geom_smooth(method = "lm")
And calculate Pearson’s r.
rcorr(crime_filtered$Population, crime_filtered$Violent_Crimes)
## x y
## x 1.00 0.69
## y 0.69 1.00
##
## n
## x y
## x 1659 1653
## y 1653 1653
##
## P
## x y
## x 0
## y 0
There is still a clear positive relationship (r=.69). Let’s build a linear model. The dataset contains a lot of data and each city appears a number of times (once each year). For our linear model, our observations need to be independent of each other so let’s just focus on the year 2015. That way each city will just appear once.
First we apply our filter.
crime_filtered <- filter(crime_filtered, Year == 2015)
Then we build a plot. I’m using the layer geom_text()
to plot the City names and set the check_overlap paramter to TRUE
to ensure the labels don’t overlap.
crime_filtered %>%
ggplot(aes(x = Population, y = Violent_Crimes, label = City)) +
geom_point() +
geom_text(nudge_y = 500, check_overlap = TRUE) +
geom_smooth(method = "lm") +
xlim(0,1800000)
This shows a clear positive linear relationship so let’s work out Pearson’s r.
rcorr(crime_filtered$Population, crime_filtered$Violent_Crimes)
## x y
## x 1.00 0.65
## y 0.65 1.00
##
## n
## x y
## x 42 40
## y 40 40
##
## P
## x y
## x 0
## y 0
Imagine we are a city planner, and we want to know by how much we think violent crimes might increase as a function of population size. In other words, we want to work out how the violent crime rate is predicted by population size.
We’re going to build two linear models - one model1
where we’re using the mean of our outcome variable as the predictor, and a second model2
where we are using Population size to predict the Violent Crimes outcome.
model1 <- lm(Violent_Crimes ~ 1, data = crime_filtered)
model2 <- lm(Violent_Crimes ~ Population, data = crime_filtered)
Let’s use the anova()
function to see if our model with Population as the predictor is better than the one using just the mean.
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: Violent_Crimes ~ 1
## Model 2: Violent_Crimes ~ Population
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 39 445568991
## 2 38 257690819 1 187878173 27.705 5.813e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It is - the models differ and you’ll see the residual sum of squares (or the error) is less in the second model (which has Population as the predictor). This means the deviation between our observed data and the regression line model model2
is significantly less than the deviation between our observed data and the mean as a model of our data model1
. So let’s get the parameter estimates of model2
.
summary(model2)
##
## Call:
## lm(formula = Violent_Crimes ~ Population, data = crime_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5465.8 -1633.4 -809.1 684.3 6213.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.443e+02 9.216e+02 1.025 0.312
## Population 6.963e-03 1.323e-03 5.264 5.81e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2604 on 38 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.4217, Adjusted R-squared: 0.4064
## F-statistic: 27.71 on 1 and 38 DF, p-value: 5.813e-06
The intercept corresponds to where our regression line intercepts the y-axis, and the Population parameter corresponds to the slope of our line. We see that for every increase in population by 1 there is an extra 0.006963 increase in violent crime.
For a city with a population of about a million, there will be about 7907 Violent Crimes. We calculate this by multiplying the estimate of our predictor (0.006963) by 1,000,000 and then adding the intercept (944.3). This gives us 7907.3 crimes - which tallys with what you see in our regression line above. We may have a few outliers.
First we need to load the packages we’ll be using - these are the tidyverse
, afex
(for ANOVA), and emmeans
(for running pairwise comparisons).
library(tidyverse)
library(afex)
library(emmeans)
We need to load our first data file. When we load it, we assign it to a new variable we’re caloing my_data
. Note, the bitly link is just a short cut to where the datafile is in my GitHub account (https://raw.githubusercontent.com/ajstewartlang/Keele_Sept_2019/master/Afternoon/ANOVA_data1.csv).
my_data <- read_csv("https://bit.ly/2YZRibL")
24 participants responded to a word that was either common (i.e., high lexical frequency) or rare (i.e., low lexical frequency). This is our IV and is coded as ‘High’ vs. ‘Low’. Our DV is reaction time and is coded as ‘RT’. Subject number is coded as ‘Subject’. We want to know whether there is a difference between conditions (and if so, where that difference lies). Visualise the data, generate descrtiptives, and run the appropriate ANOVA to determine whether our independent variable (Condition) has an influence on our dependent variable (RT).
If you type my_data
you will see the first 10 rows of our dataframe..
my_data
## # A tibble: 24 x 3
## Subject Condition RT
## <dbl> <chr> <dbl>
## 1 1 low 1103
## 2 2 low 1170
## 3 3 low 1225
## 4 4 low 1084
## 5 5 low 1219
## 6 6 low 1203
## 7 7 low 1208
## 8 8 low 1311
## 9 9 low 1078
## 10 10 low 1326
## # … with 14 more rows
We need to set our Condition column to a factor.
my_data$Condition <- as.factor(my_data$Condition)
We are first going to plot the data using ggplot - we’ll used both geom_violin()
and geom_jitter()
layers - this way we’ll be able to see both the distribution shapes, and also the raw data (but jittered so we don’t have lots of points basically on top of each other). We’re also going to add some summary statistics using the stat_summary
function - specifically we’re asking for each condition mean plus the 95% confidence interval around each condition mean to be plotted.
my_data %>%
ggplot(aes(x = Condition, y = RT, fill = Condition)) +
geom_violin() +
geom_jitter(width = .1, alpha = .5) +
stat_summary(fun.data = "mean_cl_boot") +
guides(fill = FALSE)
We’re now going to generate some descriptives. Note, we are using the pipe operator %>%
which allows us to ‘pipe’ values from left to right. The following could be read as ‘take the dataframe called my_data, pass it along to the function group_by()
and group our data by condition, pass this grouped data along to the summarise()
function and give me the mean and SD of the RT values for each group’.
my_data %>%
group_by(Condition) %>%
summarise(mean_rt = mean(RT), sd_rt = sd(RT))
## # A tibble: 2 x 3
## Condition mean_rt sd_rt
## <fct> <dbl> <dbl>
## 1 high 865. 74.7
## 2 low 1178. 85.7
We’re now going to build our ANOVA model. This is a simple between subjects design. We are going to map the output to a variable we’re calling ‘model’.
model <- aov_4(RT ~ Condition + (1 | Subject), data = my_data)
## Contrasts set to contr.sum for the following variables: Condition
We can now ask for a summary of our model in ANOVA table format using the anova()
function.
anova(model)
## Anova Table (Type 3 tests)
##
## Response: RT
## num Df den Df MSE F ges Pr(>F)
## Condition 1 22 6464.7 91.217 0.80568 2.767e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To interpret our output, we don’t need any further comparisons as Condition as just two levels.
Let’s load our second datafile. Load it and map it onto a variable called my_data2
.
my_data2 <- read_csv("https://bit.ly/2Z8vRoN")
These data are also from a rection time experiment but with a slightly more complex design. 48 participants responded to a word that differed in how frequent it was. This factor is between participants and we have four levels coded as ‘very low’, ‘low’, ‘high’, and ‘very high’. Our DV is reaction time and is coded as ‘RT’. Subject number is coded as ‘Subject’. We want to know if there is a difference between our conditions (and if so, where that difference lies).
We need to set our Condition column to a factor.
my_data2$Condition <- as.factor(my_data2$Condition)
Let’s plot the data…
my_data2 %>%
ggplot(aes(x = Condition, y = RT, fill = Condition)) +
geom_violin() +
geom_jitter(width = .1, alpha=.5) +
stat_summary(fun.data = "mean_cl_boot") +
guides(fill = FALSE)
Note that it might be better to swap around the order of the factors in our plot. We can do that using the factor()
function. Let’s reorder…
my_data2$Condition <- factor(my_data2$Condition, levels = c("very low", "low", "high", "very high"))
my_data2 %>%
ggplot(aes(x = Condition, y = RT, fill = Condition)) +
geom_violin() +
geom_jitter(width = .1, alpha = .5) +
stat_summary(fun.data = "mean_cl_boot") +
guides(fill = FALSE)
This graph looks better.
Now let’s generate some descriptives.
my_data2 %>%
group_by(Condition) %>%
summarise(mean_rt = mean(RT), sd_rt = sd(RT))
## # A tibble: 4 x 3
## Condition mean_rt sd_rt
## <fct> <dbl> <dbl>
## 1 very low 1478. 85.7
## 2 low 1165. 74.7
## 3 high 927. 78.6
## 4 very high 613. 112.
Finally, let’s build our model.
model <- aov_4(RT ~ Condition + (1 | Subject), data = my_data2)
## Contrasts set to contr.sum for the following variables: Condition
anova(model)
## Anova Table (Type 3 tests)
##
## Response: RT
## num Df den Df MSE F ges Pr(>F)
## Condition 3 44 7928.2 203.21 0.93268 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s interpret the main effect using the emmeans()
function and correcting for multiple comparisons using the Bonferroni correction.
emmeans(model, pairwise ~ Condition, adjust = "Bonferroni")
## $emmeans
## Condition emmean SE df lower.CL upper.CL
## very low 1478 25.7 44 1426 1530
## low 1165 25.7 44 1113 1216
## high 927 25.7 44 875 979
## very high 613 25.7 44 561 664
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## very low - low 314 36.4 44 8.624 <.0001
## very low - high 551 36.4 44 15.160 <.0001
## very low - very high 866 36.4 44 23.810 <.0001
## low - high 238 36.4 44 6.536 <.0001
## low - very high 552 36.4 44 15.185 <.0001
## high - very high 314 36.4 44 8.650 <.0001
##
## P value adjustment: bonferroni method for 6 tests
Load the third datafile and map it onto the variable my_data3
.
my_data3 <- read_csv("https://bit.ly/2N6Z64P")
These data are from a 2 x 2 repeated measures reaction time experiment. We were interested in how quickly participants could respond to images that were Large vs. Small and in Colour vs. Black & Wwhite. We expect that Large Colour images will be responded to more quickly than Small B & W images. We’re not sure about Small Colour images and Large B & W images. We measured the response times of 24 participants responding to an image in each of these four conditions. We want to determine if there is a difference between our conditions (and if so, where that difference lies).
We need to set the two columns (Size and Colour) as factors.
my_data3$Size <- as.factor(my_data3$Size)
my_data3$Colour <- as.factor(my_data3$Colour)
First let’s plot (and roll)…
my_data3 %>%
ggplot(aes(x = Size:Colour, y = RT, fill = Size:Colour)) +
geom_violin() +
geom_jitter(width = .1, alpha = .5) +
stat_summary(fun.data = "mean_cl_boot", colour = "black") +
guides(fill = FALSE) +
labs(title = "Violin Plot of Reaction Time (ms.) by Condition\nwith Means and 95% Confidence Intervals", x = "Condition", y = "RT (ms.)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Now we’re going to generate some descriptives.
my_data3 %>%
group_by(Size, Colour) %>%
summarise(mean_rt = mean(RT), sd_rt = sd(RT))
## # A tibble: 4 x 4
## # Groups: Size [2]
## Size Colour mean_rt sd_rt
## <fct> <fct> <dbl> <dbl>
## 1 Large B&W 1228. 133.
## 2 Large Colour 593. 79.0
## 3 Small B&W 1420. 104.
## 4 Small Colour 957. 118.
We will build our ANOVA model using the aov_4()
function in the afex package. The term (1 + Size * Colour | Subject)
corresponds to our two repeated measures factors and the fact we’re aggregating over our Subjects.
model <- aov_4(RT ~ Size * Colour + (1 + Size * Colour | Subject), data = my_data3)
summary(model)
##
## Univariate Type III Repeated-Measures ANOVA Assuming Sphericity
##
## Sum Sq num Df Error SS den Df F value Pr(>F)
## (Intercept) 105787507 1 222362 23 10942.11 < 2.2e-16 ***
## Size 1856206 1 214572 23 198.97 8.235e-13 ***
## Colour 7226489 1 317030 23 524.27 < 2.2e-16 ***
## Size:Colour 177418 1 368287 23 11.08 0.002921 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our interaction is significant so we run pairwise comparisons to determine where our difference(s) is/are…
emmeans(model, pairwise ~ Size * Colour, adjust = "Bonferroni")
## $emmeans
## Size Colour emmean SE df lower.CL upper.CL
## Large B.W 1228 22.5 87.3 1183 1273
## Small B.W 1420 22.5 87.3 1375 1465
## Large Colour 593 22.5 87.3 549 638
## Small Colour 957 22.5 87.3 913 1002
##
## Warning: EMMs are biased unless design is perfectly balanced
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## Large,B.W - Small,B.W -192 32.5 43.0 -5.913 <.0001
## Large,B.W - Large,Colour 635 35.2 45.7 18.013 <.0001
## Large,B.W - Small,Colour 271 31.0 44.4 8.721 <.0001
## Small,B.W - Large,Colour 827 31.0 44.4 26.644 <.0001
## Small,B.W - Small,Colour 463 35.2 45.7 13.133 <.0001
## Large,Colour - Small,Colour -364 32.5 43.0 -11.204 <.0001
##
## P value adjustment: bonferroni method for 6 tests
Even with the conservative Bonferroni adjustment, you’ll see that every condition differs from every other condition.