The Linear Model

First we need to install the packages we need. We’re going to install the tidyverse packages plus a few others. The package Hmisc allows us to use the rcorr() function for calculating Pearson’s r. Remember, if you haven’t previously installed these packages on your laptop you first need to type install.packages("packagename") in the console before you can call the library() function for that package.

library(tidyverse)
library(Hmisc)

Import the dataset called crime_dataset.csv - this dataset contains population data, housing price index data and crime data for cities in the US.

It is from Kaggle datasets: https://www.kaggle.com/sandeep04201988/housing-price-index-using-crime-rate-data/version/1

We can use the function head() to display the first few rows of our dataset called “crime”.

crime <- read_csv("https://bit.ly/2Z5zQlY")
head(crime)
## # A tibble: 6 x 9
##    Year index_nsa `City, State` Population `Violent Crimes` Homicides Rapes
##   <dbl>     <dbl> <chr>              <dbl>            <dbl>     <dbl> <dbl>
## 1  1975      41.1 Atlanta, GA       490584             8033       185   443
## 2  1975      30.8 Chicago, IL      3150000            37160       818  1657
## 3  1975      36.4 Cleveland, OH     659931            10403       288   491
## 4  1975      20.9 Oakland, CA       337748             5900       111   316
## 5  1975      20.4 Seattle, WA       503500             3971        52   324
## 6    NA      NA   <NA>                  NA               NA        NA    NA
## # … with 2 more variables: Assaults <dbl>, Robberies <dbl>

First let’s do some wrangling. There is one column that combines both City and State information. Let’s separate that information out into two new columns called “City” and “State” using the function separate(). Then have a look at what you now have. How has the output of head(crime) changed from above?

crime <- separate(crime, 'City, State', into=c("City", "State"))
head(crime)
## # A tibble: 6 x 10
##    Year index_nsa City  State Population `Violent Crimes` Homicides Rapes
##   <dbl>     <dbl> <chr> <chr>      <dbl>            <dbl>     <dbl> <dbl>
## 1  1975      41.1 Atla… GA        490584             8033       185   443
## 2  1975      30.8 Chic… IL       3150000            37160       818  1657
## 3  1975      36.4 Clev… OH        659931            10403       288   491
## 4  1975      20.9 Oakl… CA        337748             5900       111   316
## 5  1975      20.4 Seat… WA        503500             3971        52   324
## 6    NA      NA   <NA>  <NA>          NA               NA        NA    NA
## # … with 2 more variables: Assaults <dbl>, Robberies <dbl>

Now let’s rename the columns to change the name of the “index_nsa” column (which is column 2) to “House_price” and get rid of the space in the “Violent Crimes” heading (which is column 6). See how the output of head(crime) has changed again?

colnames(crime)[2] <- "House_price"
colnames(crime)[6] <- "Violent_Crimes"
head(crime)
## # A tibble: 6 x 10
##    Year House_price City  State Population Violent_Crimes Homicides Rapes
##   <dbl>       <dbl> <chr> <chr>      <dbl>          <dbl>     <dbl> <dbl>
## 1  1975        41.1 Atla… GA        490584           8033       185   443
## 2  1975        30.8 Chic… IL       3150000          37160       818  1657
## 3  1975        36.4 Clev… OH        659931          10403       288   491
## 4  1975        20.9 Oakl… CA        337748           5900       111   316
## 5  1975        20.4 Seat… WA        503500           3971        52   324
## 6    NA        NA   <NA>  <NA>          NA             NA        NA    NA
## # … with 2 more variables: Assaults <dbl>, Robberies <dbl>

We might first think that as population size increases, crime rate also increases. Let’s first build a scatter plot.

crime %>%
  ggplot(aes(x = Population, y = Violent_Crimes)) + 
  geom_point() + 
  geom_smooth(method = "lm")

This plot looks pretty interesting. How about calculating Pearson’s r?

rcorr(crime$Population, crime$Violent_Crimes)
##      x    y
## x 1.00 0.81
## y 0.81 1.00
## 
## n
##      x    y
## x 1714 1708
## y 1708 1708
## 
## P
##   x  y 
## x     0
## y  0

Look at the r and p-values - r is =.81 and p < .001. So ~64% of the variance in our Violent_Crimes variable is explained by our Population size variable. Clearly there is a positive relationship between population size and the rate of violent crime. From the plot, we might conclude that the relationship is being overly influenced by crime in a small number of very large cities (top right of the plot above). Let’s exclude cities with populations greater than 2,000,000

crime_filtered <- filter(crime, Population < 2000000)

Now let’s redo the plot:

crime_filtered %>%
  ggplot(aes(x = Population, y = Violent_Crimes)) + 
  geom_point() + 
  geom_smooth(method = "lm")

And calculate Pearson’s r.

rcorr(crime_filtered$Population, crime_filtered$Violent_Crimes)
##      x    y
## x 1.00 0.69
## y 0.69 1.00
## 
## n
##      x    y
## x 1659 1653
## y 1653 1653
## 
## P
##   x  y 
## x     0
## y  0

There is still a clear positive relationship (r=.69). Let’s build a linear model. The dataset contains a lot of data and each city appears a number of times (once each year). For our linear model, our observations need to be independent of each other so let’s just focus on the year 2015. That way each city will just appear once.

First we apply our filter.

crime_filtered <- filter(crime_filtered, Year == 2015)

Then we build a plot. I’m using the layer geom_text() to plot the City names and set the check_overlap paramter to TRUE to ensure the labels don’t overlap.

crime_filtered %>%
  ggplot(aes(x = Population, y = Violent_Crimes, label = City)) + 
  geom_point() + 
  geom_text(nudge_y = 500, check_overlap = TRUE) + 
  geom_smooth(method = "lm") + 
  xlim(0,1800000)

This shows a clear positive linear relationship so let’s work out Pearson’s r.

rcorr(crime_filtered$Population, crime_filtered$Violent_Crimes)
##      x    y
## x 1.00 0.65
## y 0.65 1.00
## 
## n
##    x  y
## x 42 40
## y 40 40
## 
## P
##   x  y 
## x     0
## y  0

Imagine we are a city planner, and we want to know by how much we think violent crimes might increase as a function of population size. In other words, we want to work out how the violent crime rate is predicted by population size.

We’re going to build two linear models - one model1 where we’re using the mean of our outcome variable as the predictor, and a second model2 where we are using Population size to predict the Violent Crimes outcome.

model1 <- lm(Violent_Crimes ~ 1, data = crime_filtered)
model2 <- lm(Violent_Crimes ~ Population, data = crime_filtered)

Let’s use the anova() function to see if our model with Population as the predictor is better than the one using just the mean.

anova(model1, model2)
## Analysis of Variance Table
## 
## Model 1: Violent_Crimes ~ 1
## Model 2: Violent_Crimes ~ Population
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1     39 445568991                                  
## 2     38 257690819  1 187878173 27.705 5.813e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is - the models differ and you’ll see the residual sum of squares (or the error) is less in the second model (which has Population as the predictor). This means the deviation between our observed data and the regression line model model2 is significantly less than the deviation between our observed data and the mean as a model of our data model1. So let’s get the parameter estimates of model2.

summary(model2)
## 
## Call:
## lm(formula = Violent_Crimes ~ Population, data = crime_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5465.8 -1633.4  -809.1   684.3  6213.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.443e+02  9.216e+02   1.025    0.312    
## Population  6.963e-03  1.323e-03   5.264 5.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2604 on 38 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.4217, Adjusted R-squared:  0.4064 
## F-statistic: 27.71 on 1 and 38 DF,  p-value: 5.813e-06

The intercept corresponds to where our regression line intercepts the y-axis, and the Population parameter corresponds to the slope of our line. We see that for every increase in population by 1 there is an extra 0.006963 increase in violent crime.

For a city with a population of about a million, there will be about 7907 Violent Crimes. We calculate this by multiplying the estimate of our predictor (0.006963) by 1,000,000 and then adding the intercept (944.3). This gives us 7907.3 crimes - which tallys with what you see in our regression line above. We may have a few outliers.

ANOVA

First we need to load the packages we’ll be using - these are the tidyverse, afex (for ANOVA), and emmeans (for running pairwise comparisons).

library(tidyverse)
library(afex)
library(emmeans)

We need to load our first data file. When we load it, we assign it to a new variable we’re caloing my_data. Note, the bitly link is just a short cut to where the datafile is in my GitHub account (https://raw.githubusercontent.com/ajstewartlang/Keele_Sept_2019/master/Afternoon/ANOVA_data1.csv).

my_data <- read_csv("https://bit.ly/2YZRibL")

24 participants responded to a word that was either common (i.e., high lexical frequency) or rare (i.e., low lexical frequency). This is our IV and is coded as ‘High’ vs. ‘Low’. Our DV is reaction time and is coded as ‘RT’. Subject number is coded as ‘Subject’. We want to know whether there is a difference between conditions (and if so, where that difference lies). Visualise the data, generate descrtiptives, and run the appropriate ANOVA to determine whether our independent variable (Condition) has an influence on our dependent variable (RT).

If you type my_data you will see the first 10 rows of our dataframe..

my_data
## # A tibble: 24 x 3
##    Subject Condition    RT
##      <dbl> <chr>     <dbl>
##  1       1 low        1103
##  2       2 low        1170
##  3       3 low        1225
##  4       4 low        1084
##  5       5 low        1219
##  6       6 low        1203
##  7       7 low        1208
##  8       8 low        1311
##  9       9 low        1078
## 10      10 low        1326
## # … with 14 more rows

We need to set our Condition column to a factor.

my_data$Condition <- as.factor(my_data$Condition)

We are first going to plot the data using ggplot - we’ll used both geom_violin() and geom_jitter() layers - this way we’ll be able to see both the distribution shapes, and also the raw data (but jittered so we don’t have lots of points basically on top of each other). We’re also going to add some summary statistics using the stat_summary function - specifically we’re asking for each condition mean plus the 95% confidence interval around each condition mean to be plotted.

my_data %>%
  ggplot(aes(x = Condition, y = RT, fill = Condition)) + 
  geom_violin() + 
  geom_jitter(width = .1, alpha = .5) + 
  stat_summary(fun.data = "mean_cl_boot") + 
  guides(fill = FALSE)

We’re now going to generate some descriptives. Note, we are using the pipe operator %>% which allows us to ‘pipe’ values from left to right. The following could be read as ‘take the dataframe called my_data, pass it along to the function group_by() and group our data by condition, pass this grouped data along to the summarise() function and give me the mean and SD of the RT values for each group’.

my_data %>% 
  group_by(Condition) %>% 
  summarise(mean_rt = mean(RT), sd_rt = sd(RT))
## # A tibble: 2 x 3
##   Condition mean_rt sd_rt
##   <fct>       <dbl> <dbl>
## 1 high         865.  74.7
## 2 low         1178.  85.7

We’re now going to build our ANOVA model. This is a simple between subjects design. We are going to map the output to a variable we’re calling ‘model’.

model <- aov_4(RT ~ Condition + (1 | Subject), data = my_data)
## Contrasts set to contr.sum for the following variables: Condition

We can now ask for a summary of our model in ANOVA table format using the anova() function.

anova(model)
## Anova Table (Type 3 tests)
## 
## Response: RT
##           num Df den Df    MSE      F     ges    Pr(>F)    
## Condition      1     22 6464.7 91.217 0.80568 2.767e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To interpret our output, we don’t need any further comparisons as Condition as just two levels.

Let’s load our second datafile. Load it and map it onto a variable called my_data2.

my_data2 <- read_csv("https://bit.ly/2Z8vRoN")

These data are also from a rection time experiment but with a slightly more complex design. 48 participants responded to a word that differed in how frequent it was. This factor is between participants and we have four levels coded as ‘very low’, ‘low’, ‘high’, and ‘very high’. Our DV is reaction time and is coded as ‘RT’. Subject number is coded as ‘Subject’. We want to know if there is a difference between our conditions (and if so, where that difference lies).

We need to set our Condition column to a factor.

my_data2$Condition <- as.factor(my_data2$Condition)

Let’s plot the data…

my_data2 %>%
  ggplot(aes(x = Condition, y = RT, fill = Condition)) + 
  geom_violin() + 
  geom_jitter(width = .1, alpha=.5) + 
  stat_summary(fun.data = "mean_cl_boot") + 
  guides(fill = FALSE)

Note that it might be better to swap around the order of the factors in our plot. We can do that using the factor() function. Let’s reorder…

my_data2$Condition <- factor(my_data2$Condition, levels = c("very low", "low", "high", "very high"))
my_data2 %>%
  ggplot(aes(x = Condition, y = RT, fill = Condition)) + 
  geom_violin() + 
  geom_jitter(width = .1, alpha = .5) + 
  stat_summary(fun.data = "mean_cl_boot") + 
  guides(fill = FALSE)

This graph looks better.

Now let’s generate some descriptives.

my_data2 %>% 
  group_by(Condition) %>% 
  summarise(mean_rt = mean(RT), sd_rt = sd(RT))
## # A tibble: 4 x 3
##   Condition mean_rt sd_rt
##   <fct>       <dbl> <dbl>
## 1 very low    1478.  85.7
## 2 low         1165.  74.7
## 3 high         927.  78.6
## 4 very high    613. 112.

Finally, let’s build our model.

model <- aov_4(RT ~ Condition + (1 | Subject), data = my_data2)
## Contrasts set to contr.sum for the following variables: Condition
anova(model)
## Anova Table (Type 3 tests)
## 
## Response: RT
##           num Df den Df    MSE      F     ges    Pr(>F)    
## Condition      3     44 7928.2 203.21 0.93268 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s interpret the main effect using the emmeans() function and correcting for multiple comparisons using the Bonferroni correction.

emmeans(model, pairwise ~ Condition, adjust = "Bonferroni")
## $emmeans
##  Condition emmean   SE df lower.CL upper.CL
##  very low    1478 25.7 44     1426     1530
##  low         1165 25.7 44     1113     1216
##  high         927 25.7 44      875      979
##  very high    613 25.7 44      561      664
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast             estimate   SE df t.ratio p.value
##  very low - low            314 36.4 44  8.624  <.0001 
##  very low - high           551 36.4 44 15.160  <.0001 
##  very low - very high      866 36.4 44 23.810  <.0001 
##  low - high                238 36.4 44  6.536  <.0001 
##  low - very high           552 36.4 44 15.185  <.0001 
##  high - very high          314 36.4 44  8.650  <.0001 
## 
## P value adjustment: bonferroni method for 6 tests

Load the third datafile and map it onto the variable my_data3.

my_data3 <- read_csv("https://bit.ly/2N6Z64P")

These data are from a 2 x 2 repeated measures reaction time experiment. We were interested in how quickly participants could respond to images that were Large vs. Small and in Colour vs. Black & Wwhite. We expect that Large Colour images will be responded to more quickly than Small B & W images. We’re not sure about Small Colour images and Large B & W images. We measured the response times of 24 participants responding to an image in each of these four conditions. We want to determine if there is a difference between our conditions (and if so, where that difference lies).

We need to set the two columns (Size and Colour) as factors.

my_data3$Size <- as.factor(my_data3$Size)
my_data3$Colour <- as.factor(my_data3$Colour)

First let’s plot (and roll)…

my_data3 %>%
  ggplot(aes(x = Size:Colour, y = RT, fill = Size:Colour)) + 
  geom_violin() + 
  geom_jitter(width = .1, alpha = .5) + 
  stat_summary(fun.data = "mean_cl_boot", colour = "black") + 
  guides(fill = FALSE) + 
  labs(title = "Violin Plot of Reaction Time (ms.) by Condition\nwith Means and 95% Confidence Intervals", x = "Condition", y = "RT (ms.)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Now we’re going to generate some descriptives.

my_data3 %>% 
  group_by(Size, Colour) %>% 
  summarise(mean_rt = mean(RT), sd_rt = sd(RT))
## # A tibble: 4 x 4
## # Groups:   Size [2]
##   Size  Colour mean_rt sd_rt
##   <fct> <fct>    <dbl> <dbl>
## 1 Large B&W      1228. 133. 
## 2 Large Colour    593.  79.0
## 3 Small B&W      1420. 104. 
## 4 Small Colour    957. 118.

We will build our ANOVA model using the aov_4() function in the afex package. The term (1 + Size * Colour | Subject) corresponds to our two repeated measures factors and the fact we’re aggregating over our Subjects.

model <- aov_4(RT ~ Size * Colour + (1 + Size * Colour | Subject), data = my_data3)
summary(model)
## 
## Univariate Type III Repeated-Measures ANOVA Assuming Sphericity
## 
##                Sum Sq num Df Error SS den Df  F value    Pr(>F)    
## (Intercept) 105787507      1   222362     23 10942.11 < 2.2e-16 ***
## Size          1856206      1   214572     23   198.97 8.235e-13 ***
## Colour        7226489      1   317030     23   524.27 < 2.2e-16 ***
## Size:Colour    177418      1   368287     23    11.08  0.002921 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our interaction is significant so we run pairwise comparisons to determine where our difference(s) is/are…

emmeans(model, pairwise ~ Size * Colour, adjust = "Bonferroni")
## $emmeans
##  Size  Colour emmean   SE   df lower.CL upper.CL
##  Large B.W      1228 22.5 87.3     1183     1273
##  Small B.W      1420 22.5 87.3     1375     1465
##  Large Colour    593 22.5 87.3      549      638
##  Small Colour    957 22.5 87.3      913     1002
## 
## Warning: EMMs are biased unless design is perfectly balanced 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast                    estimate   SE   df t.ratio p.value
##  Large,B.W - Small,B.W           -192 32.5 43.0  -5.913 <.0001 
##  Large,B.W - Large,Colour         635 35.2 45.7  18.013 <.0001 
##  Large,B.W - Small,Colour         271 31.0 44.4   8.721 <.0001 
##  Small,B.W - Large,Colour         827 31.0 44.4  26.644 <.0001 
##  Small,B.W - Small,Colour         463 35.2 45.7  13.133 <.0001 
##  Large,Colour - Small,Colour     -364 32.5 43.0 -11.204 <.0001 
## 
## P value adjustment: bonferroni method for 6 tests

Even with the conservative Bonferroni adjustment, you’ll see that every condition differs from every other condition.