Overview

First off I’d like you to watch the following video which builds on the first regression workshop. We explore how to build regression models with more than one predictor in R using the lm() function, test our model assumptions, and interpret the output. We look at different ways of building stepwise regression models with multiple predictors, before finishing by looking at mediation and moderation.

  

Slides

You can download the slides in .odp format by clicking here and in .pdf format by clicking on the image below.

  

Link to GLM Regression part 2 slides

  

Multiple Regression

In standard multiple regression all the independent variables (IVs) are entered into the equation and evaluated for their contribution at the same time. Let’s work through a specific example.

An educational psychologist conducted a study that investigated the psycholinguistic variables that contribute to spelling performance in primary school children aged between 7- and 9-years. The researcher presented children with 48 words that varied systematically according to certain features such as age of acquisition, word frequency, word length, and imageability. The psychologist wants to check whether performance on the test accurately reflected children’s spelling ability as estimated by a standardised spelling test. That is, the psychologist wants to check whether her test was appropriate.

Children’s chronological age (in months) (age), their reading age (RA), their standardised reading age (std_RA), and their standardised spelling score (std_SPELL) were chosen as predictor variables. The criterion variable (Y) was the percentage correct spelling (corr_spell) score attained by each child using the list of 48 words.

First we need to load the packages we need - the require function assumes they are already on your machine. If they are not, then you need to install.packages (“packagename”) first:

The Packages We Need

library(tidyverse) # Load the tidyverse packages
library(Hmisc) # Needed for correlation
library(MASS) # Needed for maths functions
library(car) # Needed for VIF calculation
library(olsrr) # Needed for stepwise regression 
library(performance) # Needed to check model assumptions

Import the Data

You now need to read in the data file.

MRes_tut2 <- read_csv("https://raw.githubusercontent.com/ajstewartlang/10_glm_regression_pt2/master/data/MRes_tut2.csv")

Examining Possible Relationships

Before we start, let’s look at the relationships between our IVs (predictors) and our DV (outcome). We can plot graphs depicting the correlations. We’ll plot test performance against each of our four predictors in turn:

ggplot(MRes_tut2, aes(x = age, y = corr_spell)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  theme(text = element_text(size = 13)) 

ggplot(MRes_tut2, aes(x = RA, y = corr_spell)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  theme(text = element_text(size = 13)) 

ggplot(MRes_tut2, aes(x = std_RA, y = corr_spell)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  theme(text = element_text(size = 13)) 

ggplot(MRes_tut2, aes(x = std_SPELL, y = corr_spell)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  theme(text = element_text(size = 13)) 

Model the Data

We are going to do hierarchical regression first - we’ll build one model (which we’ll call model0) that is the mean of our outcome variable, and another model (model1) which contains all our predictors:

model0 <- lm(corr_spell ~ 1, data = MRes_tut2)
model1 <- lm(corr_spell ~ age + RA + std_RA + std_SPELL, data = MRes_tut2)

Let’s compare them to each other:

anova(model0, model1)
## Analysis of Variance Table
## 
## Model 1: corr_spell ~ 1
## Model 2: corr_spell ~ age + RA + std_RA + std_SPELL
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     46 26348.4                                  
## 2     42  3901.1  4     22447 60.417 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the models differ from each other (look a the p-value of the comparison) and that the model with the four predictors has the lower Residuals (RSS) value meaning there is less error between the model and the observed data relative to the simpler intercept-only model (i.e., the mean) and the observed data.

Checking our Assumptions

OK, so they differ - now let’s plot information about our model assumptions - remember, we are particularly interested in Cook’s distance values for our case…

check_model(model1)

The errors looks fairly equally distributed along our fitted values (homoscedasticity) - although a little worse for high fitted values - and from the Q-Q plot we can tell they look fairly normal (they should follow the diagonal). How about influential cases? So, Case 10 looks a bit dodgy - it has a high Cook’s Distance value - which suggests it is having a disproportionate effect on our model. Let’s exclude it using the filter() function - the symbol != means ‘does not equal’ so we are selecting values other than Case 10.

Dropping an Influential Case

MRes_tut2_drop10 <- filter(MRes_tut2, case != "10")

Re(model) the Data

We now create another model (model2) which doesn’t include Case 10.

model2 <- lm(corr_spell ~ age + RA + std_RA + std_SPELL, data = MRes_tut2_drop10)

Let’s check the model assumptions again using check_model().

Checking our Assumptions

check_model(model2)

Now, let’s look at the multicollinearity values measured by VIF:

vif(model2)
##       age        RA    std_RA std_SPELL 
##  3.843462 14.763168 14.672084  3.140457

It looks like RA and std_RA are problematic. We can look at the correlation between them using the rcorr() function:

rcorr(MRes_tut2_drop10$RA, MRes_tut2_drop10$std_RA)
##      x    y
## x 1.00 0.88
## y 0.88 1.00
## 
## n= 46 
## 
## 
## P
##   x  y 
## x     0
## y  0

Re(model) the Data

The correlation is pretty high (0.88), so let’s exclude the predictor with the highest VIF value (which is RA) and build a new model:

model3 <- lm(corr_spell ~ age + std_RA + std_SPELL, data = MRes_tut2_drop10)
vif(model3)
##       age    std_RA std_SPELL 
##  1.190827  2.636186  2.821235

Checking our Assumptions

These values look ok now. Let’s check the model assumptions again.

check_model(model3)

Summary of our Model

Now let’s generate the coefficients as this looks like a sensible model.

summary(model3)
## 
## Call:
## lm(formula = corr_spell ~ age + std_RA + std_SPELL, data = MRes_tut2_drop10)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.801  -6.907   1.327   5.155  24.669 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -209.4400    26.7210  -7.838 9.43e-10 ***
## age            1.1033     0.2133   5.172 6.09e-06 ***
## std_RA         0.3804     0.1385   2.747  0.00883 ** 
## std_SPELL      1.2107     0.1650   7.339 4.78e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.824 on 42 degrees of freedom
## Multiple R-squared:  0.8388, Adjusted R-squared:  0.8273 
## F-statistic: 72.87 on 3 and 42 DF,  p-value: < 2.2e-16
model0 <- lm(corr_spell ~ 1, data = MRes_tut2_drop10)
anova(model3, model0)
## Analysis of Variance Table
## 
## Model 1: corr_spell ~ age + std_RA + std_SPELL
## Model 2: corr_spell ~ 1
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     42  4053.5                                  
## 2     45 25151.0 -3    -21098 72.866 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We’d write our equation as something like:

Spelled correct = -209.44 + 1.10(age) + 0.38(std_RA) + 1.21(std_SPELL) + residual

Stepwise regression

We can also do stepwise regression - forwards is when you start with the null model and predictors are added until they don’t explain any more variance, backwards is when you start with the full model and remove predictors until removal starts affecting your model’s predictive ability. Let’s keep case 10 dropped and also drop the high VIF predictor (RA). This is handy for models with lots of predictors where the order in sequential regression is not obvious.

Model the Data

model0 <- lm(corr_spell ~ 1, data = MRes_tut2_drop10)
model1 <- lm(corr_spell ~ age + std_RA + std_SPELL, data = MRes_tut2_drop10)

Let’s do stepwise forwards:

steplimitsf <- step(model0, scope = list (lower = model0, upper = model1), direction = "forward")
## Start:  AIC=291.98
## corr_spell ~ 1
## 
##             Df Sum of Sq     RSS    AIC
## + std_SPELL  1   17802.2  7348.8 237.39
## + std_RA     1   14780.1 10370.9 253.23
## <none>                   25151.0 291.98
## + age        1      48.7 25102.3 293.89
## 
## Step:  AIC=237.39
## corr_spell ~ std_SPELL
## 
##          Df Sum of Sq    RSS    AIC
## + age     1   2567.20 4781.6 219.62
## + std_RA  1    714.14 6634.7 234.69
## <none>                7348.8 237.39
## 
## Step:  AIC=219.62
## corr_spell ~ std_SPELL + age
## 
##          Df Sum of Sq    RSS    AIC
## + std_RA  1     728.1 4053.5 214.02
## <none>                4781.6 219.62
## 
## Step:  AIC=214.02
## corr_spell ~ std_SPELL + age + std_RA
summary(steplimitsf)
## 
## Call:
## lm(formula = corr_spell ~ std_SPELL + age + std_RA, data = MRes_tut2_drop10)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.801  -6.907   1.327   5.155  24.669 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -209.4400    26.7210  -7.838 9.43e-10 ***
## std_SPELL      1.2107     0.1650   7.339 4.78e-09 ***
## age            1.1033     0.2133   5.172 6.09e-06 ***
## std_RA         0.3804     0.1385   2.747  0.00883 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.824 on 42 degrees of freedom
## Multiple R-squared:  0.8388, Adjusted R-squared:  0.8273 
## F-statistic: 72.87 on 3 and 42 DF,  p-value: < 2.2e-16

Stepwise backwards:

steplimitsb <- step(model1, direction = "back")
## Start:  AIC=214.02
## corr_spell ~ age + std_RA + std_SPELL
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   4053.5 214.02
## - std_RA     1     728.1 4781.6 219.62
## - age        1    2581.2 6634.7 234.69
## - std_SPELL  1    5198.3 9251.8 249.98
summary(steplimitsb)
## 
## Call:
## lm(formula = corr_spell ~ age + std_RA + std_SPELL, data = MRes_tut2_drop10)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.801  -6.907   1.327   5.155  24.669 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -209.4400    26.7210  -7.838 9.43e-10 ***
## age            1.1033     0.2133   5.172 6.09e-06 ***
## std_RA         0.3804     0.1385   2.747  0.00883 ** 
## std_SPELL      1.2107     0.1650   7.339 4.78e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.824 on 42 degrees of freedom
## Multiple R-squared:  0.8388, Adjusted R-squared:  0.8273 
## F-statistic: 72.87 on 3 and 42 DF,  p-value: < 2.2e-16

And stepwise using both forwards and backwards procedures:

steplimitsboth <- step(model0, scope = list (upper = model1), direction = "both")
## Start:  AIC=291.98
## corr_spell ~ 1
## 
##             Df Sum of Sq     RSS    AIC
## + std_SPELL  1   17802.2  7348.8 237.39
## + std_RA     1   14780.1 10370.9 253.23
## <none>                   25151.0 291.98
## + age        1      48.7 25102.3 293.89
## 
## Step:  AIC=237.39
## corr_spell ~ std_SPELL
## 
##             Df Sum of Sq     RSS    AIC
## + age        1    2567.2  4781.6 219.62
## + std_RA     1     714.1  6634.7 234.69
## <none>                    7348.8 237.39
## - std_SPELL  1   17802.2 25151.0 291.98
## 
## Step:  AIC=219.62
## corr_spell ~ std_SPELL + age
## 
##             Df Sum of Sq     RSS    AIC
## + std_RA     1     728.1  4053.5 214.02
## <none>                    4781.6 219.62
## - age        1    2567.2  7348.8 237.39
## - std_SPELL  1   20320.7 25102.3 293.89
## 
## Step:  AIC=214.02
## corr_spell ~ std_SPELL + age + std_RA
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   4053.5 214.02
## - std_RA     1     728.1 4781.6 219.62
## - age        1    2581.2 6634.7 234.69
## - std_SPELL  1    5198.3 9251.8 249.98

Checking our Assumptions

check_model(steplimitsboth)

These look ok.

summary(steplimitsboth)
## 
## Call:
## lm(formula = corr_spell ~ std_SPELL + age + std_RA, data = MRes_tut2_drop10)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.801  -6.907   1.327   5.155  24.669 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -209.4400    26.7210  -7.838 9.43e-10 ***
## std_SPELL      1.2107     0.1650   7.339 4.78e-09 ***
## age            1.1033     0.2133   5.172 6.09e-06 ***
## std_RA         0.3804     0.1385   2.747  0.00883 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.824 on 42 degrees of freedom
## Multiple R-squared:  0.8388, Adjusted R-squared:  0.8273 
## F-statistic: 72.87 on 3 and 42 DF,  p-value: < 2.2e-16

You’ll see that the same final model is arrived it in each case. We have three significant predictors.

Entering predictors based on their p-values

We can also use the ols_step_forward_p() function from the package olsrr - this works by finding the best model by entering the predictors based on their p values. There are other methods within the olsrr package. To see them, type olsrr:: into the console window.

pmodel <- ols_step_forward_p(model1)
pmodel
## 
##                              Selection Summary                              
## ---------------------------------------------------------------------------
##         Variable                   Adj.                                        
## Step     Entered     R-Square    R-Square     C(p)        AIC        RMSE      
## ---------------------------------------------------------------------------
##    1    std_SPELL      0.7078      0.7012    34.1439    369.9304    12.9236    
##    2    age            0.8099      0.8010     9.5442    352.1614    10.5452    
##    3    std_RA         0.8388      0.8273     4.0000    346.5624     9.8241    
## ---------------------------------------------------------------------------

In this case, the model with the lowest AIC value is also the one arrived at statistically via the sequential procedure based on p-values - but this may not always be the case.

Improve this Workshop

If you spot any issues/errors in this workshop, you can raise an issue or create a pull request for this repo.