Overview

First off I’d like you to watch the following videos which start off by revising the basics of correlation, before examining how we build regressions models.

Slides

You can view the slides by clicking on the image below.

Simple Linear Regression

After having watched the videos above, I’d like you to work through the following simple linear regression example in R. Remember to create a new .Rproj file to keep things organised.

The Packages We Need

First we need to install the packages we need. We’re going to install the tidyverse set of packages plus a few others. The package Hmisc allows us to use the rcorr() function for calculating Pearson’s r, and the performance package so we can test our model assumptions. Remember, if you haven’t previously installed these packages on your laptop you first need to type install.packages("packagename") in the console before you can call the library() function for that package. You may also need to install the package see to get the performance package working. If so, do that in the console by typing install.packages("see").

library(tidyverse)
library(Hmisc)
library(performance)

Import the Data

Import the dataset called crime_dataset.csv - this dataset contains population data, housing price index data and crime data for cities in the US.

We can use the function head() to display the first few rows of our dataset called “crime”.

crime <- read_csv("https://raw.githubusercontent.com/ajstewartlang/09_glm_regression_pt1/master/data/crime_dataset.csv")
head(crime)

## # A tibble: 6 × 9
##    Year index_nsa `City, State` Population `Violent Crimes` Homicides Rapes
##   <dbl>     <dbl> <chr>              <dbl>            <dbl>     <dbl> <dbl>
## 1  1975      41.1 Atlanta, GA       490584             8033       185   443
## 2  1975      30.8 Chicago, IL      3150000            37160       818  1657
## 3  1975      36.4 Cleveland, OH     659931            10403       288   491
## 4  1975      20.9 Oakland, CA       337748             5900       111   316
## 5  1975      20.4 Seattle, WA       503500             3971        52   324
## 6    NA      NA   <NA>                  NA               NA        NA    NA
## # … with 2 more variables: Assaults <dbl>, Robberies <dbl>

Tidy the Data

First let’s do some wrangling. There is one column that combines both City and State information. Let’s separate that information out into two new columns called “City” and “State” using the function separate(). We’ll also rename the columns to change the name of the “index_nsa” column to “House_price” and get rid of the space in the “Violent Crimes” heading.

crime_tidied <- crime %>%
  separate(col = "City, State", into = c("City", "State")) %>%
  rename(House_price = index_nsa) %>%
  rename(Violent_Crimes = "Violent Crimes")
head(crime_tidied)

## # A tibble: 6 × 10
##    Year House_price City      State Population Violent_Crimes Homicides Rapes
##   <dbl>       <dbl> <chr>     <chr>      <dbl>          <dbl>     <dbl> <dbl>
## 1  1975        41.1 Atlanta   GA        490584           8033       185   443
## 2  1975        30.8 Chicago   IL       3150000          37160       818  1657
## 3  1975        36.4 Cleveland OH        659931          10403       288   491
## 4  1975        20.9 Oakland   CA        337748           5900       111   316
## 5  1975        20.4 Seattle   WA        503500           3971        52   324
## 6    NA        NA   <NA>      <NA>          NA             NA        NA    NA
## # … with 2 more variables: Assaults <dbl>, Robberies <dbl>

Plot the Data

We might first think that as population size increases, crime rate also increases. Let’s first build a scatter plot.

crime_tidied %>%
  ggplot(aes(x = Population, y = Violent_Crimes)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  theme(text = element_text(size = 13)) +
  labs(x = "Population", 
       y = "Violent Crimes")

Pearson’s r

This plot looks pretty interesting. How about calculating Pearson’s r?

rcorr(crime_tidied$Population, crime_tidied$Violent_Crimes)

##      x    y
## x 1.00 0.81
## y 0.81 1.00
## 
## n
##      x    y
## x 1714 1708
## y 1708 1708
## 
## P
##   x  y 
## x     0
## y  0

Look at the r and p-values - r is =.81 and p < .001. So ~64% of the variance in our Violent_Crimes variable is explained by our Population size variable. Clearly there is a positive relationship between population size and the rate of violent crime. From the plot, we might conclude that the relationship is being overly influenced by crime in a small number of very large cities (top right of the plot above). Let’s exclude cities with populations greater than 2,000,000

crime_filtered <- filter(crime_tidied, Population < 2000000)

Now let’s redo the plot. As there are still likely to be quite a lot of points (and thus overplotting with many points appearing roughly in the same place), we can set the alpha parameter to be < 1 in the geom_point() line of code. This parameter corresponds to the translucency of each point. Change it to other values to see what happens.

crime_filtered %>%
  ggplot(aes(x = Population, y = Violent_Crimes)) + 
  geom_point(alpha = .25) + 
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  theme(text = element_text(size = 13)) +
  labs(x = "Population", 
       y = "Violent Crimes")

And calculate Pearson’s r.

rcorr(crime_filtered$Population, crime_filtered$Violent_Crimes)

##      x    y
## x 1.00 0.69
## y 0.69 1.00
## 
## n
##      x    y
## x 1659 1653
## y 1653 1653
## 
## P
##   x  y 
## x     0
## y  0

There is still a clear positive relationship (r=.69). Let’s build a linear model. The dataset contains a lot of data and each city appears a number of times (once each year). For our linear model, our observations need to be independent of each other so let’s just focus on the year 2015. That way each city will just appear once.

First we apply our filter.

crime_filtered <- filter(crime_filtered, Year == 2015)

Then we build a plot. I’m using the layer geom_text() to plot the City names and set the check_overlap parameter to TRUE to ensure the labels don’t overlap.

crime_filtered %>%
  ggplot(aes(x = Population, y = Violent_Crimes, label = City)) + 
  geom_point() + 
  geom_text(nudge_y = 500, check_overlap = TRUE) + 
  geom_smooth(method = "lm", se = FALSE) + 
  xlim(0, 1800000) +
  theme_minimal() +
  theme(text = element_text(size = 13)) +
  labs(x = "Population", 
       y = "Violent Crimes")

This shows a clear positive linear relationship so let’s work out Pearson’s r.

rcorr(crime_filtered$Population, crime_filtered$Violent_Crimes)

##      x    y
## x 1.00 0.65
## y 0.65 1.00
## 
## n
##    x  y
## x 42 40
## y 40 40
## 
## P
##   x  y 
## x     0
## y  0

Model the Data

Imagine we are a city planner, and we want to know by how much we think violent crimes might increase as a function of population size. In other words, we want to work out how the violent crime rate is predicted by population size.

We’re going to build two linear models - one model1 where we’re using the mean of our outcome variable as the predictor, and a second model2 where we are using Population size to predict the Violent Crimes outcome.

model1 <- lm(Violent_Crimes ~ 1, data = crime_filtered)
model2 <- lm(Violent_Crimes ~ Population, data = crime_filtered)

Checking Our Assumptions

Let’s use the check_model() function from the performance package to check the assumptions of our model.

check_model(model2)

Our dataset is small and so some of our diagnostic plots don’t look great. We’ll come back to the influential outlier (case 29) later - but for now let’s use the anova() function to see if our model with Population as the predictor is better than the one using just the mean.

anova(model1, model2)

## Analysis of Variance Table
## 
## Model 1: Violent_Crimes ~ 1
## Model 2: Violent_Crimes ~ Population
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1     39 445568991                                  
## 2     38 257690819  1 187878173 27.705 5.813e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is - the models differ and you’ll see the residual sum of squares (or the error) is less in the second model (which has Population as the predictor). This means the deviation between our observed data and the regression line model model2 is significantly less than the deviation between our observed data and the mean as a model of our data model1. So let’s get the parameter estimates of model2.

Interpreting Our Model

summary(model2)

## 
## Call:
## lm(formula = Violent_Crimes ~ Population, data = crime_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5465.8 -1633.4  -809.1   684.3  6213.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.443e+02  9.216e+02   1.025    0.312    
## Population  6.963e-03  1.323e-03   5.264 5.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2604 on 38 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.4217, Adjusted R-squared:  0.4064 
## F-statistic: 27.71 on 1 and 38 DF,  p-value: 5.813e-06

The intercept corresponds to where our regression line intercepts the y-axis, and the Population parameter corresponds to the slope of our line. We see that for every increase in population by 1 there is an extra 0.006963 increase in violent crime.

For a city with a population of about a million, there will be about 7907 Violent Crimes. We calculate this by multiplying the estimate of our predictor (0.006963) by 1,000,000 and then adding the intercept (944.3). This gives us 7907.3 crimes - which tallys with what you see in our regression line above. We may have a few outliers - how would you figure out what those were? Try excluding any outliers you find and re-building your model.

Your Challenge

You now have three tasks:
1. Check whether the same relationship holds for population size and robberies in 2015.
2. Are house prices predicted by the number of violent crimes in 2015?
3. Are house prices predicted by population size in 2015?

Improve this Workshop

If you spot any issues/errors in this workshop, you can raise an issue or create a pull request for this repo.

The General Linear Model - Regression Part 1

Andrew Stewart, Email: drandrewjstewart@gmail.com, Twitter: @ajstewart_lang