The Linear Model

First we need to install the packages we need. We’re going to install the tidyverse packages plus a few others. The package Hmisc allows us to use the rcorr() function for calculating Pearson’s r. Remember, if you haven’t previously installed these packages on your laptop you first need to type install.packages("packagename") in the console before you can call the library() function for that package.

library(tidyverse)
library(Hmisc)

Import the dataset called crime_dataset.csv - this dataset contains population data, housing price index data and crime data for cities in the US.

It is from Kaggle datasets: https://www.kaggle.com/sandeep04201988/housing-price-index-using-crime-rate-data/version/1

We can use the function head() to display the first few rows of our dataset called “crime”.

crime <- read_csv("https://bit.ly/2Z5zQlY")
head(crime)
## # A tibble: 6 x 9
##    Year index_nsa `City, State` Population `Violent Crimes` Homicides Rapes
##   <dbl>     <dbl> <chr>              <dbl>            <dbl>     <dbl> <dbl>
## 1  1975      41.1 Atlanta, GA       490584             8033       185   443
## 2  1975      30.8 Chicago, IL      3150000            37160       818  1657
## 3  1975      36.4 Cleveland, OH     659931            10403       288   491
## 4  1975      20.9 Oakland, CA       337748             5900       111   316
## 5  1975      20.4 Seattle, WA       503500             3971        52   324
## 6    NA      NA   <NA>                  NA               NA        NA    NA
## # … with 2 more variables: Assaults <dbl>, Robberies <dbl>

First let’s do some wrangling. There is one column that combines both City and State information. Let’s separate that information out into two new columns called “City” and “State” using the function separate(). Then have a look at what you now have. How has the output of head(crime) changed from above?

crime <- separate(crime, 'City, State', into=c("City", "State"))
head(crime)
## # A tibble: 6 x 10
##    Year index_nsa City  State Population `Violent Crimes` Homicides Rapes
##   <dbl>     <dbl> <chr> <chr>      <dbl>            <dbl>     <dbl> <dbl>
## 1  1975      41.1 Atla… GA        490584             8033       185   443
## 2  1975      30.8 Chic… IL       3150000            37160       818  1657
## 3  1975      36.4 Clev… OH        659931            10403       288   491
## 4  1975      20.9 Oakl… CA        337748             5900       111   316
## 5  1975      20.4 Seat… WA        503500             3971        52   324
## 6    NA      NA   <NA>  <NA>          NA               NA        NA    NA
## # … with 2 more variables: Assaults <dbl>, Robberies <dbl>

Now let’s rename the columns to change the name of the “index_nsa” column (which is column 2) to “House_price” and get rid of the space in the “Violent Crimes” heading (which is column 6). See how the output of head(crime) has changed again?

colnames(crime)[2] <- "House_price"
colnames(crime)[6] <- "Violent_Crimes"
head(crime)
## # A tibble: 6 x 10
##    Year House_price City  State Population Violent_Crimes Homicides Rapes
##   <dbl>       <dbl> <chr> <chr>      <dbl>          <dbl>     <dbl> <dbl>
## 1  1975        41.1 Atla… GA        490584           8033       185   443
## 2  1975        30.8 Chic… IL       3150000          37160       818  1657
## 3  1975        36.4 Clev… OH        659931          10403       288   491
## 4  1975        20.9 Oakl… CA        337748           5900       111   316
## 5  1975        20.4 Seat… WA        503500           3971        52   324
## 6    NA        NA   <NA>  <NA>          NA             NA        NA    NA
## # … with 2 more variables: Assaults <dbl>, Robberies <dbl>

We might first think that as population size increases, crime rate also increases. Let’s first build a scatter plot.

crime %>%
  ggplot(aes(x = Population, y = Violent_Crimes)) + 
  geom_point() + 
  geom_smooth(method = "lm")

This plot looks pretty interesting. How about calculating Pearson’s r?

rcorr(crime$Population, crime$Violent_Crimes)
##      x    y
## x 1.00 0.81
## y 0.81 1.00
## 
## n
##      x    y
## x 1714 1708
## y 1708 1708
## 
## P
##   x  y 
## x     0
## y  0

Look at the r and p-values - r is =.81 and p < .001. So ~64% of the variance in our Violent_Crimes variable is explained by our Population size variable. Clearly there is a positive relationship between population size and the rate of violent crime. From the plot, we might conclude that the relationship is being overly influenced by crime in a small number of very large cities (top right of the plot above). Let’s exclude cities with populations greater than 2,000,000

crime_filtered <- filter(crime, Population < 2000000)

Now let’s redo the plot:

crime_filtered %>%
  ggplot(aes(x = Population, y = Violent_Crimes)) + 
  geom_point() + 
  geom_smooth(method = "lm")