You’re going to have a go at building your own script using RStudio and doing some data wrangling and visualisation of a dataset.
If this is your first time using RStudio, remember to check the Preferences window and ensure you have unchecked the restore .RData to workspace at startup option and have selected never for Save Workspace to .RData on exit
Before you start, remember to make a new project file - call it something like “week3”. Save the project folder somewhere sensible (maybe create a new folder called something like R analyses) - the script that you write and any data that you save or want to import should be in the folder associated with your project.
We’re going to use a couple of packages - the Tidyverse packages, the new gganimate package, and the NHANES package (which contains the NHANES dataset you’re going to use). If you haven’t used or installed these packages yet, you need to do that first by typing into the console:
install.packages("tidyverse")
install.packages("NHANES")
You only ever have to do this installation once - unless you update R. Packages themselves update regularly - the easiest way to check whether you’re running the most up to date versions is by clicking on “Packages”" in the right hand pane - then click on “Update”. This will give you the option to update individual packages, or all of them.
To begin with, you need to select a new R Script under the File -> New File menu option (see below). You’re going to write all of the following R coded in that script window.
library(tidyverse)
library(NHANES)
We’re going to start by using the NHANES dataset. This is survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys since the early 1960s. Since 1999 approximately 5,000 individuals of all ages are interviewed in their homes every year and complete the health examination component of the survey. The health examination is conducted in a mobile examination centre (MEC).
First we’re going to explore the NHANES dataset.
colnames(NHANES)
## [1] "ID" "SurveyYr" "Gender"
## [4] "Age" "AgeDecade" "AgeMonths"
## [7] "Race1" "Race3" "Education"
## [10] "MaritalStatus" "HHIncome" "HHIncomeMid"
## [13] "Poverty" "HomeRooms" "HomeOwn"
## [16] "Work" "Weight" "Length"
## [19] "HeadCirc" "Height" "BMI"
## [22] "BMICatUnder20yrs" "BMI_WHO" "Pulse"
## [25] "BPSysAve" "BPDiaAve" "BPSys1"
## [28] "BPDia1" "BPSys2" "BPDia2"
## [31] "BPSys3" "BPDia3" "Testosterone"
## [34] "DirectChol" "TotChol" "UrineVol1"
## [37] "UrineFlow1" "UrineVol2" "UrineFlow2"
## [40] "Diabetes" "DiabetesAge" "HealthGen"
## [43] "DaysPhysHlthBad" "DaysMentHlthBad" "LittleInterest"
## [46] "Depressed" "nPregnancies" "nBabies"
## [49] "Age1stBaby" "SleepHrsNight" "SleepTrouble"
## [52] "PhysActive" "PhysActiveDays" "TVHrsDay"
## [55] "CompHrsDay" "TVHrsDayChild" "CompHrsDayChild"
## [58] "Alcohol12PlusYr" "AlcoholDay" "AlcoholYear"
## [61] "SmokeNow" "Smoke100" "Smoke100n"
## [64] "SmokeAge" "Marijuana" "AgeFirstMarij"
## [67] "RegularMarij" "AgeRegMarij" "HardDrugs"
## [70] "SexEver" "SexAge" "SexNumPartnLife"
## [73] "SexNumPartYear" "SameSex" "SexOrientation"
## [76] "PregnantNow"
We can see there are 76 columns of data.
nrow(NHANES)
## [1] 10000
And 10,000 rows. If we use the function head()
we can see the first few rows of the dataframe.
head(NHANES)
## # A tibble: 6 x 76
## ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education
## <int> <fct> <fct> <int> <fct> <int> <fct> <fct> <fct>
## 1 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch…
## 2 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch…
## 3 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch…
## 4 51625 2009_10 male 4 " 0-9" 49 Other <NA> <NA>
## 5 51630 2009_10 female 49 " 40-49" 596 White <NA> Some Col…
## 6 51638 2009_10 male 9 " 0-9" 115 White <NA> <NA>
## # … with 67 more variables: MaritalStatus <fct>, HHIncome <fct>,
## # HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>, HomeOwn <fct>,
## # Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>, Height <dbl>,
## # BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>,
## # BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
## # BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>,
## # Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>,
## # UrineFlow1 <dbl>, UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>,
## # DiabetesAge <int>, HealthGen <fct>, DaysPhysHlthBad <int>,
## # DaysMentHlthBad <int>, LittleInterest <fct>, Depressed <fct>,
## # nPregnancies <int>, nBabies <int>, Age1stBaby <int>,
## # SleepHrsNight <int>, SleepTrouble <fct>, PhysActive <fct>,
## # PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
## # TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
## # AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
## # Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
## # RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
## # SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>,
## # SameSex <fct>, SexOrientation <fct>, PregnantNow <fct>
It looks like some participants appear more than once in the dataset - the first few rows are all for participant ID 51624 - so looks like there is duplicate data. We can use the select()
function alongwith the n_distinct()
function to tell us the unique number of IDs in the dataset.
NHANES %>%
select(ID) %>%
n_distinct()
## [1] 6779
We see we have 6,779 unique individuals. Let’s tidy our data to remove duplicate IDs. Note that below we’re using the pipe operator %>%
You can read it as ‘and then’ so it means we’re taking the NHANES dataset and then filtering it keep just rows with distinct ID numbers. The pipe operator really helps with the readability of your data wrangling code and is an integral part of the tidyverse philosophy - tidy data and tidy code.
NHANES_tidied <- NHANES %>%
distinct(ID, .keep_all = TRUE)
nrow(NHANES_tidied)
## [1] 6779
NHANES_tidied
## # A tibble: 6,779 x 76
## ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education
## <int> <fct> <fct> <int> <fct> <int> <fct> <fct> <fct>
## 1 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch…
## 2 51625 2009_10 male 4 " 0-9" 49 Other <NA> <NA>
## 3 51630 2009_10 female 49 " 40-49" 596 White <NA> Some Col…
## 4 51638 2009_10 male 9 " 0-9" 115 White <NA> <NA>
## 5 51646 2009_10 male 8 " 0-9" 101 White <NA> <NA>
## 6 51647 2009_10 female 45 " 40-49" 541 White <NA> College …
## 7 51654 2009_10 male 66 " 60-69" 795 White <NA> Some Col…
## 8 51656 2009_10 male 58 " 50-59" 707 White <NA> College …
## 9 51657 2009_10 male 54 " 50-59" 654 White <NA> 9 - 11th…
## 10 51659 2009_10 female 10 " 10-19" 123 White <NA> <NA>
## # … with 6,769 more rows, and 67 more variables: MaritalStatus <fct>,
## # HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
## # HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
## # Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
## # Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>,
## # BPDia1 <int>, BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>,
## # Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>,
## # UrineFlow1 <dbl>, UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>,
## # DiabetesAge <int>, HealthGen <fct>, DaysPhysHlthBad <int>,
## # DaysMentHlthBad <int>, LittleInterest <fct>, Depressed <fct>,
## # nPregnancies <int>, nBabies <int>, Age1stBaby <int>,
## # SleepHrsNight <int>, SleepTrouble <fct>, PhysActive <fct>,
## # PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
## # TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
## # AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
## # Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
## # RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
## # SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>,
## # SameSex <fct>, SexOrientation <fct>, PregnantNow <fct>
OK, so our tidied dataset is assigned to the variable NHANES_tidied and has 6,779 rows - as we’d expect given we have 6,779 unique individuals.
Let’s start exploring the data. We have lots of potential variables and relationships to explore. I see we have one labelled Education
which is coded as a factor. We also have information related to health such as BMI - first of all lets plot a histogram of BMI.
NHANES_tidied %>%
ggplot(aes(x = BMI)) +
geom_histogram(bins = 100, na.rm = TRUE)
We see a pretty right skewed distribution here. Note our first use of the na.rm parameter - this parameter appears in many tidyverse functions and by setting it to TRUE we tell R to ignore any parts of our dataframe where we have missing data (which is indicated by NA).
Does BMI vary as a function of Education level? In the code below we are using the data stored in the variable NHANES_tidied, grouping it by Education, then summarising to generate the median BMI for each of our groups. Again, we use the na.rm = TRUE
parameter with the summarise()
function this time to remove any missing values (NA) from our calculation.
NHANES_tidied %>%
group_by(Education) %>%
summarise(median = median(BMI, na.rm = TRUE))
## # A tibble: 6 x 2
## Education median
## <fct> <dbl>
## 1 8th Grade 28.6
## 2 9 - 11th Grade 28.2
## 3 High School 28.2
## 4 Some College 28.4
## 5 College Grad 26.5
## 6 <NA> 18.9
So it looks like those with College eduction have the lowest median BMI (ignoring the NA category which corresponds to cases where we don’t have Education level recorded).
Let’s graph it! Note here we’re filtering out cases where we don’t have BMI value recorded. The function is.na()
returns TRUE when applied to a case of missing data (NA) - we use the !
operator to negate this and combine several of these expressions together using the logical AND operator &
.
The line of code below starting with filter()
means filter cases where Education is not missing AND BMI is not missing. This means that the NHANES_tidied data that gets passed to the ggplot()
call has no missing data for the key variables we’re interested in.
I then add a geom_boxplot()
layer to create a boxplot for each level of our Education factor - and I’m arranging the order in which they’re displayed by using the fct_reorder()
function to do the re-ordering based on the median BMI for the factor Education.
The guides(colour = FALSE)
call supresses displaying the colour legend - delete it and rerun the code to see what changes.
NHANES_tidied %>%
filter(!is.na(Education) & !is.na(BMI)) %>%
group_by(Education) %>%
ggplot(aes(x = fct_reorder(Education, BMI, median), y = BMI, colour = Education)) +
geom_violin() +
geom_jitter(alpha = .2, width = .1) +
geom_boxplot(alpha = .5) +
guides(colour = FALSE) +
labs(title = "Examining the effect of education level on median BMI",
x = "Education Level",
y = "BMI")
We can also plot histograms of BMI separately for each Education level - we use the facet_wrap()
function to do this.
NHANES_tidied %>%
filter(!is.na(Education) & !is.na(BMI)) %>%
group_by(Education) %>%
ggplot(aes(x = BMI, fill = Education)) +
geom_histogram() +
guides(fill = FALSE) +
labs(title = "Examining the effect of education level on BMI",
x = "BMI",
y = "Number of cases") +
facet_wrap(~ Education)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In the above graph, notice that the same y-axis scale is used for each plot - this makes comparisons a little tricky as there are different numbers of cases for each Eduction level. Add the following scales = "free"
after Education
in the facet_wrap line. What changes?
Instead of generating the histograms using a count, we could generate them using a density function. Let’s also add a density curve.
NHANES_tidied %>%
filter(!is.na(Education) & !is.na(BMI)) %>%
group_by(Education) %>%
ggplot(aes(x = BMI, fill = Education)) +
geom_histogram(aes(y = ..density..)) +
geom_density(aes(y = ..density..)) +
guides(fill = FALSE) +
labs( title = "Examining the effect of education level on BMI",
x = "BMI",
y = "Density") +
facet_wrap(~ Education)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What about whether people are working or not?
NHANES_tidied %>%
group_by(Work) %>%
summarise(median = median(BMI, na.rm = TRUE))
## # A tibble: 4 x 2
## Work median
## <fct> <dbl>
## 1 Looking 28.0
## 2 NotWorking 27.6
## 3 Working 27.3
## 4 <NA> 17.7
Here, numerically at least, people who are working have the lowest median BMI. We could now combine our two group_by()
calls into one:
NHANES_tidied %>%
group_by(Education, Work) %>%
summarise(median = median(BMI, na.rm = TRUE))
## # A tibble: 20 x 3
## # Groups: Education [6]
## Education Work median
## <fct> <fct> <dbl>
## 1 8th Grade Looking 27.1
## 2 8th Grade NotWorking 28.8
## 3 8th Grade Working 28.4
## 4 8th Grade <NA> 28
## 5 9 - 11th Grade Looking 29.5
## 6 9 - 11th Grade NotWorking 28.4
## 7 9 - 11th Grade Working 27.9
## 8 High School Looking 28.3
## 9 High School NotWorking 28.2
## 10 High School Working 28.1
## 11 Some College Looking 29.8
## 12 Some College NotWorking 28.4
## 13 Some College Working 28.4
## 14 College Grad Looking 26.8
## 15 College Grad NotWorking 26.6
## 16 College Grad Working 26.5
## 17 <NA> Looking 25.5
## 18 <NA> NotWorking 24.1
## 19 <NA> Working 23.2
## 20 <NA> <NA> 17.7
This is a little hard to read so let’s do a couple of things, first we’re going to filter out cases of missing data indicated by NA - we need to do that both for the Education
variable and for the Work
variable.
NHANES_tidied %>%
filter(!is.na(Education) & !is.na(Work)) %>%
group_by(Education, Work) %>%
summarise(median = median(BMI, na.rm = TRUE))
## # A tibble: 15 x 3
## # Groups: Education [5]
## Education Work median
## <fct> <fct> <dbl>
## 1 8th Grade Looking 27.1
## 2 8th Grade NotWorking 28.8
## 3 8th Grade Working 28.4
## 4 9 - 11th Grade Looking 29.5
## 5 9 - 11th Grade NotWorking 28.4
## 6 9 - 11th Grade Working 27.9
## 7 High School Looking 28.3
## 8 High School NotWorking 28.2
## 9 High School Working 28.1
## 10 Some College Looking 29.8
## 11 Some College NotWorking 28.4
## 12 Some College Working 28.4
## 13 College Grad Looking 26.8
## 14 College Grad NotWorking 26.6
## 15 College Grad Working 26.5
So now we only have levels of each of our two factors where we have data - but the order in which they’re presented isn’t helpful - let’s order such that we go from highest median BMI to lowest - we use the arrange()
function to arrange by descending median values.
NHANES_tidied %>%
filter(!is.na(Education) & !is.na(Work)) %>%
group_by(Education, Work) %>%
summarise(median = median(BMI, na.rm = TRUE)) %>%
arrange(desc(median))
## # A tibble: 15 x 3
## # Groups: Education [5]
## Education Work median
## <fct> <fct> <dbl>
## 1 Some College Looking 29.8
## 2 9 - 11th Grade Looking 29.5
## 3 8th Grade NotWorking 28.8
## 4 8th Grade Working 28.4
## 5 Some College NotWorking 28.4
## 6 Some College Working 28.4
## 7 9 - 11th Grade NotWorking 28.4
## 8 High School Looking 28.3
## 9 High School NotWorking 28.2
## 10 High School Working 28.1
## 11 9 - 11th Grade Working 27.9
## 12 8th Grade Looking 27.1
## 13 College Grad Looking 26.8
## 14 College Grad NotWorking 26.6
## 15 College Grad Working 26.5
So we see those with highest median BMI have Some College Education, and are looking for work, while those with lowest have College Grad education and are in work.
Let’s graph it! Note here we’re again filtering out cases where we don’t have BMI value recorded. The line of code below starting with filter()
means filter cases where Education is not missing AND Work is not missing AND BMI is not missing. This means that the NHANES_tidied data that gets passed to the ggplot()
call has no missing data for any of the key variables we’re interested in.
Again I am adding a geom_boxplot()
layer to create a box plot for each combination of our factors Education and Work - and I’m arranging the order in which they’re displayed by using the fct_reorder()
function to do the re-ordering based on the median BMI for each combination of factors. I then use the coord_flip()
function to flip the x and y-axes to make the graph more readable.
NHANES_tidied %>%
filter(!is.na(Education) & !is.na(Work) & !is.na(BMI)) %>%
ggplot(aes(x = fct_reorder(Education:Work, BMI, median),
y = BMI,
colour = Education:Work)) +
geom_boxplot() +
coord_flip() +
guides(colour = FALSE) +
labs(title = "Examining the effect of education level and employment \nstatus on BMI",
x = "Education X Working",
y = "BMI")
As we’ve reordered based on the median BMI, it’s a little hard to see any patterns as we have two factors involved in that reordering - Education and Work. So let’s simplify and look at Work status first (reordering on the basis of median BMI). We can see that the highest median BMI group are those looking for work (and the lowest those in work).
NHANES_tidied %>%
filter(!is.na(Education) & !is.na(Work) & !is.na(BMI)) %>%
ggplot(aes(x = fct_reorder(Work, BMI, median),
y = BMI,
colour = Work)) +
geom_boxplot() +
coord_flip() +
guides(colour = FALSE) +
labs(title = "Examining the effect of employment status on BMI",
x = "Working",
y = "BMI")
Now let’s look at Education on BMI. Now we see that the lowest median BMI group are those who are college graduates.
NHANES_tidied %>%
filter(!is.na(Education) & !is.na(Work) & !is.na(BMI)) %>%
ggplot(aes(x = fct_reorder(Education, BMI, median),
y = BMI,
colour = Education)) +
geom_boxplot() +
coord_flip() +
guides(colour = FALSE) +
labs(title = "Examining the effect of education level on BMI",
x = "Education",
y = "BMI")
What about separate plots by race? Let’s first check how many values we have for the Race1
variable.
NHANES_tidied %>%
select(Race1) %>%
distinct()
## # A tibble: 5 x 1
## Race1
## <fct>
## 1 White
## 2 Other
## 3 Mexican
## 4 Black
## 5 Hispanic
OK, let’s filter out the Other race category and plot BMI first as a function of Race. For the last bit of the filter we use the operator !=
which means “not equal to” - so the last bit of the filter expression means AND (&) Race1 is not equal to “Other”.
NHANES_tidied %>%
filter(!is.na(Education) & !is.na(Work) & !is.na(BMI) & Race1 != "Other") %>%
ggplot(aes(x = fct_reorder(Race1, BMI, median),
y = BMI,
colour = Race1)) +
geom_boxplot() +
guides(colour = FALSE) +
labs(title = "Examining Race and BMI",
x = "Race",
y = "BMI")
What else might influence BMI? Instead of just looking at Race, might there be Gender differences? Have a go yourself redoing the above but exploring the question of how Gender instead might interact with Education level and Employment status to influence BMI. What about looking at Race AND Gender together? You can do that by changing the last line of the code above to: facet_wrap(~ Race1:Gender)
What else might be worth exploring in the NHANES dataset?
Finally, we’re going to create some animated graphs using the gganimate package. You need to download it first and then install it.
install.packages("gganimate")
library(gganimate)
You might also need to install the gifski
and png
packages - you should be prompted automatically if so.
install.packages("gifski")
install.packages("png")
We’re going to plot some BMI by Education level violin and box plots animated by Race.
NHANES_tidied %>%
filter(!is.na(Education) & !is.na(BMI)) %>%
group_by(Education) %>%
ggplot(aes(x = fct_reorder(Education, BMI, median), y = BMI, colour = Education)) +
geom_violin() +
geom_boxplot(alpha = .5) +
guides(colour = FALSE) +
labs(title = "Examining the effect of education level on median BMI for Race = {closest_state}",
x = "Education Level",
y = "BMI") +
transition_states(Race1, transition_length = 2, state_length = 2) +
ease_aes("linear")
These graphs are great as they can be saved as animated gif files and then used in presentations. To save the last animated plot you built, you can use the anim_save()
function such as > anim_save(filename = "my_plot.gif")