class: center, middle, title-slide # Reproducible Data Visualisations ### Andrew Stewart
University of Manchester
Division of Neuroscience and Experimental Psychology
Software Sustainability Institute Fellow
Twitter:
@ajstewart_lang<br>GitHub
: github.com/ajstewartlang ### (updated: 2019-06-04) --- class: center # Science that can be replicated <i>vs.</i> science that can be reproduced .left[ <b>Replicable Science</b> is when someone else can run a study the same as or conceptually equivalent to your one, and find a similar pattern of effects. <b>Reproducible Science</b> is when someone else can take your data and your analysis code, run it and then find the same effects that you have reported. ] -- <b>How do we make our science more replicable? How do we make our science more reproducible?</b> --- # A move towards open science… You really should read this book! .pull-left[ <img src="images/deadly.jpg" width="80%" /> ] .pull-right[<br><br>Sins include <i>p</i>-hacking, lack of power, HARKing, failing (refusal) to share data and code, too many researcher degrees of freedom… ] --- <img src="images/Gelman.jpg" width="80%" /> http://www.stat.columbia.edu/~gelman/ Andrew Gelman gives the following recommendations to researchers: - Analyze all your data. - Present all your comparisons. - Put in the effort to take accurate measurements (low bias, low variance, and a large enough sample size). - Do repeated-measures comparisons where possible. - Make your data public. But it's not just the data you need to make public, but also your <b>code</b>! --- # What role can R play in Open and Reproducible Science? - R scripts are easy to share allowing for reproducibility and easy public sharing of data and code. - R is free, open source software that is much more flexible and powerful than SPSS. - There is an active R community continuously updating statistical tests and packages that run in R. - As R is a programming language, it forces you to <b>know</b> your data. --- # R <i>vs.</i> SPSS .middle[ “SPSS is like a bus - easy to use for the standard things, but very frustrating if you want to do something that is not already pre-programmed. R is a 4-wheel drive off-roader, with a bike on the back, a kayak on top, good walking and running shoes in the passenger seat, and mountain climbing and spelunking gear in the back. R can take you anywhere you want to go if you take time to learn how to use the equipment, but that is going to take longer than learning where the bus stops are in SPSS.” <i>(Greg Snow, 2010, stackoverflow.com)</i> ] --- # In meme form… <img src="images/meme.png" width="1203" /> --- # A workflow for reproducible science in the Tidyverse .center[ <img src="images/tidyverse.png" width="90%" /> ] --- # A workflow for reproducible science in the Tidyverse <img src="images/tidyflow.png" width="1200" /> https://www.tidyverse.org --- class: center, middle # Why Data Visualisation is Important --- class: center, middle # Anscombe's Quartert --- # Plot 1 .pull-left[ ![](talk_xaringan_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] .pull-right[ ``` ## [1] "Mean of X is: 9" ``` ``` ## [1] "SD of X is: 3.32" ``` ``` ## [1] "Mean of Y is: 7.5" ``` ``` ## [1] "SD of Y is: 2.03" ``` ] ``` ## [1] "Pearson's r is 0.82" ``` --- # Plot 2 .pull-left[ ![](talk_xaringan_files/figure-html/unnamed-chunk-10-1.png)<!-- --> ] .pull-right[ ``` ## [1] "Mean of X is: 9" ``` ``` ## [1] "SD of X is: 3.32" ``` ``` ## [1] "Mean of Y is: 7.5" ``` ``` ## [1] "SD of Y is: 2.03" ``` ] ``` ## [1] "Pearson's r is 0.82" ``` --- # Plot 3 .pull-left[ ![](talk_xaringan_files/figure-html/unnamed-chunk-13-1.png)<!-- --> ] .pull-right[ ``` ## [1] "Mean of X is: 9" ``` ``` ## [1] "SD of X is: 3.32" ``` ``` ## [1] "Mean of Y is: 7.5" ``` ``` ## [1] "SD of Y is: 2.03" ``` ] ``` ## [1] "Pearson's r is 0.82" ``` --- # Plot 4 .pull-left[ ![](talk_xaringan_files/figure-html/unnamed-chunk-16-1.png)<!-- --> ] .pull-right[ ``` ## [1] "Mean of X is: 9" ``` ``` ## [1] "SD of X is: 3.32" ``` ``` ## [1] "Mean of Y is: 7.5" ``` ``` ## [1] "SD of Y is: 2.03" ``` ] ``` ## [1] "Pearson's r is 0.82" ``` --- # Plots Based on Aggregated Data Can Mislead… .center[ ```r ggplot(data1, aes(x = Group, y = RT)) + geom_boxplot() ``` ![](talk_xaringan_files/figure-html/unnamed-chunk-20-1.png)<!-- --> ] --- # But look more closely at the actual data… .center[ ![](talk_xaringan_files/figure-html/unnamed-chunk-21-1.png)<!-- --> ] --- # The distribution of data matters The data on the previous slide are clearly bimodal with no data point near the mean. Distribution shape matters and we need to capture that in our data visualisations. If we only plotted and reported information related to aggregated data, we wouldn't be being honest about what our data look like. --- # Reasons for visualising data -- For yourself - once you have collected your data, you should visualise it before you build any statistical models - does the data look (roughly) as expected with the right number of data points? -- For others - when you present your work in a talk, on a poster, or in a published paper you want the viewer to be able to quickly and unambiguously extract the intended meaning from your visualisation. -- Just as the reproducibilty of statistical models is important in the context of engaging in open and reproducible science, so too is the reproducibilty of data visualisations. --- # ggplot2 The ggplot2 package is part of the Tidyverse and is based around the Grammar of Graphics (Wickham, 2010): https://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf Start with defining your data and aesthetics of the plot, before adding geometric objects (geoms), information about labelling, faceting etc. Each plot can be built up gradually, layer by layer like the following: --- .top[ ![](talk_xaringan_files/figure-html/unnamed-chunk-23-1.png)<!-- --> ] .bottom[ ```r ggplot(data_long, aes(x = Condition, y = RT)) + geom_jitter(alpha = .25, position = position_jitter(0.05)) ``` ] --- .top[ ![](talk_xaringan_files/figure-html/unnamed-chunk-25-1.png)<!-- --> ] .bottom[ ```r ggplot(data_long, aes(x = Condition, y = RT)) + geom_jitter(alpha = .25, position = position_jitter(0.05)) + stat_summary(fun.data = "mean_cl_boot", colour = "black", size = 1) ``` ] --- .top[ ![](talk_xaringan_files/figure-html/unnamed-chunk-27-1.png)<!-- --> ] .bottom[ ```r ggplot(data_long, aes(x = Condition, y = RT)) + geom_jitter(alpha = .25, position = position_jitter(0.05)) + stat_summary(fun.data = "mean_cl_boot", colour = "black", size = 1) + geom_violin(aes(fill = Condition), alpha = .2) ``` ] --- .top[ ![](talk_xaringan_files/figure-html/unnamed-chunk-29-1.png)<!-- --> ] .bottom[ ```r ggplot(data_long, aes(x = Condition, y = RT)) + geom_jitter(alpha = .25, position = position_jitter(0.05)) + stat_summary(fun.data = "mean_cl_boot", colour = "black", size = 1) + geom_violin(aes(fill = Condition), alpha = .2) + guides(fill = FALSE) ``` ] --- .top[ ![](talk_xaringan_files/figure-html/unnamed-chunk-31-1.png)<!-- --> ] .bottom[ ```r ggplot(data_long, aes(x = Condition, y = RT)) + geom_jitter(alpha = .25, position = position_jitter(0.05)) + stat_summary(fun.data = "mean_cl_boot", colour = "black", size = 1) + geom_violin(aes(fill = Condition), alpha = .2) + guides(fill = FALSE) + coord_flip() ``` ] --- # Violin Plots These are Violin Plots - these are an example of an RDI plot as they capture the Raw data, information about the Distribution, and some Inferential statistics (e.g., Confidence Intervals). We can modify other characteristics of the plot such as the colour palette we're using, the orientation, and we can also add some labels: ![](talk_xaringan_files/figure-html/unnamed-chunk-34-1.png)<!-- --> --- # Building interactive visualisations using the plotly package .center[
] --- # Raincloud Plots ![](talk_xaringan_files/figure-html/unnamed-chunk-36-1.png)<!-- --> .footnote[Allen M, Poggiali D, Whitaker K et al. Raincloud plots: a multi-platform tool for robust data visualization [version 1; peer review: 2 approved]. Wellcome Open Res 2019, 4:63 https://doi.org/10.12688/wellcomeopenres.15191.1 ] --- # Using Different Themes You can change the ggplot theme to a number of built in ones (or define your own.) On the next page you'll see the same plot with (a) the Economics theme, (b) the fivethirtyeight theme, (c) the Tufte theme, and (d) the solarized theme. Below is the plot with the default theme. .center[ ![](talk_xaringan_files/figure-html/unnamed-chunk-37-1.png)<!-- --> ] --- class: center, middle ![](talk_xaringan_files/figure-html/unnamed-chunk-38-1.png)<!-- --> --- class: center, middle ![](talk_xaringan_files/figure-html/unnamed-chunk-39-1.png)<!-- --> --- class: center, middle ![](talk_xaringan_files/figure-html/unnamed-chunk-40-1.png)<!-- --> --- class: center, middle ![](talk_xaringan_files/figure-html/unnamed-chunk-41-1.png)<!-- --> --- # The BBC Cookbook The BBC (like many other organisations such as the FT) use R and ggplot to generate their data They have even published their own style guide and code for their BBC data visualisation theme. https://bbc.github.io/rcookbook/ .center[ <img src="images/bbc.png" width="80%" /> ] --- # World Happiness Data We can have a look at the World Happiness dataset that measures Happiness (called Life Ladder) and a bunch of other things (e.g., GDP) over countries over time. .center[ ```r vis_dat(happy_data) ``` ![](talk_xaringan_files/figure-html/unnamed-chunk-44-1.png)<!-- --> ] --- .center[ ```r vis_miss(happy_data) ``` ![](talk_xaringan_files/figure-html/unnamed-chunk-45-1.png)<!-- --> ] --- .top[ ![](talk_xaringan_files/figure-html/unnamed-chunk-46-1.png)<!-- --> ] .bottom[ ```r happy_data %>% group_by(country) %>% filter(!is.na(`Life Ladder`) & year == 2016) %>% summarise(score = `Life Ladder`) %>% mutate(country = reorder(country, score)) %>% top_n(20) %>% ggplot(aes(x = score, y = country)) + geom_point() + labs(x = "Happiness Index Score", y = "Country", title = "Top 20 Happiest Countries in 2016") + theme_tufte(base_size = 15) ``` ] --- .top[ ![](talk_xaringan_files/figure-html/unnamed-chunk-48-1.png)<!-- --> ] .bottom[ ```r country_list <- c("United Kingdom", "France", "Germany", "Italy", "Norway", "United States") happy_data %>% filter(country %in% country_list) %>% filter(!is.na(`Social support`)) %>% mutate(score = `Social support`) %>% mutate(country = reorder(country, score)) %>% ggplot(aes(y = score, x = country, fill = country)) + geom_boxplot(width = .5) + labs(y = "Boxplot of Social support", x = "Country", title = "Social support") + guides(fill = FALSE) + coord_flip() + theme_tufte(base_size = 15) ``` ] --- .top[ ![](talk_xaringan_files/figure-html/unnamed-chunk-50-1.png)<!-- --> ] .bottom[ ```r happy_data %>% filter(country %in% country_list) %>% group_by(country) %>% mutate(score = `Life Ladder`) %>% ungroup() %>% mutate(country = reorder(country, score)) %>% ggplot(aes(x = country, y = score)) + geom_boxplot() + labs(x = "Country", y = "Happiness Index Score", title = "Average Happiness Index for 6 Countries") + coord_flip() + theme_fivethirtyeight(base_size = 15) ``` ] --- .top[ ![](talk_xaringan_files/figure-html/unnamed-chunk-52-1.png)<!-- --> ] .bottom[ ```r country_list <- c("United Kingdom", "France", "Germany", "Italy", "Norway", "United States") happy_data %>% filter(country %in% country_list) %>% group_by(year) %>% filter(!is.na(`Life Ladder`)) %>% ggplot(aes(x = year, y = `Life Ladder`)) + geom_line() + facet_wrap(~ country) + labs(x = "Year", y = "Happiness index", title = "Happiness Index over Time for 6 Countries") + theme_fivethirtyeight(base_size = 15) + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` ] --- # Visualising Qualitative Data Maybe you have lots of qualitative data and are interested in running a content analysis. In the next example, I'm examining all the text in HG Wells' The War of the Worlds. .center[ ![](talk_xaringan_files/figure-html/unnamed-chunk-54-1.png)<!-- --> ] --- ```r # Get 2 HG Wells books #### titles <- "The War of the Worlds" books <- gutenberg_works(title %in% titles) %>% gutenberg_download(meta_fields = "title") text_waroftheworlds <- books %>% unnest_tokens(word, text) %>% anti_join(stop_words) text_waroftheworlds %>% count(word) %>% top_n(10) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(x = word, y = n, fill = word)) + geom_col() + coord_flip() + guides(fill = FALSE) + labs(title = "Top 10 words in The War of the Worlds") text_waroftheworlds_count <- text_waroftheworlds %>% count(word) %>% top_n(200) ``` --- .center[.middle[ ![](talk_xaringan_files/figure-html/unnamed-chunk-56-1.png)<!-- --> ]] --- ```r set.seed(1234) wordcloud(words = text_waroftheworlds_count$word, freq = text_waroftheworlds_count$n, min.freq = 1, scale = c(3, 1), max.words = 125, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2")) ``` .center[ ![](talk_xaringan_files/figure-html/unnamed-chunk-58-1.png)<!-- --> ] --- # Sentiment Analyis .center[ ![](talk_xaringan_files/figure-html/unnamed-chunk-60-1.png)<!-- --> ] --- ```r sentiments <- get_sentiments("bing") word_counts <- text_waroftheworlds %>% inner_join(sentiments) %>% count(word, sentiment, sort = TRUE) ``` ```r word_counts %>% filter(n > 20) %>% mutate(n = ifelse(sentiment == "negative", -n, n)) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n, fill = sentiment)) + geom_col() + coord_flip() + labs(y = "Contribution to sentiment", title = "Sentiment Analysis of Words in The War of the Worlds") ``` --- # Visualising Data from Twitter Scraping Twitter using the rtweet() package for everyone's favourite progressive Swedish death metal band, Opeth! 🤘 <p> .center[ ![](talk_xaringan_files/figure-html/unnamed-chunk-64-1.png)<!-- --> ] --- class: centre # Geospatial Plotting of Tweets .center[
] --- class: middle, center # Teaching with Animations --- # Data Simulations and Data Visualisation - Hockey Game Simulation Imagine a hockey game where we know that Team A scores exactly 1 goal for sure and Team B takes 20 shots, each with a 5.5% chance of going in. Which team would you rather be? (nothing additional happens if you tie.) ```r set.seed(1234) team_b_goals <- NULL for(i in 1:10000) { score <- sum(sample(c(1, 0), size = 20, replace = TRUE, prob = c(0.055, 1-.055))) team_b_goals <- c(team_b_goals, score)} team_a_goals <- rep(1, 10000) all_games <- as_tibble(cbind(team_a_goals, team_b_goals)) ``` --- ``` ## [1] "Games where Team A scores more goals than Team B: 3145" ``` ``` ## [1] "Games where Team A scores fewer goals than Team B: 3050" ``` ``` ## [1] "Games where there is a tie : 3805" ``` .center[ ![](talk_xaringan_files/figure-html/unnamed-chunk-70-1.gif)<!-- --> ] --- # Illustrating sampling error with animations On the left N=20 from the <b>same</b> population. We <b>appear</b> to find differences between our conditions. On the right when N=500 it's clear the two conditions are equivalent. .pull-left[ ![](talk_xaringan_files/figure-html/unnamed-chunk-71-1.gif)<!-- --> ] .pull-right[ ![](talk_xaringan_files/figure-html/unnamed-chunk-72-1.gif)<!-- --> ] --- # Sample size of 25 when population r = .5 .center[ ![](talk_xaringan_files/figure-html/unnamed-chunk-73-1.png)<!-- --> ] --- class: center ![](talk_xaringan_files/figure-html/unnamed-chunk-74-1.gif)<!-- --> --- # Sample size of 250 when population r = .5 .center[ ![](talk_xaringan_files/figure-html/unnamed-chunk-75-1.png)<!-- --> ] --- class: center ![](talk_xaringan_files/figure-html/unnamed-chunk-76-1.gif)<!-- --> --- class: center, middle # Make it Reproducible --- class: middle "<i>(visualisations)</i> should be autogenerated as part of the data analysis pipeline (which should also be automated), and they should come out of the pipeline ready to be sent to the printer... ...the moment you manually edit a figure, your final figure becomes irreproducible. A third party cannot generate the exact same figure you did. Interactive plot programs are a bad idea. They inherently force you to manually prepare your figures...be aware that Excel is an interactive plot program...and is not recommended for figure preparation (or data analysis)." .pull-left[ .center[ <img src="images/wilke.png" width="200" /> ] ] .pull-right[ <i>Claus Wilke, Fundamentals of Data Visualization (2019). Page xiii.</i> ] --- # Using OSF or GitHub Both the Open Science Foundation (https://osf.io) and GitHub (https://github.com) can be used to host your data and analysis code. If you have pre-registered your study or have gone down the registered report route (where your research is unconditionally accepted for publication before you've started collecting data) then you can link your data and analysis code with the pre-registration itself. However, if you just make your data and code publically available an interested reader still has to manually re-run your code to reproduce your results. They need to know which R version and which package versions you used in your analysis. --- # Use Zenodo to give your data and code a DOI .center[ <img src="images/zenodo.png" width="550" /> ] You can share this DOI alongside your output - this also allows others to cite your dataset if they use it. --- # Using R Markdown Using R Markdown you can generate a document that contains narrative, your code AND output, making your analysis transparent and allowing a reader to see your analysis, output, and text explaining what you did (and why). You can use your R Markdown script to produce HTML, Word, or PDF documents. HTML documents allow you to embed animated graphs. --- # Example R Markdown Document .center[ <img src="images/graph1.png" width="575" /> ] --- # Using R Markdown Templates .pull-left[ A number of R Markdown templates are available that allow you to write entire papers and talks (like this one) in R Markdown. <br><br> The Papaja template (https://github.com/crsh) by Frederik Aust allows you to write APA formatted outputs and render in HTML, Word, and PDF format. .center[ <img src="images/frederik.png" width="175" /> ] ] .pull-right[ This talk was written in R Markdown using the Xaringan Presentation Ninja package (https://github.com/yihui/xaringan) by Yihui Xie at RStudio. <br><br> When I 'knit' my Markdown file, all the data visualisations, Twitter scraping etc. occurs on the fly as my slides are rendered. .center[ <img src="images/yihui.png" width="175" /> ] ] --- # Interested in finding out more? .pull-left[ <img src="images/r4ds.jpg" width="175" /> <img src="images/cover.jpg" width="175" /> ] .pull-right[ <img src="images/healy.jpg" width="175" /> <img src="images/wilke.png" width="175" /> ] --- <img src="images/tidytuesday.png" width="1241" /> --- # Thanks! .pull-left[ <img src="images/ssi.png" width="1575" height="80%" /> ] .pull-right[ <img src="images/bsbr.jpg" width="640" height="70%" /> ] --- class: center # A Fully Reproducible Talk (just add my accent) All slides and R code used to generate these slides available here: https://github.com/ajstewartlang/ajstewartlang.github.io/tree/master/Lancaster_talk <img src="images/github.svg" width="50%" /> .footnote[ Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan), [**knitr**](http://yihui.name/knitr), and [R Markdown](https://rmarkdown.rstudio.com). ]