In any set of texts (such as books, interview transcripts etc.) it’s often useful to be able to quantify key aspects of the constituent parts (e.g., words, phrases). For example, some types of language may be more common in one interview transcript vs. another, and it can be useful to visualise the content of a particular text to compare it with others. In this session we are going to examine how to the {tidytext}
package in R to engage in some simple text analysis. We will examine how to count the occurrences of words in a text, engage in a basic sentiment analysis to examine what kinds of sentiments might be most common in a text, as well as using measures such as term frequency-inverse document frequency as a way of understanding what words (or phrases) are most uniquely associated with a text (compared to another set of texts). The material I’m going to cover is very much based on the fantastic “Text Mining Wirh R” book by Julia Silge and David Robinson. Scroll down to find a link to the book - or better still, buy it!
You can download the slides in .odp format by clicking here and in .pdf format by clicking on the image below.
To use the {tidytext}
package we’ll load it alongside the {tidyverse}
and the {gutenbergr}
libraries.
library(tidyverse)
library(tidytext)
library(gutenbergr)
Once we have loaded the appropriate packages into our session, we’ll then download the four HG Wells books we’re interested in and map them onto the variable books
. You can go to the Project Gutenberg site directly if you like to see what other books are available to download.
titles <- c("The War of the Worlds",
"The Time Machine",
"Twenty Thousand Leagues under the Sea",
"The Invisible Man: A Grotesque Romance")
books <- gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = "title")
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
Once the books are downloaded, we need to un-nest the words in each line so that the texts of the books in our tibble are in tidy or long format. We are also going to remove the “stop words”. You can view the stop_words
with view(stop_words)
if you are interested in seeing which words are included. As this is just a vector of words, you can also add other words to it if you like.
all_text <- books %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
all_text
## # A tibble: 91,676 x 3
## gutenberg_id title word
## <int> <chr> <chr>
## 1 35 The Time Machine time
## 2 35 The Time Machine machine
## 3 35 The Time Machine invention
## 4 35 The Time Machine contents
## 5 35 The Time Machine introduction
## 6 35 The Time Machine ii
## 7 35 The Time Machine machine
## 8 35 The Time Machine iii
## 9 35 The Time Machine time
## 10 35 The Time Machine traveller
## # … with 91,666 more rows
As our tibble is now in tidy format, we can easily extract and visualise summary statistics of the 10 most common words in each of the books in our tibble.
all_text %>%
filter(title == "The Time Machine") %>%
count(word, sort = TRUE) %>%
top_n(10)
## Selecting by n
## # A tibble: 10 x 2
## word n
## <chr> <int>
## 1 time 207
## 2 machine 88
## 3 white 61
## 4 traveller 57
## 5 hand 49
## 6 morlocks 48
## 7 people 46
## 8 weena 46
## 9 found 44
## 10 light 43
all_text %>%
filter(title == "The Time Machine") %>%
count(word, sort = TRUE) %>%
top_n(10) %>%
ggplot(aes(x = reorder(word, n), y = n, fill = word)) +
geom_col() +
coord_flip() +
guides(fill = FALSE) +
labs(x = "Word",
y = "Count",
title = "Top 10 most commonly occurring words in The Time Machine") +
theme_minimal()
## Selecting by n
Have a go at changing the code above to summarise and visualise the top 10 words in one of the other books we have downloaded.
In this section we’re going to look at sentiment analysis. There are four built-in sentiment database in the {tidytext}
package that you can access with get_sentiments()
. We’re going to use the bing
database which contains around 7,000 words coded for whether their sentiments are positive or negative.
First we’re going to ‘join’ the tidy tibble containing the text of the four books we have downloaded with the sentiment coding associated with each of the words in our tibble.
get_sentiments("bing")
## # A tibble: 6,786 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
all_text_sentiments <- all_text %>%
inner_join(get_sentiments("bing"))
## Joining, by = "word"
This gives us a new tibble that contains the text, plus a new column corresponding to the sentiment of each word in the text. We can plot the top 25 most common words in the book “The War of the Worlds” via {ggplot2}
using the code below.
all_text_sentiments %>%
filter(title == "The War of the Worlds") %>%
count(word, sentiment, sort = TRUE) %>%
top_n(25) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
#mutate(word = reorder(word, n)) %>%
ggplot(aes(x = reorder(word, n), y = n, fill = sentiment)) +
geom_col() +
coord_flip() +
labs(title = "Sentiment Analysis of Top 25 Words in The War of the Worlds",
x = "Word",
y = "Count")
## Selecting by n
In addition to looking at the frequencies of occurrence of each word in each book, we can ‘normalise’ things by looking at what usage proportion is associated with each word in each book. A simple frequency measure becomes distorted as the length of a book increases - calculating the proportions of each word in each book gets round this issue.
book_words <- all_text %>%
group_by(title) %>%
count(title, word, sort = TRUE)
total_words <- book_words %>%
group_by(title) %>%
summarise(total = sum(n))
book_words <- left_join(book_words, total_words)
book_words %>%
mutate(proportion = n/total) %>%
group_by(title) %>%
arrange(desc(title, proportion)) %>%
top_n(3) %>%
select(-n, -total)
## # A tibble: 12 x 3
## # Groups: title [4]
## title word proportion
## <chr> <chr> <dbl>
## 1 Twenty Thousand Leagues under the Sea captain 0.0153
## 2 Twenty Thousand Leagues under the Sea nautilus 0.0131
## 3 Twenty Thousand Leagues under the Sea sea 0.00880
## 4 The War of the Worlds martians 0.00722
## 5 The War of the Worlds people 0.00704
## 6 The War of the Worlds black 0.00540
## 7 The Time Machine time 0.0184
## 8 The Time Machine machine 0.00781
## 9 The Time Machine white 0.00541
## 10 The Invisible Man: A Grotesque Romance kemp 0.0117
## 11 The Invisible Man: A Grotesque Romance invisible 0.00990
## 12 The Invisible Man: A Grotesque Romance door 0.00930
book_words %>%
ggplot(aes(x = n/total, fill = title)) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.0009) +
facet_wrap(~title, ncol = 2, scales = "free")
You can see from the above that the proportion of each word in each book roughly obeys Zipf’s power law.
The Term Frequency-Inverse Document Frequency measures gives us insight into which words are unique associated with one book (or text) over another. This can be useful if we want to know what words characterise one book relative to others. Below, we work out the measure using the bind_tf_idf()
function in {tidytext}
.
book_words_tf_idf <- book_words %>%
bind_tf_idf(word, title, n)
book_words_tf_idf %>%
top_n(15, tf_idf) %>%
ggplot(aes(x = reorder(word, tf_idf), y = tf_idf, fill = title)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "Term Frequency-Inverse Document Frequency") +
coord_flip() +
facet_wrap(~ title, ncol = 2, scales = "free") +
theme(text = element_text(size = 8))
Up to this point we have been focusing on individual words. When we un-nested our texts originally, we did so on a word-by-word basis. Below we are going to un-nest as bigrams (i.e., word pairs). This can be used to tell us which words typically occur with which other words. We can plot a network graph to demonstrate this. We’ll need to install two extra packages first.
library(igraph)
library(ggraph)
The code below un-nests via bigrams. If you want to, you could modify it to examine word triplets. Just change the n value on line 3, and then create an extra word column in the separate()
function call on line 4.
wotw_bigrams <- books %>%
filter(title == "The War of the Worlds") %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(col = bigram, into = c("word1", "word2", sep = " ")) %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
count(word1, word2, sort = TRUE)
bigram_graph <- wotw_bigrams %>%
filter(n > 5) %>%
graph_from_data_frame()
set.seed(1234)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(alpha = .25) +
geom_node_point(alpha = .25) +
geom_node_text(aes(label = name), vjust = -.1, hjust = 1.25, size = 3) +
guides(size = FALSE) +
xlim(10, 22) +
theme_void()
This is a great book for introducing you to using R for text mining. You can click on the image below to be taken to an electronic version of the book. Both Julia Silge and David Robinson are very active on Twitter and well worth following for all things R related.
If you spot any issues/errors in this workshop, you can raise an issue or create a pull request for this repo.