In the following exercises, we will use the data you have collected and preprocessed in the previous sets of exercises (all comments for the video “The Census” by Last Week Tonight with John Oliver). Please note that your results might look slightly different than the output in the solutions for these exercises as we collected the comments earlier.
First we need to load the parsed comments data (NB: You might have to adjust the following code to use the correct file path on your computer).
comments <- readRDS("../data/ParsedLWTComments.rds")
After loading the data, we go through the preprocessing steps described in the slides. In a first step, we remove newline commands from the comment strings (without emojis).
library(tidyverse)
comments <- comments %>%
mutate(TextEmojiDeleted = str_replace_all(TextEmojiDeleted,
pattern = "\\\n",
replacement = " "))
Next, we tokenize the comments and create a document-feature matrix from which we remove English stopwords.
library(quanteda)
toks <- comments %>%
pull(TextEmojiDeleted) %>%
char_tolower() %>%
tokens(remove_numbers = TRUE,
remove_punct = TRUE,
remove_separators = TRUE,
remove_symbols = TRUE,
split_hyphens = TRUE,
remove_url = TRUE)
comments_dfm <- dfm(toks,
remove = quanteda::stopwords("english"))
## Warning: 'remove' is deprecated; use dfm_remove() instead
NB: Your results might look a little different as we have collected the comments that the solutions in this exercise are based on a couple of days ago.
term_freq.
textstat_frequency() from the quanteda.textstats package to answer this question.
library(quanteda.textstats)
term_freq <- textstat_frequency(comments_dfm)
head(term_freq, 20)
## feature frequency rank docfreq group
## 1 census 1475 1 1134 all
## 2 people 668 2 493 all
## 3 just 514 3 442 all
## 4 like 444 4 374 all
## 5 trump 411 5 369 all
## 6 john 394 6 363 all
## 7 one 351 7 298 all
## 8 can 333 8 293 all
## 9 get 309 9 282 all
## 10 know 308 10 276 all
## 11 many 278 11 238 all
## 12 question 265 12 228 all
## 13 government 245 13 202 all
## 14 us 244 14 202 all
## 15 oliver 228 15 214 all
## 16 country 215 16 182 all
## 17 toilets 207 17 193 all
## 18 citizenship 187 18 168 all
## 19 want 185 19 158 all
## 20 think 185 19 174 all
docfreq from the term_freq object you created in the previous task.
term_freq %>%
arrange(-docfreq) %>%
head(10)
## feature frequency rank docfreq group
## 1 census 1475 1 1134 all
## 2 people 668 2 493 all
## 3 just 514 3 442 all
## 4 like 444 4 374 all
## 5 trump 411 5 369 all
## 6 john 394 6 363 all
## 7 one 351 7 298 all
## 8 can 333 8 293 all
## 9 get 309 9 282 all
## 10 know 308 10 276 all
We also want to look at the emojis that were used in the comments on the video “The Census” by Last Week Tonight with John Oliver. Similar to what we did for the comment text without emojis, we first need to wrangle the data (remove missings, tokenize emojis, create DFM).
emoji_toks <- comments %>%
mutate(Emoji = na_if(Emoji, "NA")) %>%
mutate (Emoji = str_trim(Emoji)) %>%
filter(!is.na(Emoji)) %>%
pull(Emoji) %>%
tokens(what = "fastestword")
EmojiDfm <- dfm(emoji_toks)
EmojiFreq <- textstat_frequency(EmojiDfm)
head(EmojiFreq, n = 10)
## feature frequency rank docfreq group
## 1 emoji_facewithtearsofjoy 79 1 50 all
## 2 emoji_rollingonthefloorlaughing 45 2 15 all
## 3 emoji_thinkingface 24 3 15 all
## 4 emoji_grinningfacewithsweat 12 4 10 all
## 5 emoji_registered 12 4 2 all
## 6 emoji_fire 12 4 3 all
## 7 emoji_unamusedface 8 7 8 all
## 8 emoji_loudlycryingface 7 8 5 all
## 9 emoji_smilingfacewithsunglasses 7 8 6 all
## 10 emoji_grinningsquintingface 7 8 5 all
EmojiFreq %>%
arrange(-docfreq) %>%
head(10)
## feature frequency rank docfreq group
## 1 emoji_facewithtearsofjoy 79 1 50 all
## 2 emoji_rollingonthefloorlaughing 45 2 15 all
## 3 emoji_thinkingface 24 3 15 all
## 4 emoji_grinningfacewithsweat 12 4 10 all
## 7 emoji_unamusedface 8 7 8 all
## 9 emoji_smilingfacewithsunglasses 7 8 6 all
## 11 emoji_thumbsup 6 11 6 all
## 8 emoji_loudlycryingface 7 8 5 all
## 10 emoji_grinningsquintingface 7 8 5 all
## 14 emoji_facewithrollingeyes 5 14 5 all
emoji_mapping_function.R file to see what this functions does. Bonus Bonus: Alternatively or additionally, you can also try to recreate the emoji plot approach by Emil Hvitfeldt.
source("../scripts/emoji_mapping_function.R")
create_emoji_mappings(EmojiFreq, 10)
EmojiFreq %>%
head(n = 10) %>%
ggplot(aes(x = reorder(feature, -frequency), y = frequency)) +
geom_bar(stat="identity",
color = "black",
fill = "#FF74A6",
alpha = 0.7) +
geom_point() +
labs(title = "Most frequent emojis in comments",
subtitle = "The Census: Last Week Tonight with John Oliver (HBO)
\nhttps://www.youtube.com/watch?v=1aheRpmurAo",
x = "",
y = "Frequency") +
scale_y_continuous(expand = c(0,0),
limits = c(0,100)) +
theme(panel.grid.major.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
mapping1 +
mapping2 +
mapping3 +
mapping4 +
mapping5 +
mapping6 +
mapping7 +
mapping8 +
mapping9 +
mapping10