In the following exercises, we will use the data you have collected and preprocessed in the previous sets of exercises (all comments for the video “The Census” by Last Week Tonight with John Oliver). Please note that your results might look slightly different than the output in the solutions for these exercises as we collected the comments earlier.
First we need to load the parsed comments data (NB: You might have to adjust the following code to use the correct file path on your computer).
comments <- readRDS("../data/ParsedLWTComments.rds")
After loading the data, we go through the preprocessing steps described in the slides. In a first step, we remove newline commands from the comment strings (without emojis).
library(tidyverse)
comments <- comments %>%
mutate(TextEmojiDeleted = str_replace_all(TextEmojiDeleted,
pattern = "\\\n",
replacement = " "))
Next, we tokenize the comments and create a document-feature matrix from which we remove English stopwords.
library(quanteda)
toks <- comments %>%
pull(TextEmojiDeleted) %>%
char_tolower() %>%
tokens(remove_numbers = TRUE,
remove_punct = TRUE,
remove_separators = TRUE,
remove_symbols = TRUE,
split_hyphens = TRUE,
remove_url = TRUE)
comments_dfm <- dfm(toks,
remove = quanteda::stopwords("english"))
## Warning: 'remove' is deprecated; use dfm_remove() instead
NB: Your results might look a little different as we have collected the comments that the solutions in this exercise are based on a couple of days ago.
term_freq
.
textstat_frequency()
from the
quanteda.textstats
package to answer this question.
library(quanteda.textstats)
term_freq <- textstat_frequency(comments_dfm)
head(term_freq, 20)
## feature frequency rank docfreq group
## 1 census 1801 1 1385 all
## 2 people 998 2 726 all
## 3 just 754 3 651 all
## 4 like 625 4 528 all
## 5 one 509 5 434 all
## 6 can 491 6 430 all
## 7 trump 489 7 438 all
## 8 know 455 8 402 all
## 9 john 437 9 405 all
## 10 get 437 9 389 all
## 11 government 388 11 312 all
## 12 us 373 12 305 all
## 13 question 362 13 307 all
## 14 many 352 14 300 all
## 15 citizens 319 15 236 all
## 16 country 304 16 254 all
## 17 even 288 17 269 all
## 18 think 283 18 258 all
## 19 want 279 19 241 all
## 20 illegal 279 19 214 all
docfreq
from the
term_freq
object you created in the previous task.
term_freq %>%
arrange(-docfreq) %>%
head(10)
## feature frequency rank docfreq group
## 1 census 1801 1 1385 all
## 2 people 998 2 726 all
## 3 just 754 3 651 all
## 4 like 625 4 528 all
## 7 trump 489 7 438 all
## 5 one 509 5 434 all
## 6 can 491 6 430 all
## 9 john 437 9 405 all
## 8 know 455 8 402 all
## 10 get 437 9 389 all
We also want to look at the emojis that were used in the comments on the video “The Census” by Last Week Tonight with John Oliver. Similar to what we did for the comment text without emojis, we first need to wrangle the data (remove missings, tokenize emojis, create DFM).
emoji_toks <- comments %>%
mutate(Emoji = na_if(Emoji, "NA")) %>%
mutate (Emoji = str_trim(Emoji)) %>%
filter(!is.na(Emoji)) %>%
pull(Emoji) %>%
tokens(what = "fastestword")
EmojiDfm <- dfm(emoji_toks)
EmojiFreq <- textstat_frequency(EmojiDfm)
head(EmojiFreq, n = 10)
## feature frequency rank docfreq group
## 1 emoji_facewithtearsofjoy 109 1 63 all
## 2 emoji_rollingonthefloorlaughing 60 2 24 all
## 3 emoji_thinkingface 30 3 19 all
## 4 emoji_grinningfacewithsweat 16 4 14 all
## 5 emoji_registered 13 5 3 all
## 6 emoji_loudlycryingface 12 6 8 all
## 7 emoji_fire 12 6 3 all
## 8 emoji_grinningsquintingface 9 8 6 all
## 9 emoji_smilingfacewithsunglasses 8 9 7 all
## 10 emoji_clappinghands 8 9 2 all
EmojiFreq %>%
arrange(-docfreq) %>%
head(10)
## feature frequency rank docfreq group
## 1 emoji_facewithtearsofjoy 109 1 63 all
## 2 emoji_rollingonthefloorlaughing 60 2 24 all
## 3 emoji_thinkingface 30 3 19 all
## 4 emoji_grinningfacewithsweat 16 4 14 all
## 6 emoji_loudlycryingface 12 6 8 all
## 11 emoji_unamusedface 8 9 8 all
## 9 emoji_smilingfacewithsunglasses 8 9 7 all
## 12 emoji_winkingface 7 12 7 all
## 13 emoji_thumbsup 7 12 7 all
## 8 emoji_grinningsquintingface 9 8 6 all
emoji_mapping_function.R
file to see what this functions
does. Bonus Bonus: Alternatively or additionally, you
can also try to recreate the emoji
plot approach by Emil Hvitfeldt.
source("../../content/R/emoji_mapping_function.R")
create_emoji_mappings(EmojiFreq, 10)
EmojiFreq %>%
head(n = 10) %>%
ggplot(aes(x = reorder(feature, -frequency), y = frequency)) +
geom_bar(stat="identity",
color = "black",
fill = "#FF74A6",
alpha = 0.7) +
geom_point() +
labs(title = "Most frequent emojis in comments",
subtitle = "The Census: Last Week Tonight with John Oliver (HBO)
\nhttps://www.youtube.com/watch?v=1aheRpmurAo",
x = "",
y = "Frequency") +
scale_y_continuous(expand = c(0,0),
limits = c(0,120)) +
theme(panel.grid.major.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
mapping1 +
mapping2 +
mapping3 +
mapping4 +
mapping5 +
mapping6 +
mapping7 +
mapping8 +
mapping9 +
mapping10