Exercises B1: Basic text analysis of user comments

In the following exercises, we will use the data you have collected and preprocessed in the previous sets of exercises (all comments for the video “The Census” by Last Week Tonight with John Oliver). Please note that your results might look slightly different than the output in the solutions for these exercises as we collected the comments earlier.

First we need to load the parsed comments data (NB: You might have to adjust the following code to use the correct file path on your computer).

comments <- readRDS("../data/ParsedLWTComments.rds")

After loading the data, we go through the preprocessing steps described in the slides. In a first step, we remove newline commands from the comment strings (without emojis).

library(tidyverse)

comments <- comments %>% 
  mutate(TextEmojiDeleted = str_replace_all(TextEmojiDeleted,
                                            pattern = "\\\n",
                                            replacement = " "))

Next, we tokenize the comments and create a document-feature matrix from which we remove English stopwords.

library(quanteda)

toks <- comments %>% 
  pull(TextEmojiDeleted) %>% 
  char_tolower() %>% 
  tokens(remove_numbers = TRUE,
               remove_punct = TRUE,
               remove_separators = TRUE,
               remove_symbols = TRUE,
               split_hyphens = TRUE,
               remove_url = TRUE)

comments_dfm <- dfm(toks, 
                   remove = quanteda::stopwords("english"))

## Warning: 'remove' is deprecated; use dfm_remove() instead

NB: Your results might look a little different as we have collected the comments that the solutions in this exercise are based on a couple of days ago.

Exercise 1

Which are the 20 most frequently used words in the comments on the video “The Census” by Last Week Tonight with John Oliver? Save the overall word ranking in a new object called term_freq.

Clues

You can use the function textstat_frequency() from the quanteda.textstats package to answer this question.

Solution

library(quanteda.textstats)

term_freq <- textstat_frequency(comments_dfm)
head(term_freq, 20)

##       feature frequency rank docfreq group
## 1      census      1801    1    1385   all
## 2      people       998    2     726   all
## 3        just       754    3     651   all
## 4        like       625    4     528   all
## 5         one       509    5     434   all
## 6         can       491    6     430   all
## 7       trump       489    7     438   all
## 8        know       455    8     402   all
## 9        john       437    9     405   all
## 10        get       437    9     389   all
## 11 government       388   11     312   all
## 12         us       373   12     305   all
## 13   question       362   13     307   all
## 14       many       352   14     300   all
## 15   citizens       319   15     236   all
## 16    country       304   16     254   all
## 17       even       288   17     269   all
## 18      think       283   18     258   all
## 19       want       279   19     241   all
## 20    illegal       279   19     214   all

Exercise 2

Instead of the raw frequency we can also look at the number of comments that a particular word appears in. This metric takes into account that words might be used multiple times in the same comment. What are the 10 words that are used in the highest number of comments on the video “The Census” by Last Week Tonight with John Oliver?

Clues

You should use the variable docfreq from the term_freq object you created in the previous task.

Solution

term_freq  %>% 
  arrange(-docfreq) %>% 
  head(10)

##    feature frequency rank docfreq group
## 1   census      1801    1    1385   all
## 2   people       998    2     726   all
## 3     just       754    3     651   all
## 4     like       625    4     528   all
## 7    trump       489    7     438   all
## 5      one       509    5     434   all
## 6      can       491    6     430   all
## 9     john       437    9     405   all
## 8     know       455    8     402   all
## 10     get       437    9     389   all

We also want to look at the emojis that were used in the comments on the video “The Census” by Last Week Tonight with John Oliver. Similar to what we did for the comment text without emojis, we first need to wrangle the data (remove missings, tokenize emojis, create DFM).

emoji_toks <- comments %>% 
  mutate(Emoji = na_if(Emoji, "NA")) %>% 
  mutate (Emoji = str_trim(Emoji)) %>%
  filter(!is.na(Emoji)) %>%
  pull(Emoji) %>% 
  tokens(what = "fastestword")

EmojiDfm <- dfm(emoji_toks)

Exercise 3

What were the 10 most frequently used emojis comments on the video “The Census” by Last Week Tonight with John Oliver?

Clues

The solution is essentially the same as the one for the first task in this exercise (word frequencies).

Solution

EmojiFreq <- textstat_frequency(EmojiDfm)
head(EmojiFreq, n = 10)

##                            feature frequency rank docfreq group
## 1         emoji_facewithtearsofjoy       109    1      63   all
## 2  emoji_rollingonthefloorlaughing        60    2      24   all
## 3               emoji_thinkingface        30    3      19   all
## 4      emoji_grinningfacewithsweat        16    4      14   all
## 5                 emoji_registered        13    5       3   all
## 6           emoji_loudlycryingface        12    6       8   all
## 7                       emoji_fire        12    6       3   all
## 8      emoji_grinningsquintingface         9    8       6   all
## 9  emoji_smilingfacewithsunglasses         8    9       7   all
## 10             emoji_clappinghands         8    9       2   all

Exercise 4

The ranking based on raw counts of emojis might be affected by YouTube users “spamming” emojis in their comments (i.e., using the same emojis many times in the same comment). Hence, it makes sense to also look at the number of unique comments that an emoji appears in. What are the 10 emojis that appear in the highest number of comments on the video “The Census” by Last Week Tonight with John Oliver?

Clues

The solution is essentially the same as the one for the second task in this exercise (docfreq for words).

Solution

EmojiFreq  %>% 
  arrange(-docfreq) %>% 
  head(10)

##                            feature frequency rank docfreq group
## 1         emoji_facewithtearsofjoy       109    1      63   all
## 2  emoji_rollingonthefloorlaughing        60    2      24   all
## 3               emoji_thinkingface        30    3      19   all
## 4      emoji_grinningfacewithsweat        16    4      14   all
## 6           emoji_loudlycryingface        12    6       8   all
## 11              emoji_unamusedface         8    9       8   all
## 9  emoji_smilingfacewithsunglasses         8    9       7   all
## 12               emoji_winkingface         7   12       7   all
## 13                  emoji_thumbsup         7   12       7   all
## 8      emoji_grinningsquintingface         9    8       6   all

Exercise 5 (Bonus)

If you’re finished with tasks 1-4 and/or want to do/try out something else, you can create an emoji plot similar to the one you saw in the lecture slides for the video “The Census” by Last Week Tonight with John Oliver. We have created a script containing a function for the emoji mapping which you can source with the following code (NB: you might have to adjust the path to the script in the code below). You might also want to have a look at the emoji_mapping_function.R file to see what this functions does. Bonus Bonus: Alternatively or additionally, you can also try to recreate the emoji plot approach by Emil Hvitfeldt.

source("../../content/R/emoji_mapping_function.R")

Clues

You need to add the mapping objects to your plot. To see how you can construct the plot, you can have a look at slide #31 from the session on Basic Text Analysis of User Comments.

solution

create_emoji_mappings(EmojiFreq, 10)

EmojiFreq %>% 
head(n = 10) %>% 
ggplot(aes(x = reorder(feature, -frequency), y = frequency)) +
  geom_bar(stat="identity",
           color = "black",
           fill = "#FF74A6",
           alpha = 0.7) + 
  geom_point() +
  labs(title = "Most frequent emojis in comments",
       subtitle = "The Census: Last Week Tonight with John Oliver (HBO)
       \nhttps://www.youtube.com/watch?v=1aheRpmurAo",
       x = "",
       y = "Frequency") +
  scale_y_continuous(expand = c(0,0),
                     limits = c(0,120)) +
  theme(panel.grid.major.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
  mapping1 +
  mapping2 +
  mapping3 +
  mapping4 +
  mapping5 +
  mapping6 +
  mapping7 +
  mapping8 +
  mapping9 +
  mapping10

Exercises B1: Basic text analysis of user comments

Johannes Breuer, Annika Deubel, & M. Rohangis Mohseni

Automatic sampling and analysis of YouTube data, February 14-15, 2023

Exercise 1

Clues

Solution

Exercise 2

Clues

Solution

Exercise 3

Clues

Solution

Exercise 4

Clues

Solution

Exercise 5 (Bonus)

Clues

solution