Exercises B1: Basic text analysis of user comments

In the following exercises, we will use the data you have collected and preprocessed in the previous sets of exercises (all comments for the video “The Census” by Last Week Tonight with John Oliver). Please note that your results might look slightly different than the output in the solutions for these exercises as we collected the comments earlier.

First we need to load the parsed comments data (NB: You might have to adjust the following code to use the correct file path on your computer).

comments <- readRDS("../data/ParsedLWTComments.rds")

After loading the data, we go through the preprocessing steps described in the slides. In a first step, we remove newline commands from the comment strings (without emojis).

library(tidyverse)

comments <- comments %>% 
  mutate(TextEmojiDeleted = str_replace_all(TextEmojiDeleted,
                                            pattern = "\\\n",
                                            replacement = " "))

Next, we tokenize the comments and create a document-feature matrix from which we remove English stopwords.

library(quanteda)

toks <- comments %>% 
  pull(TextEmojiDeleted) %>% 
  char_tolower() %>% 
  tokens(remove_numbers = TRUE,
               remove_punct = TRUE,
               remove_separators = TRUE,
               remove_symbols = TRUE,
               split_hyphens = TRUE,
               remove_url = TRUE)

comments_dfm <- dfm(toks, 
                   remove = quanteda::stopwords("english"))

## Warning: 'remove' is deprecated; use dfm_remove() instead

NB: Your results might look a little different as we have collected the comments that the solutions in this exercise are based on a couple of days ago.

Exercise 1

Which are the 20 most frequently used words in the comments on the video “The Census” by Last Week Tonight with John Oliver? Save the overall word ranking in a new object called term_freq.

Clues

You can use the function textstat_frequency() from the quanteda.textstats package to answer this question.

Exercise 2

Instead of the raw frequency we can also look at the number of comments that a particular word appears in. This metric takes into account that words might be used multiple times in the same comment. What are the 10 words that are used in the highest number of comments on the video “The Census” by Last Week Tonight with John Oliver?

Clues

You should use the variable docfreq from the term_freq object you created in the previous task.

We also want to look at the emojis that were used in the comments on the video “The Census” by Last Week Tonight with John Oliver. Similar to what we did for the comment text without emojis, we first need to wrangle the data (remove missings, tokenize emojis, create DFM).

emoji_toks <- comments %>% 
  mutate(Emoji = na_if(Emoji, "NA")) %>% 
  mutate (Emoji = str_trim(Emoji)) %>%
  filter(!is.na(Emoji)) %>%
  pull(Emoji) %>% 
  tokens(what = "fastestword")

EmojiDfm <- dfm(emoji_toks)

Exercise 3

What were the 10 most frequently used emojis comments on the video “The Census” by Last Week Tonight with John Oliver?

Clues

The solution is essentially the same as the one for the first task in this exercise (word frequencies).

Exercise 4

The ranking based on raw counts of emojis might be affected by YouTube users “spamming” emojis in their comments (i.e., using the same emojis many times in the same comment). Hence, it makes sense to also look at the number of unique comments that an emoji appears in. What are the 10 emojis that appear in the highest number of comments on the video “The Census” by Last Week Tonight with John Oliver?

Clues

The solution is essentially the same as the one for the second task in this exercise (docfreq for words).

Exercise 5 (Bonus)

If you’re finished with tasks 1-4 and/or want to do/try out something else, you can create an emoji plot similar to the one you saw in the lecture slides for the video “The Census” by Last Week Tonight with John Oliver. We have created a script containing a function for the emoji mapping which you can source with the following code (NB: you might have to adjust the path to the script in the code below). You might also want to have a look at the emoji_mapping_function.R file to see what this functions does. Bonus Bonus: Alternatively or additionally, you can also try to recreate the emoji plot approach by Emil Hvitfeldt.

source("../../content/R/emoji_mapping_function.R")

Clues

You need to add the mapping objects to your plot. To see how you can construct the plot, you can have a look at slide #31 from the session on Basic Text Analysis of User Comments.

Exercises B1: Basic text analysis of user comments

Johannes Breuer, Annika Deubel, & M. Rohangis Mohseni

Automatic sampling and analysis of YouTube data, February 14-15, 2023

Exercise 1

Clues

Exercise 2

Clues

Exercise 3

Clues

Exercise 4

Clues

Exercise 5 (Bonus)

Clues