Basic text analysis of user comments

In the following exercises, we will use the data you have collected and preprocessed in the previous sets of exercises (all comments for the video “The Census” by Last Week Tonight with John Oliver). Please note that your results might look slightly different than the output in the solutions for these exercises as we collected the comments earlier.

First we need to load the parsed comments data (NB: You might have to adjust the following code to use the correct file path on your computer).

comments <- readRDS("../data/ParsedLWTComments.rds")

After loading the data, we go through the preprocessing steps described in the slides. In a first step, we remove newline commands from the comment strings (without emojis).

library(tidyverse)

comments <- comments %>% 
  mutate(TextEmojiDeleted = str_replace_all(TextEmojiDeleted,
                                            pattern = "\\\n",
                                            replacement = " "))

Next, we tokenize the comments and create a document-feature matrix from which we remove English stopwords.

library(quanteda)

toks <- comments %>% 
  pull(TextEmojiDeleted) %>% 
  char_tolower() %>% 
  tokens(remove_numbers = TRUE,
               remove_punct = TRUE,
               remove_separators = TRUE,
               remove_symbols = TRUE,
               split_hyphens = TRUE,
               remove_url = TRUE)

comments_dfm <- dfm(toks, 
                   remove = quanteda::stopwords("english"))

## Warning: 'remove' is deprecated; use dfm_remove() instead

NB: Your results might look a little different as we have collected the comments that the solutions in this exercise are based on a couple of days ago.

Exercise 1

Which are the 20 most frequently used words in the comments on the video “The Census” by Last Week Tonight with John Oliver? Save the overall word ranking in a new object called term_freq.

Clues

You can use the function textstat_frequency() from the quanteda.textstats package to answer this question.

solution

library(quanteda.textstats)

term_freq <- textstat_frequency(comments_dfm)
head(term_freq, 20)

##        feature frequency rank docfreq group
## 1       census      1475    1    1134   all
## 2       people       668    2     493   all
## 3         just       514    3     442   all
## 4         like       444    4     374   all
## 5        trump       411    5     369   all
## 6         john       394    6     363   all
## 7          one       351    7     298   all
## 8          can       333    8     293   all
## 9          get       309    9     282   all
## 10        know       308   10     276   all
## 11        many       278   11     238   all
## 12    question       265   12     228   all
## 13  government       245   13     202   all
## 14          us       244   14     202   all
## 15      oliver       228   15     214   all
## 16     country       215   16     182   all
## 17     toilets       207   17     193   all
## 18 citizenship       187   18     168   all
## 19        want       185   19     158   all
## 20       think       185   19     174   all

Exercise 2

Instead of the raw frequency we can also look at the number of comments that a particular word appears in. This metric takes into account that words might be used multiple times in the same comment. What are the 10 words that are used in the highest number of comments on the video “The Census” by Last Week Tonight with John Oliver?

Clues

You should use the variable docfreq from the term_freq object you created in the previous task.

solution

term_freq  %>% 
  arrange(-docfreq) %>% 
  head(10)

##    feature frequency rank docfreq group
## 1   census      1475    1    1134   all
## 2   people       668    2     493   all
## 3     just       514    3     442   all
## 4     like       444    4     374   all
## 5    trump       411    5     369   all
## 6     john       394    6     363   all
## 7      one       351    7     298   all
## 8      can       333    8     293   all
## 9      get       309    9     282   all
## 10    know       308   10     276   all

We also want to look at the emojis that were used in the comments on the video “The Census” by Last Week Tonight with John Oliver. Similar to what we did for the comment text without emojis, we first need to wrangle the data (remove missings, tokenize emojis, create DFM).

emoji_toks <- comments %>% 
  mutate(Emoji = na_if(Emoji, "NA")) %>% 
  mutate (Emoji = str_trim(Emoji)) %>%
  filter(!is.na(Emoji)) %>%
  pull(Emoji) %>% 
  tokens(what = "fastestword")

EmojiDfm <- dfm(emoji_toks)

Exercise 3

What were the 10 most frequently used emojis comments on the video “The Census” by Last Week Tonight with John Oliver?

Clues

The solution is essentially the same as the one for the first task in this exercise (word frequencies).

solution

EmojiFreq <- textstat_frequency(EmojiDfm)
head(EmojiFreq, n = 10)

##                            feature frequency rank docfreq group
## 1         emoji_facewithtearsofjoy        79    1      50   all
## 2  emoji_rollingonthefloorlaughing        45    2      15   all
## 3               emoji_thinkingface        24    3      15   all
## 4      emoji_grinningfacewithsweat        12    4      10   all
## 5                 emoji_registered        12    4       2   all
## 6                       emoji_fire        12    4       3   all
## 7               emoji_unamusedface         8    7       8   all
## 8           emoji_loudlycryingface         7    8       5   all
## 9  emoji_smilingfacewithsunglasses         7    8       6   all
## 10     emoji_grinningsquintingface         7    8       5   all

Exercise 4

The ranking based on raw counts of emojis might be affected by YouTube users “spamming” emojis in their comments (i.e., using the same emojis many times in the same comment). Hence, it makes sense to also look at the number of unique comments that an emoji appears in. What are the 10 emojis that appear in the highest number of comments on the video “The Census” by Last Week Tonight with John Oliver?

Clues

The solution is essentially the same as the one for the second task in this exercise (docfreq for words).

solution

EmojiFreq  %>% 
  arrange(-docfreq) %>% 
  head(10)

##                            feature frequency rank docfreq group
## 1         emoji_facewithtearsofjoy        79    1      50   all
## 2  emoji_rollingonthefloorlaughing        45    2      15   all
## 3               emoji_thinkingface        24    3      15   all
## 4      emoji_grinningfacewithsweat        12    4      10   all
## 7               emoji_unamusedface         8    7       8   all
## 9  emoji_smilingfacewithsunglasses         7    8       6   all
## 11                  emoji_thumbsup         6   11       6   all
## 8           emoji_loudlycryingface         7    8       5   all
## 10     emoji_grinningsquintingface         7    8       5   all
## 14       emoji_facewithrollingeyes         5   14       5   all

Exercise 5 (Bonus)

If you’re finished with tasks 1-4 and/or want to do/try out something else, you can create an emoji plot similar to the one you saw in the lecture slides for the video “The Census” by Last Week Tonight with John Oliver. We have created a script containing a function for the emoji mapping which you can source with the following code (NB: you might have to adjust the path to the script in the code below). You might also want to have a look at the emoji_mapping_function.R file to see what this functions does. Bonus Bonus: Alternatively or additionally, you can also try to recreate the emoji plot approach by Emil Hvitfeldt.

source("../scripts/emoji_mapping_function.R")

Clues

You need to add the mapping objects to your plot. To see how you can construct the plot, you can have a look at slide #31 from the session on Basic Text Analysis of User Comments.

solution

create_emoji_mappings(EmojiFreq, 10)

EmojiFreq %>% 
head(n = 10) %>% 
ggplot(aes(x = reorder(feature, -frequency), y = frequency)) +
  geom_bar(stat="identity",
           color = "black",
           fill = "#FF74A6",
           alpha = 0.7) + 
  geom_point() +
  labs(title = "Most frequent emojis in comments",
       subtitle = "The Census: Last Week Tonight with John Oliver (HBO)
       \nhttps://www.youtube.com/watch?v=1aheRpmurAo",
       x = "",
       y = "Frequency") +
  scale_y_continuous(expand = c(0,0),
                     limits = c(0,100)) +
  theme(panel.grid.major.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
  mapping1 +
  mapping2 +
  mapping3 +
  mapping4 +
  mapping5 +
  mapping6 +
  mapping7 +
  mapping8 +
  mapping9 +
  mapping10

Basic text analysis of user comments - Solutions

Julian Kohne, Johannes Breuer, M. Rohangis Mohseni

Automatic sampling and analysis of YouTube data, February 21-22, 2022

Exercise 1

Clues

solution

Exercise 2

Clues

solution

Exercise 3

Clues

solution

Exercise 4

Clues

solution

Exercise 5 (Bonus)

Clues

solution