Exercises A5: Processing and Cleaning User Comments

Exercise 1

Naturally, before we can process the data, we need to load it :-) Load the comment data you scraped in the previous set of exercises into your R session and assign it to an object called comments. Get an overview of the contained variables. What do the variables describe? Why do we have missing data in some of them?

Clues

To load the data, you can use the readRDS() function. To get an overview of the contained variables, you can simply use colnames() or names() (or glimpse() from the dplyr package). To find out more about what the variables mean, you can have a look at the YouTube data API documentation and search for the respective variable descriptions.

Solution 1

# Load data
comments <- readRDS("./data/RawLWTComments.rds")

# overview of columns
colnames(comments)

##  [1] "videoId"               "textDisplay"           "textOriginal"          "authorDisplayName"    
##  [5] "authorProfileImageUrl" "authorChannelUrl"      "authorChannelId.value" "canRate"              
##  [9] "viewerRating"          "likeCount"             "publishedAt"           "updatedAt"            
## [13] "id"                    "parentId"              "moderationStatus"

Exercise 2

As a first processing step, we want to remove the following variables: authorProfileImageUrl, authorChannelUrl, authorChannelId.value,canRate, viewerRating, moderationStatus. Create a new dataframe called Selection containing only the remaining variables.

Clues

You can use the subset() function from base R to keep or remove a selection of variables from a dataframe. For more information on how to use it, have a look at its documentation by running ?subset().

Solution 2

# Select only the columns we need
Selection <- subset(comments,select = -c(authorProfileImageUrl,
                                         authorChannelUrl,
                                         authorChannelId.value,
                                         videoId,
                                         canRate,
                                         viewerRating,
                                         moderationStatus))
# Alternatively, you could, of course also use dplyr::select()

# Check selection
colnames(Selection)

Exercise 3

Check the class of the variable publishedAt in your new dataframe. Is this class suitable for further analysis? If not, change the class to the appropriate one and compute the time difference in publishing dates between the comment in the first row and the comment in the last row.

Do the same transformation for the variable updatedAt.

Clues

To check the class of the publishedAt variable, you can use the class() function. You can get information about formatting of the comment timestamp from the YouTube API documentation. To transform character strings into datetime objects in R, you can use the base R function as.POSIXct(), However, we would recommend using the anytime() function from the package with the same name as that is more convenient (Note: If you are a tidyverse afficionado, you can also use functions from the lubridate package for this task).

Solution 3

# check variable class
class(Selection$publishedAt)

# transform to datetime object with as.POISXct
DateTime <- as.POSIXct(Selection$publishedAt,format = "%Y-%m-%dT%H:%M:%SZ")

# transform to datetime object with anytime
library(anytime)
Selection$publishedAt <- anytime(Selection$publishedAt,asUTC = TRUE)

# recheck variable class
class(Selection$publishedAt)

# compute time difference in publishing time between first and last comment
Selection$publishedAt[1] - Selection$publishedAt[nrow(Selection)]

# transform the updatedAt variable as well
Selection$updatedAt <- anytime(Selection$updatedAt,asUTC = TRUE)

Exercise 4

Check the likeCount variable in your data. Is it suitable for numeric analysis? If not, transform it to the appropriate class and test whether your transformation worked.

Clues

You can use the class() function to check the class of an object in R. To change a class, for example from character to numeric, you can use the family of “as”-functions, for example as.numeric().

Solution 4

# check variable class
class(Selection$likeCount)

## [1] "character"

# transform class
Selection$likeCount <- as.numeric(Selection$likeCount)

# recheck class
class(Selection$likeCount)

## [1] "numeric"

summary(Selection$likeCount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   14.41    1.00 5642.00

Exercise 5

Check the textOriginal column in your Selection dataframe. Some comments contain hyperlinks that we should remove for later text analysis steps. Extract the hyperlinks from the textOriginal column into a new list called Links. In addition, create a new variable called LinksDel that contains the text from textOriginal without hyperlinks.

Clues

The qdapRegex package offers many pre-built functions for detecting, removing, and replacing specific character strings. You can, for example, use the rm_url() function for extracting and replacing hyperlinks. As a reminder: You can check the documentation for this function with ?rm_url().

Solution 5

# load package
library(qdapRegex)

## 
## Attache Paket: 'qdapRegex'

## Das folgende Objekt ist maskiert 'package:dplyr':
## 
##     explain

## Das folgende Objekt ist maskiert 'package:ggplot2':
## 
##     %+%

# check column
Selection$textOriginal[396:406]

##  [1] "https://www.youtube.com/watch?v=Bs4oSQWdWWw"                                                                                                                                                                                                                   
##  [2] "America = \"Census is the most difficult and the largest peacetime operation was undertaken by the government\"\n\n\nIndia = \"That`s Cute\""                                                                                                                  
##  [3] "The US Census Bureau can conduct periodic surveys that could conceivably ask the number of toilets in a home."                                                                                                                                                 
##  [4] "I live in providence county and I didnt even know this happened lmfao"                                                                                                                                                                                         
##  [5] "Trump confused a real estate appraiser and a census taker..."                                                                                                                                                                                                  
##  [6] "These people who don’t like the census remind me of Ron Swanson"                                                                                                                                                                                               
##  [7] "It’s even worse now since Trump signed an executive order that would make a “second census” that would specifically ask if a person is a citizen"                                                                                                              
##  [8] "Watch this entertaining TikTok video based on the census. Be sure to stay and watch the whole video! https://vm.tiktok.com/7r2pWx/"                                                                                                                            
##  [9] "- Census Man -\nStill a more useful superhero than Aquaman."                                                                                                                                                                                                   
## [10] "I am honestly a little confused about the US, aren't everybody required to have some form of nation number? And doesn't that update when you file for where you live? Why would a census being needed? Shouldn't the government already have that information?"
## [11] "How many toilets do thy have, how many desks do thy have , what's your roof made of .................WTF"

# extract hyperlinks
Links <- rm_url(Selection$textOriginal, extract = TRUE)
Links[396:406]

## [[1]]
## [1] "https://www.youtube.com/watch?v=Bs4oSQWdWWw"
## 
## [[2]]
## [1] NA
## 
## [[3]]
## [1] NA
## 
## [[4]]
## [1] NA
## 
## [[5]]
## [1] NA
## 
## [[6]]
## [1] NA
## 
## [[7]]
## [1] NA
## 
## [[8]]
## [1] "https://vm.tiktok.com/7r2pWx/"
## 
## [[9]]
## [1] NA
## 
## [[10]]
## [1] NA
## 
## [[11]]
## [1] NA

# remove hyperlinks
LinksDel <- rm_url(Selection$textOriginal)
LinksDel[396:406]

##  [1] ""                                                                                                                                                                                                                                                              
##  [2] "America = \"Census is the most difficult and the largest peacetime operation was undertaken by the government\" India = \"That`s Cute\""                                                                                                                       
##  [3] "The US Census Bureau can conduct periodic surveys that could conceivably ask the number of toilets in a home."                                                                                                                                                 
##  [4] "I live in providence county and I didnt even know this happened lmfao"                                                                                                                                                                                         
##  [5] "Trump confused a real estate appraiser and a census taker..."                                                                                                                                                                                                  
##  [6] "These people who don’t like the census remind me of Ron Swanson"                                                                                                                                                                                               
##  [7] "It’s even worse now since Trump signed an executive order that would make a “second census” that would specifically ask if a person is a citizen"                                                                                                              
##  [8] "Watch this entertaining TikTok video based on the census. Be sure to stay and watch the whole video!"                                                                                                                                                          
##  [9] "- Census Man - Still a more useful superhero than Aquaman."                                                                                                                                                                                                    
## [10] "I am honestly a little confused about the US, aren't everybody required to have some form of nation number? And doesn't that update when you file for where you live? Why would a census being needed? Shouldn't the government already have that information?"
## [11] "How many toilets do thy have, how many desks do thy have , what's your roof made of .................WTF"

Exercise 6

While hyperlinks have been removed in the new LinksDel variable, the strings therein still contain emojis. For our later analysis, we want to do three things:

Create one column without hyperlinks and emojis
Create one column where emojis are replaced by a textual description
Create one column containing only the textual description of emojis

To achieve this, we first need a dictionary of emojis and their corresponding textual descriptions in a usable format. Load the emo package and have a look at the contained dataframe jis. Assign it to a new object called EmojiList. Afterwards, source the provided CamelCase.R script (contained in the folder content\R within the workshop materials) to transform the textual description from regular case to CamelCase. Finally, create a new variable called TextEmoDel containing the text without the emoji.

Clues

We created a function that capitalizes the first character of each word. The function is called simpleCap() and the name of the in which the function is stored is CamelCase.R. You can load it into your workspace using the source() function and specifying its location. You can find the script containing this function in the folder content\R within the workshop materials. Keep in mind that this function only capitalizes the first letters of each word, so you still need to get rid of the extra space characters. The gsub() function is a handy tool for this purpose. You can use the ji_replace_all() function from the emo package to replace emojis with an empty string (““).

Solution 6

# load package
library(emo)

# source script
source("./content/R/CamelCase.R")

# reassign the jis dataframe from the emo package to a new object
EmojiList <- jis

# apply the function to all emoji names
CamelCaseEmojis <- lapply(jis$name, simpleCap)

# delete empty spaces
CollapsedEmojis <- lapply(CamelCaseEmojis,function(x){gsub(" ", "", x, fixed = TRUE)})

# replace the names column in the EmojiList df with the new name column
EmojiList[,4] <- unlist(CollapsedEmojis)

# check the first 10 entries in the new df
EmojiList[1:10,c(1,3,4)]

## # A tibble: 10 × 3
##    runes emoji name                       
##    <chr> <chr> <chr>                      
##  1 1F600 😀    GrinningFace               
##  2 1F601 😁    BeamingFaceWithSmilingEyes 
##  3 1F602 😂    FaceWithTearsOfJoy         
##  4 1F923 🤣    RollingOnTheFloorLaughing  
##  5 1F603 😃    GrinningFaceWithBigEyes    
##  6 1F604 😄    GrinningFaceWithSmilingEyes
##  7 1F605 😅    GrinningFaceWithSweat      
##  8 1F606 😆    GrinningSquintingFace      
##  9 1F609 😉    WinkingFace                
## 10 1F60A 😊    SmilingFaceWithSmilingEyes

# create a new text column with emojis (& hyperlinks) removed
TextEmoDel <- ji_replace_all(LinksDel,"")

# Check content of new vectors
LinksDel[c(604,650,934)]

## [1] "More bullshit 😷🤢"                 "😂"                                 "That fake Native American 😂😂😩😩"

TextEmoDel[c(604,650,934)]

## [1] "More bullshit "             ""                           "That fake Native American "

Exercise 7

Ultimately, we want to use our EmojiList dataframe to replace the instances of emojis in our text with textual descriptions. We can do that by looping over all emojis in all texts and replacing them one at a time. There is a problem, however: Some emoji strings are made up of multiple “shorter” emoji strings. If we match parts of a “longer” emoji string and replace it with its textual description, the rest will become unreadable. For this reason, we need to make sure that we replace the emoji from longest to shortest string. Sort the EmojiList dataframe by the length of the emoji column from longest to shortest.

Clues

You can count the number of characters in a vector of text using the nchar() function. You can reorder dataframes using the order() function and you can reverse an order with the rev() function (Note: The tidyverse equivalent here would be to use arrange(desc()) from the dplyr package).

Solution 7

# order from longest to shortest
EmojiList <- EmojiList[rev(order(nchar(jis$emoji))),]

# overview of new order
head(EmojiList[,c(1,3,4)],5)

## # A tibble: 5 × 3
##   runes                                      emoji name            
##   <chr>                                      <chr> <chr>           
## 1 1F469 200D 2764 FE0F 200D 1F48B 200D 1F469 👩‍❤️‍💋‍👩    Kiss:Woman,Woman
## 2 1F468 200D 2764 FE0F 200D 1F48B 200D 1F468 👨‍❤️‍💋‍👨    Kiss:Man,Man    
## 3 1F469 200D 2764 FE0F 200D 1F48B 200D 1F468 👩‍❤️‍💋‍👨    Kiss:Woman,Man  
## 4 1F3F4 E0067 E0062 E0077 E006C E0073 E007F  🏴󠁧󠁢󠁷󠁬󠁳󠁿    Wales           
## 5 1F3F4 E0067 E0062 E0073 E0063 E0074 E007F  🏴󠁧󠁢󠁳󠁣󠁴󠁿    Scotland

Exercise 8

We now have a working dictionary for replacing emojis with a textual description! Create a new variable called TextEmoRep as a copy of the LinksDel variable. Next, loop through the ordered EmojiList and, for every element in TextEmoRep, replace the contained emoji with “EMOJI_” followed by their textual description. You can use the rm_default() function from the qdapRegex package to replace custom patterns. Be sure to check the documentation so you can set the appropriate options for the function.

NB: There will be warnings in your console even if you are doing everything right, so don’t worry about those.

Clues

Loop through the dictionary sorted from longest to shortest emoji. You need to use a “for loop” to go through all emojis for all comments, one at a time. The paste() function is useful for adding the prefix “EMOJI_” at the beginning of the textual descriptions. Don’t forget to set the arguments fixed = TRUE, clean = TRUE and trim = FALSE in your call to rm_default()

Solution 8

# assign the column to a new variable
TextEmoRep <- LinksDel

# switch off warnings
options(warn=-1)

# loop through all emojis for all comments
for (i in 1:dim(EmojiList)[1]) {

  TextEmoRep <- rm_default(TextEmoRep,
                    pattern = EmojiList[i,3],
                    replacement = paste0("EMOJI_",
                                       EmojiList[i,4],
                                       " "),
                    fixed = TRUE,
                    clean = FALSE,
                    trim = FALSE)
}

# check results
LinksDel[c(604,650,934)]

## [1] "More bullshit 😷🤢"                 "😂"                                 "That fake Native American 😂😂😩😩"

TextEmoRep[c(604,650,934)]

## [1] "More bullshit EMOJI_FaceWithMedicalMask EMOJI_NauseatedFace "                                                
## [2] "EMOJI_FaceWithTearsOfJoy "                                                                                   
## [3] "That fake Native American EMOJI_FaceWithTearsOfJoy EMOJI_FaceWithTearsOfJoy EMOJI_WearyFace EMOJI_WearyFace "

Exercise 9

We now have the original text column, and the text column with removed hyperlinks in which emojis are replaced with their textual descriptions (TextEmoRep). We need one more variable that only contains the textual descriptions of the emojis. For this purpose, you can use the function ExtractEmoji() which we have created and stored in an R script with the same name in the folder content\R within the workshop materials. The new vector should be named Emoji.

Clues

Use the source() function to source the ExtractEmoji.R script from the content\R folder within the workshop materials and then sapply() the ExtractEmoji() function to the variable TextEmoRep. To remove useless rownames from the extracted emojis, you can set names(Emoji) to NULL

Solution 9

# source script containing the function
source("./content/R/ExtractEmoji.R")

# apply function & remove rownames
Emoji <- sapply(TextEmoRep,ExtractEmoji)
names(Emoji) <- NULL

# check results
LinksDel[c(604,650,934)]

## [1] "More bullshit 😷🤢"                 "😂"                                 "That fake Native American 😂😂😩😩"

TextEmoRep[c(604,650,934)]

## [1] "More bullshit EMOJI_FaceWithMedicalMask EMOJI_NauseatedFace "                                                
## [2] "EMOJI_FaceWithTearsOfJoy "                                                                                   
## [3] "That fake Native American EMOJI_FaceWithTearsOfJoy EMOJI_FaceWithTearsOfJoy EMOJI_WearyFace EMOJI_WearyFace "

Emoji[c(604,650,934)]

## [1] "EMOJI_FaceWithMedicalMask EMOJI_NauseatedFace "                                    
## [2] "EMOJI_FaceWithTearsOfJoy "                                                         
## [3] "EMOJI_FaceWithTearsOfJoy EMOJI_FaceWithTearsOfJoy EMOJI_WearyFace EMOJI_WearyFace "

Exercise 10

We now have selected or created all the variables we need. As a final step in this set of exercises, create a new dataframe called comments_clean that contains the following variables:

Selection$authorDisplayName
Selection$textOriginal
TextEmoRep
TextEmoDel
Emoji
Selection$likeCount
Links
Selection$publishedAt
Selection$updatedAt
Selection$parentId
Selection$id

Set the following names for the columns in the new dataframe:

Author
Text
TextEmojiReplaced
TextEmojiDeleted
Emoji
LikeCount
URL
Published
Updated
ParentId
CommentID

Save the new dataframe as an .rds file with the name ParsedLWTComments.rds in the data folder that you (should) have created for the previous set of exercises.

Clues

You can use the cbind.data.frame() function to paste together multiple columns into a dataframe. Note: You need to set the argument stringsAsFactors = FALSE if your R version is < 4.0.0 to prevent strings from being interpreted as factors. The variables Links and Emoji are lists and can contain multiple values per row. For this reason, you need to enclose them with the I() function to store them as columns within a dataframe. You can save your result using the saveRDS() function.

Solution 10

# create df dataframe (use I() function to enclose Emoji and Links)
comments_clean <- cbind.data.frame(Selection$authorDisplayName,
                                   Selection$textOriginal,
                                   TextEmoRep,
                                   TextEmoDel,
                                   I(Emoji),
                                   Selection$likeCount,
                                   I(Links),
                                   Selection$publishedAt,
                                   Selection$updatedAt,
                                   Selection$parentId,
                                   Selection$id)

# set column names
names(comments_clean) <- c("Author",
                           "Text",
                           "TextEmojiReplaced",
                           "TextEmojiDeleted",
                           "Emoji",
                           "LikeCount",
                           "URL",
                           "Published",
                           "Updated",
                           "ParentId",
                           "CommentID")

# save dataframe
saveRDS(comments_clean, file = "./data/ParsedLWTComments.rds")

Exercises A5: Processing and Cleaning User Comments

Johannes Breuer, Annika Deubel, & M. Rohangis Mohseni

Automatic sampling and analysis of YouTube data, February 14-15, 2023

Exercise 1

Clues

Solution 1

Exercise 2

Clues

Solution 2

Exercise 3

Clues

Solution 3

Exercise 4

Clues

Solution 4

Exercise 5

Clues

Solution 5

Exercise 6

Clues

Solution 6

Exercise 7

Clues

Solution 7

Exercise 8

Clues

Solution 8

Exercise 9

Clues

Solution 9

Exercise 10

Clues

Solution 10