R
session and assign it to an object called
comments
. Get an overview of the contained variables. What
do the variables describe? Why do we have missing data in some of them?
To load the data, you can use the readRDS()
function. To
get an overview of the contained variables, you can simply use
colnames()
or names()
(or
glimpse()
from the dplyr
package). To find out
more about what the variables mean, you can have a look at the YouTube
data API documentation and search for the respective variable
descriptions.
# Load data
comments <- readRDS("./data/RawLWTComments.rds")
# overview of columns
colnames(comments)
## [1] "videoId" "textDisplay" "textOriginal" "authorDisplayName"
## [5] "authorProfileImageUrl" "authorChannelUrl" "authorChannelId.value" "canRate"
## [9] "viewerRating" "likeCount" "publishedAt" "updatedAt"
## [13] "id" "parentId" "moderationStatus"
authorProfileImageUrl
, authorChannelUrl
,
authorChannelId.value
,canRate
,
viewerRating
, moderationStatus
. Create a new
dataframe called Selection
containing only the remaining
variables.
You can use the subset()
function from
base R
to keep or remove a selection of variables from a
dataframe. For more information on how to use it, have a look at its
documentation by running ?subset()
.
# Select only the columns we need
Selection <- subset(comments,select = -c(authorProfileImageUrl,
authorChannelUrl,
authorChannelId.value,
videoId,
canRate,
viewerRating,
moderationStatus))
# Alternatively, you could, of course also use dplyr::select()
# Check selection
colnames(Selection)
Check the class of the variable publishedAt
in your new
dataframe. Is this class suitable for further analysis? If not, change
the class to the appropriate one and compute the time difference in
publishing dates between the comment in the first row and the comment in
the last row.
Do the same transformation for the variable
updatedAt
.
To check the class of the publishedAt
variable, you can
use the class()
function. You can get information about
formatting of the comment timestamp from the YouTube
API documentation. To transform character strings into datetime
objects in R
, you can use the base R
function
as.POSIXct()
, However, we would recommend using the
anytime()
function from the package with the same name as
that is more convenient (Note: If you are a
tidyverse
afficionado, you can also use functions from the
lubridate
package for this task).
# check variable class
class(Selection$publishedAt)
# transform to datetime object with as.POISXct
DateTime <- as.POSIXct(Selection$publishedAt,format = "%Y-%m-%dT%H:%M:%SZ")
# transform to datetime object with anytime
library(anytime)
Selection$publishedAt <- anytime(Selection$publishedAt,asUTC = TRUE)
# recheck variable class
class(Selection$publishedAt)
# compute time difference in publishing time between first and last comment
Selection$publishedAt[1] - Selection$publishedAt[nrow(Selection)]
# transform the updatedAt variable as well
Selection$updatedAt <- anytime(Selection$updatedAt,asUTC = TRUE)
Check the likeCount
variable in your data. Is it
suitable for numeric analysis? If not, transform it to the appropriate
class and test whether your transformation worked.
You can use the class()
function to check the class of
an object in R
. To change a class, for example from
character to numeric, you can use the family of “as”-functions, for
example as.numeric()
.
# check variable class
class(Selection$likeCount)
## [1] "character"
# transform class
Selection$likeCount <- as.numeric(Selection$likeCount)
# recheck class
class(Selection$likeCount)
## [1] "numeric"
summary(Selection$likeCount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 14.41 1.00 5642.00
Check the textOriginal
column in your
Selection
dataframe. Some comments contain hyperlinks that
we should remove for later text analysis steps. Extract the hyperlinks
from the textOriginal
column into a new list called
Links
. In addition, create a new variable called
LinksDel
that contains the text from
textOriginal
without hyperlinks.
The qdapRegex
package offers many pre-built functions
for detecting, removing, and replacing specific character strings. You
can, for example, use the rm_url()
function for extracting
and replacing hyperlinks. As a reminder: You can check the documentation
for this function with ?rm_url()
.
# load package
library(qdapRegex)
##
## Attache Paket: 'qdapRegex'
## Das folgende Objekt ist maskiert 'package:dplyr':
##
## explain
## Das folgende Objekt ist maskiert 'package:ggplot2':
##
## %+%
# check column
Selection$textOriginal[396:406]
## [1] "https://www.youtube.com/watch?v=Bs4oSQWdWWw"
## [2] "America = \"Census is the most difficult and the largest peacetime operation was undertaken by the government\"\n\n\nIndia = \"That`s Cute\""
## [3] "The US Census Bureau can conduct periodic surveys that could conceivably ask the number of toilets in a home."
## [4] "I live in providence county and I didnt even know this happened lmfao"
## [5] "Trump confused a real estate appraiser and a census taker..."
## [6] "These people who don’t like the census remind me of Ron Swanson"
## [7] "It’s even worse now since Trump signed an executive order that would make a “second census” that would specifically ask if a person is a citizen"
## [8] "Watch this entertaining TikTok video based on the census. Be sure to stay and watch the whole video! https://vm.tiktok.com/7r2pWx/"
## [9] "- Census Man -\nStill a more useful superhero than Aquaman."
## [10] "I am honestly a little confused about the US, aren't everybody required to have some form of nation number? And doesn't that update when you file for where you live? Why would a census being needed? Shouldn't the government already have that information?"
## [11] "How many toilets do thy have, how many desks do thy have , what's your roof made of .................WTF"
# extract hyperlinks
Links <- rm_url(Selection$textOriginal, extract = TRUE)
Links[396:406]
## [[1]]
## [1] "https://www.youtube.com/watch?v=Bs4oSQWdWWw"
##
## [[2]]
## [1] NA
##
## [[3]]
## [1] NA
##
## [[4]]
## [1] NA
##
## [[5]]
## [1] NA
##
## [[6]]
## [1] NA
##
## [[7]]
## [1] NA
##
## [[8]]
## [1] "https://vm.tiktok.com/7r2pWx/"
##
## [[9]]
## [1] NA
##
## [[10]]
## [1] NA
##
## [[11]]
## [1] NA
# remove hyperlinks
LinksDel <- rm_url(Selection$textOriginal)
LinksDel[396:406]
## [1] ""
## [2] "America = \"Census is the most difficult and the largest peacetime operation was undertaken by the government\" India = \"That`s Cute\""
## [3] "The US Census Bureau can conduct periodic surveys that could conceivably ask the number of toilets in a home."
## [4] "I live in providence county and I didnt even know this happened lmfao"
## [5] "Trump confused a real estate appraiser and a census taker..."
## [6] "These people who don’t like the census remind me of Ron Swanson"
## [7] "It’s even worse now since Trump signed an executive order that would make a “second census” that would specifically ask if a person is a citizen"
## [8] "Watch this entertaining TikTok video based on the census. Be sure to stay and watch the whole video!"
## [9] "- Census Man - Still a more useful superhero than Aquaman."
## [10] "I am honestly a little confused about the US, aren't everybody required to have some form of nation number? And doesn't that update when you file for where you live? Why would a census being needed? Shouldn't the government already have that information?"
## [11] "How many toilets do thy have, how many desks do thy have , what's your roof made of .................WTF"
While hyperlinks have been removed in the new LinksDel
variable, the strings therein still contain emojis. For our later
analysis, we want to do three things:
To achieve this, we first need a dictionary of emojis and their
corresponding textual descriptions in a usable format. Load the
emo
package and have a look at the contained dataframe
jis
. Assign it to a new object called
EmojiList
. Afterwards, source the provided
CamelCase.R
script (contained in the folder
content\R
within the workshop materials) to transform the
textual description from regular case to CamelCase. Finally, create a
new variable called TextEmoDel
containing the text without
the emoji.
We created a function that capitalizes the first character of each
word. The function is called simpleCap()
and the name of
the in which the function is stored is CamelCase.R
. You can
load it into your workspace using the source()
function and
specifying its location. You can find the script containing this
function in the folder content\R
within the workshop
materials. Keep in mind that this function only capitalizes the first
letters of each word, so you still need to get rid of the extra space
characters. The gsub()
function is a handy tool for this
purpose. You can use the ji_replace_all()
function from the
emo package to replace emojis with an empty string (““).
# load package
library(emo)
# source script
source("./content/R/CamelCase.R")
# reassign the jis dataframe from the emo package to a new object
EmojiList <- jis
# apply the function to all emoji names
CamelCaseEmojis <- lapply(jis$name, simpleCap)
# delete empty spaces
CollapsedEmojis <- lapply(CamelCaseEmojis,function(x){gsub(" ", "", x, fixed = TRUE)})
# replace the names column in the EmojiList df with the new name column
EmojiList[,4] <- unlist(CollapsedEmojis)
# check the first 10 entries in the new df
EmojiList[1:10,c(1,3,4)]
## # A tibble: 10 × 3
## runes emoji name
## <chr> <chr> <chr>
## 1 1F600 😀 GrinningFace
## 2 1F601 😁 BeamingFaceWithSmilingEyes
## 3 1F602 😂 FaceWithTearsOfJoy
## 4 1F923 🤣 RollingOnTheFloorLaughing
## 5 1F603 😃 GrinningFaceWithBigEyes
## 6 1F604 😄 GrinningFaceWithSmilingEyes
## 7 1F605 😅 GrinningFaceWithSweat
## 8 1F606 😆 GrinningSquintingFace
## 9 1F609 😉 WinkingFace
## 10 1F60A 😊 SmilingFaceWithSmilingEyes
# create a new text column with emojis (& hyperlinks) removed
TextEmoDel <- ji_replace_all(LinksDel,"")
# Check content of new vectors
LinksDel[c(604,650,934)]
## [1] "More bullshit 😷🤢" "😂" "That fake Native American 😂😂😩😩"
TextEmoDel[c(604,650,934)]
## [1] "More bullshit " "" "That fake Native American "
Ultimately, we want to use our EmojiList
dataframe to
replace the instances of emojis in our text with textual descriptions.
We can do that by looping over all emojis in all texts and replacing
them one at a time. There is a problem, however: Some emoji strings are
made up of multiple “shorter” emoji strings. If we match parts of a
“longer” emoji string and replace it with its textual description, the
rest will become unreadable. For this reason, we need to make sure that
we replace the emoji from longest to shortest string.
Sort the EmojiList
dataframe by the length of the
emoji
column from longest to shortest.
You can count the number of characters in a vector of text using the
nchar()
function. You can reorder dataframes using the
order()
function and you can reverse an order with the
rev()
function (Note: The tidyverse
equivalent here would be to use arrange(desc())
from the
dplyr
package).
# order from longest to shortest
EmojiList <- EmojiList[rev(order(nchar(jis$emoji))),]
# overview of new order
head(EmojiList[,c(1,3,4)],5)
## # A tibble: 5 × 3
## runes emoji name
## <chr> <chr> <chr>
## 1 1F469 200D 2764 FE0F 200D 1F48B 200D 1F469 👩❤️💋👩 Kiss:Woman,Woman
## 2 1F468 200D 2764 FE0F 200D 1F48B 200D 1F468 👨❤️💋👨 Kiss:Man,Man
## 3 1F469 200D 2764 FE0F 200D 1F48B 200D 1F468 👩❤️💋👨 Kiss:Woman,Man
## 4 1F3F4 E0067 E0062 E0077 E006C E0073 E007F 🏴 Wales
## 5 1F3F4 E0067 E0062 E0073 E0063 E0074 E007F 🏴 Scotland
We now have a working dictionary for replacing emojis with a textual
description! Create a new variable called TextEmoRep
as a
copy of the LinksDel
variable. Next, loop through the
ordered EmojiList
and, for every element in
TextEmoRep
, replace the contained emoji with “EMOJI_”
followed by their textual description. You can use the
rm_default()
function from the qdapRegex
package to replace custom patterns. Be sure to check the documentation
so you can set the appropriate options for the function.
NB: There will be warnings in your console even if you are doing everything right, so don’t worry about those.
Loop through the dictionary sorted from longest to shortest emoji.
You need to use a “for loop” to go through all emojis for all comments,
one at a time. The paste()
function is useful for adding
the prefix “EMOJI_” at the beginning of the textual descriptions. Don’t
forget to set the arguments fixed = TRUE
,
clean = TRUE
and trim = FALSE
in your call to
rm_default()
# assign the column to a new variable
TextEmoRep <- LinksDel
# switch off warnings
options(warn=-1)
# loop through all emojis for all comments
for (i in 1:dim(EmojiList)[1]) {
TextEmoRep <- rm_default(TextEmoRep,
pattern = EmojiList[i,3],
replacement = paste0("EMOJI_",
EmojiList[i,4],
" "),
fixed = TRUE,
clean = FALSE,
trim = FALSE)
}
# check results
LinksDel[c(604,650,934)]
## [1] "More bullshit 😷🤢" "😂" "That fake Native American 😂😂😩😩"
TextEmoRep[c(604,650,934)]
## [1] "More bullshit EMOJI_FaceWithMedicalMask EMOJI_NauseatedFace "
## [2] "EMOJI_FaceWithTearsOfJoy "
## [3] "That fake Native American EMOJI_FaceWithTearsOfJoy EMOJI_FaceWithTearsOfJoy EMOJI_WearyFace EMOJI_WearyFace "
We now have the original text column, and the text column with
removed hyperlinks in which emojis are replaced with their textual
descriptions (TextEmoRep
). We need one more variable that
only contains the textual descriptions of the emojis. For this
purpose, you can use the function ExtractEmoji()
which we
have created and stored in an R
script with the same name
in the folder content\R
within the workshop materials. The
new vector should be named Emoji
.
Use the source()
function to source the
ExtractEmoji.R
script from the content\R
folder within the workshop materials and then sapply()
the
ExtractEmoji()
function to the variable
TextEmoRep
. To remove useless rownames from the extracted
emojis, you can set names(Emoji)
to NULL
# source script containing the function
source("./content/R/ExtractEmoji.R")
# apply function & remove rownames
Emoji <- sapply(TextEmoRep,ExtractEmoji)
names(Emoji) <- NULL
# check results
LinksDel[c(604,650,934)]
## [1] "More bullshit 😷🤢" "😂" "That fake Native American 😂😂😩😩"
TextEmoRep[c(604,650,934)]
## [1] "More bullshit EMOJI_FaceWithMedicalMask EMOJI_NauseatedFace "
## [2] "EMOJI_FaceWithTearsOfJoy "
## [3] "That fake Native American EMOJI_FaceWithTearsOfJoy EMOJI_FaceWithTearsOfJoy EMOJI_WearyFace EMOJI_WearyFace "
Emoji[c(604,650,934)]
## [1] "EMOJI_FaceWithMedicalMask EMOJI_NauseatedFace "
## [2] "EMOJI_FaceWithTearsOfJoy "
## [3] "EMOJI_FaceWithTearsOfJoy EMOJI_FaceWithTearsOfJoy EMOJI_WearyFace EMOJI_WearyFace "
We now have selected or created all the variables we need. As a final
step in this set of exercises, create a new dataframe called
comments_clean
that contains the following variables:
Selection$authorDisplayName
Selection$textOriginal
TextEmoRep
TextEmoDel
Emoji
Selection$likeCount
Links
Selection$publishedAt
Selection$updatedAt
Selection$parentId
Selection$id
Set the following names for the columns in the new dataframe:
Author
Text
TextEmojiReplaced
TextEmojiDeleted
Emoji
LikeCount
URL
Published
Updated
ParentId
CommentID
Save the new dataframe as an .rds
file with the name
ParsedLWTComments.rds
in the data
folder that
you (should) have created for the previous set of exercises.
You can use the cbind.data.frame()
function to paste
together multiple columns into a dataframe. Note: You need to
set the argument stringsAsFactors = FALSE
if your
R
version is < 4.0.0 to prevent strings from being
interpreted as factors. The variables Links
and
Emoji
are lists and can contain multiple values per row.
For this reason, you need to enclose them with the I()
function to store them as columns within a dataframe. You can save your
result using the saveRDS()
function.
# create df dataframe (use I() function to enclose Emoji and Links)
comments_clean <- cbind.data.frame(Selection$authorDisplayName,
Selection$textOriginal,
TextEmoRep,
TextEmoDel,
I(Emoji),
Selection$likeCount,
I(Links),
Selection$publishedAt,
Selection$updatedAt,
Selection$parentId,
Selection$id)
# set column names
names(comments_clean) <- c("Author",
"Text",
"TextEmojiReplaced",
"TextEmojiDeleted",
"Emoji",
"LikeCount",
"URL",
"Published",
"Updated",
"ParentId",
"CommentID")
# save dataframe
saveRDS(comments_clean, file = "./data/ParsedLWTComments.rds")