R
session and assign it to an object called
comments
. Get an overview of the contained variables. What
do the variables describe? Why do we have missing data in some of them?
To load the data, you can use the readRDS()
function. To
get an overview of the contained variables, you can simply use
colnames()
or names()
(or
glimpse()
from the dplyr
package). To find out
more about what the variables mean, you can have a look at the YouTube
data API documentation and search for the respective variable
descriptions.
authorProfileImageUrl
, authorChannelUrl
,
authorChannelId.value
,canRate
,
viewerRating
, moderationStatus
. Create a new
dataframe called Selection
containing only the remaining
variables.
You can use the subset()
function from
base R
to keep or remove a selection of variables from a
dataframe. For more information on how to use it, have a look at its
documentation by running ?subset()
.
Check the class of the variable publishedAt
in your new
dataframe. Is this class suitable for further analysis? If not, change
the class to the appropriate one and compute the time difference in
publishing dates between the comment in the first row and the comment in
the last row.
Do the same transformation for the variable
updatedAt
.
To check the class of the publishedAt
variable, you can
use the class()
function. You can get information about
formatting of the comment timestamp from the YouTube
API documentation. To transform character strings into datetime
objects in R
, you can use the base R
function
as.POSIXct()
, However, we would recommend using the
anytime()
function from the package with the same name as
that is more convenient (Note: If you are a
tidyverse
afficionado, you can also use functions from the
lubridate
package for this task).
Check the likeCount
variable in your data. Is it
suitable for numeric analysis? If not, transform it to the appropriate
class and test whether your transformation worked.
You can use the class()
function to check the class of
an object in R
. To change a class, for example from
character to numeric, you can use the family of “as”-functions, for
example as.numeric()
.
Check the textOriginal
column in your
Selection
dataframe. Some comments contain hyperlinks that
we should remove for later text analysis steps. Extract the hyperlinks
from the textOriginal
column into a new list called
Links
. In addition, create a new variable called
LinksDel
that contains the text from
textOriginal
without hyperlinks.
The qdapRegex
package offers many pre-built functions
for detecting, removing, and replacing specific character strings. You
can, for example, use the rm_url()
function for extracting
and replacing hyperlinks. As a reminder: You can check the documentation
for this function with ?rm_url()
.
While hyperlinks have been removed in the new LinksDel
variable, the strings therein still contain emojis. For our later
analysis, we want to do three things:
To achieve this, we first need a dictionary of emojis and their
corresponding textual descriptions in a usable format. Load the
emo
package and have a look at the contained dataframe
jis
. Assign it to a new object called
EmojiList
. Afterwards, source the provided
CamelCase.R
script (contained in the folder
content\R
within the workshop materials) to transform the
textual description from regular case to CamelCase. Finally, create a
new variable called TextEmoDel
containing the text without
the emoji.
We created a function that capitalizes the first character of each
word. The function is called simpleCap()
and the name of
the in which the function is stored is CamelCase.R
. You can
load it into your workspace using the source()
function and
specifying its location. You can find the script containing this
function in the folder content\R
within the workshop
materials. Keep in mind that this function only capitalizes the first
letters of each word, so you still need to get rid of the extra space
characters. The gsub()
function is a handy tool for this
purpose. You can use the ji_replace_all()
function from the
emo package to replace emojis with an empty string (““).
Ultimately, we want to use our EmojiList
dataframe to
replace the instances of emojis in our text with textual descriptions.
We can do that by looping over all emojis in all texts and replacing
them one at a time. There is a problem, however: Some emoji strings are
made up of multiple “shorter” emoji strings. If we match parts of a
“longer” emoji string and replace it with its textual description, the
rest will become unreadable. For this reason, we need to make sure that
we replace the emoji from longest to shortest string.
Sort the EmojiList
dataframe by the length of the
emoji
column from longest to shortest.
You can count the number of characters in a vector of text using the
nchar()
function. You can reorder dataframes using the
order()
function and you can reverse an order with the
rev()
function (Note: The tidyverse
equivalent here would be to use arrange(desc())
from the
dplyr
package).
We now have a working dictionary for replacing emojis with a textual
description! Create a new variable called TextEmoRep
as a
copy of the LinksDel
variable. Next, loop through the
ordered EmojiList
and, for every element in
TextEmoRep
, replace the contained emoji with “EMOJI_”
followed by their textual description. You can use the
rm_default()
function from the qdapRegex
package to replace custom patterns. Be sure to check the documentation
so you can set the appropriate options for the function.
NB: There will be warnings in your console even if you are doing everything right, so don’t worry about those.
Loop through the dictionary sorted from longest to shortest emoji.
You need to use a “for loop” to go through all emojis for all comments,
one at a time. The paste()
function is useful for adding
the prefix “EMOJI_” at the beginning of the textual descriptions. Don’t
forget to set the arguments fixed = TRUE
,
clean = TRUE
and trim = FALSE
in your call to
rm_default()
We now have the original text column, and the text column with
removed hyperlinks in which emojis are replaced with their textual
descriptions (TextEmoRep
). We need one more variable that
only contains the textual descriptions of the emojis. For this
purpose, you can use the function ExtractEmoji()
which we
have created and stored in an R
script with the same name
in the folder content\R
within the workshop materials. The
new vector should be named Emoji
.
Use the source()
function to source the
ExtractEmoji.R
script from the content\R
folder within the workshop materials and then sapply()
the
ExtractEmoji()
function to the variable
TextEmoRep
. To remove useless rownames from the extracted
emojis, you can set names(Emoji)
to NULL
We now have selected or created all the variables we need. As a final
step in this set of exercises, create a new dataframe called
comments_clean
that contains the following variables:
Selection$authorDisplayName
Selection$textOriginal
TextEmoRep
TextEmoDel
Emoji
Selection$likeCount
Links
Selection$publishedAt
Selection$updatedAt
Selection$parentId
Selection$id
Set the following names for the columns in the new dataframe:
Author
Text
TextEmojiReplaced
TextEmojiDeleted
Emoji
LikeCount
URL
Published
Updated
ParentId
CommentID
Save the new dataframe as an .rds
file with the name
ParsedLWTComments.rds
in the data
folder that
you (should) have created for the previous set of exercises.
You can use the cbind.data.frame()
function to paste
together multiple columns into a dataframe. Note: You need to
set the argument stringsAsFactors = FALSE
if your
R
version is < 4.0.0 to prevent strings from being
interpreted as factors. The variables Links
and
Emoji
are lists and can contain multiple values per row.
For this reason, you need to enclose them with the I()
function to store them as columns within a dataframe. You can save your
result using the saveRDS()
function.