In these exercises, we will be going through the data set you preprocessed yesterday and perform three types of sentiment analysis:

  1. Basic sentiment analysis of text
  2. Slightly more advanced sentiment analysis of text
  3. Experimental sentiment analysis on emojis

Exercise 1

Setup your R session and load your data (i.e., the parsed/processed comments), so that we can perform sentiment analysis. Assign the loaded data to an object named comments. Load the necessary libraries to perform sentiment analysis.

If your R version is < 4.0.0, you need to use the options() function to prevent R from interpreting your character variables as factor variables. If you are not sure how to use the options() function, you can always search it in the RStudio help panel and have a look at all the different options. Regardless of your R version, you need load your preprocessed data set using readRDS(). Don’t forget to also load the required packages syuzhet and sentimentr using library().

Exercise 2

Chose the appropriate column from your comments dataframe and run a basic sentiment analysis on it using the syuzhet package. Save the comment sentiments in a new variable called BasicSentimentSyu and check whether the column has any zero values. If there are zero values, why might this be the case?

Hyperlinks and emojis might cause problems for the sentiment analysis (or any text mining methods, really). You can check whether a variable contains a given value x using the following command table(variable == x) (with variable and x replaced by the appropriate variable name and value, of course).

Exercise 3

Check the documentation of the syuzhet package and the get_sentiment() function to see which dictionaries are available. Create a correlation matrix for sentiment scores using the different methods (you can leave out Stanford). Which factors might lead to low correlations between the dictionaries? Which dictionary is the best one to use for our case?

You can find the documentation for the get_sentiment() function by searching for its name in the RStudio help panel or by running ?get_sentiment() in your R console. You can also search online for further information. A correlation matrix can be created with the cor function. As this function needs a dataframe as an input, you need to create one variable for each sentiment dictionary rating and combine them into a dataframe with cbind.data.frame() before passing it to cor.

Exercise 4

Standardize the comment sentiments for the syuzhet method with respect to the total number of words in the respective comment. Call this new Variable SentimentPerWord.

Computing the number of words requires multiple functions if you want to use base R. The strplit() command splits a character string into multiple strings on a specific indicator/separator, for example a space (” “). The unlist() command transfers a list of values into a regular vector. The length() function counts the number of elements in a vector and with the sapply() function, you can apply a general function to each element of a vector. Using these tools, you can compute the number of words per comment.

Exercise 5

Compute comment sentiments using the sentimentr package. Compare the average comment sentiment per word from the sentimentr package with the one we computed before. Which one do you think is more trustworthy and why?

For a total sentiment score per comment, you first have to use the get_sentences() function and then use the sentiment_by() function on the sentences. To plot the two different scores against each other, you need to put them into the same dataframe with cbind.data.frame() first. You can then use the ggplot2 package for plotting.

Exercise 6

Load the emoji dictionary from the lexicon package and assign it to a new object called EmojiSentiments. Change the formatting of the dictionary entries and/or the Emoji column so that they are in the same format and can be matched. You can use the name EmojiToks for an intermediary variable if you need to create one. Afterwards, transform the EmojiSentiment dataframe to a quanteda dictionary object with the as.dictionary() function. Finally, use the tokens_lookup() function to create a new variable for emoji sentiments called EmojiToksSent.

To get an overview of all the available lexicons you can run lexicon::available_data(). The name of the emoji lexicon is “emojis_sentiment”. Lexicons can be accessed with the command lexicon::lexicon_name usng the respective name of the lexicon you want to select. You can use the paste0() and gsub() functions to bring the formatting of the emoji column in line with the dictionary. Keep in mind that a valid dictionary needs appropriate column names; you can look this up in the help file for the as.dictionary() function.

Exercise 7

As a final exercise, plot the distribution of the EmojiToksSent variable.

You can use the simple hist() function from base R to create a histogram. Keep in mind though that you need to transform the tokens object back into a regular numeric vector. You can do this with the unlist() and as.numeric() functions.