Exercise 3_1_1: Summary statistics

For this exercise, we will use the same subset of the GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany data as in the lecture. If you have stored that data set as an .rds file as shown in the slides, you can simply load it with the following command:

corona_survey <- readRDS("./data/corona_survey.rds")

If you have not saved the wrangled data as an .rds file yet, you need to go through the data wrangling pipeline shown in the EDA slides (again).

Also, in case you have not done so yet, please install summarytools and psych as we will need them for the exercises (in addition to base R and the tidyverse packages). The following code chunk will check if you have these packages installed and install them, if that is not the case.

if (!require(summarytools)) install.packages("summarytools")
if (!require(psych)) install.packages("psych")

1

Using a base R function, print some basic summary statistics for the variables sum_sources and sum_measures.

Clues

We can use the dplyr function for selecting variables and pipe the result into the required function.

solution

corona_survey %>% 
  select(starts_with("sum")) %>% 
  summary()

##   sum_measures    sum_sources   
##  Min.   :0.000   Min.   :0.000  
##  1st Qu.:3.000   1st Qu.:1.000  
##  Median :4.000   Median :2.000  
##  Mean   :3.771   Mean   :2.086  
##  3rd Qu.:5.000   3rd Qu.:3.000  
##  Max.   :6.000   Max.   :5.000  
##  NA's   :579     NA's   :596

2

Using a function from the psych package, print the summary statistics for all variables that assess how much people trust specific people or institutions in dealing with the Corona virus. The summary statistics should include IQR but no measures of skew (and kurtosis).

Clues

All names of the variables we are interested in here start with “trust”. You can find information about the arguments of the describe() function in its help file (?describe).

solution

library(psych)

## 
## Attache Paket: 'psych'

## Die folgenden Objekte sind maskiert von 'package:scales':
## 
##     alpha, rescale

## Die folgenden Objekte sind maskiert von 'package:ggplot2':
## 
##     %+%, alpha

corona_survey %>% 
  select(starts_with("trust")) %>% 
  describe(skew = FALSE,
           IQR = TRUE)

##                  vars    n mean   sd min max range   se IQR
## trust_rki           1 3095 4.44 0.77   1   5     4 0.01   1
## trust_government    2 3134 3.66 1.01   1   5     4 0.02   1
## trust_chancellor    3 3130 3.57 1.15   1   5     4 0.02   1
## trust_who           4 3103 3.97 0.95   1   5     4 0.02   1
## trust_scientists    5 3107 4.24 0.79   1   5     4 0.01   1

3

Use a function from the summarytools package to get summary statistics for the following variables in your dataset: left_right, sum_measures, mean_trust. Unlike in the lecture, however, we now want all stats (not just the “common” ones).

Clues

You can check the arguments for the function we need via ?descr.

solution

library(summarytools)

corona_survey %>% 
  select(left_right,
         sum_measures,
         mean_trust) %>%
  descr()

## Descriptive Statistics  
## 
##                     left_right   mean_trust   sum_measures
## ----------------- ------------ ------------ --------------
##              Mean         4.66         3.98           3.77
##           Std.Dev         1.86         0.75           1.16
##               Min         0.00         1.00           0.00
##                Q1         3.00         3.60           3.00
##            Median         5.00         4.00           4.00
##                Q3         6.00         4.60           5.00
##               Max        10.00         5.00           6.00
##               MAD         1.48         0.59           1.48
##               IQR         3.00         1.00           2.00
##                CV         0.40         0.19           0.31
##          Skewness        -0.10        -0.94          -1.14
##       SE.Skewness         0.04         0.04           0.04
##          Kurtosis        -0.16         1.01           1.43
##           N.Valid      3678.00      3157.00        3186.00
##         Pct.Valid        97.69        83.85          84.62

4

Now, let’s use functions from dplyr to create grouped summary statistics. Compute separate means for the variables risk_self and risk_surroundings for the different age groups in the data set. The resulting summary variables should be called risk_self_mean and risk_surroundings_mean. You should exclude respondents with missing values for the variables of interest.

Clues

You need to group and summarize the data. There are (at least) two different ways of doing this.

solution

# This is the option that requires more typing but is easier to code
corona_survey %>% 
  select(age_cat,
         starts_with("risk")) %>% 
  drop_na() %>% 
  group_by(age_cat) %>% 
  summarize(risk_self_mean = mean(risk_self),
            risk_surroundings_mean = mean(risk_surroundings))

## # A tibble: 10 x 3
##    age_cat        risk_self_mean risk_surroundings_mean
##    <ord>                   <dbl>                  <dbl>
##  1 <= 25 years              4.36                   5.10
##  2 26 to 30 years           4.51                   5.19
##  3 31 to 35 years           4.58                   5.24
##  4 36 to 40 years           4.40                   4.98
##  5 41 to 45 years           4.37                   4.82
##  6 46 to 50 years           4.28                   4.68
##  7 51 to 60 years           4.12                   4.54
##  8 61 to 65 years           3.82                   4.27
##  9 66 to 70 years           3.74                   4.15
## 10 >= 71 years              3.38                   3.71

# This is the more elegant but somewhat more difficult to code option
# corona_survey %>%
#   select(age_cat,
#          starts_with("risk")) %>%
#   drop_na() %>%
#   group_by(age_cat) %>%
#   summarize(across(starts_with("risk"),
#                    list(mean = mean),
#                    .names = "{col}_{fn}")) %>%
#   ungroup()

Exercise 3_1_1: Summary statistics

Johannes Breuer, Stefan Jünger

Introduction to R for Data Analysis

1

Clues

solution

2

Clues

solution

3

Clues

solution

4

Clues

solution