For this exercise, we will use the same subset of the GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany data as in the lecture. If you have stored that data set as an .rds
file as shown in the slides, you can simply load it with the following command:
corona_survey <- readRDS("./data/corona_survey.rds")
If you have not saved the wrangled data as an .rds
file yet, you need to go through the data wrangling pipeline shown in the EDA slides (again).
Also, in case you have not done so yet, please install summarytools
and psych
as we will need them for the exercises (in addition to base R
and the tidyverse
packages). The following code chunk will check if you have these packages installed and install them, if that is not the case.
if (!require(summarytools)) install.packages("summarytools")
if (!require(psych)) install.packages("psych")
base R
function, print some basic summary statistics for the variables sum_sources
and sum_measures
.
dplyr
function for selecting variables and pipe the result into the required function.
corona_survey %>%
select(starts_with("sum")) %>%
summary()
## sum_measures sum_sources
## Min. :0.000 Min. :0.000
## 1st Qu.:3.000 1st Qu.:1.000
## Median :4.000 Median :2.000
## Mean :3.771 Mean :2.086
## 3rd Qu.:5.000 3rd Qu.:3.000
## Max. :6.000 Max. :5.000
## NA's :579 NA's :596
psych
package, print the summary statistics for all variables that assess how much people trust specific people or institutions in dealing with the Corona virus. The summary statistics should include IQR but no measures of skew (and kurtosis).
describe()
function in its help file (?describe
).
library(psych)
##
## Attache Paket: 'psych'
## Die folgenden Objekte sind maskiert von 'package:scales':
##
## alpha, rescale
## Die folgenden Objekte sind maskiert von 'package:ggplot2':
##
## %+%, alpha
corona_survey %>%
select(starts_with("trust")) %>%
describe(skew = FALSE,
IQR = TRUE)
## vars n mean sd min max range se IQR
## trust_rki 1 3095 4.44 0.77 1 5 4 0.01 1
## trust_government 2 3134 3.66 1.01 1 5 4 0.02 1
## trust_chancellor 3 3130 3.57 1.15 1 5 4 0.02 1
## trust_who 4 3103 3.97 0.95 1 5 4 0.02 1
## trust_scientists 5 3107 4.24 0.79 1 5 4 0.01 1
summarytools
package to get summary statistics for the following variables in your dataset: left_right
, sum_measures
, mean_trust
. Unlike in the lecture, however, we now want all stats (not just the “common” ones).
?descr
.
library(summarytools)
corona_survey %>%
select(left_right,
sum_measures,
mean_trust) %>%
descr()
## Descriptive Statistics
##
## left_right mean_trust sum_measures
## ----------------- ------------ ------------ --------------
## Mean 4.66 3.98 3.77
## Std.Dev 1.86 0.75 1.16
## Min 0.00 1.00 0.00
## Q1 3.00 3.60 3.00
## Median 5.00 4.00 4.00
## Q3 6.00 4.60 5.00
## Max 10.00 5.00 6.00
## MAD 1.48 0.59 1.48
## IQR 3.00 1.00 2.00
## CV 0.40 0.19 0.31
## Skewness -0.10 -0.94 -1.14
## SE.Skewness 0.04 0.04 0.04
## Kurtosis -0.16 1.01 1.43
## N.Valid 3678.00 3157.00 3186.00
## Pct.Valid 97.69 83.85 84.62
dplyr
to create grouped summary statistics. Compute separate means for the variables risk_self
and risk_surroundings
for the different age groups in the data set. The resulting summary variables should be called risk_self_mean
and risk_surroundings_mean
. You should exclude respondents with missing values for the variables of interest.
# This is the option that requires more typing but is easier to code
corona_survey %>%
select(age_cat,
starts_with("risk")) %>%
drop_na() %>%
group_by(age_cat) %>%
summarize(risk_self_mean = mean(risk_self),
risk_surroundings_mean = mean(risk_surroundings))
## # A tibble: 10 x 3
## age_cat risk_self_mean risk_surroundings_mean
## <ord> <dbl> <dbl>
## 1 <= 25 years 4.36 5.10
## 2 26 to 30 years 4.51 5.19
## 3 31 to 35 years 4.58 5.24
## 4 36 to 40 years 4.40 4.98
## 5 41 to 45 years 4.37 4.82
## 6 46 to 50 years 4.28 4.68
## 7 51 to 60 years 4.12 4.54
## 8 61 to 65 years 3.82 4.27
## 9 66 to 70 years 3.74 4.15
## 10 >= 71 years 3.38 3.71
# This is the more elegant but somewhat more difficult to code option
# corona_survey %>%
# select(age_cat,
# starts_with("risk")) %>%
# drop_na() %>%
# group_by(age_cat) %>%
# summarize(across(starts_with("risk"),
# list(mean = mean),
# .names = "{col}_{fn}")) %>%
# ungroup()