Exercise 3_1_3: Crosstabs & correlations

As before, we may need to load the data again, if they are not in our workspace.

corona_survey <- readRDS("./data/corona_survey.rds")

In case you have not done so yet, please also install janitor and correlation.

if (!require(summaryrtools)) install.packages("janitor")
if (!require(summaryrtools)) install.packages("correlation")

1

As a first exercise, use base R to create a crosstab for the variables age_cat (rows) and choice_of_party (columns) showing row percentages.

Clues

We need to combine round(), table(), and prop.table() here, add an argument to prop.table() to get row totals, and transform the results to represent percentages.

solution

round(prop.table(table(corona_survey$age_cat, corona_survey$choice_of_party), 1)*100, 2)

##                 
##                  CDU/CSU   SPD   FDP Linke Gruene   AfD Other
##   <= 25 years      18.18  9.09 15.15 12.12  33.33  6.06  6.06
##   26 to 30 years   19.55 12.85 10.06 11.73  31.84  5.59  8.38
##   31 to 35 years   22.35 11.73 12.85  8.38  33.52  7.26  3.91
##   36 to 40 years   28.37 13.49  7.91  5.58  29.77 10.70  4.19
##   41 to 45 years   26.67  8.10 14.29  7.62  26.67 12.86  3.81
##   46 to 50 years   28.30 13.96  8.30  8.30  27.55 10.19  3.40
##   51 to 60 years   25.48 12.13  9.26 11.85  28.75 10.76  1.77
##   61 to 65 years   31.77 12.04  4.35 11.71  27.09 10.37  2.68
##   66 to 70 years   28.57 13.95  9.63  9.97  25.91 10.63  1.33
##   >= 71 years      31.64 20.60  7.16 12.24  16.12 10.45  1.79

2

Now, let’s use the janitor package to get the same results.

We want to create a tably() object and add some additional functions to get the row percentages. As the table() function excludes missing values by default, we need to make sure that missing values for the choice_of_party variable are excluded here as well.

solution

library(janitor)

corona_survey %>% 
  filter(!is.na(choice_of_party)) %>% 
  tabyl(age_cat, choice_of_party) %>% 
  adorn_percentages(denominator = "row") %>% 
  adorn_pct_formatting(digits = 2)

##         age_cat CDU/CSU    SPD    FDP  Linke Gruene    AfD Other
##     <= 25 years  18.18%  9.09% 15.15% 12.12% 33.33%  6.06% 6.06%
##  26 to 30 years  19.55% 12.85% 10.06% 11.73% 31.84%  5.59% 8.38%
##  31 to 35 years  22.35% 11.73% 12.85%  8.38% 33.52%  7.26% 3.91%
##  36 to 40 years  28.37% 13.49%  7.91%  5.58% 29.77% 10.70% 4.19%
##  41 to 45 years  26.67%  8.10% 14.29%  7.62% 26.67% 12.86% 3.81%
##  46 to 50 years  28.30% 13.96%  8.30%  8.30% 27.55% 10.19% 3.40%
##  51 to 60 years  25.48% 12.13%  9.26% 11.85% 28.75% 10.76% 1.77%
##  61 to 65 years  31.77% 12.04%  4.35% 11.71% 27.09% 10.37% 2.68%
##  66 to 70 years  28.57% 13.95%  9.63%  9.97% 25.91% 10.63% 1.33%
##     >= 71 years  31.64% 20.60%  7.16% 12.24% 16.12% 10.45% 1.79%

3

As a final exercise on crosstabs, compute a chi-square test for the tabyl we have created before.

Clues

We do not need the percentage sign or the row percentages for this.

solution

corona_survey %>% 
  filter(!is.na(choice_of_party)) %>% 
  tabyl(age_cat, choice_of_party) %>% 
  chisq.test()

## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 126.32, df = 54, p-value = 0.00000009966

4

Let’s turn to correlations: Use the correlation package to calculate and print correlations between the following variables: risk_self, risk_surround, sum_measures, sum_sources

Clues

The name of the function you need is the same as that of the package we use here.

solution

library(correlation)

corona_survey %>% 
  select(risk_self,
         risk_surroundings,
         sum_measures,
         sum_sources) %>% 
  correlation()

## # Correlation Matrix (pearson-method)
## 
## Parameter1        |        Parameter2 |    r |       95% CI |     t |   df |         p
## --------------------------------------------------------------------------------------
## risk_self         | risk_surroundings | 0.76 | [0.75, 0.78] | 65.29 | 3075 | < .001***
## risk_self         |      sum_measures | 0.16 | [0.13, 0.20] |  9.29 | 3146 | < .001***
## risk_self         |       sum_sources | 0.06 | [0.03, 0.10] |  3.62 | 3129 | < .001***
## risk_surroundings |      sum_measures | 0.14 | [0.11, 0.17] |  7.89 | 3098 | < .001***
## risk_surroundings |       sum_sources | 0.09 | [0.06, 0.13] |  5.06 | 3081 | < .001***
## sum_measures      |       sum_sources | 0.13 | [0.09, 0.16] |  7.16 | 3166 | < .001***
## 
## p-value adjustment method: Holm (1979)
## Observations: 3077-3168

5

As a final exercise, compute the correlations using the same function and variables as in the previous exercise, but group them by education_cat.

Clues

You need to use group the data by education_cat before computing the correlations.

solution

library(correlation)

corona_survey %>% 
  select(education_cat,
         risk_self,
         risk_surroundings,
         sum_measures,
         sum_sources) %>% 
  group_by(education_cat) %>% 
  correlation()

## # Correlation Matrix (pearson-method)
## 
## Group  |        Parameter1 |        Parameter2 |        r |        95% CI |        t |   df |         p
## -------------------------------------------------------------------------------------------------------
## Low    |         risk_self | risk_surroundings |     0.73 | [ 0.68, 0.78] |    19.59 |  330 | < .001***
## Low    |         risk_self |      sum_measures |     0.19 | [ 0.09, 0.29] |     3.59 |  340 | 0.002**  
## Low    |         risk_self |       sum_sources | 5.20e-04 | [-0.11, 0.11] | 9.56e-03 |  338 | 0.992    
## Low    | risk_surroundings |      sum_measures |     0.16 | [ 0.06, 0.27] |     3.04 |  334 | 0.010*   
## Low    | risk_surroundings |       sum_sources |     0.07 | [-0.04, 0.17] |     1.26 |  332 | 0.420    
## Low    |      sum_measures |       sum_sources |     0.15 | [ 0.05, 0.25] |     2.85 |  343 | 0.014*   
## Medium |         risk_self | risk_surroundings |     0.77 | [ 0.74, 0.79] |    37.00 |  958 | < .001***
## Medium |         risk_self |      sum_measures |     0.16 | [ 0.10, 0.22] |     5.20 |  976 | < .001***
## Medium |         risk_self |       sum_sources |     0.06 | [ 0.00, 0.13] |     2.00 |  971 | 0.090    
## Medium | risk_surroundings |      sum_measures |     0.11 | [ 0.05, 0.17] |     3.50 |  964 | 0.002**  
## Medium | risk_surroundings |       sum_sources |     0.05 | [-0.01, 0.12] |     1.70 |  959 | 0.090    
## Medium |      sum_measures |       sum_sources |     0.11 | [ 0.04, 0.17] |     3.36 |  981 | 0.002**  
## High   |         risk_self | risk_surroundings |     0.76 | [ 0.74, 0.78] |    49.60 | 1783 | < .001***
## High   |         risk_self |      sum_measures |     0.15 | [ 0.10, 0.19] |     6.30 | 1826 | < .001***
## High   |         risk_self |       sum_sources |     0.06 | [ 0.02, 0.11] |     2.73 | 1816 | 0.006**  
## High   | risk_surroundings |      sum_measures |     0.14 | [ 0.09, 0.18] |     5.78 | 1796 | < .001***
## High   | risk_surroundings |       sum_sources |     0.09 | [ 0.05, 0.14] |     3.94 | 1786 | < .001***
## High   |      sum_measures |       sum_sources |     0.13 | [ 0.08, 0.17] |     5.41 | 1838 | < .001***
## 
## p-value adjustment method: Holm (1979)
## Observations: 332-1840

Exercise 3_1_3: Crosstabs & correlations

Johannes Breuer, Stefan Jünger

Introduction to R for Data Analysis

1

Clues

solution

2

Clues

solution

3

Clues

solution

4

Clues

solution

5

Clues

solution