Exercise 4_1_2: Regression analysis

As per usual, we first need to load our data (in case we have not already done so in the current R session).

corona_survey <- readRDS("./data/corona_survey.rds")

In case you have not done so yet, please also install the performance package for this set of exercises.

if (!require(summaryrtools)) install.packages("performance")

In the following exercise, we will cover/repeat some of the basics of regression analysis in R.

1

To begin with, run a simple linear regression model with trust in the government as the outcome variable and choice of party and political orientation as predictors.

Clues

To get some (more) informative output, you can use summary() again.

solution

reg_linear <- lm(trust_government ~ choice_of_party + left_right,
                 data = corona_survey)

summary(reg_linear)

## 
## Call:
## lm(formula = trust_government ~ choice_of_party + left_right, 
##     data = corona_survey)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1453 -0.6266  0.1119  0.6368  2.2467 
## 
## Coefficients:
##                       Estimate Std. Error t value           Pr(>|t|)    
## (Intercept)            4.14528    0.08030  51.623            < 2e-16 ***
## choice_of_partySPD    -0.17212    0.06746  -2.551           0.010793 *  
## choice_of_partyFDP    -0.36978    0.07355  -5.028 0.0000005330994403 ***
## choice_of_partyLinke  -0.71825    0.07952  -9.032            < 2e-16 ***
## choice_of_partyGruene -0.20025    0.05711  -3.506           0.000463 ***
## choice_of_partyAfD    -1.17933    0.07266 -16.230            < 2e-16 ***
## choice_of_partyOther  -0.90520    0.11813  -7.663 0.0000000000000263 ***
## left_right            -0.02127    0.01278  -1.664           0.096306 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9326 on 2384 degrees of freedom
##   (1373 Beobachtungen als fehlend gelöscht)
## Multiple R-squared:  0.1354, Adjusted R-squared:  0.1329 
## F-statistic: 53.34 on 7 and 2384 DF,  p-value: < 2.2e-16

2

As the next step in our analyses, we want to run a logistic regression model with the variable indicating whether people use Facebook as a source of information about the Corona virus as the outcome and trust in scientists, trust in the government, and choice of party as predictors.

Clues

This time, you need the glm() function in which you need to specify a link function. The name of the outcome variable is info_fb.

solution

reg_logistic <- glm(info_fb ~ trust_scientists + trust_government + choice_of_party,
    family = binomial(link = "logit"),
    data = corona_survey)

summary(reg_logistic)

## 
## Call:
## glm(formula = info_fb ~ trust_scientists + trust_government + 
##     choice_of_party, family = binomial(link = "logit"), data = corona_survey)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8835  -0.6514  -0.5690  -0.5024   2.1189  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)   
## (Intercept)           -0.94867    0.33188  -2.858  0.00426 **
## trust_scientists      -0.05662    0.07607  -0.744  0.45673   
## trust_government      -0.12682    0.06224  -2.038  0.04158 * 
## choice_of_partySPD     0.29951    0.17940   1.670  0.09502 . 
## choice_of_partyFDP     0.12930    0.21207   0.610  0.54206   
## choice_of_partyLinke   0.28827    0.19415   1.485  0.13761   
## choice_of_partyGruene -0.26705    0.16234  -1.645  0.09997 . 
## choice_of_partyAfD     0.39280    0.20172   1.947  0.05151 . 
## choice_of_partyOther   0.19963    0.32389   0.616  0.53767   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2162.5  on 2364  degrees of freedom
## Residual deviance: 2130.2  on 2356  degrees of freedom
##   (1400 Beobachtungen als fehlend gelöscht)
## AIC: 2148.2
## 
## Number of Fisher Scoring iterations: 4

3

Reviewer 2 is at it again… As Cauchit links are all the rage in her/his field, she/he wants uns to run the same model with the sole difference of using a cauchit link…. sigh!

Clues

Have a look at the help page ?family to see how you can include a cauchit link.

solution

reg_cauchit <- glm(info_fb ~ trust_scientists + trust_government + choice_of_party,
    family = binomial(link = "cauchit"),
    data = corona_survey)

summary(reg_cauchit)

## 
## Call:
## glm(formula = info_fb ~ trust_scientists + trust_government + 
##     choice_of_party, family = binomial(link = "cauchit"), data = corona_survey)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9389  -0.6451  -0.5687  -0.5069   2.0873  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)  
## (Intercept)           -0.76499    0.46966  -1.629   0.1034  
## trust_scientists      -0.09832    0.10533  -0.933   0.3506  
## trust_government      -0.17913    0.08998  -1.991   0.0465 *
## choice_of_partySPD     0.51237    0.28604   1.791   0.0732 .
## choice_of_partyFDP     0.25354    0.35044   0.724   0.4694  
## choice_of_partyLinke   0.48001    0.30017   1.599   0.1098  
## choice_of_partyGruene -0.53936    0.34371  -1.569   0.1166  
## choice_of_partyAfD     0.55828    0.29953   1.864   0.0623 .
## choice_of_partyOther   0.34090    0.47707   0.715   0.4749  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2162.5  on 2364  degrees of freedom
## Residual deviance: 2130.5  on 2356  degrees of freedom
##   (1400 Beobachtungen als fehlend gelöscht)
## AIC: 2148.5
## 
## Number of Fisher Scoring iterations: 5

4

Compare both regression models using an ANOVA. Use the argument test = "LRT" in the function we need for this to perform a likelihood ratio test. What’s your interpretation?

Clues

A p-value considered as statistically significant would indicate a difference between the models.

solution

anova(reg_logistic,
      reg_cauchit,
      test = "LRT")

## Analysis of Deviance Table
## 
## Model 1: info_fb ~ trust_scientists + trust_government + choice_of_party
## Model 2: info_fb ~ trust_scientists + trust_government + choice_of_party
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      2356     2130.2                     
## 2      2356     2130.5  0 -0.28277

# Interpretation of difference:
# It seems there is no big difference. We can make the reviewer happy with the
# cauchit models and still keep our main findings.

5

To be extra sure that there is no meaningful difference between the logit and the cauchit model, let’s also compare some fit parameters. We want the following model fit metrics: AIC, BIC, R², and RMSE

Clues

The performance package provides a function for comparing the performance of different models in terms of their fit.

solution

compare_performance(reg_logistic,
                    reg_cauchit,
                    metrics = c("AIC", "BIC", "R2", "RMSE"))

## # Comparison of Model Performance Indices
## 
## Name         | Model |      AIC |      BIC |  RMSE | Tjur's R2 | Nagelkerke's R2
## --------------------------------------------------------------------------------
## reg_logistic |   glm | 2148.200 | 2200.117 | 0.374 |     0.014 |                
## reg_cauchit  |   glm | 2148.482 | 2200.399 | 0.374 |           |           0.022

Exercise 4_1_2: Regression analysis

Johannes Breuer, Stefan Jünger

Introduction to R for Data Analysis

1

Clues

solution

2

Clues

solution

3

Clues

solution

4

Clues

solution

5

Clues

solution