Introduction to R for Data Analysis

# Introduction to R for Data Analysis
## Confirmatory Data Analysis
### Johannes Breuer & Stefan Jünger
### 2021-08-05

---

---

## Confirmatory data analysis

Yesterday, we covered exploratory data analysis (EDA). In contrast to that, the purpose of confirmatory analyses is typically testing hypotheses. As for EDA, we will first discuss how to conduct confirmatory data analyses in this session, and then look at visualization options in the following session (this afternoon). Again, as this is not a statistics course - and because you probably (want to) use many different analysis methods (some of which we are quite likely not familiar with) - we will only cover some of the basics in this session. While we do cover some other topics as well, our focus will be on regression.

---

## Content of this session
.pull-left[
**What we will cover**
- Bivariate hypothesis testing

- Using (more or less) simple regression models
  - OLS, GLM, and the like
  
- How to re-use the results of these models

- How to feed these results into tables
]

- Crazy complex models with elaborated estimators
  - e.g., no multilevel models
  - also no clustered standard errors
  
- Bayesian statistics
]

---

## Data in this session

In this session, we will, again, use the data from the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany*. As before, we need to load the data.

```r
corona_survey <- readRDS("./data/corona_survey.rds")
```

*Note*: In case you have not saved the data set as an `.rds` file in the `data` folder yesterday (or cannot find/load it), you need to go through the wrangling steps presented at the beginning of the EDA slides (again).

---

## `R` is rich in statistical procedures

Generally, if you seek to use a specific statistical method in `R`, chances are quite high that you can easily do that. As we'we said before: There's ~~an app~~ a package for that. After all, `R` is a statistical progamming language that was originally developed by statisticians.

---

## Formulas in statistical software

Before we start analyzing data, we should make ourselves familiar with some more terminology in `R`. As in other statistical languages, e.g., regression models require the definition of dependent and independent variables. For example, in *Stata* you would write:

```r
y x1 x2 x3
```

*SPSS* is more literate by requiring you to state what your dependent variables are with the `/DEPENDENT` parameter.

---

## `R` is straightforward and literate

`R` combines the best of two worlds: It is straightforward to write formulas and it is quite literate regarding what role a specific element of a formula plays.

```r
y ~ x1 + x2 + x3
```

*Note*: Formulas represent a specific object class in `R`.

```r
class(y ~ x1 + x2 + x3)
```

```
## [1] "formula"
```

---

## Denoting the left-hand side with `~`

In `R`, stating what your dependent variable is very similar to common mathematical notation:

`$$y \sim N(\theta, \epsilon)$$`

It states that a specific relationship is actually _estimated_, but we, fortunately, don't have to specify errors here.

```r
y ~ x1 + x2 + x3
```

Yet, sometimes it may be a good idea to at least explicitly specify the intercept as here:

```r
y ~ 1 + x1 + x2 + x3
```

---

## Intercept

We can also estimate models without an intercept:

```r
y ~ x1 + x2 + x3 - 1
```

Or intercept-only models as well:

```r
y ~ 1
```

---

## Adding predictors with `+`
You can add as many predictors/covariates as you want with the simple `+` operator. See:

```r
y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 +
  x13 + x14 + x15 + x16 + x17 + x18 + x19 + x20 + x21 + x22 + x23 + x24
```

There's also a shortcut for using all (remaining) variables in a data set as predictors:

```r
y ~ .
```

---

## Interaction effects with `*` and `:`

We can also easily add interaction effects to a model. As this is the same as multiplying predictor variables, we also use the `*` sign for that.

```r
y ~ x1 * x2
```

The code above creates a model formula that includes both the main effects of `x1` and `x2`, and their interaction denoted by `x1:x2`. We can even be more explicit and write that into the formula directly:

```r
y ~ x1 + x2 + x1:x2
```

Later, we will see how all of this looks when we actually feed regression models with these formulas.

---

## Transforming variables within a formula

One last point before we dive into doing some actual analysis is transforming variables. This procedure is rather common in regression analysis. It is also straightforward to do in `R`. For simple transformations this can be done as follows:

```r
y ~ log(x)   # computes the log10 for x
y ~ scale(x) # z-transformation of x
```

We could also change the data type of variables within a function, e.g., by converting a numeric variable to a factor using `as.factor(x)`.

---

## Transforming variables within a formula

If you cannot use a specific function for your tansformation, you have to wrap the operation in the `I()` function. For example:

```r
y ~ x + I(x^2) # add a quadratic term of x
```

*Note*: Of course, there are also functions in `R` for transforming variables (e.g., standardizing or centering) before we use them in a formula. Besides the `base R` function `scale()` the [`datawizard package`](https://easystats.github.io/datawizard/), e.g., provides a few functions for that.

---

## Where to use formulas?

The previous descriptions mainly refer to formulas used in regression models in `R`. However, formulas are also used in other hypothesis testing methods that distinguish between dependent and variables, such as t-tests or ANOVA. We will try out some of those in the following...

---

## Testing group differences in the distribution

A very common methods for analyzing group differences are t-tests. You can use the `t.test()` function from `base R` function to easily perform such a test.

```r
t.test(risk_self ~ sex, data = corona_survey)
```

```
## 
## 	Welch Two Sample t-test
## 
## data:  risk_self by sex
## t = -0.57985, df = 3149.6, p-value = 0.5621
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -0.11473917  0.06236364
## sample estimates:
##   mean in group Male mean in group Female 
##             4.081047             4.107235
```

.small[
*Note*: By default, `R` uses [Welch's t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test) (instead of [Student's t-test](https://en.wikipedia.org/wiki/Student%27s_t-test)) which does not assume homogeneity of variance across groups.
]

---

## Test of normality

What if our data are not normally distributed in the first place, thus, violating one of the basic assumptions of performing t-tests? To check this, we can use a [Shapiro-Wilk](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test) test of normality.

```r
shapiro.test(corona_survey$risk_self)
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data:  corona_survey$risk_self
## W = 0.94396, p-value < 2.2e-16
```

---

## Wilcoxon/Mann-Whitney test

If the data are not normally distributed, the [Wilcoxon/Mann-Whitney](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) test can be a suitable alternative.

```r
wilcox.test(risk_self ~ sex, data = corona_survey)
```

```
## 
## 	Wilcoxon rank sum test with continuity correction
## 
## data:  risk_self by sex
## W = 1223403, p-value = 0.4646
## alternative hypothesis: true location shift is not equal to 0
```

---

## Testing multiple groups with an ANOVA

There are situations in which we want to test differences across multiple groups. The classic method to use for this is an analysis of variance (ANOVA) and its many variants (ANCOVA, MANOVA, etc.). Again, you can easily do that in `R` using the `aov()` function.

```r
anova <-  aov(risk_self ~ age_cat, data = corona_survey)

anova
```

```
## Call:
##    aov(formula = risk_self ~ age_cat, data = corona_survey)
## 
## Terms:
##                  age_cat Residuals
## Sum of Squares   400.078  4666.125
## Deg. of Freedom        9      3142
## 
## Residual standard error: 1.218639
## Estimated effects may be unbalanced
## 613 Beobachtungen als fehlend gelöscht
```

---

## Testing multiple groups with an ANOVA

To get some more detailed information, you need to use the `summary()` function we have already seen in the EDA session on the resulting `anova` object.

```r
summary(anova)
```

```
##               Df Sum Sq Mean Sq F value Pr(>F)    
## age_cat        9    400   44.45   29.93 <2e-16 ***
## Residuals   3142   4666    1.49                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 613 Beobachtungen als fehlend gelöscht
```

---

## Testing ANOVA assumptions

If you conduct any actual analyses, you should test the assumptions for conducting an ANOVA. We sort of gloss over that that here, but you can easily find many tutorials on this online (such as [this one](https://www.datanovia.com/en/lessons/ancova-in-r/#assumptions), for example). One assumption check that is commonly performed in the context of ANOVA is [Levene's Test for Homogeneity of Variance](https://en.wikipedia.org/wiki/Levene%27s_test). Several packages offer functions for that, including [`car`](https://cran.r-project.org/web/packages/car/index.html) or [`rstatix`](https://rpkgs.datanovia.com/rstatix/index.html), the latter of which is also designed to work nicely with pipes.

```r
library(rstatix)

corona_survey %>% 
  levene_test(risk_self ~ age_cat)
```

```
## # A tibble: 1 x 4
##     df1   df2 statistic       p
##   <int> <int>     <dbl>   <dbl>
## 1     9  3142      2.70 0.00397
```

---

## Alternatives for ANOVA

As we have seen throughout the previous sessions, there are different/alternative functions and packages for most things in `R`. Two interesting packages for ANOVA are [`ez`](https://cran.r-project.org/web/packages/ez/index.html) and [`afex`](https://github.com/singmann/afex), both of which can, e.g., also be used for mixed within-between ANOVA.<sup>1</sup>

.footnote[
[1] Another good options for ANOVA as well as t-tests and other tests is the [`rstatix` package](https://rpkgs.datanovia.com/rstatix/index.html) which provides a "pipe-friendly framework, coherent with the 'tidyverse' design philosophy, for performing basic statistical tests, including t-test".
]

---

## ANOVA with `ez`

```r
library(ez)

ez_anova = ezANOVA(data = corona_survey[!is.na(corona_survey$risk_self),],
                   dv = risk_self,
                   wid = id,
                   between = age_cat)

ez_anova
```

```
## $ANOVA
##    Effect DFn  DFd        F          p p<.05        ges
## 1 age_cat   9 3142 29.93314 1.4115e-50     * 0.07897004
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn  DFd      SSn      SSd        F           p p<.05
## 1   9 3142 17.67673 2287.205 2.698116 0.003971681     *
```

As you can see, one nice thing about `ezANOVA` is that is also reports the results of Levene's test for our ANOVA.

---

## ANOVA with `afex`

```r
library(afex)

afex_anova <- aov_ez(id = "id",
                     dv = "risk_self",
                     data = corona_survey,
                     between = "age_cat")

afex_anova
```

```
## Anova Table (Type 3 tests)
## 
## Response: risk_self
##    Effect      df  MSE         F  ges p.value
## 1 age_cat 9, 3142 1.49 29.93 *** .079   <.001
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
```

---

## Effect sizes

In many disciplines and journals, it is standard practice to also report effect sizes (in addition to test statistics and p-values) for t-tests and ANOVAS. The [`effectsize` package](https://easystats.github.io/effectsize/) that is part of the [`easystats` collection of packages](https://easystats.github.io/easystats/) offers some helpful options for that<sup>1</sup>: For example for Cohen's d...

```r
library(effectsize)

cohens_d(risk_self ~ sex, data = corona_survey)
```

```
## Cohen's d |        95% CI
## -------------------------
## -0.02     | [-0.09, 0.05]
## 
## - Estimated using pooled SD.
```

.footnote[
[1] The [`rstatix` package](https://rpkgs.datanovia.com/rstatix/index.html) also provides a couple of "pipe-friendly" functions for calculating effect sizes.
]

---

## Effect sizes

... or Eta².

```r
eta_squared(anova)
```

```
## # Effect Size for ANOVA
## 
## Parameter | Eta2 |       90% CI
## -------------------------------
## age_cat   | 0.08 | [0.06, 0.09]
```

---

# [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_4_1_1_t-test_ANOVA.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_4_1_1_t-test_ANOVA.html)

---

## Simple linear regressions

This may not come as a surprise at this point, but regression models are also easy to perform in `R`. The function for this is `lm()`. *Note*: To improve the readability of the output, we need to transform the `age_cat` factor.

*Disclaimer*: We won't do many regression diagnostics and assume that our dependent variable is continuous for the OLS regression context here.

```r
corona_survey <- corona_survey %>% 
  mutate(age_cat_num = as.factor(as.numeric(age_cat)))
```

```r
simple_linear_model <-  lm(risk_self ~ age_cat_num + sex + left_right,
                           data = corona_survey)

simple_linear_model
```

---

```
## 
## Call:
## lm(formula = risk_self ~ age_cat_num + sex + left_right, data = corona_survey)
## 
## Coefficients:
##   (Intercept)   age_cat_num2   age_cat_num3   age_cat_num4   age_cat_num5   age_cat_num6   age_cat_num7   age_cat_num8  
##     4.3780219      0.1502594      0.2148462      0.0196150      0.0262248     -0.0276572     -0.2369713     -0.5343957  
##  age_cat_num9  age_cat_num10      sexFemale     left_right  
##    -0.6366123     -1.0002513     -0.0355715      0.0007054
```
]

---

## Simple linear regressions

As for the ANOVA, we can get some more detailed (and nicely formatted) output using the well-known `summary()` function.

```r
summary(simple_linear_model)
```

---

```
## 
## Call:
## lm(formula = risk_self ~ age_cat_num + sex + left_right, data = corona_survey)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5608 -0.7094 -0.1090  0.6554  3.6201 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.3780219  0.1541107  28.408  < 2e-16 ***
## age_cat_num2   0.1502594  0.1687390   0.890 0.373275    
## age_cat_num3   0.2148462  0.1671819   1.285 0.198852    
## age_cat_num4   0.0196150  0.1624988   0.121 0.903930    
## age_cat_num5   0.0262248  0.1627695   0.161 0.872013    
## age_cat_num6  -0.0276572  0.1598974  -0.173 0.862687    
## age_cat_num7  -0.2369713  0.1497679  -1.582 0.113693    
## age_cat_num8  -0.5343957  0.1581236  -3.380 0.000735 ***
## age_cat_num9  -0.6366123  0.1594911  -3.992 6.72e-05 ***
## age_cat_num10 -1.0002513  0.1580731  -6.328 2.85e-10 ***
## sexFemale     -0.0355715  0.0441969  -0.805 0.420974    
## left_right     0.0007054  0.0117938   0.060 0.952307    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.218 on 3091 degrees of freedom
##   (662 Beobachtungen als fehlend gelöscht)
## Multiple R-squared:  0.07973,	Adjusted R-squared:  0.07646 
## F-statistic: 24.35 on 11 and 3091 DF,  p-value: < 2.2e-16
```
]

---

## Checking regression assumptions

As in the case of ANOVA, we do not cover the topic of testing regression assumptions in this session. Of course, when you do some actual analyses, you should definitely do that. We will cover some assumption checks (and model diagnostics) in the following data visualization session, but if you want more information about this issue, there are plenty of good tutorials online (such as [this blog post by Ian Ruginski](https://www.ianruginski.com/post/regressionassumptions/) or [this chapter](https://bookdown.org/jimr1603/Intermediate_R_-_R_for_Survey_Analysis/testing-regression-assumptions.html) in the online book [*R for Survey Analysis*](https://bookdown.org/jimr1603/Intermediate_R_-_R_for_Survey_Analysis/)).

---

## Dummy coding of categorical predictors

As you have seen, `R` automatically converts factors in a regression model to dummy-coded variables, with the reference being the first value level. Hence, there is no need to create several variables with dummy codes and add them one by one to the regression formula.

You can inspect the contrast matrix using:

```r
contrasts(corona_survey$age_cat_num)
```
]

```
##    2 3 4 5 6 7 8 9 10
## 1  0 0 0 0 0 0 0 0  0
## 2  1 0 0 0 0 0 0 0  0
## 3  0 1 0 0 0 0 0 0  0
## 4  0 0 1 0 0 0 0 0  0
## 5  0 0 0 1 0 0 0 0  0
## 6  0 0 0 0 1 0 0 0  0
## 7  0 0 0 0 0 1 0 0  0
## 8  0 0 0 0 0 0 1 0  0
## 9  0 0 0 0 0 0 0 1  0
## 10 0 0 0 0 0 0 0 0  1
```
]

---

## Changing the reference category for a categorical predictor

If you want to change the refence category for a categorical predictor, you can use the `base R` function `relevel()` within the regression formula.

```r
lm_relevel <-  lm(risk_self ~ relevel(age_cat_num, ref = "10") + sex + left_right,
                           data = corona_survey)

summary(lm_relevel)
```

---

```
## 
## Call:
## lm(formula = risk_self ~ relevel(age_cat_num, ref = "10") + sex + 
##     left_right, data = corona_survey)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5608 -0.7094 -0.1090  0.6554  3.6201 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        3.3777705  0.0894208  37.774  < 2e-16 ***
## relevel(age_cat_num, ref = "10")1  1.0002513  0.1580731   6.328 2.85e-10 ***
## relevel(age_cat_num, ref = "10")2  1.1505107  0.1102767  10.433  < 2e-16 ***
## relevel(age_cat_num, ref = "10")3  1.2150976  0.1078838  11.263  < 2e-16 ***
## relevel(age_cat_num, ref = "10")4  1.0198664  0.1002314  10.175  < 2e-16 ***
## relevel(age_cat_num, ref = "10")5  1.0264761  0.1004757  10.216  < 2e-16 ***
## relevel(age_cat_num, ref = "10")6  0.9725941  0.0961067  10.120  < 2e-16 ***
## relevel(age_cat_num, ref = "10")7  0.7632800  0.0780749   9.776  < 2e-16 ***
## relevel(age_cat_num, ref = "10")8  0.4658556  0.0930868   5.005 5.91e-07 ***
## relevel(age_cat_num, ref = "10")9  0.3636391  0.0953800   3.813  0.00014 ***
## sexFemale                         -0.0355715  0.0441969  -0.805  0.42097    
## left_right                         0.0007054  0.0117938   0.060  0.95231    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.218 on 3091 degrees of freedom
##   (662 Beobachtungen als fehlend gelöscht)
## Multiple R-squared:  0.07973,	Adjusted R-squared:  0.07646 
## F-statistic: 24.35 on 11 and 3091 DF,  p-value: < 2.2e-16
```
]

---

## Different coding example: Effect coding

How to include a factor variable in a regression model can be changed in `R`. For a full overview of different options, you can, e.g., have a look at this [tutorial](https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables).

Let's try one alternative option: The so-called effect coding or deviation coding, which compares the mean at a given level to the overall mean. You can create effect-coded dummies by changing the contrasts as follows:

```r
contrasts(corona_survey$age_cat_num) <- contr.sum(10)
```

---

## Effect coding contrast matrix

```r
contrasts(corona_survey$age_cat_num)
```

```
##    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## 1     1    0    0    0    0    0    0    0    0
## 2     0    1    0    0    0    0    0    0    0
## 3     0    0    1    0    0    0    0    0    0
## 4     0    0    0    1    0    0    0    0    0
## 5     0    0    0    0    1    0    0    0    0
## 6     0    0    0    0    0    1    0    0    0
## 7     0    0    0    0    0    0    1    0    0
## 8     0    0    0    0    0    0    0    1    0
## 9     0    0    0    0    0    0    0    0    1
## 10   -1   -1   -1   -1   -1   -1   -1   -1   -1
```

---

## Effect coding regression

```r
simple_linear_model_effect_coded <- lm(risk_self ~ age_cat_num + sex + left_right, 
                                       data = corona_survey)

summary(simple_linear_model_effect_coded)
```

---

```
## 
## Call:
## lm(formula = risk_self ~ age_cat_num + sex + left_right, data = corona_survey)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5608 -0.7094 -0.1090  0.6554  3.6201 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.1755276  0.0664387  62.848  < 2e-16 ***
## age_cat_num1  0.2024942  0.1311359   1.544  0.12265    
## age_cat_num2  0.3527537  0.0833736   4.231 2.39e-05 ***
## age_cat_num3  0.4173405  0.0807739   5.167 2.53e-07 ***
## age_cat_num4  0.2221093  0.0725522   3.061  0.00222 ** 
## age_cat_num5  0.2287190  0.0729389   3.136  0.00173 ** 
## age_cat_num6  0.1748370  0.0679343   2.574  0.01011 *  
## age_cat_num7 -0.0344771  0.0459005  -0.751  0.45263    
## age_cat_num8 -0.3319015  0.0645208  -5.144 2.86e-07 ***
## age_cat_num9 -0.4341180  0.0672035  -6.460 1.21e-10 ***
## sexFemale    -0.0355715  0.0441969  -0.805  0.42097    
## left_right    0.0007054  0.0117938   0.060  0.95231    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.218 on 3091 degrees of freedom
##   (662 Beobachtungen als fehlend gelöscht)
## Multiple R-squared:  0.07973,	Adjusted R-squared:  0.07646 
## F-statistic: 24.35 on 11 and 3091 DF,  p-value: < 2.2e-16
```
]

---

## Generalized linear regression

What we have seen so far are estimates for linear OLS regression models. A standard `R` installation provides a multitude of other estimators/link functions, so-called family objects, e.g., binomial logistic or Poisson regression models through the `glm()` function. See `?family` for an overview.

Let's look at the the example of logistic regression. For this  purpose, we recode our subjective risk of infection variable into a binary one:

```r
corona_survey <- corona_survey %>% 
  mutate(risk_self_bin = case_when(risk_self > 4  ~ 1,
                                   risk_self <= 4 ~ 0))

table(corona_survey$risk_self_bin)
```

```
## 
##    0    1 
## 2031 1121
```

---

## Logistic regression

```r
simple_linear_model_logistic <- glm(risk_self_bin ~ age_cat_num + sex + left_right,
    family = binomial(link = "logit"),
    data = corona_survey)

summary(simple_linear_model_logistic)
```

---

```
## 
## Call:
## glm(formula = risk_self_bin ~ age_cat_num + sex + left_right, 
##     family = binomial(link = "logit"), data = corona_survey)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3134  -0.9587  -0.6496   1.2241   2.0776  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.499493   0.119606  -4.176 2.96e-05 ***
## age_cat_num1  0.329783   0.217478   1.516  0.12942    
## age_cat_num2  0.816217   0.139156   5.865 4.48e-09 ***
## age_cat_num3  0.642687   0.133970   4.797 1.61e-06 ***
## age_cat_num4  0.389152   0.120812   3.221  0.00128 ** 
## age_cat_num5  0.485068   0.121148   4.004 6.23e-05 ***
## age_cat_num6  0.302457   0.113661   2.661  0.00779 ** 
## age_cat_num7 -0.033119   0.079505  -0.417  0.67700    
## age_cat_num8 -0.550916   0.121132  -4.548 5.41e-06 ***
## age_cat_num9 -0.941366   0.139100  -6.768 1.31e-11 ***
## sexFemale    -0.087024   0.079077  -1.101  0.27111    
## left_right   -0.001284   0.021302  -0.060  0.95194    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4036.2  on 3102  degrees of freedom
## Residual deviance: 3776.6  on 3091  degrees of freedom
##   (662 Beobachtungen als fehlend gelöscht)
## AIC: 3800.6
## 
## Number of Fisher Scoring iterations: 4
```
]

---

## Assessing model quality

The [`performance` package](https://easystats.github.io/performance/) that is that is part of the [`easystats` collection of packages](https://easystats.github.io/easystats/) offers some functions for assessing model quality, including different types of R². A commonly used metric for logistic regression models, e.g., is Nagelkerke's R².

```r
library(performance)

r2_nagelkerke(simple_linear_model_logistic)
```

```
## Nagelkerke's R2 
##       0.1103116
```

*Note*: The `performance` package also includes several helpful functions for model diagnostics.

---

## Changing the link function

For the fun of it, let's change the link function in our regression model from logit to probit.

```r
simple_linear_model_probit <- glm(risk_self_bin ~ age_cat_num + sex + left_right,
                                  family = binomial(link = "probit"),
                                  data = corona_survey)

summary(simple_linear_model_probit)
```

---

class:middle
.small[

```
## 
## Call:
## glm(formula = risk_self_bin ~ age_cat_num + sex + left_right, 
##     family = binomial(link = "probit"), data = corona_survey)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3123  -0.9592  -0.6512   1.2237   2.0814  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.3040549  0.0724149  -4.199 2.68e-05 ***
## age_cat_num1  0.1951991  0.1357027   1.438 0.150312    
## age_cat_num2  0.4991937  0.0865738   5.766 8.11e-09 ***
## age_cat_num3  0.3905963  0.0835688   4.674 2.95e-06 ***
## age_cat_num4  0.2317226  0.0752264   3.080 0.002068 ** 
## age_cat_num5  0.2916069  0.0755170   3.861 0.000113 ***
## age_cat_num6  0.1776014  0.0706261   2.515 0.011914 *  
## age_cat_num7 -0.0306378  0.0486595  -0.630 0.528933    
## age_cat_num8 -0.3431128  0.0717301  -4.783 1.72e-06 ***
## age_cat_num9 -0.5693086  0.0791810  -7.190 6.48e-13 ***
## sexFemale    -0.0546581  0.0480328  -1.138 0.255148    
## left_right   -0.0001053  0.0128956  -0.008 0.993485    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4036.2  on 3102  degrees of freedom
## Residual deviance: 3776.5  on 3091  degrees of freedom
##   (662 Beobachtungen als fehlend gelöscht)
## AIC: 3800.5
## 
## Number of Fisher Scoring iterations: 4
```
]

---

## Comparing models with an ANOVA

We can also compare models with some standard tools. For example, to examine competing models, such as our logistic and probit regression, we can apply an ANOVA.

```r
anova(simple_linear_model_logistic, simple_linear_model_probit)
```

```
## Analysis of Deviance Table
## 
## Model 1: risk_self_bin ~ age_cat_num + sex + left_right
## Model 2: risk_self_bin ~ age_cat_num + sex + left_right
##   Resid. Df Resid. Dev Df Deviance
## 1      3091     3776.6            
## 2      3091     3776.5  0 0.094111
```

---

## Comparing model performance

The `performance` package we have previously used for calculating Nagelkerke's R² for our logistic regression model also provides a handy function for comparing model performance.

```r
compare_performance(simple_linear_model_logistic, simple_linear_model_probit,
                    metrics = c("AIC", "BIC", "R2"))
```

```
## # Comparison of Model Performance Indices
## 
## Name                         | Model |      AIC |      BIC | Tjur's R2 | Nagelkerke's R2
## ----------------------------------------------------------------------------------------
## simple_linear_model_logistic |   glm | 3800.600 | 3873.082 |     0.078 |                
## simple_linear_model_probit   |   glm | 3800.506 | 3872.988 |           |           0.110
```

---

## Other regression variants

While `glm()` already provides quite a few estimators and link functions, depending on the distribution of your dependent variable, you may need more specialized regression models. A few interesting options for that, e.g, include are the [`MASS` package](https://cran.r-project.org/web/packages/MASS/index.html) for negative binomial regression, the [`pscl` package](https://cran.r-project.org/web/packages/pscl/index.html) for zero-inflated (negative) binomial regression and hurdle models, or the [`mcp` package](https://lindeloev.github.io/mcp/) for regression with multiple change points.

---

# [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_4_1_2_Regression_Analysis.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_4_1_2_Regression_Analysis.html)

---

## Handling regression results

While it is (still) common practice to run regressions, search for 'significant' p-values, and paste the results into a table without interpreting them substantially, this may not be the best thing to do. We will (briefly) discuss how to create nice regression tables later on, but before that, we will briefly talk about how we can apply some other techniques to gain some more substantial insights into our data using regression.

To do that, we first need to look into how to work with a readily estimated regression object in `R`. This is also meant to show some of the mechanics of what is actually happening in the background when a regression is run in `R`.

---

## Accessing model results in `base R`

Regression results are a specific type/class of objects in `R`. You can use the `str()` function to get an overview of the whole structure of the object (it's a list of different information). For starters, we may want to see what the first level of this list may provide by asking for the names of the included pieces of information:

```r
names(simple_linear_model)
```

```
##  [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "contrasts"     "xlevels"       "call"          "terms"         "model"
```

---

## Accessing coefficients

We can access the coefficients from our model as follows:

```r
simple_linear_model$coefficients
```

```
##   (Intercept)  age_cat_num2  age_cat_num3  age_cat_num4  age_cat_num5  age_cat_num6  age_cat_num7  age_cat_num8  age_cat_num9 
##  4.3780218611  0.1502594097  0.2148462393  0.0196150389  0.0262247598 -0.0276572461 -0.2369713291 -0.5343957261 -0.6366122520 
## age_cat_num10     sexFemale    left_right 
## -1.0002513373 -0.0355714558  0.0007054441
```

---

## Accessing standard errors

`lm` objects are a little bit cumbersome to use as the information is deeply nested within the object. If you want to extract the standard errors, e.g., you can do so as follows:

```r
summary(simple_linear_model)$coefficients[,2]
```

```
##   (Intercept)  age_cat_num2  age_cat_num3  age_cat_num4  age_cat_num5  age_cat_num6  age_cat_num7  age_cat_num8  age_cat_num9 
##    0.15411071    0.16873903    0.16718186    0.16249883    0.16276953    0.15989742    0.14976793    0.15812356    0.15949107 
## age_cat_num10     sexFemale    left_right 
##    0.15807305    0.04419694    0.01179375
```

---

## Accessing confidence intervals

The standard `summary()` doesn't supply confidence intervals. We can use the `confint()` command to produce them. For example, for the logistic regression:

```r
confint(simple_linear_model_logistic)
```

```
##                    2.5 %      97.5 %
## (Intercept)  -0.73477302 -0.26576666
## age_cat_num1 -0.10174634  0.75435977
## age_cat_num2  0.54462150  1.09087948
## age_cat_num3  0.38028449  0.90610455
## age_cat_num4  0.15157308  0.62559600
## age_cat_num5  0.24720885  0.72255407
## age_cat_num6  0.07876006  0.52466107
## age_cat_num7 -0.18943519  0.12230533
## age_cat_num8 -0.79247625 -0.31710976
## age_cat_num9 -1.22151746 -0.67530037
## sexFemale    -0.24214913  0.06788039
## left_right   -0.04303617  0.04049153
```

---

## Alternatives for accessing model information

Addon-packages, e.g., for creating tables, usually gather information about models (such as standard errors or confidence intervals). However, there are also other, less tedious ways of directly accessing model parameters. One such option is the [`parameters` package](https://easystats.github.io/parameters/) that is part of the [`easystats` collection of packages](https://easystats.github.io/easystats/).

```r
library(parameters)

model_parameters(simple_linear_model)
```

---

class:middle
.small[

```
## Parameter        | Coefficient |   SE |         95% CI | t(3091) |      p
## -------------------------------------------------------------------------
## (Intercept)      |        4.38 | 0.15 | [ 4.08,  4.68] |   28.41 | < .001
## age_cat_num [2]  |        0.15 | 0.17 | [-0.18,  0.48] |    0.89 | 0.373 
## age_cat_num [3]  |        0.21 | 0.17 | [-0.11,  0.54] |    1.29 | 0.199 
## age_cat_num [4]  |        0.02 | 0.16 | [-0.30,  0.34] |    0.12 | 0.904 
## age_cat_num [5]  |        0.03 | 0.16 | [-0.29,  0.35] |    0.16 | 0.872 
## age_cat_num [6]  |       -0.03 | 0.16 | [-0.34,  0.29] |   -0.17 | 0.863 
## age_cat_num [7]  |       -0.24 | 0.15 | [-0.53,  0.06] |   -1.58 | 0.114 
## age_cat_num [8]  |       -0.53 | 0.16 | [-0.84, -0.22] |   -3.38 | < .001
## age_cat_num [9]  |       -0.64 | 0.16 | [-0.95, -0.32] |   -3.99 | < .001
## age_cat_num [10] |       -1.00 | 0.16 | [-1.31, -0.69] |   -6.33 | < .001
## sex [Female]     |       -0.04 | 0.04 | [-0.12,  0.05] |   -0.80 | 0.421 
## left_right       |    7.05e-04 | 0.01 | [-0.02,  0.02] |    0.06 | 0.952
```
]

---

## (Simple) model predictions

It is also quite straightforward to produce simple predictions from an estimated model using the `predict()` function in `R`. We can feed the `predict()` function with our own data to figure out what our model actually predicts when something changes (in the example below that is the age category). This can provide additional insights into our models. *Note*: For the following code to work, we need to change the `sex` variable to numeric and then fit our simple linear model again, using that as a predictor.

```r
corona_survey <- corona_survey %>% 
  mutate(sex_num = as.numeric(sex))

simple_linear_model_new <-  lm(risk_self ~ age_cat_num + sex_num + left_right,
                               data = corona_survey)

predictions_data <-  data.frame(left_right = rep(mean(corona_survey$left_right, na.rm = TRUE), 2),
                                sex_num = rep(mean(corona_survey$sex_num), 2),
                                age_cat_num = as.factor(c(1, 10)))

predict(object = simple_linear_model_new,
        newdata = predictions_data,
        interval = "confidence")
```

```
##        fit      lwr      upr
## 1 4.364001 4.082243 4.645759
## 2 3.363749 3.235150 3.492348
```
]

---

## More advanced post-estimation techniques

In an OLS context, predictions of this kind are straightforward to interpret. For non-linear models, such as in logistic regression, this is more difficult:

```r
simple_linear_model_logistic_new <- glm(risk_self_bin ~ age_cat_num + sex_num + left_right,
                                        family = binomial(link = "logit"),
                                        data = corona_survey)

predictions <-  predict(object = simple_linear_model_logistic_new,
                        newdata = predictions_data)
```

Predictions have to be converted into probabilities:

```r
exp(predictions) / (1 + exp(predictions))
```

```
##         1         2 
## 0.4457056 0.1204915
```

---

## Average marginal effects (AME)

AME provide a similar interpretation for a one-unit change as in OLS models: the average change of the dependent variable when all other variables are held constant (at their empirical value). The [`margins` package](https://cran.r-project.org/web/packages/margins/index.html) provides a simple function for AME.

```r
library(margins)

margins(simple_linear_model_logistic)
```

---

```
##  left_right age_cat_num2 age_cat_num3 age_cat_num4 age_cat_num5 age_cat_num6 age_cat_num7 age_cat_num8 age_cat_num9 age_cat_num10
##  -0.0002708       0.1209      0.07796       0.0147      0.03859    -0.006737     -0.08695      -0.1957      -0.2616       -0.3251
##  sexFemale
##   -0.01835
```

---

## Regression tables

As we have seen in the session on *Exploratory Data Analysis*, there are different options for creating tables as output in `R`. The ones we have looked at in the EDA session, [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html) and [`gtsummary`](http://www.danieldsjoberg.com/gtsummary/) can be used for creating regression tables. The functions from these packages can also be used for comparing multiple regression models in one table.

---

## Regression tables with the `stargazer` package

```r
library(stargazer)

stargazer(simple_linear_model,
          simple_linear_model_effect_coded,
          simple_linear_model_logistic,
          simple_linear_model_probit,
          type = "text",
          dep.var.labels=c("Infection risk", "Infection risk binary"),
          covariate.labels=c("Age category 1", "Age category 2", "Age category 3", "Age category 4",
                             "Age category 5", "Age category 6", "Age category 7", "Age category 8", "Age category 9",
                             "Sex (Female)", "Political orientation (left-right)"))
```

---

```
## 
## ==============================================================================
##                                                Dependent variable:            
##                                    -------------------------------------------
##                                       Infection risk     Infection risk binary
##                                             OLS           logistic    probit  
##                                       (1)        (2)        (3)        (4)    
## ------------------------------------------------------------------------------
## Age category 1                                  0.202      0.330      0.195   
##                                                (0.131)    (0.217)    (0.136)  
##                                                                               
## Age category 2                       0.150    0.353***    0.816***   0.499*** 
##                                     (0.169)    (0.083)    (0.139)    (0.087)  
##                                                                               
## Age category 3                       0.215    0.417***    0.643***   0.391*** 
##                                     (0.167)    (0.081)    (0.134)    (0.084)  
##                                                                               
## Age category 4                       0.020    0.222***    0.389***   0.232*** 
##                                     (0.162)    (0.073)    (0.121)    (0.075)  
##                                                                               
## Age category 5                       0.026    0.229***    0.485***   0.292*** 
##                                     (0.163)    (0.073)    (0.121)    (0.076)  
##                                                                               
## Age category 6                      -0.028     0.175**    0.302***   0.178**  
##                                     (0.160)    (0.068)    (0.114)    (0.071)  
##                                                                               
## Age category 7                      -0.237     -0.034      -0.033     -0.031  
##                                     (0.150)    (0.046)    (0.080)    (0.049)  
##                                                                               
## Age category 8                     -0.534***  -0.332***  -0.551***  -0.343*** 
##                                     (0.158)    (0.065)    (0.121)    (0.072)  
##                                                                               
## Age category 9                     -0.637***  -0.434***  -0.941***  -0.569*** 
##                                     (0.159)    (0.067)    (0.139)    (0.079)  
##                                                                               
## Sex (Female)                       -1.000***                                  
##                                     (0.158)                                   
##                                                                               
## Political orientation (left-right)  -0.036     -0.036      -0.087     -0.055  
##                                     (0.044)    (0.044)    (0.079)    (0.048)  
##                                                                               
## left_right                           0.001      0.001      -0.001    -0.0001  
##                                     (0.012)    (0.012)    (0.021)    (0.013)  
##                                                                               
## Constant                           4.378***   4.176***   -0.499***  -0.304*** 
##                                     (0.154)    (0.066)    (0.120)    (0.072)  
##                                                                               
## ------------------------------------------------------------------------------
## Observations                         3,103      3,103      3,103      3,103   
## R2                                   0.080      0.080                         
## Adjusted R2                          0.076      0.076                         
## Log Likelihood                                           -1,888.300 -1,888.253
## Akaike Inf. Crit.                                        3,800.600  3,800.506 
## Residual Std. Error (df = 3091)      1.218      1.218                         
## F Statistic (df = 11; 3091)        24.346***  24.346***                       
## ==============================================================================
## Note:                                              *p<0.1; **p<0.05; ***p<0.01
```
]

---

## Regression tables with the `stargazer` package

As for the summary table, the `stargazer()` output can also be saved in various formats, such as `LaTeX`.

```r
# If you have not done so before, you need to create a directory for output files first
# dir.create("./output")

stargazer(simple_linear_model,
          simple_linear_model_effect_coded,
          simple_linear_model_logistic,
          simple_linear_model_probit,
          type = "latex",
          dep.var.labels=c("Infection risk", "Infection risk binary"),
          covariate.labels=c("Age category 1", "Age category 2", "Age category 3", "Age category 4",
                             "Age category 5", "Age category 6", "Age category 7", "Age category 8", "Age category 9",
                             "Sex (Female)", "Political orientation (left-right)"),
          out = "./output/stargazer_regression.tex")
```
]

---

## Regression tables with `gtsummary`

The `gtsummary` package also provides a function for creating regression tables. *Note*: Remember that, by default, the output of the `gtsummary` functions will be shown in the `Viewer` tab in *RStudio*.

```r
library(gtsummary)

tbl_regression(simple_linear_model,
               label = list(sex ~ "Sex",
                            age_cat_num ~ "Age category",
                            left_right ~ "Political orientation (left-right)"))
```

---

## Regression tables with `gtsummary`

We can also export the `gtsummary` output into a *Word* document using a function from the `flextable` package.

```r
# If you have not done so before, you need to create a directory for output files first
# dir.create("./output")

library(flextable)

tbl_regression(simple_linear_model,
               label = list(sex ~ "Sex",
                            age_cat_num ~ "Age category",
                            left_right ~ "Political orientation (left-right)")) %>% 
as_flex_table() %>% 
  save_as_docx("Results of the OLS regression model" = .,
               path = "./output/gt_regression.docx")
```

---

## Generating text output

Instead of or in addition to tables, we often want to report the results of our statistical analysis in textual form. To save us from typing and copy-pasting (too much), the [`report` package](https://easystats.github.io/report/) from the `easystats` collection offers some convenient functions. For example, we can easily produce some template text for the t-test we computed earlier on.

```r
library(report)

t.test(risk_self ~ sex, data = corona_survey) %>% 
  report()
```

```
## Effect sizes were labelled following Cohen's (1988) recommendations.
## 
## The Welch Two Sample t-test testing the difference of risk_self by sex (mean in group Male = 4.08, mean in group Female = 4.11) suggests that the effect is positive, statistically not significant, and very small (difference = 0.03, 95% CI [-0.11, 0.06], t(3149.60) = -0.58, p = 0.562; Cohen's d = -0.02, 95% CI [-0.09, 0.05])
```

.small[
*Note*: While `report` is format-agnostic as it produces plain-text output, a nice tool for getting your outpur from `R` into *Microsoft Word* is the [`tidystats` Word add-in](https://www.tidystats.io/word-add-in.html).
]

---

## Generating text output

We can do the same for the ANOVA...

```r
anova %>% 
  report()
```

```
## The ANOVA (formula: risk_self ~ age_cat) suggests that:
## 
##   - The main effect of age_cat is statistically significant and medium (F(9, 3142) = 29.93, p < .001; Eta2 = 0.08, 90% CI [0.06, 0.09])
## 
## Effect sizes were labelled following Field's (2013) recommendations.
```

---

## Generating text output

... or our regression models.

```r
simple_linear_model_logistic %>% 
  report()
```

---

```
## We fitted a logistic model (estimated using ML) to predict risk_self_bin with age_cat_num, sex and left_right (formula: risk_self_bin ~ age_cat_num + sex + left_right). The model's explanatory power is weak (Tjur's R2 = 0.08). The model's intercept, corresponding to age_cat_num = , sex = Male and left_right = 0, is at -0.50 (95% CI [-0.73, -0.27], p < .001). Within this model:
## 
##   - The effect of age_cat_num [1] is statistically non-significant and positive (beta = 0.33, 95% CI [-0.10, 0.75], p = 0.129; Std. beta = 0.33, 95% CI [-0.10, 0.75])
##   - The effect of age_cat_num [2] is statistically significant and positive (beta = 0.82, 95% CI [0.54, 1.09], p < .001; Std. beta = 0.82, 95% CI [0.54, 1.09])
##   - The effect of age_cat_num [3] is statistically significant and positive (beta = 0.64, 95% CI [0.38, 0.91], p < .001; Std. beta = 0.64, 95% CI [0.38, 0.91])
##   - The effect of age_cat_num [4] is statistically significant and positive (beta = 0.39, 95% CI [0.15, 0.63], p = 0.001; Std. beta = 0.39, 95% CI [0.15, 0.63])
##   - The effect of age_cat_num [5] is statistically significant and positive (beta = 0.49, 95% CI [0.25, 0.72], p < .001; Std. beta = 0.49, 95% CI [0.25, 0.72])
##   - The effect of age_cat_num [6] is statistically significant and positive (beta = 0.30, 95% CI [0.08, 0.52], p = 0.008; Std. beta = 0.30, 95% CI [0.08, 0.52])
##   - The effect of age_cat_num [7] is statistically non-significant and negative (beta = -0.03, 95% CI [-0.19, 0.12], p = 0.677; Std. beta = -0.03, 95% CI [-0.19, 0.12])
##   - The effect of age_cat_num [8] is statistically significant and negative (beta = -0.55, 95% CI [-0.79, -0.32], p < .001; Std. beta = -0.55, 95% CI [-0.79, -0.32])
##   - The effect of age_cat_num [9] is statistically significant and negative (beta = -0.94, 95% CI [-1.22, -0.68], p < .001; Std. beta = -0.94, 95% CI [-1.22, -0.68])
##   - The effect of sex [Female] is statistically non-significant and negative (beta = -0.09, 95% CI [-0.24, 0.07], p = 0.271; Std. beta = -0.09, 95% CI [-0.24, 0.07])
##   - The effect of left_right is statistically non-significant and negative (beta = -1.28e-03, 95% CI [-0.04, 0.04], p = 0.952; Std. beta = -2.40e-03, 95% CI [-0.08, 0.08])
## 
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using
```
]

---

## tidy models with `broom`

.pull-left[
We have already entered the area of reporting statistical results. We will have a separate session on reporting on with `R Markdown` tomorrow. One thing to note at this point is that more and more developers in `R` were unsatisfied with the diverse output some of the standard regression procedures provide. The outputs may be helpful to look at, but they're usually not great for further processing. For that purpose, we need data frames/tibbles.
]

.pull-right[
<img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\broom.png" width="70%" style="display: block; margin: auto;" />
]

---

## 3 functions of `broom`

The [`broom`](https://broom.tidymodels.org/) package provides only 3 but very powerful main functions:
- `tidy()`: creates a `tibble` from your model
- `glance()`: information about your model as `tibble` ('model fit')
- `augment()`: adds information, e.g., individual fitted values to your data

*Note*: With `broom` you can also create a standardized output of other types of models, such as Latent Class Analysis (LCA) (with the package [`poLCA`](https://cran.r-project.org/web/packages/poLCA/index.html)).

---

## `tidy()`

```r
library(broom)

tidy(simple_linear_model)
```

```
## # A tibble: 12 x 5
##    term           estimate std.error statistic   p.value
##    <chr>             <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)    4.38        0.154    28.4    6.24e-158
##  2 age_cat_num2   0.150       0.169     0.890  3.73e-  1
##  3 age_cat_num3   0.215       0.167     1.29   1.99e-  1
##  4 age_cat_num4   0.0196      0.162     0.121  9.04e-  1
##  5 age_cat_num5   0.0262      0.163     0.161  8.72e-  1
##  6 age_cat_num6  -0.0277      0.160    -0.173  8.63e-  1
##  7 age_cat_num7  -0.237       0.150    -1.58   1.14e-  1
##  8 age_cat_num8  -0.534       0.158    -3.38   7.35e-  4
##  9 age_cat_num9  -0.637       0.159    -3.99   6.72e-  5
## 10 age_cat_num10 -1.00        0.158    -6.33   2.85e- 10
## 11 sexFemale     -0.0356      0.0442   -0.805  4.21e-  1
## 12 left_right     0.000705    0.0118    0.0598 9.52e-  1
```

---

## `glance()`

```r
glance(simple_linear_model)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik    AIC    BIC deviance df.residual  nobs
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>    <dbl>       <int> <int>
## 1    0.0797        0.0765  1.22      24.3 8.66e-49    11 -5008. 10042. 10121.    4583.        3091  3103
```

---

## `augment()`

```r
augment(simple_linear_model)
```

```
## # A tibble: 3,103 x 11
##    .rownames risk_self age_cat_num sex    left_right .fitted .resid    .hat .sigma    .cooksd .std.resid
##    <chr>         <dbl> <fct>       <fct>       <dbl>   <dbl>  <dbl>   <dbl>  <dbl>      <dbl>      <dbl>
##  1 2                 5 7           Female          5    4.11  0.891 0.00158   1.22 0.0000707      0.732 
##  2 3                 5 8           Male            5    3.85  1.15  0.00326   1.22 0.000245       0.948 
##  3 4                 4 4           Male            7    4.40 -0.403 0.00457   1.22 0.0000420     -0.331 
##  4 6                 3 10          Male           10    3.38 -0.385 0.00549   1.22 0.0000462     -0.317 
##  5 7                 4 4           Female          5    4.37 -0.366 0.00416   1.22 0.0000315     -0.301 
##  6 8                 4 7           Female          6    4.11 -0.110 0.00179   1.22 0.00000122    -0.0902
##  7 10                7 1           Male            7    4.38  2.62  0.0150    1.22 0.00594        2.17  
##  8 11                4 6           Female          6    4.32 -0.319 0.00387   1.22 0.0000223     -0.263 
##  9 12                5 8           Male            7    3.85  1.15  0.00369   1.22 0.000277       0.947 
## 10 14                6 6           Male            6    4.35  1.65  0.00378   1.22 0.000580       1.35  
## # ... with 3,093 more rows
```

---

## Outlook: Other modeling options

As we've said at the beginning of this session, we only looked at some of the basic confirmatory analysis methods. As you can imagine, however, `R` offers plenty of options more more complex statistical models as well. A few examples:

- Structural equation models with [`lavaan`](https://lavaan.ugent.be/)

- Multilevel/mixed-effects models with [`lme4`](https://cran.r-project.org/web/packages/lme4/index.html)

- Bayesian regression models using [`brms`](https://paul-buerkner.github.io/brms/)

---

# [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_4_1_3_Regression_Reporting.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_4_1_3_Regression_Reporting.html)

---

## Extracurricular activities

As it is such a simple and useful package, we recommend exploring the `broom` package a bit further. Alex Hayes, one authors of the package, [gave an interesting talk about it at the *RStudio* Conference in 2019](https://rstudio.com/resources/rstudioconf-2019/solving-the-model-representation-problem-with-broom/).