Introduction to R for Data Analysis

# Introduction to R for Data Analysis
## Data Wrangling Advanced
### Johannes Breuer & Stefan Jünger
### 2021-08-03

---

---

## Data wrangling continued 🤠

While in the last sessions we focused on the bread-and-butter tasks of the data preparation business, in this part we will focus on the more 'programmy' side of things. The things we will cover in this context are:

- altering the content of a whole set of variables
- conditional variable transformation
- formulating logical requests to our data
- writing loops

---

## Load the data

Again, we will work with `.csv` version of the *Public Use File (PUF) of the GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany*.

```r
gp_covid <- read_csv2("./data/ZA5667_v1-1-0.csv")
```

---

## Quickly define missing values

In the previous session, we discussed how to define missing values. Here we will use the `set_na()` function from the `sjlabelled` package.

```r
library(sjlabelled)

gp_covid <-
 gp_covid %>%
 set_na(na = c(-99, -77, -33, 97, 98))
```

*Note*: This is a bit of a quick-and-dirty approach as 97 and 98 are valid values in the `id` variable (however, we will not explicitly use that in this session).

---

## Variables of interest

Say, we are interested in the (dis)trust towards several authorities during the early stages of the COVID-19 pandemic in Germany. There are 9 items on this topic included in the data set.

What if we want to use some data reduction method (e.g., PCA) and need the variables in reverse order for interpretation purposes?

---

## Recode data `across()` defined variables

The `dplyr` package provides a handy tool for applying transformations (such as recoding) across a set of variables: `across()`.

```r
gp_covid <- 
 gp_covid %>% 
 mutate(
 across(
 hzcy044a:hzcy052a,
 ~recode(
 .x,
 `5` = 1, # `old value` = new value
 `4` = 2,
 `2` = 4,
 `1` = 5
 )
 )
 )
```

---

## Using the function `across()` with logical conditions

Sometimes we want to transform variables that meet certain conditions. For example, for some analyses, we might want to z-standardize all numeric variables in a data set. Let's create a temporary subset and transform the `id` variable into a character for this example.

```r
gp_covid_tmp <-
 gp_covid %>% 
 select(id, hzcy044a:hzcy052a) %>% 
 mutate(id = as.character(id))

gp_covid_tmp %>% 
  sample_n(5)  # randomly sample 5 cases from the df
```

```
## # A tibble: 5 x 10
## id hzcy044a hzcy045a hzcy046a hzcy047a hzcy048a hzcy049a hzcy050a hzcy051a hzcy052a
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3086 1 2 2 1 2 2 2 2 2
## 2 3167 NA 2 3 2 2 2 3 2 2
## 3 821 2 NA 3 2 4 4 4 4 2
## 4 3156 NA NA NA NA NA NA NA NA NA
## 5 3012 3 4 4 1 2 1 2 1 1
```

---

## Example: z-standardize all numeric variables

The `base R` function for z-standardizing a variable is `scale()`.

```r
gp_covid_tmp <-
 gp_covid_tmp %>% 
 mutate(
 across(
 is.numeric,
 ~scale(.x)
 )
 )

gp_covid_tmp %>% 
  sample_n(5)
```

```
## # A tibble: 5 x 10
## id hzcy044a[,1] hzcy045a[,1] hzcy046a[,1] hzcy047a[,1] hzcy048a[,1] hzcy049a[,1] hzcy050a[,1] hzcy051a[,1] hzcy052a[,1]
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2591 0.164 -1.33 -0.582 -0.722 -1.32 -1.24 -1.18 -1.09 0.302
## 2 952 -0.920 -1.33 -1.63 -0.722 -1.32 -1.24 -1.18 -1.09 -0.962
## 3 1961 -0.920 NA -0.582 -0.722 -1.32 -1.24 -1.18 -0.0341 0.302
## 4 3685 1.25 -0.214 -0.582 0.570 0.655 0.499 0.824 -0.0341 0.302
## 5 3450 -0.920 -0.214 -0.582 0.570 0.655 0.499 0.824 1.02 0.302
```

---

## `dplyr::across()`

<img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\across_blank.png" width="95%" style="display: block; margin: auto;" />
Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)

---

## Aggregate variables `c_across()` rows

Something we might want to do for our analyses is to create aggregate variables, such as sum or mean scores for a set of items. As `dplyr` operations are applied to columns, whereas such aggregations relate to rows (i.e., respondents), we need to make use of the function `rowwise()`. Say, we want to compute a sum score for all measures that respondents have reported to engage in to prevent an infection with or the spread of the Corona virus.

```r
gp_covid <- 
 gp_covid %>% 
 rowwise() %>%
 mutate(
 sum_trust = 
 sum(
 c_across(hzcy044a:hzcy052a),
 na.rm = TRUE
 )
 ) %>% 
 ungroup()
```

---

## Aggregate variables

Three things to note here:

1. `c_across()` is a special version of `across()`for rowwise operations.

2. We use the `ungroup()` function at the end to ensure that `dplyr` verbs will operate the default way when we further work with the `gpc` object. We will discuss grouping in the session on *Exploratory Data Analysis*, but you can also check out the [documentation for `group_by()`](https://dplyr.tidyverse.org/reference/group_by.html) to learn more about this.

3. If you only need sums or means, a somewhat faster alternative is using the base `R` functions `rowSums()` and `rowMeans()` in combination with `mutate()` (and possibly also `across()` plus selection helpers). For an explanation why this can be faster, you can read the [online documentation for `rowwise()`](https://dplyr.tidyverse.org/articles/rowwise.html).

---

## Aggregate variables

```r
gp_covid %>% 
  select(hzcy044a:hzcy052a, sum_trust) %>% 
  glimpse()
```

```
## Rows: 3,765
## Columns: 10
## $ hzcy044a <dbl> NA, 1, 2, 2, NA, 2, 2, 2, NA, 1, 3, 1, NA, 2, 3, NA, NA, 1, NA, 2, NA, 1, 1, NA, 2, 1, 2, 1, 2, NA, NA, 1, NA, 2~
## $ hzcy045a <dbl> NA, 2, 2, 2, NA, 1, 2, 2, NA, 2, 2, 3, NA, 3, 3, NA, NA, 1, NA, 4, NA, 2, 1, NA, 2, NA, NA, 3, 2, NA, NA, 1, NA,~
## $ hzcy046a <dbl> NA, 2, 1, 2, NA, 2, 2, 2, NA, 3, 4, 4, NA, 3, 3, NA, NA, 1, NA, 4, NA, 3, 2, NA, 2, 2, 3, 3, 2, NA, NA, 3, NA, 2~
## $ hzcy047a <dbl> NA, 1, 1, 2, NA, 1, 2, 1, NA, 2, 2, 1, NA, 2, 2, NA, NA, 2, NA, 1, NA, 1, 1, 2, 2, 2, 2, 2, 1, NA, NA, 3, NA, 2,~
## $ hzcy048a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 3, 4, 5, NA, 2, 4, NA, NA, 2, NA, 1, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 4, NA, 3,~
## $ hzcy049a <dbl> NA, 2, 3, 2, NA, 4, 3, 2, NA, 4, 5, 4, NA, 2, 4, NA, NA, 2, NA, NA, NA, 3, 2, 2, 2, 2, 2, 2, 3, NA, NA, 4, NA, 5~
## $ hzcy050a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 4, 4, 5, NA, 2, 2, NA, NA, 1, NA, 2, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 2, NA, 2,~
## $ hzcy051a <dbl> NA, 2, 4, 2, NA, 3, 4, 1, NA, 1, 2, 3, NA, 2, 1, NA, NA, 1, NA, 3, NA, 4, 1, 2, 2, 2, 2, 3, 2, NA, NA, 3, NA, 2,~
## $ hzcy052a <dbl> NA, 2, 1, 2, NA, 1, 2, 1, NA, 1, 2, 2, NA, 1, 1, NA, NA, 1, NA, 1, NA, 2, 2, 2, 2, 2, 3, 3, 1, NA, NA, 4, NA, 2,~
## $ sum_trust <dbl> 0, 16, 18, 18, 0, 18, 23, 15, 0, 21, 28, 28, 0, 19, 23, 0, 0, 12, 0, 18, 0, 22, 14, 12, 18, 15, 18, 21, 17, 0, 0~
```

---

## Example: Aggregate variables based on means

Rowwise transformations work the same way for means. Here, we create a mean score for the items that ask how much people trust specific people or institutions in dealing with the Corona virus.

```r
gp_covid <- 
 gp_covid %>% 
 rowwise() %>% 
 mutate(
 mean_trust = 
 mean(
 c_across(hzcy044a:hzcy052a), 
 na.rm = TRUE
 )
 ) %>% 
 ungroup()
```

---

```r
gp_covid %>% 
  select(hzcy044a:hzcy052a, mean_trust) %>% 
  glimpse()
```

```
## Rows: 3,765
## Columns: 10
## $ hzcy044a <dbl> NA, 1, 2, 2, NA, 2, 2, 2, NA, 1, 3, 1, NA, 2, 3, NA, NA, 1, NA, 2, NA, 1, 1, NA, 2, 1, 2, 1, 2, NA, NA, 1, NA, ~
## $ hzcy045a <dbl> NA, 2, 2, 2, NA, 1, 2, 2, NA, 2, 2, 3, NA, 3, 3, NA, NA, 1, NA, 4, NA, 2, 1, NA, 2, NA, NA, 3, 2, NA, NA, 1, NA~
## $ hzcy046a <dbl> NA, 2, 1, 2, NA, 2, 2, 2, NA, 3, 4, 4, NA, 3, 3, NA, NA, 1, NA, 4, NA, 3, 2, NA, 2, 2, 3, 3, 2, NA, NA, 3, NA, ~
## $ hzcy047a <dbl> NA, 1, 1, 2, NA, 1, 2, 1, NA, 2, 2, 1, NA, 2, 2, NA, NA, 2, NA, 1, NA, 1, 1, 2, 2, 2, 2, 2, 1, NA, NA, 3, NA, 2~
## $ hzcy048a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 3, 4, 5, NA, 2, 4, NA, NA, 2, NA, 1, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 4, NA, 3~
## $ hzcy049a <dbl> NA, 2, 3, 2, NA, 4, 3, 2, NA, 4, 5, 4, NA, 2, 4, NA, NA, 2, NA, NA, NA, 3, 2, 2, 2, 2, 2, 2, 3, NA, NA, 4, NA, ~
## $ hzcy050a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 4, 4, 5, NA, 2, 2, NA, NA, 1, NA, 2, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 2, NA, 2~
## $ hzcy051a <dbl> NA, 2, 4, 2, NA, 3, 4, 1, NA, 1, 2, 3, NA, 2, 1, NA, NA, 1, NA, 3, NA, 4, 1, 2, 2, 2, 2, 3, 2, NA, NA, 3, NA, 2~
## $ hzcy052a <dbl> NA, 2, 1, 2, NA, 1, 2, 1, NA, 1, 2, 2, NA, 1, 1, NA, NA, 1, NA, 1, NA, 2, 2, 2, 2, 2, 3, 3, 1, NA, NA, 4, NA, 2~
## $ mean_trust <dbl> NaN, 1.777778, 2.000000, 2.000000, NaN, 2.000000, 2.555556, 1.666667, NaN, 2.333333, 3.111111, 3.111111, NaN, 2~
```

---

# [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_2_2_1_Across_the_Tidyverse.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_2_2_1_Across_the_Tidyverse.html)

---

## Becoming a data wrangling pro

Sometimes, things are a bit more complicated when it comes to creating new variables. Simple recoding can be insufficient when we need to make the values of a new variable conditional on values of (multiple) other variables. Such cases require conditional transformations.

---

## Simple conditional transformation

The simplest version of a conditional variable transformation is using an `ifelse()` statement.

```r
gp_covid <- 
 gp_covid %>% 
 mutate(
 high_education = 
 ifelse(education_cat == 3, "high", "not so high")
 )

gp_covid %>% 
  select(education_cat, high_education) %>% 
  sample_n(5)
```

```
## # A tibble: 5 x 2
## education_cat high_education
## <dbl> <chr> 
## 1 3 high 
## 2 3 high 
## 3 1 not so high 
## 4 3 high 
## 5 3 high
```

.small[
*Note*: A more versatile option for creating dummy variables is the [`fastDummies` package](https://jacobkap.github.io/fastDummies/).
]

---

## Advanced conditional transformation

For more flexible (or complex) conditional transformations, the `case_when()` function from `dyplyr` is a powerful tool.

```r
gp_covid <- 
 gp_covid %>% 
 mutate(
 pol_leaning_cat = 
 case_when(
 between(political_orientation, 0, 3) ~ "left",
 between(political_orientation, 4, 7) ~ "center",
 political_orientation > 7 ~ "right"
 )
 )

gp_covid %>% 
  select(political_orientation, pol_leaning_cat) %>% 
  sample_n(5)
```

```
## # A tibble: 5 x 2
## political_orientation pol_leaning_cat
## <dbl> <chr> 
## 1 8 right 
## 2 4 center 
## 3 6 center 
## 4 2 left 
## 5 3 left
```

---

## Conditional transformation based on multiple values

```r
gp_covid <- 
 gp_covid %>% 
 mutate(
 pol_leaning_edu = 
 case_when(
 between(political_orientation, 0, 3) & high_education == "high" ~ "left high",
 between(political_orientation, 4, 7) & high_education == "high" ~ "center high",
 political_orientation > 7 & high_education == "high" ~ "right high",
 TRUE ~ "not so high"
 )
 )

gp_covid %>% 
  select(political_orientation, high_education, pol_leaning_edu) %>% 
  sample_n(5)
```

```
## # A tibble: 5 x 3
## political_orientation high_education pol_leaning_edu
## <dbl> <chr> <chr> 
## 1 7 high center high 
## 2 4 not so high not so high 
## 3 5 high center high 
## 4 2 not so high not so high 
## 5 5 high center high
```

---

## `dplyr::case_when()`

A few things to note about `case_when()`:
- you can have multiple conditions per value
- conditions are evaluated consecutively
- when none of the specified conditions are met for an observation, by default, the new variable will have a missing value `NA` for that case
- if you want some other value in the new variables when the specified conditions are not met, you need to add `TRUE ~ value` as the last argument of the `case_when()` call
- to explore the full range of options for `case_when()` check out its [online documentation](https://dplyr.tidyverse.org/reference/case_when.html) or run `?case_when()` in `R`/*RStudio*

---

## `dplyr::case_when()` & `NA`s

```r
gp_covid <- 
 gp_covid %>% 
 mutate(
 pol_leaning_edu_2 = 
 case_when(
 between(political_orientation, 0, 3) & high_education == "high" ~ "left high",
 between(political_orientation, 4, 7) & high_education == "high" ~ "center high",
 political_orientation > 7 & high_education == "high" ~ "right high",
 age_cat == 1 ~ NA_character_,
 TRUE ~ "not so high"
 )
 )

table(gp_covid$pol_leaning_edu, useNA = "always")
```

```
## 
## center high left high not so high right high <NA> 
## 1379 663 1624 99 0
```

```r
table(gp_covid$pol_leaning_edu_2, useNA = "always")
```

```
## 
## center high left high not so high right high <NA> 
## 1379 663 1606 99 18
```

---

## `dplyr::case_when()`

<img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\dplyr_case_when.png" width="95%" style="display: block; margin: auto;" />
Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)

---

# [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_2_2_2_Define_your_Cases.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_2_2_2_Define_your_Cases.html)

---

## Getting a bit more "programmy"

So far, all of the previous tasks share two characteristics
- they were based on the structure of the whole data set
- the output is, again, the whole data set

As we will talk about in more detail tomorrow and the day after that, in data analysis, our aim is often to extract information from a data set (e.g., summary statistics, regression estimates). We now will learn a bit more about
- writing functions
- if-else loops
- for-loops and the like
- modern `tidyverse` implementations

---

## Functional Programming: In `R`, everything's a function (more or less)

You should already be familiar with using functions in `R` (at least we have used them heavily so far). In general, functions are applied as folows:

```r
fancy_function(data)
```

They can be nested, for example:

```r
log(sum(c(1, 2, 3)))
```

```
## [1] 1.791759
```

---

## Defining your own function is straightforward

First, let's create a simple function that adds `1` to an entered number.

```r
add_one <- function (a_number) {
 a_number + 1
}
```

Now, we can simply apply it to some data as with any other `R` function.

```r
add_one(2)
```

```
## [1] 3
```

```r
add_one(99)
```

```
## [1] 100
```

---

## Example: extending the sum function

In some of the previous slides, you may have noticed that one issue with the `sum()` function is that it prints `NA` by default when missing values are present in the data. So we always have to set the `na.rm = TRUE` option. We could define our own function to circumvent this.

```r
sum_na <- function (x) {
 sum(x, na.rm = TRUE)
}
```

---

## Feeding it into `mutate()` and `across()`

```r
gp_covid <- 
 gp_covid %>% 
 rowwise() %>%
 mutate(
 new_sum_trust = 
 sum_na(c_across(hzcy044a:hzcy052a))
 ) %>% 
 ungroup()

gp_covid %>% 
  select(hzcy044a:hzcy052a, new_sum_trust) %>% 
  glimpse()
```

```
## Rows: 3,765
## Columns: 10
## $ hzcy044a <dbl> NA, 1, 2, 2, NA, 2, 2, 2, NA, 1, 3, 1, NA, 2, 3, NA, NA, 1, NA, 2, NA, 1, 1, NA, 2, 1, 2, 1, 2, NA, NA, 1, N~
## $ hzcy045a <dbl> NA, 2, 2, 2, NA, 1, 2, 2, NA, 2, 2, 3, NA, 3, 3, NA, NA, 1, NA, 4, NA, 2, 1, NA, 2, NA, NA, 3, 2, NA, NA, 1,~
## $ hzcy046a <dbl> NA, 2, 1, 2, NA, 2, 2, 2, NA, 3, 4, 4, NA, 3, 3, NA, NA, 1, NA, 4, NA, 3, 2, NA, 2, 2, 3, 3, 2, NA, NA, 3, N~
## $ hzcy047a <dbl> NA, 1, 1, 2, NA, 1, 2, 1, NA, 2, 2, 1, NA, 2, 2, NA, NA, 2, NA, 1, NA, 1, 1, 2, 2, 2, 2, 2, 1, NA, NA, 3, NA~
## $ hzcy048a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 3, 4, 5, NA, 2, 4, NA, NA, 2, NA, 1, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 4, NA~
## $ hzcy049a <dbl> NA, 2, 3, 2, NA, 4, 3, 2, NA, 4, 5, 4, NA, 2, 4, NA, NA, 2, NA, NA, NA, 3, 2, 2, 2, 2, 2, 2, 3, NA, NA, 4, N~
## $ hzcy050a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 4, 4, 5, NA, 2, 2, NA, NA, 1, NA, 2, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 2, NA~
## $ hzcy051a <dbl> NA, 2, 4, 2, NA, 3, 4, 1, NA, 1, 2, 3, NA, 2, 1, NA, NA, 1, NA, 3, NA, 4, 1, 2, 2, 2, 2, 3, 2, NA, NA, 3, NA~
## $ hzcy052a <dbl> NA, 2, 1, 2, NA, 1, 2, 1, NA, 1, 2, 2, NA, 1, 1, NA, NA, 1, NA, 1, NA, 2, 2, 2, 2, 2, 3, 3, 1, NA, NA, 4, NA~
## $ new_sum_trust <dbl> 0, 16, 18, 18, 0, 18, 23, 15, 0, 21, 28, 28, 0, 19, 23, 0, 0, 12, 0, 18, 0, 22, 14, 12, 18, 15, 18, 21, 17, ~
```

---

## if-else statements in `R`

There may be cases in which we only want to apply a function if certain conditions are met. For such cases, we can use if-else statements (similar to what we have already seen in the example of simple conditional variable transformation).

---

## if-else architecture in `R`

Using if-else statements in `R` requires at least 3 steps:
1. Starting the loop with `if()`
2. Add the condition to be tested in the parentheses of the `if(condition)`
3. Write a function or procedure on data in the curly brackets of the `if(condition){ ... }`

For example:

```r
if (1 < 2) {
 1 + 2
}
```

```
## [1] 3
```

---

## Adding else statements
In a fourth step, we can add an `else { ... }` condition:

```r
if (1 > 2) {
  1 + 2
} else {
  2 + 5
}
```

```
## [1] 7
```

So, the general architecture is like this:

```r
if (condition) {
  function_to_apply(data)
} else {
  other_function_to_apply(data)
}
```

*Note*: We could also test for another condition within the else statements using `else if()`.

---

## Example: adding it to our function

We can now use this new skill to broaden the scope of our `sum_na()` function and introduce more statistics as a feature.

```r
descriptives_na <- function(x, statistic) {
 if (statistic == "sum") {
 sum(x, na.rm = TRUE)
 } else if (statistic == "mean") {
 mean(x, na.rm = TRUE)
 } else {
 stop("no valid statistic provided!")
 }
}
```

---

## Trying it out

```r
descriptives_na(c(1, 2), statistic = "sum")
```

```
## [1] 3
```

```r
descriptives_na(c(1, 2), statistic = "mean")
```

```
## [1] 1.5
```

```r
descriptives_na(c(1, 2), statistic = "mode")
```

```
## Error in descriptives_na(c(1, 2), statistic = "mode"): no valid statistic provided!
```

---

# [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_2_2_3_If_I_had_a_Function.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_2_2_3_If_I_had_a_Function.html)

---

## `for()` loops
(Simple) loops using the `for()` function are some of the most useful tools in functional programming.

They, e.g., enable iterating through input data and applying functions to each element of the data
- it depends on the specific purpose what defines this element
  - the elements can be rows, columns, list elements, etc.
- hence, it is crucial to think about the iterator of the specific call

---

## Architecture of for-loops

for-loops follow a straightforward structure, which is always the same:

```r
for (iterator_name in data) {
  function_to_apply(iterator_name)
}
```

---

## Calculating means for all trust variables

```r
variables_vector <- 
 c(
 "hzcy044a", "hzcy044a", "hzcy044a", "hzcy047a", "hzcy048a",
 "hzcy049a", "hzcy050a", "hzcy051a", "hzcy052a"
 )

for (variable in variables_vector) {
  print(
    descriptives_na(
      gp_covid[[variable]], 
      statistic = "mean"
    )
  )
}
```

```
## [1] 1.84894
## [1] 1.84894
## [1] 1.84894
## [1] 1.558643
## [1] 2.336631
## [1] 2.427157
## [1] 2.178297
## [1] 2.032227
## [1] 1.761184
```

---

## The apply family
The apply family can make your life a bit easier when writing `base R` loops:
- it provides a friendly interface for entering your data
- data come out in a standard format
- it can be faster than, e.g., writing a `for()` loop

We can't cover all members of this family of function. We will (briefly) cover:
- **`apply()`**
- **`lapply()`**
- **`sapply()`**
- **`tapply()`**
- `mapply()`, `rapply()`, & `vapply()` are left out

---

## apply()
The `apply()` function is useful if you want to fire up a short command across either all columns (option `MARGIN = 2`) _or_ rows (option `MARGIN = 1`).

```r
# means across columns/variables
apply(gp_covid[,20:24], 2, function (x) descriptives_na(x, statistic = "mean"))
```

```
##   hzcy007a   hzcy008a   hzcy009a   hzcy010a   hzcy011a 
## 0.80288763 0.46547395 0.01851852 0.08600126 0.91054614
```

```r
# means across rows/observations
apply(gp_covid[1:10,20:24], 1, function (x) descriptives_na(x, statistic = "mean"))
```

```
##  [1] NaN 0.2 0.4 0.4 NaN 0.0 0.4 0.4 NaN 0.6
```

.small[
*Note*: While there are plenty of functions for building descriptive tables available via different packages (many of which we will cover in the session on *Exploratory Data Analysis*), this becomes handy when if you want/need to create them yourself.
]

---

## lapply()
`lapply()` is for more elaborated operations. However, there are no `MARGIN` options, so let's see what happens when we use it in a similar way as we did before:

```r
lapply(gp_covid[,20:24], function (x) descriptives_na(x, statistic = "mean"))
```

```
## $hzcy007a
## [1] 0.8028876
## 
## $hzcy008a
## [1] 0.4654739
## 
## $hzcy009a
## [1] 0.01851852
## 
## $hzcy010a
## [1] 0.08600126
## 
## $hzcy011a
## [1] 0.9105461
```

---

## lapply() returns lists

It might be a little bit uncomfortable, but `lapply()` returns each result of an iterated operation as a list element. Thus, the output of applying the function is a list.

---

## sapply()

`sapply()` is similar to `lapply()`. The minor but significant difference is that it returns vectors instead of lists. If you want to add the results of this function as a new column to your existing data, this comes in handy.

```r
sapply(gp_covid[,20:24], function (x) descriptives_na(x, statistic = "mean"))
```

```
##   hzcy007a   hzcy008a   hzcy009a   hzcy010a   hzcy011a 
## 0.80288763 0.46547395 0.01851852 0.08600126 0.91054614
```

---

## tapply()
Finally, `tapply()` is useful if you want to perform an action across different groups in your data.

```r
tapply(
  gp_covid$political_orientation, 
  gp_covid$sex,
  function (x) descriptives_na(x, statistic = "mean")
  )
```

```
##        1        2 
## 4.893819 4.412325
```

*Note*: Again, there are plenty of functions for creating descriptive statistics available already (which we will discuss in the session on *Exploratory Data Analysis*). However, at some point you may want/need to create your own functions and the members of the `apply()` can come in handy there.

---

## Modern iteration options from the `purrr` package

.pull-left[
Thus far, our examples have not been that complicated
- we had one specific task to perform and the input data were not complex
]

.pull-right[
Sometimes, things are a bit more complicated
- for example, the data may  have to be wrangled before the actual loop
]

.pull-left[
<img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\purrr_logo.png" width="30%" style="display: block; margin: auto;" />
]

.pull-right[
**`purrr` provides a collection of functions that also integrate nicely into a
`%>%` workflow** 
]

---

## A simple `map()` example

We can use `map()` to apply our `descriptives_na()` function to multiple list elements at once.

```r
library(purrr)

gp_covid %>% 
  select(sex, hzcy044a:hzcy052a) %>% 
  group_by(sex) %>% 
  group_split(sex, .keep = FALSE) %>% 
  map(~as.matrix(.x)) %>% 
  map_dbl(~descriptives_na(.x, statistic = "mean"))
```

```
## [1] 2.147828 2.046260
```

*Note*: The `dplyr` package contains some helpful functions for summarizing data (which we will cover in the session on *Exploratory Data Analysis*). In addition to increasing your `R` programming skills, the above example can help in understanding how these work.

---

## `purrr::map()`

A few things to note about `map()`:
- `map()` usually expects a list as input
  - this is why we split our data into two lists
- a function is applied to each list element with a preceding `~` operator
- per default, `map()` returns the results also as a list
  - yet, there are pre-defined `map()`-flavors that return other data types (e.g.,  the used `map_dbl()`)
  - you may want to have a look at the help page using `?map` for a comprehensive overview
  
**We will re-use the `purrr` capabilities later this week when we wrangle multiple regression models at the same time.**

---

## `purrr::map()`

<img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\map_frosting.png" width="95%" style="display: block; margin: auto;" />
Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)

---

## Overview of looping functions in `R`

| Name           | For what?                                | belongs to  |
|----------------|------------------------------------------|-------------|
| `for()`        | raw interface to repeated tasks          | `base R`    |
| `apply()` etc. | convenience functions for repeated tasks | `base R`    |
| `map()` etc.   | integrates into `%>%` workflow           | `tidyverse` |
| ...            | ...                                      | ...         |
| `while()`      | do something as long condition is met    | `base R`    |

---

# [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_2_2_4_Purrr_Joy_of_Writing_Loops.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_2_2_4_Purrr_Joy_of_Writing_Loops.html)

---

# Extracurricular activities
`R` can also be used for creating text-based adventure games. Play the fun short text adventure ["Castle of R"](https://github.com/gsimchoni/CastleOfR) which was designed to test your programming skills using `base R`.

Also check out the [background](http://giorasimchoni.com/2017/09/10/2017-09-10-you-re-in-a-room-the-castleofr-package/) of the programming of the game/package.