class: center, middle, inverse, title-slide # Introduction to R for Data Analysis ## Data Wrangling Basics ### Johannes Breuer & Stefan Jünger ### 2021-08-03 --- layout: true --- ## Data wrangling 🤠 <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\data_cowboy.png" width="95%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- ## What is data wrangling? Data wrangling is the process of "getting the data into shape", so that you can then explore and analyze them. Common data wrangling steps when working with tabular data in the social & behavioral sciences (e.g., from surveys) include: - **renaming** variables - **selecting** a subset of variables - **filtering** a subset of cases - **recoding** variables/values (incl. missing values) - **creating/computing** new variables -- The (in)famous **80/20-rule**: 80% wrangling, 20% analysis (of course, this ratio relates to the time required for writing the code, not the computing time). --- ## The `tidyverse` > The `tidyverse` is an .highlight[opinionated collection of R packages designed for data science]. All packages share an .highlight[underlying design philosophy, grammar, and data structures] ([Tidyverse website](https://www.tidyverse.org/)). > The `tidyverse` is a .highlight[coherent system of packages for data manipulation, exploration and visualization] that share a .highlight[common design philosophy] ([Rickert, 2017](https://rviews.rstudio.com/2017/06/08/what-is-the-tidyverse/)). <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\hex-tidyverse.png" width="25%" style="display: block; margin: auto;" /> --- ## Benefits of the `tidyverse` Data wrangling can also be done with `base R`. However, the syntax for this is typically (more) verbose and not intuitive and, hence, difficult to learn, remember, and read (plus many `tidyverse` operations are faster than their base `R` equivalents). --- ## Benefits of the `tidyverse` `Tidyverse` syntax is designed to increase **human-readability**. This makes it especially **attractive for `R` novices** as it can facilitate the experience of **self-efficacy** (see [Robinson, 2017](http://varianceexplained.org/r/teach-tidyverse/)). The `tidyverse` also aims for **consistency** (e.g., data frame as first argument and output) and uses **smarter defaults** (e.g., no partial matching of data frame and column names). --- ## The 'dark side' of the `tidyverse` `tidyverse` is not `R` as in `base R` - some routines are like using a whole different language, which... - ... can be nice when learning `R` - ... can get difficult when searching for solutions to certain problems Often, `tidyverse` functions are under heavy development - they change and can potentially break your code - Example: [Converting tables into long or wide format](https://tidyr.tidyverse.org/news/index.html#pivoting) - to learn more about the `tidyverse` lifecycle you can watch this [talk by Hadley Wickham](https://www.youtube.com/watch?v=izFssYRsLZs) or read the corresponding [documentation](https://lifecycle.r-lib.org/articles/stages.html#deprecated) --- ## `Base R` vs. `tidyverse` Similar to other fierce academic debates over, e.g., `R` vs. `Python` or Frequentism vs. Bayesianism, people have argued [for](http://varianceexplained.org/r/teach-tidyverse/) and [against](https://blog.ephorie.de/why-i-dont-use-the-tidyverse) using/teaching the `tidyverse`. Our personal experience with teaching the `tidyverse` is something like this... <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\tidyverse_meme.png" width="50%" style="display: block; margin: auto;" /> .center[ <small><small>Source: https://s.unhb.de/ReoyN</small></small> ] --- ## Data wrangling alternatives As with almost all tasks in `R`, there are more than two packages for data wrangling. Two alternatives (or additions) to `base R` and the `tidyverse` are: - [`data.table`](https://rdatatable.gitlab.io/data.table/index.html) - [`datawizard`](https://easystats.github.io/datawizard/) --- ## `data.table` The `data.table` package also is a powerful tool for data wrangling, especially if you work with large data sets. The reason we do not discuss `data.table` in this course is that neither of us has extensive experience with it, and comparing all three options (`base R`, `tidyverse`, and `data.table`) side-by-side would be enough for a separate workshop/course. There is, however, a very detailed [blog post by Jason Mercer](https://wetlandscapes.com/blog/a-comparison-of-r-dialects/) that compares the functionalities of `base R`, `tidyverse`, and `data.table` for data wrangling and [another one by Atreba](https://atrebas.github.io/post/2019-03-03-datatable-dplyr/) that focuses on a comparison between `data.table` and [`dplyr`](https://dplyr.tidyverse.org/) which is a key package for data manipulation from the `tidyverse`. --- # `datawizard` 🧙 `datawizard` is a fairly new contender in the data wrangling game that also offers quite a few handy and easy to use functions. `datawizard` is part of the [`easystats` collection of `R` packages](https://easystats.github.io/easystats/) which offer many helpful functionalities for data preparation, analysis, and reporting, which can nicely extend or complement the `tidyverse`. We will discuss some of the `easystats` packages again in the sessions on exploratory and confirmatory data analysis. --- ## Structure & focus of this session For most of the data wrangling tasks we discuss in this section, we will show how do do them with `base R` and the `tidyverse`, so that you can get a sense of the differences. Our main focus, however, will be on the use of packages (and functions) from the `tidyverse` and how they can be used to clean and transform your data. Of course, it is possible to combine `base R` and `tidyverse` code. However, in the long run, you should try to aim for consistency. --- ## Lift-off into the `tidyverse` 🚀 **Install all `tidyverse` packages** (for the full list of `tidyverse` packages see [https://www.tidyverse.org/packages/](https://www.tidyverse.org/packages/)) ```r install.packages("tidyverse") ``` **Load core `tidyverse` packages** (NB: To save time and reduce namespace conflicts you can also load `tidyverse` packages individually) ```r library("tidyverse") ``` --- ## `tidyverse` vocabulary 101 While there is much more to the `tidyverse` than this, three important concepts that you need to be familiar with, if you want to use it, are: 1. Tidy data 2. Tibbles 3. Pipes We already discussed tibbles in the session on *Data Import & Export*, so we will focus on tidy data and pipes here. --- ## Tidy data The 3 rules of tidy data: 1. Each **variable** is in a separate **column**. 2. Each **observation** is in a separate **row**. 3. Each **value** is in a separate **cell**. <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\tidy_data.png" width="2560" style="display: block; margin: auto;" /> <small><small>Source: https://r4ds.had.co.nz/tidy-data.html</small></small> *Note*: In the `tidyverse` terminology 'tidy data' usually also means data in long format (where applicable). --- ## Wide vs. long format <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\wide-long.png" width="70%" style="display: block; margin: auto;" /> <small><small>Source: https://github.com/gadenbuie/tidyexplain#tidy-data</small></small> .small[ *Note*: The functions `pivot_wider()` and `pivot_longer()` from the [`tidyr` package](https://tidyr.tidyverse.org/) are easy-to-use options from changing data from long to wide format and vice versa. ] --- ### What's a pipe? <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\pipe_office_decoration.jpg" width="50%" style="display: block; margin: auto;" /> .small[ Source: Johannes' GESIS office wall ] --- ## Pipes Usually, in `R` we apply functions as follows: ```r f(x) ``` In the logic of pipes this function is written as: ```r x %>% f(.) ``` Here, object `x` is piped into function `f`, becoming (by default) its first argument (but by using *.* it can also be fed into other arguments). -- We can use pipes with more than one function: ```r x %>% f_1() %>% f_2() %>% f_3() ``` .small[ More about pipes: https://r4ds.had.co.nz/pipes.html ] --- ## Pipes There `%>%` pipe used in the `tidyverse` is part of the [`magrittr` package](https://magrittr.tidyverse.org/) which also includes other specialized types of pipes. *RStudio* offers a keyboard shortcut for inserting the `%>%` pipe: <kbd>Ctrl + Shift + M</kbd> (*Windows* & *Linux*)/<kbd>Cmd + Shift + M</kbd> (*Mac*) Since [version 4.1.0](https://cran.r-project.org/bin/windows/base/NEWS.R-4.1.0.html), `base R` also offers its own pipe `|>`, which is similar to but not the same as the `%>%` pipe. --- ## Data set For the examples and exercises in this session we will, again, use data from the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany*. Remember that to code along and for the exercises the *GESIS Panel* files should be in a folder called `data` in the same folder as the other materials for this course. --- ## Interlude 1: Citing data If you (re-)use existing data sets, please cite them in your publications, theses, teaching materials, etc. Data repositories normally provide information on how to cite the data. For example, the APA-style citation for *Public Use File (PUF) of the GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* is: GESIS Panel Team (2020). GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany. *GESIS Datenarchiv, Köln. ZA5667 Datenfile Version 1.1.0*, https://doi.org/10.4232/1.13520. --- ## Interlude 2: Citing FOSS You should also make sure to cite the free and open-source software that you use, such as `R` packages and `R` itself. There is a function in `R` that tells you how to cite it or any of the packages you have installed. ```r citation() ``` ``` ## ## To cite R in publications use: ## ## R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical ## Computing, Vienna, Austria. URL https://www.R-project.org/. ## ## Ein BibTeX-Eintrag für LaTeX-Benutzer ist ## ## @Manual{, ## title = {R: A Language and Environment for Statistical Computing}, ## author = {{R Core Team}}, ## organization = {R Foundation for Statistical Computing}, ## address = {Vienna, Austria}, ## year = {2021}, ## url = {https://www.R-project.org/}, ## } ## ## We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also ## 'citation("pkgname")' for citing R packages. ``` --- ## Interlude 3: Codebook It is always advisable to consult the codebook (if there is one) before starting to work with a data set. The *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* comes with a very [detailed codebook](https://dbk.gesis.org/dbksearch/download.asp?id=67378). Side note: If you want to (semi-)automatically generate a codebook for your own dataset, there are several options in `R`: - The [`codebook` package](https://rubenarslan.github.io/codebook/) which includes an *RStudio*-Addin and also offers a [web app](https://rubenarslan.ocpu.io/codebook/www/) - the `makeCodebook()` function from the [`dataMaid` package](https://github.com/ekstroem/dataMaid) (see this [blog post](http://sandsynligvis.dk/articles/18/codebook.html) for a short tutorial) - the `codebook()` function from the [`memisc` package](https://github.com/melff/memisc) --- ## Load the data The first step, of course, is loading the data into `R`. The *Public Use File (PUF) of the GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* is available in different formats. We will work with the `.csv` file. ```r gp_covid <- read_csv2("./data/ZA5667_v1-1-0.csv") ``` *Note*: `read_csv2()` is used to load files that use "; for the field separator and , for the decimal point" (from the function help file), which is the format that the `.csv` version of this data set is in. --- ## Note: Tidy vs. untidy data As a lot of work (by many people) has already gone into this data set, the *Public Use File (PUF) of the GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* is already tidy. If you collect data yourself, this may not be the case (at least for the raw data). For example, cells may hold more than one value or a variable that should be in one column is spread across multiple columns (e.g., parts of a date or name). If you need to make your data tidy or change it from wide to long format or vice versa (which may, e.g., be necessary if you work with longitudinal survey data from multiple waves), the [`tidyr` package](https://tidyr.tidyverse.org/) from the `tidyverse` is a good option. --- ## `dplyr` The `tidyverse` examples in the following will make use of functions from the [`dplyr` package](https://dplyr.tidyverse.org/): - `dplyr` functions are verbs that signal an action - first argument = a data frame - the output normally also is a data frame (tibble) - columns (= variables in a tidy data frame) can be referenced without quotation marks (non-standard evaluation) - actions (verbs) can be applied to columns (variables) and rows (cases/observations) --- ## First look 👀 The `dplyr` package provides a function for getting a first good look at your data, that is especially helpful when working with data sets that contain many columns/variables. The function `glimpse()` prints a data frame/tibble in a way that represents columns as rows and rows as columns and also provides some additional information about the data frame and its columns. ```r gp_covid %>% glimpse() ``` .right[↪️] --- class: middle .tinyisher[ ``` ## Rows: 3,765 ## Columns: 137 ## $ za_number [3m[38;5;246m<chr>[39m[23m "ZA5667", "ZA5667", "ZA5667", "ZA5667", "ZA5667", "ZA5667", "ZA5667", "ZA5667", "ZA5667", "ZA5667", ~ ## $ version [3m[38;5;246m<chr>[39m[23m "v1-1-0 2020-04-27", "v1-1-0 2020-04-27", "v1-1-0 2020-04-27", "v1-1-0 2020-04-27", "v1-1-0 2020-04-~ ## $ doi [3m[38;5;246m<chr>[39m[23m "10.4232/1.13520", "10.4232/1.13520", "10.4232/1.13520", "10.4232/1.13520", "10.4232/1.13520", "10.4~ ## $ id [3m[38;5;246m<dbl>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2~ ## $ cohort [3m[38;5;246m<dbl>[39m[23m 3, 1, 3, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 3, 3, 1, 1, 1, 1, 3, 1, 1, 1, 3, 2, 1, 2, 1~ ## $ sex [3m[38;5;246m<dbl>[39m[23m 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 2~ ## $ age_cat [3m[38;5;246m<dbl>[39m[23m 7, 7, 8, 4, 1, 10, 4, 7, 8, 1, 6, 8, 2, 6, 2, 2, 2, 7, 4, 8, 1, 7, 4, 3, 5, 7, 7, 6, 6, 5, 7, 7, 5, ~ ## $ education_cat [3m[38;5;246m<dbl>[39m[23m 3, 2, 2, 3, 3, 2, 2, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 3, 2~ ## $ intention_to_vote [3m[38;5;246m<dbl>[39m[23m 2, 2, 2, 2, -33, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -77, 2, 2, 2, 2, 2, -99~ ## $ choice_of_party [3m[38;5;246m<dbl>[39m[23m 1, 5, 1, 1, -33, 6, 6, 5, 1, 2, 1, 6, 98, 1, 7, 1, 5, 1, 98, 1, 1, 1, 7, 1, 7, -77, 1, 5, 6, 3, 1, 1~ ## $ political_orientation [3m[38;5;246m<dbl>[39m[23m 6, 5, 5, 7, 4, 10, 5, 6, 6, 7, 6, 7, 5, 6, 6, 3, 5, 5, 6, 6, 4, 6, 5, 5, 7, -77, 6, 4, 6, 8, 4, 6, 3~ ## $ marstat [3m[38;5;246m<dbl>[39m[23m 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 2, 1, 1, 2, 1, 3, 1, 1, 1, 2, 3, 2, 1, 1~ ## $ household [3m[38;5;246m<dbl>[39m[23m 1, 2, 2, 3, 3, 2, 3, 3, 2, 3, 2, 2, 3, 3, 1, 3, 3, 2, 3, 2, 2, 3, 3, 2, 3, 2, 2, 3, 2, 1, 2, 2, 3, 2~ ## $ hzcy001a [3m[38;5;246m<dbl>[39m[23m -33, 5, 5, 4, -33, 3, 4, 4, -33, 7, 4, 5, -33, 6, 4, -33, -33, 6, -33, 3, -33, 6, 5, 5, 6, 5, 7, 6, ~ ## $ hzcy002a [3m[38;5;246m<dbl>[39m[23m -33, 5, 6, 4, -33, 3, 3, 4, -33, 5, 6, 6, -33, 6, 5, -33, -33, 6, -33, 5, -33, 6, 6, 97, 6, 6, 7, 6,~ ## $ hzcy003a [3m[38;5;246m<dbl>[39m[23m -33, 2, 3, 2, -33, -99, 3, 3, -33, 3, 3, 7, -33, 3, 2, -33, -33, 4, -33, 3, -33, 1, 3, 4, 3, 3, 7, 3~ ## $ hzcy004a [3m[38;5;246m<dbl>[39m[23m -33, 5, 6, 4, -33, 3, 3, 3, -33, 4, 5, 6, -33, 7, 3, -33, -33, 5, -33, 4, -33, 3, 4, 5, 6, 3, 7, 3, ~ ## $ hzcy005a [3m[38;5;246m<dbl>[39m[23m -33, 5, 6, 3, -33, 3, 4, 4, -33, 2, 4, 6, -33, 4, 2, -33, -33, 6, -33, 3, -33, 6, 4, 3, 5, 4, 7, 5, ~ ## $ hzcy006a [3m[38;5;246m<dbl>[39m[23m -33, 1, 1, 0, -33, 1, 1, 1, -33, 1, 1, 1, -33, 1, 1, -33, -33, 1, -33, 1, -33, 1, 0, 1, 1, 1, 0, 1, ~ ## $ hzcy007a [3m[38;5;246m<dbl>[39m[23m -33, 0, 1, 0, -33, 0, 1, 1, -33, 1, 1, 1, -33, 1, 1, -33, -33, 1, -33, 1, -33, 1, 0, 1, 0, 1, 0, 1, ~ ## $ hzcy008a [3m[38;5;246m<dbl>[39m[23m -33, 0, 0, 1, -33, 0, 0, 0, -33, 1, 1, 0, -33, 1, 1, -33, -33, 0, -33, 1, -33, 0, 1, 1, 1, 1, 0, 0, ~ ## $ hzcy009a [3m[38;5;246m<dbl>[39m[23m -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, -33, -33, 0, -33, 0, -33, 0, 0, 0, 0, 0, 0, 0, ~ ## $ hzcy010a [3m[38;5;246m<dbl>[39m[23m -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, 0, -33, 1, 0, -33, -33, 0, -33, 0, -33, 0, 0, 0, 0, 0, 0, 0, ~ ## $ hzcy011a [3m[38;5;246m<dbl>[39m[23m -33, 1, 1, 1, -33, 0, 1, 1, -33, 1, 1, 1, -33, 1, 1, -33, -33, 1, -33, 0, -33, 1, 0, 1, 1, 1, 1, 1, ~ ## $ hzcy012a [3m[38;5;246m<dbl>[39m[23m -33, 1, 0, 1, -33, 1, 0, 1, -33, 1, 1, 0, -33, 0, 0, -33, -33, 1, -33, 0, -33, 1, 0, 1, 1, 1, 1, 0, ~ ## $ hzcy013a [3m[38;5;246m<dbl>[39m[23m -33, 1, 0, 0, -33, 0, 0, 0, -33, 1, 1, 1, -33, 0, 0, -33, -33, 0, -33, 0, -33, 0, 1, 0, 0, 1, 0, 1, ~ ## $ hzcy014a [3m[38;5;246m<dbl>[39m[23m -33, 0, 1, 1, -33, 0, 1, 1, -33, 0, 1, 1, -33, 1, 1, -33, -33, 1, -33, 1, -33, 1, 1, 1, 0, 1, 1, 1, ~ ## $ hzcy015a [3m[38;5;246m<dbl>[39m[23m -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, -33, -33, 0, -33, 0, -33, 0, 0, 0, 0, 0, 0, 0, ~ ## $ hzcy016a [3m[38;5;246m<dbl>[39m[23m -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, -33, -33, 0, -33, 0, -33, 0, 0, 0, 0, 0, 0, 0, ~ ## $ hzcy018a [3m[38;5;246m<dbl>[39m[23m -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, -33, -33, 0, -33, 0, -33, 0, 0, 0, 0, 0, 0, 0, ~ ## $ hzcy019a [3m[38;5;246m<dbl>[39m[23m -33, 5, 5, 3, -33, 4, 5, 5, -33, 4, 3, 5, -33, 4, 4, -33, -33, 5, -33, 5, -33, 5, 4, 5, 4, 4, 5, 4, ~ ## $ hzcy020a [3m[38;5;246m<dbl>[39m[23m -33, 5, 5, 4, -33, 4, 5, 5, -33, 5, 4, 5, -33, 4, 5, -33, -33, 5, -33, 5, -33, 5, 4, 4, -99, 5, 5, 4~ ## $ hzcy021a [3m[38;5;246m<dbl>[39m[23m -33, 5, 5, 4, -33, 4, 5, 5, -33, 4, 4, 5, -33, 5, 5, -33, -33, 5, -33, 5, -33, 5, 4, -99, 3, 5, 5, 5~ ## $ hzcy022a [3m[38;5;246m<dbl>[39m[23m -33, 5, 4, 3, -33, 4, 5, 5, -33, 5, 4, 4, -33, 2, 4, -33, -33, 4, -33, 4, -33, 5, 2, 3, 3, 5, 5, 4, ~ ## $ hzcy023a [3m[38;5;246m<dbl>[39m[23m -33, 5, 5, 4, -33, 4, 5, 3, -33, 5, 3, 5, -33, 5, 5, -33, -33, 5, -33, 4, -33, 5, 5, 4, 5, 5, 5, 5, ~ ## $ hzcy024a [3m[38;5;246m<dbl>[39m[23m -33, 5, 5, 2, -33, 2, 5, 3, -33, 5, 3, 2, -33, 5, 5, -33, -33, 5, -33, 3, -33, 3, 4, 4, 5, 5, 5, 5, ~ ## $ hzcy025a [3m[38;5;246m<dbl>[39m[23m -33, 5, 4, 2, -33, 3, 3, 3, -33, 3, 3, 2, -33, 2, 4, -33, -33, 4, -33, 3, -33, 5, 4, 3, 5, 5, 5, 5, ~ ## $ hzcy026a [3m[38;5;246m<dbl>[39m[23m -33, 4, 1, 1, -33, 1, 1, 1, -33, 1, 1, 1, -33, 1, 1, -33, -33, 1, -33, 1, -33, 1, 4, 1, 1, 1, 1, 4, ~ ## $ hzcy027a [3m[38;5;246m<dbl>[39m[23m -33, -88, 5, 4, -33, 5, 5, 5, -33, 4, 3, 1, -33, 4, 5, -33, -33, 5, -33, 5, -33, 4, -88, 4, 5, 4, 5,~ ## $ hzcy028a [3m[38;5;246m<dbl>[39m[23m -33, -88, 1, 2, -33, 2, 3, 1, -33, 2, 3, 2, -33, 2, 2, -33, -33, 2, -33, 1, -33, 4, -88, 2, 2, 2, 2,~ ## $ hzcy029a [3m[38;5;246m<dbl>[39m[23m -33, -88, 4, 3, -33, 4, 5, 5, -33, 4, 4, 5, -33, 2, 3, -33, -33, 4, -33, 4, -33, 3, -88, 3, 5, 5, 5,~ ## $ hzcy030a [3m[38;5;246m<dbl>[39m[23m -33, -88, 5, 3, -33, 4, 5, 5, -33, 5, 4, 5, -33, 2, 4, -33, -33, 5, -33, 3, -33, 5, -88, 4, 5, 5, 5,~ ## $ hzcy031a [3m[38;5;246m<dbl>[39m[23m -33, -88, 5, 3, -33, 4, 5, 5, -33, 4, 4, 5, -33, 2, 5, -33, -33, 5, -33, 3, -33, 5, -88, 4, 5, 5, 5,~ ## $ hzcy032a [3m[38;5;246m<dbl>[39m[23m -33, -88, 5, 3, -33, 4, 5, 5, -33, 4, 4, 4, -33, 4, 5, -33, -33, 5, -33, 4, -33, 5, -88, 5, 5, 5, 5,~ ## $ hzcy033a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, -88, -33, -88, ~ ## $ hzcy034a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, -88, -33, -88, ~ ## $ hzcy035a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, -88, -33, -88, ~ ## $ hzcy036a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, -88, -33, -88, ~ ## $ hzcy037a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, -88, -33, -88, ~ ## $ hzcy038a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, -88, -33, -88, ~ ## $ hzcy039a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, -88, -33, -88, ~ ## $ hzcy040a [3m[38;5;246m<dbl>[39m[23m -33, 3, 2, 3, -33, 2, 3, 3, -33, 1, 3, 1, -33, 3, 2, -33, -33, 3, -33, 3, -33, 2, 3, 3, 3, 2, 2, 2, ~ ## $ hzcy041a [3m[38;5;246m<dbl>[39m[23m -33, 3, 3, 4, -33, 3, 3, 3, -33, 3, 3, 1, -33, 4, 3, -33, -33, 3, -33, 3, -33, 2, 4, 3, 4, 3, 3, 3, ~ ## $ hzcy042a [3m[38;5;246m<dbl>[39m[23m -33, 3, 3, 3, -33, 3, 3, 3, -33, 2, 2, 2, -33, 2, 2, -33, -33, 2, -33, 2, -33, 2, 1, 2, 2, 3, 2, 2, ~ ## $ hzcy043a [3m[38;5;246m<dbl>[39m[23m -33, 3, 3, 3, -33, 2, 3, 3, -33, 2, 2, 1, -33, 3, 2, -33, -33, 2, -33, 3, -33, 1, 2, 3, 2, 3, 3, 2, ~ ## $ hzcy044a [3m[38;5;246m<dbl>[39m[23m -33, 5, 4, 4, -33, 4, 4, 4, -33, 5, 3, 5, -33, 4, 3, -33, -33, 5, -33, 4, -33, 5, 5, 98, 4, 5, 4, 5,~ ## $ hzcy045a [3m[38;5;246m<dbl>[39m[23m -33, 4, 4, 4, -33, 5, 4, 4, -33, 4, 4, 3, -33, 3, 3, -33, -33, 5, -33, 2, -33, 4, 5, 98, 4, 98, 98, ~ ## $ hzcy046a [3m[38;5;246m<dbl>[39m[23m -33, 4, 5, 4, -33, 4, 4, 4, -33, 3, 2, 2, -33, 3, 3, -33, -33, 5, -33, 2, -33, 3, 4, 98, 4, 4, 3, 3,~ ## $ hzcy047a [3m[38;5;246m<dbl>[39m[23m -33, 5, 5, 4, -33, 5, 4, 5, -33, 4, 4, 5, -33, 4, 4, -33, -33, 4, -33, 5, -33, 5, 5, 4, 4, 4, 4, 4, ~ ## $ hzcy048a [3m[38;5;246m<dbl>[39m[23m -33, 4, 4, 4, -33, 4, 3, 4, -33, 3, 2, 1, -33, 4, 2, -33, -33, 4, -33, 5, -33, 3, 4, 4, 4, 4, 4, 4, ~ ## $ hzcy049a [3m[38;5;246m<dbl>[39m[23m -33, 4, 3, 4, -33, 2, 3, 4, -33, 2, 1, 2, -33, 4, 2, -33, -33, 4, -33, 98, -33, 3, 4, 4, 4, 4, 4, 4,~ ## $ hzcy050a [3m[38;5;246m<dbl>[39m[23m -33, 4, 4, 4, -33, 4, 3, 4, -33, 2, 2, 1, -33, 4, 4, -33, -33, 5, -33, 4, -33, 3, 4, 4, 4, 4, 4, 4, ~ ## $ hzcy051a [3m[38;5;246m<dbl>[39m[23m -33, 4, 2, 4, -33, 3, 2, 5, -33, 5, 4, 3, -33, 4, 5, -33, -33, 5, -33, 3, -33, 2, 5, 4, 4, 4, 4, 3, ~ ## $ hzcy052a [3m[38;5;246m<dbl>[39m[23m -33, 4, 5, 4, -33, 5, 4, 5, -33, 5, 4, 4, -33, 5, 5, -33, -33, 5, -33, 5, -33, 4, 4, 4, 4, 4, 3, 3, ~ ## $ hzcy053a [3m[38;5;246m<dbl>[39m[23m -33, 1, 5, 1, -33, 5, 1, 1, -33, 6, 1, 5, -33, 1, 1, -33, -33, 2, -33, 1, -33, 1, 1, 2, 1, 1, 2, 1, ~ ## $ hzcy054a [3m[38;5;246m<dbl>[39m[23m -33, 0, -88, 0, -33, -88, 0, 0, -33, -88, 0, -88, -33, 0, 0, -33, -33, -88, -33, 0, -33, 0, 0, -88, ~ ## $ hzcy055a [3m[38;5;246m<dbl>[39m[23m -33, 1, -88, 0, -33, -88, 0, 0, -33, -88, 0, -88, -33, 0, 0, -33, -33, -88, -33, 0, -33, 0, 0, -88, ~ ## $ hzcy056a [3m[38;5;246m<dbl>[39m[23m -33, 0, -88, 0, -33, -88, 0, 0, -33, -88, 1, -88, -33, 1, 1, -33, -33, -88, -33, 1, -33, 0, 0, -88, ~ ## $ hzcy057a [3m[38;5;246m<dbl>[39m[23m -33, 0, -88, 0, -33, -88, 0, 0, -33, -88, 0, -88, -33, 0, 0, -33, -33, -88, -33, 0, -33, 0, 0, -88, ~ ## $ hzcy058a [3m[38;5;246m<dbl>[39m[23m -33, 0, -88, 0, -33, -88, 0, 0, -33, -88, 0, -88, -33, 0, 0, -33, -33, -88, -33, 0, -33, 0, 0, -88, ~ ## $ hzcy059a [3m[38;5;246m<dbl>[39m[23m -33, 0, -88, 0, -33, -88, 0, 0, -33, -88, 0, -88, -33, 0, 0, -33, -33, -88, -33, 0, -33, 0, 0, -88, ~ ## $ hzcy060a [3m[38;5;246m<dbl>[39m[23m -33, 0, -88, 1, -33, -88, 1, 1, -33, -88, 0, -88, -33, 0, 0, -33, -33, -88, -33, 0, -33, 1, 1, -88, ~ ## $ hzcy061a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, 0, -33, -88, -3~ ## $ hzcy062a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, 1, -33, -88, -3~ ## $ hzcy063a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, 0, -33, -88, -3~ ## $ hzcy064a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, 1, -33, -88, -3~ ## $ hzcy065a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, 0, -33, -88, -3~ ## $ hzcy066a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, 0, -33, -88, -3~ ## $ hzcy067a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, 0, -33, -88, -3~ ## $ hzcy068a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, 1, -33, -88, -3~ ## $ hzcy069a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, 0, -33, -88, -3~ ## $ hzcy070a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, 0, -33, -88, -3~ ## $ hzcy071a [3m[38;5;246m<dbl>[39m[23m -33, 2, 2, 1, -33, 2, 1, 2, -33, 2, 2, 2, -33, 1, 2, -33, -33, 2, -33, 2, -33, 2, 2, 2, 1, 2, 2, 1, ~ ## $ hzcy072a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, 1, -33, -88, 1, -88, -33, -88, -88, -88, -33, 1, -88, -33, -33, -88, -33, -88, -33, -~ ## $ hzcy073a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, 0, -33, -88, 1, -88, -33, -88, -88, -88, -33, 1, -88, -33, -33, -88, -33, -88, -33, -~ ## $ hzcy074a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, 0, -33, -88, 0, -88, -33, -88, -88, -88, -33, 1, -88, -33, -33, -88, -33, -88, -33, -~ ## $ hzcy075a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, 0, -33, -88, 0, -88, -33, -88, -88, -88, -33, 0, -88, -33, -33, -88, -33, -88, -33, -~ ## $ hzcy076a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, 0, -33, -88, 0, -88, -33, -88, -88, -88, -33, 0, -88, -33, -33, -88, -33, -88, -33, -~ ## $ hzcy077a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, 0, -33, -88, 0, -88, -33, -88, -88, -88, -33, 1, -88, -33, -33, -88, -33, -88, -33, -~ ## $ hzcy078a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, 0, -33, -88, 0, -88, -33, -88, -88, -88, -33, 0, -88, -33, -33, -88, -33, -88, -33, -~ ## $ hzcy079a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, 0, -33, -88, 0, -88, -33, -88, -88, -88, -33, 0, -88, -33, -33, -88, -33, -88, -33, -~ ## $ hzcy080a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, 0, -33, -88, 1, -88, -33, -88, -88, -88, -33, 0, -88, -33, -33, -88, -33, -88, -33, -~ ## $ hzcy081a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, 0, -33, -88, 0, -88, -33, -88, -88, -88, -33, 0, -88, -33, -33, -88, -33, -88, -33, -~ ## $ hzcy083a [3m[38;5;246m<dbl>[39m[23m -33, -88, -88, 0, -33, -88, 0, -88, -33, -88, -88, -88, -33, 0, -88, -33, -33, -88, -33, -88, -33, -~ ## $ hzcy084a [3m[38;5;246m<dbl>[39m[23m -33, 1, 1, 1, -33, 1, 0, 1, -33, 1, 1, 1, -33, 1, 0, -33, -33, 1, -33, 1, -33, 1, 1, 1, 1, 1, 1, 1, ~ ## $ hzcy085a [3m[38;5;246m<dbl>[39m[23m -33, 1, 1, 0, -33, 0, 1, 0, -33, 0, 0, 0, -33, 0, 1, -33, -33, 0, -33, 0, -33, 1, 0, 0, 0, 0, 0, 1, ~ ## $ hzcy086a [3m[38;5;246m<dbl>[39m[23m -33, 0, 0, 0, -33, 0, 0, 1, -33, 1, 1, 1, -33, 1, 0, -33, -33, 0, -33, 0, -33, 1, 0, 1, 0, 1, 0, 0, ~ ## $ hzcy087a [3m[38;5;246m<dbl>[39m[23m -33, 0, 1, 0, -33, 0, 0, 1, -33, 1, 0, 1, -33, 0, 0, -33, -33, 1, -33, 0, -33, 0, 0, 1, 1, 1, 1, 1, ~ ## $ hzcy088a [3m[38;5;246m<dbl>[39m[23m -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, -33, -33, 1, -33, 0, -33, 0, 0, 1, 0, 0, 0, 0, ~ ## $ hzcy089a [3m[38;5;246m<dbl>[39m[23m -33, 1, 1, 0, -33, 1, 0, 1, -33, 0, 0, 1, -33, 0, 0, -33, -33, 0, -33, 1, -33, 0, 1, 1, 0, 1, 1, 0, ~ ## $ hzcy090a [3m[38;5;246m<dbl>[39m[23m -33, 1, 0, 0, -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, -33, -33, 0, -33, 0, -33, 0, 0, 1, 0, 0, 0, 0, ~ ## $ hzcy091a [3m[38;5;246m<dbl>[39m[23m -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 1, -33, -33, 0, -33, 0, -33, 0, 0, 1, 0, 0, 0, 0, ~ ## $ hzcy092a [3m[38;5;246m<dbl>[39m[23m -33, 0, 1, 0, -33, 0, 0, 1, -33, 0, 0, 1, -33, 1, 1, -33, -33, 0, -33, 0, -33, 1, 1, 1, 1, 1, 1, 1, ~ ## $ hzcy093a [3m[38;5;246m<dbl>[39m[23m -33, 1, 0, 0, -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, -33, -33, 0, -33, 1, -33, 0, 0, 0, 0, 0, 0, 0, ~ ## $ hzcy095a [3m[38;5;246m<dbl>[39m[23m -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, -33, -33, 0, -33, 0, -33, 0, 0, 0, 0, 0, 0, 0, ~ ## $ hzcy096a [3m[38;5;246m<dbl>[39m[23m -33, 4, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, -88, -33, -88, -3~ ## $ hzcy097a [3m[38;5;246m<dbl>[39m[23m -33, 0, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, -88, -33, -88, -3~ ## $ hzcy098a [3m[38;5;246m<dbl>[39m[23m -33, 0, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, -88, -33, -88, -3~ ## $ hzcy099a [3m[38;5;246m<dbl>[39m[23m -33, 1, -88, -88, -33, -88, -88, -88, -33, -88, -88, -88, -33, -88, -88, -33, -33, -88, -33, -88, -3~ ## $ hzza001a [3m[38;5;246m<dbl>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~ ## $ hzza002a [3m[38;5;246m<dbl>[39m[23m 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1~ ## $ hzza003a [3m[38;5;246m<dbl>[39m[23m 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1~ ## $ hzzq009a [3m[38;5;246m<dbl>[39m[23m -33, 4, 5, 4, -33, 4, 3, 5, -33, 4, 4, 5, -33, 4, 4, -33, -33, 4, -33, -99, -33, 4, 5, 5, 4, 4, 4, 4~ ## $ hzzq016b [3m[38;5;246m<dbl>[39m[23m -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, 1, -33, 0, 0, -33, -33, 0, -33, 0, -33, 0, 0, 0, 1, 0, 0, 0, ~ ## $ hzzq023a [3m[38;5;246m<dbl>[39m[23m -33, 5, 5, 4, -33, 4, 4, 5, -33, 5, 4, 5, -33, 5, 4, -33, -33, 4, -33, 4, -33, 5, 5, 5, 4, 4, 4, 5, ~ ## $ hzzp201a [3m[38;5;246m<dbl>[39m[23m -33, 31, 31, 31, -33, 31, 31, 31, -33, 31, 31, 31, -33, 31, 31, -33, -33, 31, -33, 31, -33, 31, 31, ~ ## $ hzzp204a [3m[38;5;246m<dbl>[39m[23m -33, 210, 377, 309, -33, 429, 586, 366, -33, 283, 248, 703, -33, 466, 332, -33, -33, 223, -33, 306, ~ ## $ hzzp207a [3m[38;5;246m<dbl>[39m[23m -33, 1584549879, 1584469614, 1584525461, -33, 1584461540, 1584823080, 1584543510, -33, 1584823044, 1~ ## $ hzzr001a [3m[38;5;246m<dbl>[39m[23m -33, 3, 34, 4, -33, 3, 7, 2, -33, 2, 2, 3, -33, 2, 65, -33, -33, 6, -33, 2, -33, 6, 9, 2, 16, 5, 16,~ ## $ hzzr002a [3m[38;5;246m<dbl>[39m[23m -33, 24, 83, 35, -33, 41, 67, 39, -33, 40, 33, 57, -33, 50, 142, -33, -33, 39, -33, 43, -33, 67, 74,~ ## $ hzzr003a [3m[38;5;246m<dbl>[39m[23m -33, 48, 117, 67, -33, 90, 121, 140, -33, 75, 71, 112, -33, 74, 158, -33, -33, 62, -33, 72, -33, 107~ ## $ hzzr004a [3m[38;5;246m<dbl>[39m[23m -33, 71, 161, 101, -33, 143, 212, 176, -33, 115, 97, 177, -33, 137, 188, -33, -33, 91, -33, 116, -33~ ## $ hzzr005a [3m[38;5;246m<dbl>[39m[23m -33, 82, 175, 110, -33, 159, 230, 188, -33, 123, 101, 196, -33, 199, 204, -33, -33, 96, -33, 123, -3~ ## $ hzzr006a [3m[38;5;246m<dbl>[39m[23m -33, 0, 206, 140, -33, 209, 264, 222, -33, 150, 128, 250, -33, 257, 220, -33, -33, 118, -33, 151, -3~ ## $ hzzr007a [3m[38;5;246m<dbl>[39m[23m -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, -33, -33, 0, -33, 0, -33, 0, 0, 0, 0, 0, 0, 0, ~ ## $ hzzr008a [3m[38;5;246m<dbl>[39m[23m -33, 101, 245, 166, -33, 288, 319, 238, -33, 194, 154, 293, -33, 305, 248, -33, -33, 138, -33, 178, ~ ## $ hzzr009a [3m[38;5;246m<dbl>[39m[23m -33, 130, 293, 190, -33, 340, 388, 278, -33, 232, 188, 347, -33, 388, 266, -33, -33, 160, -33, 218, ~ ## $ hzzr010a [3m[38;5;246m<dbl>[39m[23m -33, 145, 0, 216, -33, 0, 438, 310, -33, 0, 210, 0, -33, 410, 292, -33, -33, 0, -33, 245, -33, 388, ~ ## $ hzzr011a [3m[38;5;246m<dbl>[39m[23m -33, 150, 312, 222, -33, 366, 446, 315, -33, 248, 221, 376, -33, 413, 307, -33, -33, 191, -33, 250, ~ ## $ hzzr012a [3m[38;5;246m<dbl>[39m[23m -33, 0, 0, 246, -33, 0, 509, 0, -33, 0, 0, 0, -33, 426, 0, -33, -33, 0, -33, 0, -33, 0, 0, 0, 966, 0~ ## $ hzzr013a [3m[38;5;246m<dbl>[39m[23m -33, 189, 345, 266, -33, 412, 558, 355, -33, 267, 240, 427, -33, 458, 325, -33, -33, 214, -33, 293, ~ ## $ hzzr014a [3m[38;5;246m<dbl>[39m[23m -33, 193, 0, 0, -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, -33, -33, 0, -33, 0, -33, 0, 0, 396, 0, 0, 0,~ ## $ hzzr015a [3m[38;5;246m<dbl>[39m[23m -33, 200, 0, 0, -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, -33, -33, 0, -33, 0, -33, 0, 0, 415, 0, 0, 0,~ ## $ hzzr016a [3m[38;5;246m<dbl>[39m[23m -33, 209, 360, 307, -33, 424, 576, 363, -33, 277, 247, 445, -33, 465, 331, -33, -33, 222, -33, 303, ~ ## $ hzzr017a [3m[38;5;246m<dbl>[39m[23m -33, 210, 377, 309, -33, 429, 586, 366, -33, 283, 248, 703, -33, 466, 332, -33, -33, 223, -33, 306, ~ ## $ hzzr018a [3m[38;5;246m<dbl>[39m[23m -33, 138, 307, 206, -33, 360, 416, 293, -33, 245, 200, 369, -33, 396, 283, -33, -33, 168, -33, 233, ~ ## $ hzzr019a [3m[38;5;246m<dbl>[39m[23m -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, 0, -33, 0, 0, -33, -33, 187, -33, 0, -33, 0, 0, 361, 0, 0, 44~ ``` ] --- ## Selecting variables We might want to reduce our data frame (or create a new one) to only include a subset of specific variables. Say, for example, we want to select only the variables that measure the risk of becoming infected with or spreading the Corona virus from our full data set. There are two options for doing this with `base R`: Option 1 .small[ ```r gp_covid_risk <- gp_covid[, c("hzcy001a", "hzcy002a", "hzcy003a", "hzcy004a", "hzcy005a")] # When subsetting with [], the first value refers to rows, the second to columns # [, c("var1", "var2", ...)] means we want to select all rows but only some specific columns. ``` ] Option 2 .small[ ```r gp_covid_risk <- subset(gp_covid, TRUE, select = c(hzcy001a, hzcy002a, hzcy003a, hzcy004a, hzcy005a)) # Again, here the 2nd argument refers to the rows. # Setting it to TRUE means that we want to include all rows in the subset. ``` ] --- ## Selecting variables You can also select variables based on their numeric index. ```r gp_covid_demo <- gp_covid[, 6:13] names(gp_covid_demo) ``` ``` ## [1] "sex" "age_cat" "education_cat" "intention_to_vote" "choice_of_party" ## [6] "political_orientation" "marstat" "household" ``` --- ## Selecting variables In the `tidyverse`, we can create a subset of variables with the `dplyr` verb `select()`. ```r gp_covid_risk <- gp_covid %>% select(hzcy001a, hzcy002a, hzcy003a, hzcy004a, hzcy005a) head(gp_covid_risk) ``` ``` ## # A tibble: 6 x 5 ## hzcy001a hzcy002a hzcy003a hzcy004a hzcy005a ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -33 -33 -33 -33 -33 ## 2 5 5 2 5 5 ## 3 5 6 3 6 6 ## 4 4 4 2 4 3 ## 5 -33 -33 -33 -33 -33 ## 6 3 3 -99 3 3 ``` --- ## Selecting a range of variables There also is a shorthand notation for selecting a set of consecutive columns with `select()`. ```r gp_covid_risk <- gp_covid %>% select(hzcy001a:hzcy005a) head(gp_covid_risk) ``` ``` ## # A tibble: 6 x 5 ## hzcy001a hzcy002a hzcy003a hzcy004a hzcy005a ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -33 -33 -33 -33 -33 ## 2 5 5 2 5 5 ## 3 5 6 3 6 6 ## 4 4 4 2 4 3 ## 5 -33 -33 -33 -33 -33 ## 6 3 3 -99 3 3 ``` *Note*: You can also use this shorthand notation for the `select` argument of the `base R` function `subset()`. --- ## Selecting a range of variables Same as for `base R`, you can also use the numeric index of variables in combination with `select()` from `dplyr`. ```r gp_covid_demo <- gp_covid %>% select(6:13) names(gp_covid_demo) ``` ``` ## [1] "sex" "age_cat" "education_cat" "intention_to_vote" "choice_of_party" ## [6] "political_orientation" "marstat" "household" ``` --- ## Unselecting variables If you just want to exclude one or a few columns/variables, it is easier to unselect those than to select all others. Again, there's two ways to do this with `base R`. Option 1 .small[ ```r gp_covid_cut <- gp_covid[!(names(gp_covid) %in% c("za_number", "version", "doi"))] # The ! operator means "not" (i.e., it negates a condition) # The %in% operator means "is included in" (in this case the following character vector) dim(gp_covid_cut) ``` ``` ## [1] 3765 134 ``` ] Option 2 .small[ ```r gp_covid_cut <- subset(gp_covid, TRUE, select = -c(za_number, version, doi)) dim(gp_covid_cut) ``` ``` ## [1] 3765 134 ``` ] --- ## Unselecting variables You can also use `select()` from `dplyr` to exclude one or more columns/variables. ```r gp_covid_cut <- gp_covid %>% select(-c(za_number, version, doi)) dim(gp_covid_cut) ``` ``` ## [1] 3765 134 ``` --- ## Advanced ways of selecting variables `dplyr` offers several helper functions for selecting variables. For a full list of those, you can check the [documentation for the `select()` function](https://dplyr.tidyverse.org/reference/select.html). ```r gp_covid_cy <- gp_covid %>% select(starts_with("hzcy")) gp_covid_cat <- gp_covid %>% select(ends_with("_cat")) glimpse(gp_covid_cat) ``` ``` ## Rows: 3,765 ## Columns: 2 ## $ age_cat <dbl> 7, 7, 8, 4, 1, 10, 4, 7, 8, 1, 6, 8, 2, 6, 2, 2, 2, 7, 4, 8, 1, 7, 4, 3, 5, 7, 7, 6, 6, 5, 7, 7, 5, 7, 5, 2,~ ## $ education_cat <dbl> 3, 2, 2, 3, 3, 2, 2, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 3, 2, 3, 3, ~ ``` --- ## Advanced ways of selecting variables Another particularly useful selection helper is `where()`. You can, e.g., use `where()` to select only a specific type of variables. ```r gp_covid_num <- gp_covid %>% select(where(is.numeric)) ``` --- ## What's in a name? One thing that we need to know - and might want to change - are the names of the variables in the dataset. ```r names(gp_covid) ``` ``` ## [1] "za_number" "version" "doi" "id" "cohort" ## [6] "sex" "age_cat" "education_cat" "intention_to_vote" "choice_of_party" ## [11] "political_orientation" "marstat" "household" "hzcy001a" "hzcy002a" ## [16] "hzcy003a" "hzcy004a" "hzcy005a" "hzcy006a" "hzcy007a" ## [21] "hzcy008a" "hzcy009a" "hzcy010a" "hzcy011a" "hzcy012a" ## [26] "hzcy013a" "hzcy014a" "hzcy015a" "hzcy016a" "hzcy018a" ## [31] "hzcy019a" "hzcy020a" "hzcy021a" "hzcy022a" "hzcy023a" ## [36] "hzcy024a" "hzcy025a" "hzcy026a" "hzcy027a" "hzcy028a" ## [41] "hzcy029a" "hzcy030a" "hzcy031a" "hzcy032a" "hzcy033a" ## [46] "hzcy034a" "hzcy035a" "hzcy036a" "hzcy037a" "hzcy038a" ## [51] "hzcy039a" "hzcy040a" "hzcy041a" "hzcy042a" "hzcy043a" ## [56] "hzcy044a" "hzcy045a" "hzcy046a" "hzcy047a" "hzcy048a" ## [61] "hzcy049a" "hzcy050a" "hzcy051a" "hzcy052a" "hzcy053a" ## [66] "hzcy054a" "hzcy055a" "hzcy056a" "hzcy057a" "hzcy058a" ## [71] "hzcy059a" "hzcy060a" "hzcy061a" "hzcy062a" "hzcy063a" ## [76] "hzcy064a" "hzcy065a" "hzcy066a" "hzcy067a" "hzcy068a" ## [81] "hzcy069a" "hzcy070a" "hzcy071a" "hzcy072a" "hzcy073a" ## [86] "hzcy074a" "hzcy075a" "hzcy076a" "hzcy077a" "hzcy078a" ## [91] "hzcy079a" "hzcy080a" "hzcy081a" "hzcy083a" "hzcy084a" ## [96] "hzcy085a" "hzcy086a" "hzcy087a" "hzcy088a" "hzcy089a" ## [101] "hzcy090a" "hzcy091a" "hzcy092a" "hzcy093a" "hzcy095a" ## [106] "hzcy096a" "hzcy097a" "hzcy098a" "hzcy099a" "hzza001a" ## [111] "hzza002a" "hzza003a" "hzzq009a" "hzzq016b" "hzzq023a" ## [116] "hzzp201a" "hzzp204a" "hzzp207a" "hzzr001a" "hzzr002a" ## [121] "hzzr003a" "hzzr004a" "hzzr005a" "hzzr006a" "hzzr007a" ## [126] "hzzr008a" "hzzr009a" "hzzr010a" "hzzr011a" "hzzr012a" ## [131] "hzzr013a" "hzzr014a" "hzzr015a" "hzzr016a" "hzzr017a" ## [136] "hzzr018a" "hzzr019a" ``` --- ## What's in a name? As you can see, only a few of the variable names in the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* data set are self-explanatory. The other variable names are composed of codes representing the study wave, study name, variable number and whether they are original or derived variables (have a look at the [*GESIS Panel* cheatsheet](https://www.gesis.org/fileadmin/upload/forschung/programme_projekte/Drittmittelprojekte/GESIS_Panel/gesis_panel_cheatsheet.pdf) if you want to know more), but they are not intuitive to understand. Hence, for analyzing them, especially if you want to create tables and/or plots, it can make sense to rename them. This is also a common step if you work with your own data. Depending on what method or tool(s) you used to collect the data, the variable names in your raw data may also not be what you want or need them to be. --- ## Renaming variables It is good practice to use consistent naming conventions. Since `R` is case-sensitive, we might, e.g., want to only use lowercase letters. As spaces in variable names can cause problems, we could, e.g., decide to use 🐍 *snake_case* (🐫 *camelCase* is a common alternative; for a good brief discussion of options for avoiding spaces in variable names, see this [Medium post by Patrick Divine](https://medium.com/@pddivine/string-case-styles-camel-pascal-snake-and-kebab-case-981407998841)). --- # Become an ace of case <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\coding_cases.png" width="90%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations) </small></small> --- ## Renaming variables You can rename individual columns/variables in `base R` as follows: ```r colnames(gp_covid)[colnames(gp_covid) == "hzcy048a"] <- "trust_government" ``` As for subsetting, you can also rename variables based on their numeric index. ```r colnames(gp_covid)[4] <- "respondent_id" ``` --- ## Renaming variables An easier to use and more versatile option for renaming columns/variables is the `dplyr` function `rename()`. ```r gp_covid_risk <- gp_covid_risk %>% rename(risk_self = hzcy001a, # new_name = old_name risk_surroundings = hzcy002a, risk_hospital = hzcy003a, risk_quarantine = hzcy004a, risk_infect_others = hzcy005a) names(gp_covid_risk) ``` ``` ## [1] "risk_self" "risk_surroundings" "risk_hospital" "risk_quarantine" "risk_infect_others" ``` --- ## Renaming variables For some more advanced renaming options, you can use the `dplyr` function `rename_with()`. ```r gp_covid_risk %>% rename_with(toupper) %>% names() ``` ``` ## [1] "RISK_SELF" "RISK_SURROUNDINGS" "RISK_HOSPITAL" "RISK_QUARANTINE" "RISK_INFECT_OTHERS" ``` *Note*: The [`janitor` package](https://sfirke.github.io/janitor/) (which is `tidyverse`-oriented) can be used to facilitate several common data cleaning tasks. Among other things, it contains the function `clean_names()` that takes a data frame and creates column names that "are unique and consist only of the _ character, numbers, and letters" (from the help file for this function), with the default being 🐍 snake_case (but support for many other types of cases). --- ## Renaming variables We can, e.g., use `rename_with()` in combination with `gsub()` (which we've already encountered in the session on *Getting Started*) to remove (or change) prefixes in variable names. ```r gp_covid %>% select(hzcy001a:hzcy005a) %>% rename_with(~ gsub("hzcy", "risk", .x, fixed = TRUE)) %>% names() ``` ``` ## [1] "risk001a" "risk002a" "risk003a" "risk004a" "risk005a" ``` --- ## Re~~wind~~name selecta A nice thing about the `dplyr` verb `select` is that you can use it to select and rename variables in one step. .small[ ```r gp_covid_risk <- gp_covid %>% select(risk_self = hzcy001a, risk_surroundings = hzcy002a, risk_hospital = hzcy003a, risk_quarantine = hzcy004a, risk_infect_others = hzcy005a) head(gp_covid_risk) ``` ``` ## # A tibble: 6 x 5 ## risk_self risk_surroundings risk_hospital risk_quarantine risk_infect_others ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -33 -33 -33 -33 -33 ## 2 5 5 2 5 5 ## 3 5 6 3 6 6 ## 4 4 4 2 4 3 ## 5 -33 -33 -33 -33 -33 ## 6 3 3 -99 3 3 ``` ] --- ## Moving columns Although the positions of columns in a data frame do not matter for analyses or plotting (unless you want to select columns using their numerical index), you might want to change them. For this purpose, `dplyr` provides the `relocate()` function. ```r gp_covid_risk <- gp_covid_risk %>% relocate(risk_infect_others, .after = risk_surroundings) glimpse(gp_covid_risk) ``` ``` ## Rows: 3,765 ## Columns: 5 ## $ risk_self <dbl> -33, 5, 5, 4, -33, 3, 4, 4, -33, 7, 4, 5, -33, 6, 4, -33, -33, 6, -33, 3, -33, 6, 5, 5, 6, 5, 7, 6, 5, ~ ## $ risk_surroundings <dbl> -33, 5, 6, 4, -33, 3, 3, 4, -33, 5, 6, 6, -33, 6, 5, -33, -33, 6, -33, 5, -33, 6, 6, 97, 6, 6, 7, 6, 6,~ ## $ risk_infect_others <dbl> -33, 5, 6, 3, -33, 3, 4, 4, -33, 2, 4, 6, -33, 4, 2, -33, -33, 6, -33, 3, -33, 6, 4, 3, 5, 4, 7, 5, 2, ~ ## $ risk_hospital <dbl> -33, 2, 3, 2, -33, -99, 3, 3, -33, 3, 3, 7, -33, 3, 2, -33, -33, 4, -33, 3, -33, 1, 3, 4, 3, 3, 7, 3, 3~ ## $ risk_quarantine <dbl> -33, 5, 6, 4, -33, 3, 3, 3, -33, 4, 5, 6, -33, 7, 3, -33, -33, 5, -33, 4, -33, 3, 4, 5, 6, 3, 7, 3, 2, ~ ``` *Note*: You can also move a column before a specific other column by providing a variable name to the `.before` argument (instead of `.after`). --- ## `dplyr::relocate()` <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\dplyr_relocate.png" width="85%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- class: center, middle # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_2_1_1_Select_Rename.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_2_1_1_Select_Rename.html) --- ## Filtering rows In `R`, you can filter rows/observations dependent on one or more conditions. To filter rows/observations you can use... - **comparison operators**: - **<** (smaller than) - **<=** (smaller than or equal to) - **==** (equal to) - **!=** (not equal to) - **>=** (larger than or equal to) - **>** (larger than) - **%in%** (included in) --- ## Filtering rows ... and combine comparisons with - **logical operators**: - **&** (and) - **|** (or) - **!** (not) - **xor** (either or, not both) --- ## Filtering rows Similar to selecting columns/variables, there are two options for filtering rows/observations with `base R`. Option 1 ```r gp_covid_male <- gp_covid[gp_covid$sex == 1, ] dim(gp_covid_male) ``` ``` ## [1] 1933 137 ``` Option 2 ```r gp_covid_male <- subset(gp_covid, sex == 1) dim(gp_covid_male) ``` ``` ## [1] 1933 137 ``` --- ## Filtering rows The `dplyr` solution for filtering rows/observations is the verb `filter()`. ```r gp_covid_male <- gp_covid %>% filter(sex == 1) dim(gp_covid_male) ``` ``` ## [1] 1933 137 ``` --- ## Filtering rows based on multiple conditions ```r gp_covid_old_men <- gp_covid %>% filter(sex == 1, age_cat > 7) dim(gp_covid_old_men) ``` ``` ## [1] 626 137 ``` --- ## `dplyr::filter()` <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\dplyr_filter.jpg" width="95%" style="display: block; margin: auto;" /> <small><small>Illustration by [Allison Horst](https://github.com/allisonhorst/stats-illustrations) </small></small> --- ## `dplyr::filter` - multiple conditions By default, multiple conditions in `filter()` are added as & (and). You can, however, also specify multiple conditions differently. **or** (cases for which at least one of the conditions is true) ```r gp_covid_old_andor_male <- gp_covid %>% filter(sex == 1 | age_cat > 7) dim(gp_covid_old_andor_male) ``` ``` ## [1] 2432 137 ``` --- ## `dplyr::filter` - multiple conditions **xor** (cases for which only one of the two conditions is true) ```r gp_covid_old_or_male <- gp_covid %>% filter(xor(sex == 1, age_cat > 7)) dim(gp_covid_old_or_male) ``` ``` ## [1] 1806 137 ``` --- ## Advanced ways of filtering rows Similar to `select()` there are some helper functions for `filter()` for advanced filtering of rows. For example, you can... - Filter rows based on a range in a numeric variable ```r gp_covid_centrist <- gp_covid %>% filter(between(political_orientation, 4, 6)) dim(gp_covid_centrist) ``` ``` ## [1] 2049 137 ``` *Note*: The range specified in `between()` is inclusive (on both sides). --- ## Advanced ways of filtering rows - Filter rows based on the values of specific variables matching certain criteria ```r gp_covid_risk_low <- gp_covid_risk %>% filter(if_all(everything(), ~ . < 4)) # read: if the values of all vars in this df are < 4 dim(gp_covid_risk_low) ``` ``` ## [1] 926 5 ``` *Note*: The helper function `if_any()` can be used to specify that at least one of the variables needs to match a certain criterion. --- ## Selecting columns + filtering rows Of course, you can also combine the selection of columns and the filtering of rows. `Base R` option 1 ```r gp_covid_risk_male <- gp_covid[gp_covid$sex == 1, c("hzcy001a", "hzcy002a", "hzcy003a", "hzcy004a", "hzcy005a")] dim(gp_covid_risk_male) ``` ``` ## [1] 1933 5 ``` `Base R` option 2 ```r gp_covid_risk_male <- subset(gp_covid, sex == 1, select = c(hzcy001a, hzcy002a, hzcy003a, hzcy004a, hzcy005a)) dim(gp_covid_risk_male) ``` ``` ## [1] 1933 5 ``` --- ## Selecting columns + filtering rows The `tidyverse` approach solution for combining the selection of columns and the filtering of rows is chaining these steps together in a pipe (in this case, the order of the pipe steps does not matter). ```r gp_covid_risk_male <- gp_covid %>% filter(sex == 1) %>% select(hzcy001a:hzcy005a) dim(gp_covid_risk_male) ``` ``` ## [1] 1933 5 ``` --- ## (Re-)Arranging the order of rows Again, while this does not directly matter for analyses or plotting (unless you want to filter rows by their numeric index), you can rearrange the order of rows in a data set. In `base R` this can be achived as follows: ```r gp_covid <- gp_covid[order(gp_covid$age_cat),] head(gp_covid[, 6:13]) ``` ``` ## # A tibble: 6 x 8 ## sex age_cat education_cat intention_to_vote choice_of_party political_orientation marstat household ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2 1 3 -33 -33 4 2 3 ## 2 1 1 3 2 2 7 2 3 ## 3 2 1 3 2 1 4 2 2 ## 4 2 1 3 2 5 2 2 3 ## 5 1 1 3 2 3 4 2 2 ## 6 1 1 3 2 3 6 2 3 ``` --- ## (Re-)Arranging the order of rows Of course, it is also possible to sort a data frame in descending order of a variable. ```r gp_covid <- gp_covid[order(desc(gp_covid$age_cat)),] head(gp_covid[, 6:13]) ``` ``` ## # A tibble: 6 x 8 ## sex age_cat education_cat intention_to_vote choice_of_party political_orientation marstat household ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 10 2 2 6 10 1 2 ## 2 2 10 1 2 98 5 1 2 ## 3 1 10 3 2 3 7 1 2 ## 4 2 10 1 2 2 5 1 2 ## 5 1 10 2 2 2 4 1 2 ## 6 1 10 1 2 2 2 4 2 ``` --- ## (Re-)Arranging the order of rows You can also sort your data frame by more than one variable. ```r gp_covid <- gp_covid[order(gp_covid$age_cat, gp_covid$education_cat),] head(gp_covid[, 6:13]) ``` ``` ## # A tibble: 6 x 8 ## sex age_cat education_cat intention_to_vote choice_of_party political_orientation marstat household ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 1 1 1 1 8 2 3 ## 2 2 1 1 1 98 5 1 3 ## 3 2 1 1 2 5 7 2 3 ## 4 1 1 1 -33 -33 5 2 2 ## 5 1 1 1 2 7 5 2 3 ## 6 1 1 1 2 3 6 2 3 ``` --- ## (Re-)Arranging the order of rows The `dplyr` verb for changing the order of rows in a data set is `arrange()` and you can use it in the same ways as the `base R` equivalent: Sorting by a single variable in ascending order, ... ```r gp_covid %>% arrange(age_cat) %>% select(sex:household) %>% glimpse() ``` ``` ## Rows: 3,765 ## Columns: 8 ## $ sex <dbl> 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2~ ## $ age_cat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~ ## $ education_cat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3~ ## $ intention_to_vote <dbl> 1, 1, 2, -33, 2, 2, 2, -99, 2, 2, -33, 2, 2, 2, 2, 1, -33, 2, 2, 2, 2, 2, 2, 2, -33, -33, 2, 2, -33,~ ## $ choice_of_party <dbl> 1, 98, 5, -33, 7, 3, 4, -99, 4, 98, -33, 98, 6, 98, 4, -99, -33, 2, 1, 5, 3, 3, 5, 5, -33, -33, 3, 3~ ## $ political_orientation <dbl> 8, 5, 7, 5, 5, 6, 2, 5, 2, 6, -33, 5, 8, 5, 2, 6, 4, 7, 4, 2, 4, 6, 5, 3, 2, 1, 5, 6, 2, 3, 4, 3, 3,~ ## $ marstat <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~ ## $ household <dbl> 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 2, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3~ ``` --- ## (Re-)Arranging the order of rows ... sorting by a single variable in descending order, ... ```r gp_covid %>% arrange(desc(age_cat)) %>% select(sex:household) %>% glimpse() ``` ``` ## Rows: 3,765 ## Columns: 8 ## $ sex <dbl> 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1~ ## $ age_cat <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, ~ ## $ education_cat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~ ## $ intention_to_vote <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -99, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, -99, 2, 2, ~ ## $ choice_of_party <dbl> 98, 2, 2, 2, 4, 2, 2, 98, 1, 1, 1, 2, 2, 5, 6, 6, 2, 2, 1, 2, 98, 2, 1, 2, 2, 1, 6, 98, 1, 2, 1, 1, ~ ## $ political_orientation <dbl> 5, 5, 2, 0, 0, 5, 2, 5, 5, 5, 5, 5, 6, 3, 8, 10, 4, 5, 3, 3, 5, 2, 6, 5, 2, 6, 8, 6, 5, 4, 8, 8, 8, ~ ## $ marstat <dbl> 1, 1, 4, 1, 4, 1, 1, 4, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 3, 1, 2, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1~ ## $ household <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 1, 2, 3, 2, 2, 2, 2, 2, 2, 2~ ``` --- ## (Re-)Arranging the order of rows ... sorting by more than one variable. ```r gp_covid %>% arrange(age_cat, education_cat) %>% select(sex:household) %>% glimpse() ``` ``` ## Rows: 3,765 ## Columns: 8 ## $ sex <dbl> 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2~ ## $ age_cat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~ ## $ education_cat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3~ ## $ intention_to_vote <dbl> 1, 1, 2, -33, 2, 2, 2, -99, 2, 2, -33, 2, 2, 2, 2, 1, -33, 2, 2, 2, 2, 2, 2, 2, -33, -33, 2, 2, -33,~ ## $ choice_of_party <dbl> 1, 98, 5, -33, 7, 3, 4, -99, 4, 98, -33, 98, 6, 98, 4, -99, -33, 2, 1, 5, 3, 3, 5, 5, -33, -33, 3, 3~ ## $ political_orientation <dbl> 8, 5, 7, 5, 5, 6, 2, 5, 2, 6, -33, 5, 8, 5, 2, 6, 4, 7, 4, 2, 4, 6, 5, 3, 2, 1, 5, 6, 2, 3, 4, 3, 3,~ ## $ marstat <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~ ## $ household <dbl> 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 2, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3~ ``` --- class: center, middle # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_2_1_2_Filter_Arrange.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_2_1_2_Filter_Arrange.html) --- ## Creating & transforming variables The simplest case of adding a new variable is creating a constant. You might, e.g., want to do that to indicate the number of the survey wave in a longitudinal data set. This is how you can do this in `base R`: ```r gp_covid$wave <- 1 ``` *Note*: By default, new variables are added after the last column in the data set. --- ## Creating & transforming variables Another simple variable transformation is adding or substracting a constant from its values, which, in `base R`, you can do as follows: ```r gp_covid$sex_new <- gp_covid$sex - 1 ``` --- ## Creating & transforming variables We can also add new variables by changing the data type of a variable. ```r gp_covid$id_char <- as.character(gp_covid$id) ``` *Note*: In case you want to overwrite a variable, you can do so by giving the new variable the same name as the old one. --- ## Creating & transforming variables The `dplyr` package provides a very versatile function for creating and transforming variables: `mutate()`, which you can also use to create a new variable that is a constant, ... ```r gp_covid <- gp_covid %>% mutate(wave = 1) ``` --- ## Creating & transforming variables ... applies a simple transformation to an existing variable, ... ```r gp_covid <- gp_covid %>% mutate(sex_new = sex - 1) ``` --- ## Creating & transforming variables ... or changes the data type of an existing variable. ```r gp_covid <- gp_covid %>% mutate(id_char = as.character(id)) ``` --- ## Creating & transforming variables Notably, however, `mutate()` can be used for much more complex variable transformations. We will go through a few examples of those in this session, and discuss even more of them in the following session on advanced data wrangling operations. One situation in which we might want to transform variables or create new ones, e.g., is when we want to recode their values. *Note*: We could, of course, also do this in `base R`, but the code for that can get quite convoluted. --- ## Recoding values Say, for example, we want to recode the item from the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* that measures trust in scientists with regard to dealing with the Coronavirus so that it represents distrust instead. In that case, we could combine the two `dplyr` functions `mutate()` and `recode()`. .small[ ```r gp_covid <- gp_covid %>% mutate(hzcy052aR = recode(hzcy052a, `5` = 1, # `old value` = new value `4` = 2, `2` = 4, `1` = 5)) table(gp_covid$hzcy052a, gp_covid$hzcy052aR) ``` ``` ## ## -99 -77 -33 1 2 3 4 5 98 ## -99 36 0 0 0 0 0 0 0 0 ## -77 0 58 0 0 0 0 0 0 0 ## -33 0 0 527 0 0 0 0 0 0 ## 1 0 0 0 0 0 0 0 25 0 ## 2 0 0 0 0 0 0 79 0 0 ## 3 0 0 0 0 0 303 0 0 0 ## 4 0 0 0 0 1422 0 0 0 0 ## 5 0 0 0 1278 0 0 0 0 0 ## 98 0 0 0 0 0 0 0 0 37 ``` ] --- ## Missing values A particular reason why we may want to recode specific values of one or multiple variable is if we have missing data in our data set. --- ## Missing values Most of the real data sets we work with have missing data. As the data can be missing for various reasons, we often use codes (and labels) to distinguish between different types of missing data. As the data can be missing for various reasons, we often use codes (and labels) to distinguish between different types of missing data. If you look at the the [codebook](https://dbk.gesis.org/dbksearch/download.asp?id=67378) of the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* or the [*GESIS Panel* Cheatsheet](https://www.gesis.org/fileadmin/upload/GESIS_Panel/Cheatsheet/gesis_panel_cheatsheet.pdf), you will see that there are quite a few types of and codes for missing data. Some types of missing values are the same across variables, while some variables also have additional types of missing data (and, hence, additional codes for missings). --- ## Missing values In `R`, missing values are represented by `NA`. `NA` is a reserved term in `R`, meaning that you cannot use it as a name for anything else (this is also the case for `TRUE` and `FALSE`). When we prepare our data for analysis there are generally two things we might want/have to do with regard to missing values: - define specific values as missings (i.e., set them to `NA`) - recode `NA` values into something else (typically to distinguish between different types of missing values) --- ## Recode values as `NA` With `base R` you can set values to `NA` for specific variables as follows: .small[ ```r sum(is.na(gp_covid$hzcy006a)) ``` ``` ## [1] 0 ``` ```r gp_covid$hzcy006a[gp_covid$hzcy006a == -99] <- NA gp_covid$hzcy006a[gp_covid$hzcy006a == -77] <- NA gp_covid$hzcy006a[gp_covid$hzcy006a == -33] <- NA sum(is.na(gp_covid$hzcy006a)) ``` ``` ## [1] 579 ``` ] --- ## Recode values as `NA` The `tidyverse` option for setting specific values of individual variables to `NA` is the `dplyr` function `na_if()` combined with the `mutate()`. ```r gp_covid <- gp_covid %>% mutate(hzcy006a = na_if(hzcy006a, -99)) %>% mutate(hzcy006a = na_if(hzcy006a, -77)) %>% mutate(hzcy006a = na_if(hzcy006a, -33)) ``` --- ## Recode values as `NA` The `na_if()` function can also be used to recode specific values as `NA` across a whole data set. ```r gp_covid <- gp_covid %>% na_if(-99) %>% na_if(-77) %>% na_if(-33) ``` *Note*: `na_if()` only takes single values as its second argument (i.e., the value to replace with `NA`). --- ## Recode values as `NA` While `na_if()` can be applied to a specified selection of variables if combined with another `dplyr` function that we will cover in the following session on advanced data wrangling, the `base R` and `tidyverse` options for recoding values as `NA` are somewhat difficult to use when they should be used for a selection or range of values. There are, however, functions from two other packages that come in handy here: - `set_na()` from the [`sjlabelled` package](https://strengejacke.github.io/sjlabelled/index.html) - `replace_with_na()` and its scoped variants, such as `replace_with_na_all()`, from the [`naniar` package](http://naniar.njtierney.com/index.html) 🦁 --- ## The missings of `naniar` 🦁 The `naniar` package provides many useful functions for handling missing data in `R` (and works very well in combination with the `tidyverse`). If we have a list of values we want to code as `NA` for all variables in our data set, we can do that with a function from `naniar` the following way: ```r library(naniar) missings <- c(-99, -77, -33, -22) gp_covid <- gp_covid %>% replace_with_na_all(condition = ~.x %in% missings) ``` --- ## The missings of `naniar` 🦁 We could, e.g., also use the same function to code every value < 0 as `NA`. ```r gp_covid <- gp_covid %>% replace_with_na_all(condition = ~.x < 0) ``` Using the functions `replace_with_na_at()` and `replace_with_na_if()`, we can also recode values as `NA` for a selection or specific type of variables (e.g., all numeric variables). --- ## `set_na()` from `sjlabelled` Another easy-to-use option for recoding values to `NA` (for individual variables of full data frames) is the function `set_na()` from the `sjlabelled` package. We can, e.g., use it the same way we have used the `naniar` function `replace_with_na_all()` in the previous example. ```r library(sjlabelled) gp_covid <- gp_covid %>% set_na(na = c(-99, -77, -33, -22)) ``` --- ## Excluding cases with missing values If you want to exclude observations with missing values for individual variables, you can use `!is.na(variable_name)` with your filtering method of choice. However, there are also methods for only keeping complete cases (i.e., cases without missing data). The `base R` function for that is `na.omit()` ```r gp_covid_complete <- na.omit(gp_covid) ``` --- ## Excluding cases with missing values The `tidyverse` equivalent of `na.omit()` is `drop_na()` from the `tidyr` package. You can use this function to remove cases that have missings on any variable in a data set or only on specific variables. ```r gp_covid %>% drop_na() %>% nrow() ``` ``` ## [1] 2347 ``` ```r gp_covid %>% drop_na(choice_of_party) %>% nrow() ``` ``` ## [1] 3552 ``` --- ## Recode `NA` into something else An easy option for replacing `NA` with another value for a single variable is the `replace_na()` function from the `tidyr` package in combination with `mutate()`. ```r gp_covid <- gp_covid %>% mutate(hzcy006a = replace_na(hzcy006a, -99)) ``` **NB**: This particular example does not make much sense. You can, however, specify different values for different types of missing values. To do this, you probably need to make the recoding dependent on other variables, which is what we will discuss in the next session on advanced data wrangling operations. --- ## Other variable types In the examples in this session, we only worked with numeric variables. There are, however, other variable types that occur frequently in data sets in the social sciences: - factors - strings - time and dates Working with strings in `R` is a topic that would require its own workshop (and the same is essentially true for time and dates). Hence, we will only briefly discuss the basics of factors in this session (also because we will meet them again in the following session). --- ## Factors Factor are a special type of variable in `R` that represent categorical data. Before `R` version `4.0.0.` the default for `base R` was that all characters variables are imported as factors. Internally, factors are stored as integers, but they have (character) labels (so-called *levels*) associated with them. Hence, if you are not working with the special class of labelled data (e.g., via the packages [`haven`](https://haven.tidyverse.org/), [`labelled`](https://larmarange.github.io/labelled/index.html), or [`sjlabelled`](https://strengejacke.github.io/sjlabelled/index.html)), factors come closest to having variables with value labels as you might know from *SPSS*. --- ## Factors Factors in `R` can be **unordered** - in which case they are similar to **nominal** level variables in *SPSS* - or **ordered** - in which case they are similar to **ordinal** level variables in *SPSS*. Using factors can be necessary for certain statistical analysis and plots (e.g., if you want to compare groups). Working with factors in `R` is a big topic, and we will only briefly touch upon it in this workshop. For a more in-depth discussion of factors in `R` you can, e.g., have a look at the [chapter on factors](https://r4ds.had.co.nz/factors.html) in *R for Data Science*. --- ## Factors 4 🐱s There are many functions for working with factors in `base R`, such as `factor()` or `as.factor()`. However, a generally more versatile and easier-to-use option is the [`forcats` package](https://forcats.tidyverse.org/) from the `tidyverse`. <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\forcats.png" width="25%" style="display: block; margin: auto;" /> *Note*: There is a good [introduction to working with factors using `forcats` by Vebash Naidoo](https://sciencificity-blog.netlify.app/posts/2021-01-30-control-your-factors-with-forcats/) and *RStudio* also offers a [`forcats` cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/factors.pdf). --- ## From numeric to factor Using the `recode_factor()` function (together with `mutate()`) from `dplyr`, you can create a factor from a numeric (or a character) variable. In the example below, we want an ordered factor. ```r gp_covid %>% mutate(edu_cat = recode_factor(education_cat, `1` = "Low", `2` = "Medium", `3`= "High", .ordered = TRUE)) %>% select(education_cat, edu_cat) %>% sample_n(5) # randomly sample 5 cases from the df ``` ``` ## # A tibble: 5 x 2 ## education_cat edu_cat ## <dbl> <ord> ## 1 3 High ## 2 3 High ## 3 3 High ## 4 3 High ## 5 2 Medium ``` --- ## Working with strings in `R` As stated before, we won't be able to cover the specifics of working with strings in `R` in this course. However, it may be good to know that the `tidyverse` package [`stringr`](https://stringr.tidyverse.org/index.html) offers a collection of convenient functions for working with strings. <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\stringr.png" width="25%" style="display: block; margin: auto;" /> The `stringr` package provides a good [introduction vignette](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html), the book *R for Data Science* has a whole section on [strings with `stringr`](https://r4ds.had.co.nz/strings.html), and there also is an [*RStudio* Cheat Sheet for `stringr`](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf). --- ## Sidenote: Regular expressions If you want (or have) to work with [regular expressions](https://en.wikipedia.org/wiki/Regular_expression), you should also check out the [`rebus` package](https://github.com/richierocks/rebus) which allows you to create regular expressions in R in a human-readable way. Another helpful tool is the *RStudio* addin [`RegExplain`](https://www.garrickadenbuie.com/project/regexplain/). --- ## Times and dates If you are want/need to work with times and dates in `R`, you may want to look into the [`lubridate` package](https://lubridate.tidyverse.org/) which is part of the `tidyverse`, and for which *RStudio* also provides a [cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/lubridate.pdf). <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\lubridate.png" width="25%" style="display: block; margin: auto;" /> *Note*: If you work with time series data, it is also worth checking out the [`tsibble` package](https://tsibble.tidyverts.org/) for your wrangling tasks. --- class: center, middle # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_2_1_3_Mutate_Recode_Missings.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_2_1_3_Mutate_Recode_Missings.html) --- ## Extracurricular activities Check out the [appendix slides for today](https://jobreu.github.io/r-intro-gesis-2021/slides/2_3_Appendix_Relational_Data.html) which cover the topic of relational data (i.e., combining multiple data sets). Have a look at the [*Tidy Tuesday* repository on *GitHub*](https://github.com/rfordatascience/tidytuesday), listen to a few of the very short episodes of the [*Tidy Tuesday* Podcast](https://www.tidytuesday.com/), check out the [#tidytuesday Twitter hashtag](https://twitter.com/hashtag/tidytuesday?lang=en), or watch one (or more) of the [*Tidy Tuesday* screencasts on *YouTube* by David Robinson](https://www.youtube.com/watch?v=E2amEz_upzU&list=PL19ev-r1GBwkuyiwnxoHTRC8TTqP8OEi8).