class: center, middle, inverse, title-slide # Introduction to R for Data Analysis ## Exploratory Data Analysis ### Johannes Breuer & Stefan Jünger ### 2021-08-04 --- layout: true --- ## What is Exploratory Data Analysis (EDA)? After wrangling our data, the next thing we should do is exploring them. In practice, of course, these steps are often done iteratively. There is no single definition of what exploratory data analysis is and which steps and methods belong to this category. For example, while descriptive (or summary) statistics are almost always part of EDA, analysis methods like correlation, chi-square tests, or principal component analysis (PCA) can be part of exploratory or confirmatory analyses (which we will cover in the next session). --- ## What is Exploratory Data Analysis (EDA)? For this course, we specifically decided not to separate the analysis sessions into descriptive vs. inferential statistics (in parts also because, as we pointed out in the Introduction, this is not a statistics workshop), so we will also cover some inferential statistics in the EDA part. Instead, the distinction we make between EDA and confirmatory analysis is more akin to that between exploratory and confirmatory research: While the former can be employed to develop hypotheses, the goal of the latter typically is to test hypotheses.<sup>1</sup> .small[ .footnote[ [1] Of course, this does not mean that EDA techniques cannot be used to answer (exploratory) research questions. ] ] --- ## Exploring EDA As stated before, exploratory data analysis can take many shapes and forms. In this session, we will focus on the following: - summary statistics (for numeric variables) - frequencies & proportions (for categorical variables) - cross-tabulations & correlations - exploring missing values and checking for outliers *Note*: In this session, we will focus on numeric summaries and descriptions of our data. Notably, data visualization typically also is a big part of EDA. We will cover that in the following session on data visualization with `R`. --- ## Disclaimer: A flood 🌊 of packages 📦 In the previous sessions, we have tried to focus on a small set of packages (mostly from `base R` and the `tidyverse`) and took a deep dive 🥽 exploring their functionalities. In this session, however, we will browse through quite a few packages for EDA. Admittedly, this can be a bit overwhelming. However, there is no need to worry: You do not need to remember all of the options and their differences right away. The topic of EDA is a very suitable example for demonstrating that there often are many `R` packages for the same tasks. --- ## Choosing packages 📦 for EDA Many of the EDA options that we are going to discuss in this session are quite similar across the different packages. However, there are differences, e.g., regarding the type of output they produce, their ease-of-use, and their flexibility. In practice, you will probably not need more than 1 to 3 packages for EDA most of the time. It is, however, worth exploring the different options to find packages that (best) suit your needs. Of course, what the best options are strongly depends on your goals and the output you want to get. For example: Do you only want to quickly check the data for yourself, create a report or documentation for internal use, or do you want to produce publication-ready output (tables, plots, etc.)? --- ## Data As using the full data set can become somewhat unwieldy for the examples in this section, we will create/use a subset of the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* data. We will select a subset of variables on the following: - demographics - political orientation - risk perceptions regarding infection with and spreading of the Corona virus - personal prevention measures - trust in people and institutions with regard to dealing with the Corona virus - sources of information about the Corona virus --- ## Repetition: Data wrangling pipeline As a repetition and reminder, we will quickly go through a wrangling pipeline for these data in the following. *Note*: Of course, it is possible to do the whole wrangling in one pipe. However, to check if everything worked it is advisable to break up the pipe into smaller chunks (a nice tool for checking and debugging pipes that also provides an *RStudio* Addin is the package [`ViewPipeSteps`](https://github.com/daranzolin/ViewPipeSteps)). Also, splitting up the wrangling pipe steps allows us to show them on the slides. --- ## Wrangling pipeline: Select & rename .small[ ```r gp_covid <- read_csv2("./data/ZA5667_v1-1-0.csv") ``` ```r library(tidyverse) corona_survey <- gp_covid %>% select(id, sex:education_cat, choice_of_party, left_right = political_orientation, risk_self = hzcy001a, risk_surroundings = hzcy002a, avoid_places = hzcy006a, keep_distance = hzcy007a, wash_hands = hzcy011a, stockup_supplies = hzcy013a, reduce_contacts = hzcy014a, wear_mask = hzcy015a, trust_rki = hzcy047a, trust_government = hzcy048a, trust_chancellor = hzcy049a, trust_who = hzcy051a, trust_scientists = hzcy052a, info_nat_pub_br = hzcy084a, info_nat_np = hzcy086a, info_loc_np = hzcy089a, info_fb = hzcy090a, info_other_sm = hzcy091a) ``` ] --- ## Wrangling pipeline: Missing values If you look at the [codebook](https://dbk.gesis.org/dbksearch/download.asp?id=67378) for the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany*, you will see that some of the variables we have selected have specific values that we probably want to code as missing values before we do any analyses (exploratory or otherwise). .small[ ```r library(naniar) corona_survey <- corona_survey %>% replace_with_na_all(condition = ~.x < 0) %>% replace_with_na(replace = list(choice_of_party = c(97,98), risk_self = c(97), risk_surroundings = c(97), trust_rki = c(98), trust_government = c(98), trust_chancellor = c(98), trust_who = c(98), trust_scientists = c(98))) ``` ] --- ## Wrangling pipeline: Change variable types & recode values .small[ ```r corona_survey <- corona_survey %>% mutate(sex = recode_factor(sex, `1`= "Male", `2` = "Female"), education_cat = recode_factor(education_cat, `1` = "Low", `2` = "Medium", `3`= "High", .ordered = TRUE), age_cat = recode_factor(age_cat, `1`= "<= 25 years", `2`= "26 to 30 years", `3` = "31 to 35 years", `4` = "36 to 40 years", `5` = "41 to 45 years", `6` = "46 to 50 years", `7` = "51 to 60 years", `8` = "61 to 65 years", `9`= "66 to 70 years", `10` = ">= 71 years", .ordered = TRUE), choice_of_party = recode_factor(choice_of_party, `1`= "CDU/CSU", `2`= "SPD", `3` = "FDP", `4` = "Linke", `5` = "Gruene", `6` = "AfD", `7` = "Other")) ``` ] --- ## Wrangling pipeline: Compute new aggregate variables ```r corona_survey <- corona_survey %>% rowwise() %>% mutate(sum_measures = sum(c_across(avoid_places:wear_mask)), sum_sources = sum(c_across(info_nat_pub_br:info_other_sm)), mean_trust = mean(c_across(trust_rki:trust_scientists), na.rm = T)) %>% ungroup() ``` --- ## Saving the wrangled data As we will use this reduced and wrangled data set again in the next sessions, we should save it (as an `.rds` file). ```r saveRDS(corona_survey, "./data/corona_survey.rds") ``` We can then (later on) load the data set again with ```r corona_survey <- readRDS("./data/corona_survey.rds") ``` --- ## Explore your data: First look 👀 To get a first impression of the data set, we can use some of the functions we discussed in the sessions on *Data Import & Export* and *Data Wrangling Basics*, such as `dim()`, `head()`, or `str()` from `base R`, `glimpse()` from `dplyr`, or `View()` in *RStudio*. While looking at the the full data set can give us a general understanding of the data and their format and also show if (and how) we may need to wrangle them (further), it is difficult to make sense of the data just by looking at it. --- ## Summary statistics To make sense of quantitative data we can reduce their information to unique values. -- .center[ ~ **That's a simple definition of summary statistics** ~] -- As such, we can use summarizing functions of - location (e.g., the mean), - spread (e.g., standard deviation), - the shape of the distribution (e.g., skewness), and - relations between variables (e.g., correlation coefficients) --- ## Summary statistics: `summary()` A quick and easy way to check some summary statistics for your data set is the `base R` function `summary()` which can be applied to individual variables... ```r summary(corona_survey$left_right) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.00 3.00 5.00 4.66 6.00 10.00 87 ``` as well as whole data frames: ```r summary(corona_survey[, 2:19]) ``` .right[↪️] --- class: middle .small[ ``` ## sex age_cat education_cat choice_of_party left_right risk_self ## Male :1933 51 to 60 years:978 Low : 423 Gruene :756 Min. : 0.00 Min. :1.000 ## Female:1832 61 to 65 years:386 Medium:1154 CDU/CSU:753 1st Qu.: 3.00 1st Qu.:3.000 ## >= 71 years :382 High :2188 SPD :369 Median : 5.00 Median :4.000 ## 46 to 50 years:367 Linke :287 Mean : 4.66 Mean :4.094 ## 66 to 70 years:357 AfD :281 3rd Qu.: 6.00 3rd Qu.:5.000 ## 36 to 40 years:328 (Other):337 Max. :10.00 Max. :7.000 ## (Other) :967 NA's :982 NA's :87 NA's :613 ## risk_surroundings avoid_places keep_distance wash_hands stockup_supplies reduce_contacts ## Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 ## 1st Qu.:4.000 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.0000 ## Median :5.000 Median :1.0000 Median :1.0000 Median :1.0000 Median :0.0000 Median :1.0000 ## Mean :4.552 Mean :0.8446 Mean :0.8029 Mean :0.9105 Mean :0.3214 Mean :0.8547 ## 3rd Qu.:6.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 ## Max. :7.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 ## NA's :661 NA's :579 NA's :579 NA's :579 NA's :579 NA's :579 ## wear_mask trust_rki trust_government trust_chancellor trust_who trust_scientists ## Min. :0.0000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 ## 1st Qu.:0.0000 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:4.000 ## Median :0.0000 Median :5.000 Median :4.000 Median :4.000 Median :4.000 Median :4.000 ## Mean :0.0367 Mean :4.441 Mean :3.663 Mean :3.573 Mean :3.968 Mean :4.239 ## 3rd Qu.:0.0000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.000 ## Max. :1.0000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 ## NA's :579 NA's :670 NA's :631 NA's :635 NA's :662 NA's :658 ``` ] --- ## Summary statistics: `psych::describe()` For more detailed summary statistics for the numeric variables you can use the `describe()` function from the [`psych` package](https://cran.r-project.org/web/packages/psych/index.html). ```r library(psych) corona_survey %>% select(where(is.numeric), -id) %>% describe() ``` .right[↪️] --- class: middle .tinyish[ ``` ## vars n mean sd median trimmed mad min max range skew kurtosis se ## left_right 1 3678 4.66 1.86 5 4.68 1.48 0 10 10 -0.10 -0.16 0.03 ## risk_self 2 3152 4.09 1.27 4 4.11 1.48 1 7 6 -0.05 -0.12 0.02 ## risk_surroundings 3 3104 4.55 1.40 5 4.59 1.48 1 7 6 -0.29 -0.21 0.03 ## avoid_places 4 3186 0.84 0.36 1 0.93 0.00 0 1 1 -1.90 1.62 0.01 ## keep_distance 5 3186 0.80 0.40 1 0.88 0.00 0 1 1 -1.52 0.32 0.01 ## wash_hands 6 3186 0.91 0.29 1 1.00 0.00 0 1 1 -2.88 6.27 0.01 ## stockup_supplies 7 3186 0.32 0.47 0 0.28 0.00 0 1 1 0.76 -1.42 0.01 ## reduce_contacts 8 3186 0.85 0.35 1 0.94 0.00 0 1 1 -2.01 2.05 0.01 ## wear_mask 9 3186 0.04 0.19 0 0.00 0.00 0 1 1 4.92 22.25 0.00 ## trust_rki 10 3095 4.44 0.77 5 4.59 0.00 1 5 4 -1.68 3.54 0.01 ## trust_government 11 3134 3.66 1.01 4 3.75 1.48 1 5 4 -0.79 0.24 0.02 ## trust_chancellor 12 3130 3.57 1.15 4 3.68 1.48 1 5 4 -0.72 -0.22 0.02 ## trust_who 13 3103 3.97 0.95 4 4.09 1.48 1 5 4 -1.02 1.01 0.02 ## trust_scientists 14 3107 4.24 0.79 4 4.35 1.48 1 5 4 -1.15 1.83 0.01 ## info_nat_pub_br 15 3169 0.90 0.30 1 1.00 0.00 0 1 1 -2.66 5.07 0.01 ## info_nat_np 16 3169 0.35 0.48 0 0.31 0.00 0 1 1 0.63 -1.60 0.01 ## info_loc_np 17 3169 0.50 0.50 0 0.50 0.00 0 1 1 0.00 -2.00 0.01 ## info_fb 18 3169 0.19 0.39 0 0.11 0.00 0 1 1 1.61 0.60 0.01 ## info_other_sm 19 3169 0.15 0.36 0 0.06 0.00 0 1 1 1.94 1.78 0.01 ## sum_measures 20 3186 3.77 1.16 4 3.92 1.48 0 6 6 -1.14 1.43 0.02 ## sum_sources 21 3169 2.09 0.95 2 2.04 1.48 0 5 5 0.41 0.35 0.02 ## mean_trust 22 3157 3.98 0.75 4 4.04 0.59 1 5 4 -0.94 1.01 0.01 ``` ] --- ## Summary statistics: `summarytools::descr()` The [`summarytools` package](https://github.com/dcomtois/summarytools) provides a lot of functionalities for EDA, including the `descr()` function for summary statistics, which also provides quite a few options for customizing the output. ```r library(summarytools) corona_survey %>% select(left_right, starts_with("trust"), sum_measures, sum_sources, mean_trust) %>% descr(stats = "common") ``` .right[↪️] --- class: middle .small[ ``` ## Descriptive Statistics ## corona_survey ## N: 3765 ## ## left_right mean_trust sum_measures sum_sources trust_chancellor ## --------------- ------------ ------------ -------------- ------------- ------------------ ## Mean 4.66 3.98 3.77 2.09 3.57 ## Std.Dev 1.86 0.75 1.16 0.95 1.15 ## Min 0.00 1.00 0.00 0.00 1.00 ## Median 5.00 4.00 4.00 2.00 4.00 ## Max 10.00 5.00 6.00 5.00 5.00 ## N.Valid 3678.00 3157.00 3186.00 3169.00 3130.00 ## Pct.Valid 97.69 83.85 84.62 84.17 83.13 ## ## Table: Table continues below ## ## ## ## trust_government trust_rki trust_scientists trust_who ## --------------- ------------------ ----------- ------------------ ----------- ## Mean 3.66 4.44 4.24 3.97 ## Std.Dev 1.01 0.77 0.79 0.95 ## Min 1.00 1.00 1.00 1.00 ## Median 4.00 5.00 4.00 4.00 ## Max 5.00 5.00 5.00 5.00 ## N.Valid 3134.00 3095.00 3107.00 3103.00 ## Pct.Valid 83.24 82.20 82.52 82.42 ``` ] --- ## Summary statistics with the `datawizard` package 🧙 The [`datawizard` package](https://easystats.github.io/datawizard/) which we introduced as an interesting alternative or addition for data wrangling before also provides a function for summary statistics. Similar to `describe()` from `summarytools`, this function also offers several options for customizing the output. ```r library(datawizard) corona_survey %>% select(left_right, starts_with("trust"), sum_measures, sum_sources, mean_trust) %>% describe_distribution(quartiles = TRUE) ``` .right[↪️] --- class: middle .small[ ``` ## Variable | Mean | SD | IQR | Range | Quartiles | Skewness | Kurtosis | n | n_Missing ## ----------------------------------------------------------------------------------------------------------- ## left_right | 4.66 | 1.86 | 3.00 | [0.00, 10.00] | 3.00, 6.00 | -0.10 | -0.16 | 3678 | 87 ## trust_rki | 4.44 | 0.77 | 1.00 | [1.00, 5.00] | 4.00, 5.00 | -1.69 | 3.55 | 3095 | 670 ## trust_government | 3.66 | 1.01 | 1.00 | [1.00, 5.00] | 3.00, 4.00 | -0.79 | 0.24 | 3134 | 631 ## trust_chancellor | 3.57 | 1.15 | 1.00 | [1.00, 5.00] | 3.00, 4.00 | -0.72 | -0.22 | 3130 | 635 ## trust_who | 3.97 | 0.95 | 1.00 | [1.00, 5.00] | 4.00, 5.00 | -1.02 | 1.02 | 3103 | 662 ## trust_scientists | 4.24 | 0.79 | 1.00 | [1.00, 5.00] | 4.00, 5.00 | -1.15 | 1.84 | 3107 | 658 ## sum_measures | 3.77 | 1.16 | 2.00 | [0.00, 6.00] | 3.00, 5.00 | -1.14 | 1.44 | 3186 | 579 ## sum_sources | 2.09 | 0.95 | 2.00 | [0.00, 5.00] | 1.00, 3.00 | 0.41 | 0.36 | 3169 | 596 ## mean_trust | 3.98 | 0.75 | 1.00 | [1.00, 5.00] | 3.60, 4.60 | -0.94 | 1.02 | 3157 | 608 ``` ] --- ## Summary statistics with `dplyr` `dplyr` provides a helpful function for creating summary statistics: `summarize()` `summarize()` is a [vectorized](https://win-vector.com/2019/01/03/what-does-it-mean-to-write-vectorized-code-in-r/) function that can be used to create summary statistics for variables using functions like... - `mean()` - `sd()` - `min()` - `max()` - etc. --- ## Summary statistics with `dplyr` While creating summary statistics using `summarize()` from `dplyr()` requires writing more code, it is the most flexible option. Another nice benefit of `summarize()` is that it produces a `tibble` which can be used for further analyses or for creating plots or tables. --- ## `dplyr::summarize()` .small[ ```r corona_survey %>% summarize( mean_trust_gov = mean(trust_government, na.rm = TRUE), sd_trust_gov = sd(trust_government, na.rm = TRUE), var_trust_gov = var(trust_government, na.rm = TRUE), min_trust_gov = min(trust_government, na.rm = TRUE), max_trust_gov = max(trust_government, na.rm = TRUE) ) ``` ``` ## # A tibble: 1 x 5 ## mean_trust_gov sd_trust_gov var_trust_gov min_trust_gov max_trust_gov ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 3.66 1.01 1.03 1 5 ``` ] --- ## `dplyr::group_by()` The `dplyr` function `group_by()` creates data frames (tibbles) that are grouped by one or more variables. This can, e.g., be used to produce grouped summary statistics. As we have seen for the `rowwise()` function in the session on advanced data wrangling, we can end/undo the grouping via the `ungroup()` function. ```r corona_survey %>% filter(!is.na(choice_of_party)) %>% group_by(choice_of_party) %>% summarize( mean_trust_gov = mean(trust_government, na.rm = TRUE), sd_trust_gov = sd(trust_government, na.rm = TRUE), var_trust_gov = var(trust_government, na.rm = TRUE), min_trust_gov = min(trust_government, na.rm = TRUE), max_trust_gov = max(trust_government, na.rm = TRUE) ) %>% ungroup() ``` .right[↪️] --- class: middle ``` ## # A tibble: 7 x 6 ## choice_of_party mean_trust_gov sd_trust_gov var_trust_gov min_trust_gov max_trust_gov ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 CDU/CSU 4.03 0.838 0.702 1 5 ## 2 SPD 3.89 0.910 0.829 1 5 ## 3 FDP 3.66 1.06 1.11 1 5 ## 4 Linke 3.37 1.06 1.13 1 5 ## 5 Gruene 3.87 0.804 0.647 1 5 ## 6 AfD 2.83 1.21 1.46 1 5 ## 7 Other 3.14 1.07 1.14 1 5 ``` --- ## `dplyr::across()` To produce grouped summary statistics for multiple variables we can use the `dplyr` function `across()` which we already saw in action in the session on advanced data wrangling. *Note*: We only use cases without missing data for any of the variables here (= listwise deletion). ```r corona_survey %>% select(choice_of_party, starts_with("trust")) %>% drop_na() %>% group_by(choice_of_party) %>% summarize(across(starts_with("trust"), list(mean = mean, sd = sd), .names = "{col}_{fn}")) ``` .right[↪️] --- class: middle ``` ## # A tibble: 7 x 11 ## choice_of_party trust_rki_mean trust_rki_sd trust_government_mean trust_government_sd trust_chancellor_mean ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 CDU/CSU 4.57 0.648 4.02 0.837 4.04 ## 2 SPD 4.57 0.647 3.87 0.910 3.81 ## 3 FDP 4.46 0.705 3.62 1.05 3.38 ## 4 Linke 4.36 0.858 3.35 1.05 3.30 ## 5 Gruene 4.52 0.696 3.86 0.802 3.84 ## 6 AfD 4.10 0.923 2.82 1.19 2.42 ## 7 Other 4.24 0.964 3.16 1.05 3 ## # ... with 5 more variables: trust_chancellor_sd <dbl>, trust_who_mean <dbl>, trust_who_sd <dbl>, ## # trust_scientists_mean <dbl>, trust_scientists_sd <dbl> ``` --- ## `dplyr::across()` <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\dplyr_across.png" width="95%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations) </small></small> --- class: center, middle # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_3_1_1_Summary_Statistics.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_3_1_1_Summary_Statistics.html) --- ## Frequencies: `table()` A simple way of looking at frequencies (e.g., for categorical variables) is the `base R` function `table()`. ```r table(corona_survey$choice_of_party) ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other ## 753 369 254 287 756 281 83 ``` If you also want to include `NA` in the frequency counts, you need to specify the argument `useNA = "always"`. ```r table(corona_survey$choice_of_party, useNA = "always") ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other <NA> ## 753 369 254 287 756 281 83 982 ``` --- ## Proportions with `prop.table()` If you want proportions instead of raw counts, you can use the `base R` function `prop.table()`. **NB**: You need to apply this function to an output produced by `table()`. .small[ ```r prop.table(table(corona_survey$choice_of_party)) ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other ## 0.27057133 0.13259073 0.09126842 0.10312612 0.27164930 0.10097018 0.02982393 ``` ```r prop.table(table(corona_survey$choice_of_party, useNA = "always")) ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other <NA> ## 0.20000000 0.09800797 0.06746348 0.07622842 0.20079681 0.07463479 0.02204515 0.26082337 ``` ] --- ## Proportions with `prop.table()` If you want fewer decimals places in the output, you can wrap the the `prop.table()` function in a `round()` call. ```r round(prop.table(table(corona_survey$choice_of_party, useNA = "always")), 3) # rounded to 3 decimal places ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other <NA> ## 0.200 0.098 0.067 0.076 0.201 0.075 0.022 0.261 ``` Or if you want percentages... ```r round((prop.table(table(corona_survey$choice_of_party, useNA = "always")) * 100), 2) ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other <NA> ## 20.00 9.80 6.75 7.62 20.08 7.46 2.20 26.08 ``` --- ## Frequencies and proportions with `summarytools::freq()` The `summarytools` package which we used before for summary statistics also includes the `freq()` function for frequency tables. .small[ ```r library(summarytools) freq(corona_survey$choice_of_party) ``` ``` ## Frequencies ## corona_survey$choice_of_party ## Type: Factor ## ## Freq % Valid % Valid Cum. % Total % Total Cum. ## ------------- ------ --------- -------------- --------- -------------- ## CDU/CSU 753 27.06 27.06 20.00 20.00 ## SPD 369 13.26 40.32 9.80 29.80 ## FDP 254 9.13 49.44 6.75 36.55 ## Linke 287 10.31 59.76 7.62 44.17 ## Gruene 756 27.16 86.92 20.08 64.25 ## AfD 281 10.10 97.02 7.46 71.71 ## Other 83 2.98 100.00 2.20 73.92 ## <NA> 982 26.08 100.00 ## Total 3765 100.00 100.00 100.00 100.00 ``` ] --- ## Frequencies and proportions with `janitor::tabyl()` The [`janitor` package](https://github.com/sfirke/janitor) that we briefly mentioned in the session on *Data Wrangling Basics* also provides the `tabyl()` function for creating frequency and proportion tables: .small[ ```r library(janitor) party_stats <- corona_survey %>% tabyl(choice_of_party) %>% adorn_pct_formatting(digits = 2, affix_sign = TRUE) party_stats ``` ``` ## choice_of_party n percent valid_percent ## CDU/CSU 753 20.00% 27.06% ## SPD 369 9.80% 13.26% ## FDP 254 6.75% 9.13% ## Linke 287 7.62% 10.31% ## Gruene 756 20.08% 27.16% ## AfD 281 7.46% 10.10% ## Other 83 2.20% 2.98% ## <NA> 982 26.08% - ``` ] --- ## Frequencies and proportions with `janitor::tabyl()` A nice thing about `tabyl()` is that is produces a data frame which we can, e.g., use for plotting or creating tables (which we will discuss later on). ```r class(party_stats) ``` ``` ## [1] "tabyl" "data.frame" ``` --- ## Frequencies and proportions with `dplyr` We can also use `group_by()` and `summarize()` to get frequencies and proportions for variables in our data set. ```r corona_survey %>% filter(!is.na(choice_of_party)) %>% group_by(choice_of_party) %>% summarize(n = n()) %>% mutate(proportion = n/sum(n)) %>% ungroup() ``` ``` ## # A tibble: 7 x 3 ## choice_of_party n proportion ## <fct> <int> <dbl> ## 1 CDU/CSU 753 0.271 ## 2 SPD 369 0.133 ## 3 FDP 254 0.0913 ## 4 Linke 287 0.103 ## 5 Gruene 756 0.272 ## 6 AfD 281 0.101 ## 7 Other 83 0.0298 ``` --- ## Frequencies and proportions with `dplyr` Instead of using `group_by` and `summarize()` to get frequency counts, we can also use `count()` from `dplyr` as a shorthand. ```r corona_survey %>% filter(!is.na(choice_of_party)) %>% count(choice_of_party) %>% mutate(proportion = n/sum(n)) %>% ungroup() ``` ``` ## # A tibble: 7 x 3 ## choice_of_party n proportion ## <fct> <int> <dbl> ## 1 CDU/CSU 753 0.271 ## 2 SPD 369 0.133 ## 3 FDP 254 0.0913 ## 4 Linke 287 0.103 ## 5 Gruene 756 0.272 ## 6 AfD 281 0.101 ## 7 Other 83 0.0298 ``` --- ## Tables in `R` There are typically two types of outputs you can produce with `R` for further use in reports or publications: tables and plots. We will cover the creation of plots in the following session(s) on data visualization. However, as summary statistics, frequencies, and proportions are often presented in tables, we will briefly discuss how to create tables as output in `R` in the following. As with almost everything in `R`, there are many options (read: packages) for creating tables. We will show examples from three popular options: - [`stargazer`](https://cran.r-project.org/web/packages/stargazer/index.html) - [`gtsummary`](http://www.danieldsjoberg.com/gtsummary/) + [`flextable`](https://davidgohel.github.io/flextable/index.html) --- ## Summary tables with `stargazer` While there is an ever-growing list of `R` packages for creating tables with many new (promising) contenders (such as [`flextable`](https://davidgohel.github.io/flextable/index.html) or [`gt`](https://gt.rstudio.com/index.html)), `stargazer` is an established and widely-used tool for creating ACSII (plain text), `LaTeX`, and `HTML` table output. --- ## Summary tables with `stargazer` If we, e.g., want to create a summary statistics table as text output (e.g., for printing in the `R` console), we can use `stargazer` for that. **NB**: As the main `stargazer()` function does not work with tibbles, we need to convert our data to a regular data frame. ```r library(stargazer) corona_survey %>% select(starts_with("trust")) %>% as.data.frame() %>% stargazer(type = "text", digits = 2, title="Descriptive statistics") ``` .right[↪️] --- class: middle ``` ## ## Descriptive statistics ## ================================================================ ## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max ## ---------------------------------------------------------------- ## trust_rki 3,095 4.44 0.77 1.00 4.00 5.00 5.00 ## trust_government 3,134 3.66 1.01 1.00 3.00 4.00 5.00 ## trust_chancellor 3,130 3.57 1.15 1.00 3.00 4.00 5.00 ## trust_who 3,103 3.97 0.95 1.00 4.00 5.00 5.00 ## trust_scientists 3,107 4.24 0.79 1.00 4.00 5.00 5.00 ## ---------------------------------------------------------------- ``` --- ## Summary tables with `stargazer` `stargazer` also supports `HTML` and `LaTeX` as output formats. We can export the output for further use (e.g., with our `LaTeX` editor of choice). ```r # We create a directory for output files first dir.create("./output") corona_survey %>% select(starts_with("trust")) %>% as.data.frame() %>% stargazer(type = "latex", digits = 2, out = "./output/stargazer_summary.tex", title="Descriptive statistics") ``` *Note*: If you look at the help file for the `stargazer()` function (via `?stargazer`), you will see that it provides a lot of customization options. --- ## Summary tables with `gtsummary` The `tbl_summary()` function from the [`gtsummary` package](http://www.danieldsjoberg.com/gtsummary/) can, e.g., be used to produce frequency table output. If you run the following code in *RStudio*, the table will be displayed in the `Viewer` pane and can then be exported from there as an image or an `HTML` file. ```r corona_survey %>% select(sex, age_cat, education_cat) %>% tbl_summary() ``` *Note*: As with `stargazer`, `tbl_summary()` also provides various customization options and you can save its output in different formats, including `.html`, `.rtf`, and `.tex`. --- ## Summary tables with `gtsummary` You can also save the output of `tbl_summary()` directly in a *Word* file. For that to work, however, you also need to use a function from the [`flextable` package](https://davidgohel.github.io/flextable/index.html). ```r library(flextable) corona_survey %>% select(sex, age_cat, education_cat) %>% tbl_summary(label = list(sex ~ "Sex", age_cat ~ "Age category", education_cat ~ "Education category")) %>% as_flex_table() %>% save_as_docx("Sample descriptives" = ., path = "./output/gt_summary.docx") ``` *Note*: If you have not done this before, for this code to work, you need to create an `output` folder in your current working directory using `dir.create("./output")`. --- class: center, middle # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_3_1_2_Frequencies_Proportions.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_3_1_2_Frequencies_Proportions.html) --- ## Relationships between variables In addition to checking summary statistics for individual variables, another thing that you quite possibly also want to look at as part of your exploratory data analysis (EDA) are the relationships between (specific) variables in your data set. There are many ways to do so and the appropriate choice of methods, of course, depends on the types of variables you want to explore. In the following, we will briefly discuss some options for two methods of exploring relationships between variables: - crosstabulation (for categorical variables) - correlations (for numeric and/or binary variables) --- ## Crosstabs You can also use the `base R` functions `table()` and `prop.table()` we have used before for creating univariate frequency and proportion tables to generate crosstabs. ```r table(corona_survey$sex, corona_survey$choice_of_party) # rows, columns ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other ## Male 413 221 160 170 315 204 52 ## Female 340 148 94 117 441 77 31 ``` ```r round(prop.table(table(corona_survey$sex, corona_survey$choice_of_party))*100, 2) ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other ## Male 14.84 7.94 5.75 6.11 11.32 7.33 1.87 ## Female 12.22 5.32 3.38 4.20 15.85 2.77 1.11 ``` --- ## Crosstabs We can also calculate row or column percentages using these `base R` functions. ```r round(prop.table(table(corona_survey$sex, corona_survey$choice_of_party), 1)*100, 2) # row percentages ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other ## Male 26.91 14.40 10.42 11.07 20.52 13.29 3.39 ## Female 27.24 11.86 7.53 9.38 35.34 6.17 2.48 ``` ```r round(prop.table(table(corona_survey$sex, corona_survey$choice_of_party), 2)*100, 2) # column percentages ``` ``` ## ## CDU/CSU SPD FDP Linke Gruene AfD Other ## Male 54.85 59.89 62.99 59.23 41.67 72.60 62.65 ## Female 45.15 40.11 37.01 40.77 58.33 27.40 37.35 ``` .small[ *Note*: If you want to generate tables based on more than two variables, the `base R` function `ftable()` is a good option for prettier printing of results. ] --- ## Crosstabs with `dplyr` We can use functions from `dplyr` to create crosstabs including frequencies. ```r corona_survey %>% filter(!is.na(choice_of_party)) %>% count(sex, choice_of_party) %>% pivot_wider(names_from = choice_of_party, values_from = n) ``` ``` ## # A tibble: 2 x 8 ## sex `CDU/CSU` SPD FDP Linke Gruene AfD Other ## <fct> <int> <int> <int> <int> <int> <int> <int> ## 1 Male 413 221 160 170 315 204 52 ## 2 Female 340 148 94 117 441 77 31 ``` .small[ *Note*: We have only briefly mentioned the function in the session on *Data Wrangling Basics* (when introducing the concept of tidy data), but - as you might guess from its name - the function `pivot_wider()`, which is part of the [`tidyr` package](https://tidyr.tidyverse.org/), changes the format of a data set from long to wide. ] --- ## Crosstabs with `dplyr` We can also use functions from `dplyr` in a similar fashion as we have done for a single variable to create crosstabs including percentages. ```r corona_survey %>% filter(!is.na(choice_of_party)) %>% count(sex, choice_of_party) %>% mutate(proportion = n/sum(n)*100) %>% select(-n) %>% pivot_wider(names_from = choice_of_party, values_from = proportion) ``` ``` ## # A tibble: 2 x 8 ## sex `CDU/CSU` SPD FDP Linke Gruene AfD Other ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Male 14.8 7.94 5.75 6.11 11.3 7.33 1.87 ## 2 Female 12.2 5.32 3.38 4.20 15.8 2.77 1.11 ``` --- ## Other options for crosstabulation in `R` As with most things in `R`, there are many options for creating crosstabs. Some alternatives to the ones we presented before include the `CrossTable()` and `crosstab()` functions from the [`descr` package](https://cran.r-project.org/web/packages/descr/index.html) or the `ctable()` function from the [`summarytools` package](https://github.com/dcomtois/summarytools). A very versatile option for crosstabs is the `tabyl()` function from the `janitor` package that we have introduced before. --- ## Crosstabs with the `janitor` package The `tabyl()` function from the `janitor` package provides quite a few options for crosstabs. We will only show one example here, but you can learn more in the [`tabyl` vignette](https://cran.r-project.org/web/packages/janitor/vignettes/tabyls.html). .small[ ```r library(janitor) corona_survey %>% filter(!is.na(choice_of_party)) %>% tabyl(sex, choice_of_party) %>% adorn_totals(where = c("row","col")) %>% adorn_percentages(denominator = "row") %>% adorn_pct_formatting(digits = 2) %>% adorn_ns(position = "front") ``` ``` ## sex CDU/CSU SPD FDP Linke Gruene AfD Other Total ## Male 413 (26.91%) 221 (14.40%) 160 (10.42%) 170 (11.07%) 315 (20.52%) 204 (13.29%) 52 (3.39%) 1535 (100.00%) ## Female 340 (27.24%) 148 (11.86%) 94 (7.53%) 117 (9.38%) 441 (35.34%) 77 (6.17%) 31 (2.48%) 1248 (100.00%) ## Total 753 (27.06%) 369 (13.26%) 254 (9.13%) 287 (10.31%) 756 (27.16%) 281 (10.10%) 83 (2.98%) 2783 (100.00%) ``` ] --- ## Chi-Square Test We can use the `summary()` function from `base R` in combination with `table()` to perform a chi-square test. ```r summary(table(corona_survey$sex, corona_survey$choice_of_party)) ``` ``` ## Number of cases in table: 2783 ## Number of factors: 2 ## Test for independence of all factors: ## Chisq = 103.67, df = 6, p-value = 4.292e-20 ``` --- ## Chi-Square Test The other packages that include functions for crosstabs mentioned before can also be used for chi-square tests: For example, the `janitor` package. ```r library(janitor) corona_survey %>% filter(!is.na(choice_of_party)) %>% tabyl(sex, choice_of_party) %>% chisq.test() ``` ``` ## ## Pearson's Chi-squared test ## ## data: . ## X-squared = 103.67, df = 6, p-value < 2.2e-16 ``` --- ## Correlations Again, as with the crosstabs examples, there are many different options for calculating and displaying correlations in `R`. In addition to the `base R` functions, we will look at two packages in this part: [`corrr`](https://corrr.tidymodels.org/) and [`correlation`](https://github.com/easystats/correlation). *Note*: While we will not cover that in this session, `corrr` and `correlation` also offer some nice options for plotting correlations. --- ## Correlations with `base R` The `base R` function `cor()` computes the correlation coefficient(s) between two or more variables. This function can be used to calculate *Pearson's r*, *Kendall's tau*, and *Spearman's rho*. We also need to specify how we want to deal with missing values (e.g., use pairwise complete observations). For example, let's look at the correlations between the trust variables in our data set: .small[ ```r trust <- corona_survey %>% select(starts_with("trust")) cor(trust, use = "pairwise.complete.obs", method = "pearson") ``` ``` ## trust_rki trust_government trust_chancellor trust_who trust_scientists ## trust_rki 1.0000000 0.5233628 0.4566379 0.5234133 0.4919817 ## trust_government 0.5233628 1.0000000 0.8700270 0.5686134 0.4475961 ## trust_chancellor 0.4566379 0.8700270 1.0000000 0.5322833 0.3899243 ## trust_who 0.5234133 0.5686134 0.5322833 1.0000000 0.5117994 ## trust_scientists 0.4919817 0.4475961 0.3899243 0.5117994 1.0000000 ``` ] --- ## Correlations with `base R` With `corr.test()` you can display the results of a significance test for a correlation. ```r cor.test(trust$trust_rki, trust$trust_scientists, method = "pearson") ``` ``` ## ## Pearson's product-moment correlation ## ## data: trust$trust_rki and trust$trust_scientists ## t = 31.25, df = 3058, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.4646480 0.5183788 ## sample estimates: ## cor ## 0.4919817 ``` --- ## The `corrr` package The [`corrr` package](https://corrr.tidymodels.org/) is part of the [`tidymodels` suite of packages](https://www.tidymodels.org/) and provides various functions for displaying correlations. The main function is `correlate()` which produces a `tibble` as output. .small[ ```r library(corrr) correlate(trust) ``` ``` ## # A tibble: 5 x 6 ## term trust_rki trust_government trust_chancellor trust_who trust_scientists ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 trust_rki NA 0.523 0.457 0.523 0.492 ## 2 trust_government 0.523 NA 0.870 0.569 0.448 ## 3 trust_chancellor 0.457 0.870 NA 0.532 0.390 ## 4 trust_who 0.523 0.569 0.532 NA 0.512 ## 5 trust_scientists 0.492 0.448 0.390 0.512 NA ``` ] --- ## The `corrr` package The `corrr` package provides several functions for tweaking/optimizing the output of the `correlate()` function. Here's one example: .small[ ```r trust %>% correlate() %>% rearrange() %>% shave() %>% fashion() ``` ``` ## term trust_scientists trust_rki trust_who trust_government trust_chancellor ## 1 trust_scientists ## 2 trust_rki .49 ## 3 trust_who .51 .52 ## 4 trust_government .45 .52 .57 ## 5 trust_chancellor .39 .46 .53 .87 ``` ] --- ## The `correlation` package The [`correlation` package](https://github.com/easystats/correlation) is part of the [`easystats` collection of packages](https://easystats.github.io/easystats/) (also called the `Easyverse`). It provides a much wider range of correlation types than the `base R` function `cor()` and the `correlate()` function from the `corrr` package (e.g., biserial and tetrachoric correlations for factors). Its core is the `correlation()` function. ```r library(correlation) correlation(trust) ``` .right[↪️] --- class: middle ``` ## # Correlation Matrix (pearson-method) ## ## Parameter1 | Parameter2 | r | 95% CI | t | df | p ## ------------------------------------------------------------------------------------ ## trust_rki | trust_government | 0.52 | [0.50, 0.55] | 34.06 | 3076 | < .001*** ## trust_rki | trust_chancellor | 0.46 | [0.43, 0.48] | 28.45 | 3072 | < .001*** ## trust_rki | trust_who | 0.52 | [0.50, 0.55] | 33.94 | 3053 | < .001*** ## trust_rki | trust_scientists | 0.49 | [0.46, 0.52] | 31.25 | 3058 | < .001*** ## trust_government | trust_chancellor | 0.87 | [0.86, 0.88] | 98.46 | 3113 | < .001*** ## trust_government | trust_who | 0.57 | [0.54, 0.59] | 38.38 | 3083 | < .001*** ## trust_government | trust_scientists | 0.45 | [0.42, 0.48] | 27.82 | 3089 | < .001*** ## trust_chancellor | trust_who | 0.53 | [0.51, 0.56] | 34.89 | 3079 | < .001*** ## trust_chancellor | trust_scientists | 0.39 | [0.36, 0.42] | 23.52 | 3084 | < .001*** ## trust_who | trust_scientists | 0.51 | [0.49, 0.54] | 32.99 | 3066 | < .001*** ## ## p-value adjustment method: Holm (1979) ## Observations: 3055-3115 ``` --- ## The `correlation` package Among other things, the `correlation` package allows to calculate grouped/stratified correlations. ```r corona_survey %>% select(sex, starts_with("trust")) %>% group_by(sex) %>% correlation() ``` .right[↪️] --- class: center, middle .small[ ``` ## # Correlation Matrix (pearson-method) ## ## Group | Parameter1 | Parameter2 | r | 95% CI | t | df | p ## --------------------------------------------------------------------------------------------- ## Male | trust_rki | trust_government | 0.52 | [0.48, 0.55] | 23.98 | 1577 | < .001*** ## Male | trust_rki | trust_chancellor | 0.45 | [0.41, 0.49] | 19.88 | 1575 | < .001*** ## Male | trust_rki | trust_who | 0.49 | [0.45, 0.53] | 22.40 | 1566 | < .001*** ## Male | trust_rki | trust_scientists | 0.46 | [0.42, 0.49] | 20.25 | 1566 | < .001*** ## Male | trust_government | trust_chancellor | 0.86 | [0.85, 0.87] | 67.81 | 1596 | < .001*** ## Male | trust_government | trust_who | 0.56 | [0.53, 0.59] | 27.06 | 1584 | < .001*** ## Male | trust_government | trust_scientists | 0.41 | [0.37, 0.45] | 17.87 | 1583 | < .001*** ## Male | trust_chancellor | trust_who | 0.54 | [0.50, 0.57] | 25.50 | 1582 | < .001*** ## Male | trust_chancellor | trust_scientists | 0.36 | [0.31, 0.40] | 15.26 | 1580 | < .001*** ## Male | trust_who | trust_scientists | 0.49 | [0.46, 0.53] | 22.49 | 1572 | < .001*** ## Female | trust_rki | trust_government | 0.53 | [0.49, 0.57] | 24.29 | 1497 | < .001*** ## Female | trust_rki | trust_chancellor | 0.47 | [0.43, 0.51] | 20.56 | 1495 | < .001*** ## Female | trust_rki | trust_who | 0.57 | [0.53, 0.60] | 26.75 | 1485 | < .001*** ## Female | trust_rki | trust_scientists | 0.54 | [0.50, 0.57] | 24.45 | 1490 | < .001*** ## Female | trust_government | trust_chancellor | 0.88 | [0.87, 0.89] | 72.30 | 1515 | < .001*** ## Female | trust_government | trust_who | 0.57 | [0.54, 0.61] | 27.15 | 1497 | < .001*** ## Female | trust_government | trust_scientists | 0.50 | [0.46, 0.53] | 22.12 | 1504 | < .001*** ## Female | trust_chancellor | trust_who | 0.51 | [0.47, 0.55] | 23.06 | 1495 | < .001*** ## Female | trust_chancellor | trust_scientists | 0.43 | [0.39, 0.47] | 18.66 | 1502 | < .001*** ## Female | trust_who | trust_scientists | 0.55 | [0.51, 0.58] | 25.38 | 1492 | < .001*** ## ## p-value adjustment method: Holm (1979) ## Observations: 1487-1598 ``` ] --- class: center, middle # [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_3_1_3_Crosstabs_Correlations.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_3_1_3_Crosstabs_Correlations.html) --- ## Guilty by ~~association~~ correlation While correlation coefficients are useful for exploring relationships between variables, they can also be misleading. For example, if we do correlation analysis and encounter a (Pearson's) correlation coefficient close to 0, we often think of relationships as pictured below. <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-1-1.png" width="45%" style="display: block; margin: auto;" /> --- ## Guilty by ~~association~~ correlation This data set has **the same correlation coefficient (Pearson's r of -0.06)** as the one on the previous slide: <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-2-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Guilty by ~~association~~ correlation 🦖 So does this one... <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-3-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Guilty by ~~association~~ correlation We could go on... The previous three examples all come from the [`datasauRus` package](https://github.com/lockedata/datasauRus) which essentially is an extension of [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) and includes 13 data sets with the same (Pearson) correlation between x and y. <img src="data:image/png;base64,#3_1_Exploratory_Data_Analysis_files/figure-html/dino-plot-4-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Trust no singular value! Importantly, the x- and y-variables in these `datasaurus_dozen` data set also all have the same means and standard deviations. ```r datasaurus_dozen %>% group_by(dataset) %>% summarize( mean_x = mean(x), mean_y = mean(y), sd_x = sd(x), sd_y = sd(y), corr = cor(x, y, method = "pearson") ) ``` .right[↪️] --- class: middle .small[ ``` ## # A tibble: 13 x 6 ## dataset mean_x mean_y sd_x sd_y corr ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 away 54.3 47.8 16.8 26.9 -0.0641 ## 2 bullseye 54.3 47.8 16.8 26.9 -0.0686 ## 3 circle 54.3 47.8 16.8 26.9 -0.0683 ## 4 dino 54.3 47.8 16.8 26.9 -0.0645 ## 5 dots 54.3 47.8 16.8 26.9 -0.0603 ## 6 h_lines 54.3 47.8 16.8 26.9 -0.0617 ## 7 high_lines 54.3 47.8 16.8 26.9 -0.0685 ## 8 slant_down 54.3 47.8 16.8 26.9 -0.0690 ## 9 slant_up 54.3 47.8 16.8 26.9 -0.0686 ## 10 star 54.3 47.8 16.8 26.9 -0.0630 ## 11 v_lines 54.3 47.8 16.8 26.9 -0.0694 ## 12 wide_lines 54.3 47.8 16.8 26.9 -0.0666 ## 13 x_shape 54.3 47.8 16.8 26.9 -0.0656 ``` ] --- ## Plot your data! The message from the `datasaurus_dozen` examples should be clear. Relying only on singular values that summarize the location or spread of a single variable or the association of two variables is not a good idea. To avoid reducing a ~~mountain to a molehill~~ dinosaur to a lack of correlation, it is important to plot your data to explore: - univariate distributions - grouped univariate distributions (if you want to compare groups) - bivariate distributions For that reason (and because it is fun and `R` has a lot to offer in that regard), we will dive 🥽 right into data visualization in the next session... --- ## Missings & outliers Two things that can influence summary statistics as well as univariate and bivariate distributions are missings and outliers. Hence, before we can analyze our data, we should check whether we have clear patterns of missingness or extreme outliers in our data. --- ## Missing data In the *Data Wrangling Basics* session we have already discussed how we can define specific values as missings (`NA`) and how we can recode `NA` into something else. As we have seen in that session, the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* data set contains quite a few codes for different types of missing data. However, when we collect data ourselves or if we (re-)use data sets that are not as well-documented as the *GESIS Panel* data, we may need to explore potential patterns of missingness to see if there may be identifiable reasons why data for certain variables and/or observations are missing. --- ## Summarizing missingness The [`naniar` package](http://naniar.njtierney.com/index.html) that we have used for recoding values as `NA` in the *Data Wrangling Basics* session also provides a [collection of functions for summarizing missingness](http://naniar.njtierney.com/reference/miss_var_summary.html) in variables or whole data sets, of which we will show three examples in the following. ```r library(naniar) corona_survey %>% select(starts_with("trust")) %>% miss_var_summary(order = TRUE) ``` ``` ## # A tibble: 5 x 3 ## variable n_miss pct_miss ## <chr> <int> <dbl> ## 1 trust_rki 670 17.8 ## 2 trust_who 662 17.6 ## 3 trust_scientists 658 17.5 ## 4 trust_chancellor 635 16.9 ## 5 trust_government 631 16.8 ``` --- ## Summarizing missingness with `naniar` ```r corona_survey %>% pct_complete_case() ``` ``` ## [1] 58.91102 ``` --- ## Summarizing missingness with `naniar` ```r corona_survey %>% miss_case_table() ``` ``` ## # A tibble: 16 x 3 ## n_miss_in_case n_cases pct_cases ## <int> <int> <dbl> ## 1 0 2218 58.9 ## 2 1 796 21.1 ## 3 2 102 2.71 ## 4 3 23 0.611 ## 5 4 8 0.212 ## 6 5 5 0.133 ## 7 6 12 0.319 ## 8 7 8 0.212 ## 9 12 8 0.212 ## 10 13 5 0.133 ## 11 14 2 0.0531 ## 12 19 2 0.0531 ## 13 20 2 0.0531 ## 14 21 350 9.30 ## 15 22 187 4.97 ## 16 23 37 0.983 ``` --- ## Outliers There are [many ways of identifying and dealing with outliers](https://ouzhang.me/2020/11/16/outliers-part4/). Here, we will only look at two methods (one univariate, one multivariate): [Interquartile range (IQR)](https://en.wikipedia.org/wiki/Interquartile_range) and [Mahalanobis distance](https://en.wikipedia.org/wiki/Mahalanobis_distance). --- ## Outliers As outlier detection works better with numeric data on an [interval scale](https://en.wikipedia.org/wiki/Level_of_measurement#Interval_scale), we will use data from [*Gapminder*](https://www.gapminder.org/) on life expectancy GDP per capita in 2018 for identifying outliers. .small[ ```r life_exp_2018 <- read_csv("./data/life_expectancy_years.csv") %>% pivot_longer(-country, names_to = "year", values_to = "life_exp") %>% filter(year == "2018") %>% select(-year) gdp_pc_2018 <- read_csv("./data/gdppercapita_us_inflation_adjusted.csv") %>% pivot_longer(-country, names_to = "year", values_to = "gdp_percap") %>% filter(year == "2018") %>% select(-year) gapminder_2018 <- life_exp_2018 %>% left_join(gdp_pc_2018, by = "country") ``` ] --- ## Detecting outliers: IQR For identifying univariate outliers we can specify lower and upper cutoffs, e.g., using the formula 25th percentile - 1.5 x interquartile range (IQR) for the lower and 75th percentile + 1.5 x IQR for the upper limit. In `R` this can be done as follows: .small[ ```r q2575 <- quantile(gdp_pc_2018$gdp_percap, probs=c(.25, .75), na.rm = TRUE) iqr <- IQR(gdp_pc_2018$gdp_percap, na.rm = TRUE) ul <- q2575[2]+1.5*iqr ll <- q2575[1]-1.5*iqr gdp_pc_2018_cut <- gdp_pc_2018 %>% filter(gdp_percap <= ul) nrow(gdp_pc_2018) - nrow(gdp_pc_2018_cut) ``` ``` ## [1] 31 ``` ] .small[ **NB**: The value of 1.5 x IQR is a rule of thumb. You should always check whether this makes sense for your data. In the above example with the *Gapminder* data, e.g., it most likely does not. An alternative approach is using mean + 2 x SD (or 3 x SD), but, again, this is nothing more than a rule of thumb. ] --- ## Detecting outliers: Mahalanobis distance A common method for identifying multivariate outliers is Mahalanobis distance. While there is a `base R` function called `mahalanobis()` that you can use to calculate Mahalanobis distance, a more convenient option is the `mahalanobis_distance()` function from the [`rstatix` package](https://rpkgs.datanovia.com/rstatix/index.html). ```r library(rstatix) md <- gapminder_2018 %>% drop_na() %>% mahalanobis_distance() %>% select(-c(life_exp, gdp_percap)) names(md) gapminder_2018 %>% drop_na() %>% bind_cols(md) %>% arrange(desc(mahal.dist)) ``` .right[↪️] --- class: middle ``` ## [1] "mahal.dist" "is.outlier" ``` ``` ## # A tibble: 176 x 5 ## country life_exp gdp_percap mahal.dist is.outlier ## <chr> <dbl> <dbl> <dbl> <lgl> ## 1 Luxembourg 81.8 111000 31.4 TRUE ## 2 Norway 82.5 92100 18.7 TRUE ## 3 Switzerland 84.1 79200 11.8 FALSE ## 4 Central African Republic 52.4 379 11.7 FALSE ## 5 Ireland 82.1 76700 11.4 FALSE ## 6 Lesotho 55.5 1410 8.32 FALSE ## 7 Denmark 80.9 63900 7.06 FALSE ## 8 Qatar 80.3 63300 7.03 FALSE ## 9 Papua New Guinea 58.7 2420 5.41 FALSE ## 10 Singapore 85 58200 5.26 FALSE ## # ... with 166 more rows ``` --- ## Other packages for EDA While we have covered quite a few packages (that can be used) for EDA in this session, there still are plenty of others that also provide some interesting functionalities. Some of these additional/alternative options are: - [`inspectdf`](https://alastairrushworth.github.io/inspectdf/) - [`skimr`](https://docs.ropensci.org/skimr/) - [`descriptr`](https://descriptr.rsquaredacademy.com/index.html) - [`DataExplorer`](https://boxuancui.github.io/DataExplorer/) - [`explore`](https://github.com/rolkra/explore) - [`dataReporter`](https://github.com/ekstroem/dataReporter) --- ## Automated EDA reports Some of the EDA packages provide functions for generating automated EDA reports, for example: - the `dfSummary()` function from the `summarytools` package - the `skim()` function from the [`skimr` package](https://docs.ropensci.org/skimr/) - the `make_report()` function from the [`dataReporter` package](https://github.com/ekstroem/dataReporter) - the `report()` function from the [`explore` package](https://github.com/rolkra/explore) The function `explore`() from the `explore` package also allows for interactive data exploration. --- # Extracurricular activities Further explore the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* data set and develop some ideas for interesting research questions that you could answer by analyzing it with `R`.