Introduction to R for Data Analysis

# Introduction to R for Data Analysis
## Data Types, Import & Export
### Johannes Breuer & Stefan Jünger
### 2021-08-02

---

---

## Getting data into `R`
Thus far, we've already learned what `R` and `RStudio` are. This course is about starting to use `R` and feeling prepared to use it for statistical analyses. There's one essential prerequisite:

---

## Content of this session

- What are `R`'s internal data types?
- How to work with different data types?
- How to import data in different formats?
- How to export data in different formats

---

## Data we use in this course

During the course, we use several different data sets. Especially in this session, where we apply different importing functions, we quite a few data sets, from data about the Titanic to data about unicorns. However, we will also use data that are more interesting for social and behavioral scientists.

---

## It boils all down to...

.pull-left[
**How your data are stored (data types)**
- 'Numbers' (Integers & Doubles)
- Character Strings
- Logical
- Factors
- ...
- There's more, e.g., expressions, but let's leave it at that
]

.pull-right[
**Where your data are stored (data formats)**
- Vectors
- Matrices
- Arrays
- Data frames / Tibbles
- Lists
]

---

## Numeric data
.small[
*Integers* are values without a decimal value. To be explicit in `R` in using them, you have to place an `L` behind the actual value.

```r
1L
```

```
## [1] 1
```

By contrast, *doubles* are values with a decimal value.

```r
1.1
```

```
## [1] 1.1
```

We can check data types by using the `typeof()` function.

```r
typeof(1L)
```

```
## [1] "integer"
```

```r
typeof(1.1)
```

```
## [1] "double"
```
]

---

## Character strings
At first glance, a *character* is a letter somewhere between a-z. *String* in this context might mean that we have a series of characters. However, numbers and other symbols can be part of a *character string*, which can then be, e.g., part of a text. In `R`, character strings are wrapped in quotation marks.

```r
"Hi. I am a character string, the 1st of its kind!"
```

```
## [1] "Hi. I am a character string, the 1st of its kind!"
```

*Note*: There are no values associated with the content of character strings unless we change that, e.g., with factors.

---

## Factors

If you're a *Stata* (or *SPSS*) user, you may already be  familiar with factors. Factors are data types that assume that their values are not continuous, e.g., as in [ordinal](https://en.wikipedia.org/wiki/Level_of_measurement#Ordinal_scale) or [nominal](https://en.wikipedia.org/wiki/Level_of_measurement#Nominal_level) data.

```r
factor(1.1)
```

```
## [1] 1.1
## Levels: 1.1
```

```r
factor("Hi. I am a character string, the 1st of its kind!")
```

```
## [1] Hi. I am a character string, the 1st of its kind!
## Levels: Hi. I am a character string, the 1st of its kind!
```

Factors take numeric data or character strings as input as they simply convert them into so-called levels. This concept may be a little bit abstract for the time being. It's just essential to have heard about them before you learn more about them.

---

## Logical values

Logical values are basically either `TRUE` or `FALSE` values. These values are produced by making logical requests on your data.

```r
2 > 1
```

```
## [1] TRUE
```

```r
2 < 1
```

```
## [1] FALSE
```

Logical values are at the heart of creating loops. For this purpose, however, we need more logical operators to request `TRUE` or `FALSE` values.

---

## Logical operators

There are quite a few logical operators in `R`:

.pull-left[
- `<` 	less than
- `<=` 	less than or equal to
- `>` 	greater than
- `>=` 	greater than or equal to
- `== `	exactly equal to
- `!=` 	not equal to
]

.pull-right[
- `!x` 	Not x
- `x | y` 	x OR y
- `x & y `	x AND y
- `isTRUE(x)` 	test if X is TRUE 
- `isFALSE(x)` 	test if X is FALSE 
]

Moreover, there are some more `is.PROPERTY_ASKED_FOR()` functions, such as `is.numeric()`, which also return `TRUE` or `FALSE` values.

---

## `R`'s data formats

`R`'s different data types can be put into 'containers'.

---

## Vectors

Vectors are built by enclosing your content with `c()` ("c" for "concatenate")

```r
numeric_vector   <- c(1, 2, 3, 4)
character_vector <- c("a", "b", "c", "d")

numeric_vector
```

```
## [1] 1 2 3 4
```

```r
character_vector
```

```
## [1] "a" "b" "c" "d"
```

Vectors are really like vectors in mathematics. Initially, it doesn't matter if you look at them as column or row vectors.

---

## ...but it matters when you combine vectors

Using the function `cbind()` or `rbind()` you can either combine vectors column-wise or row-wise. Thus, they become matrices.

```r
cbind(numeric_vector, character_vector)
```

```
##      numeric_vector character_vector
## [1,] "1"            "a"             
## [2,] "2"            "b"             
## [3,] "3"            "c"             
## [4,] "4"            "d"
```

```r
rbind(numeric_vector, character_vector)
```

```
##                  [,1] [,2] [,3] [,4]
## numeric_vector   "1"  "2"  "3"  "4" 
## character_vector "a"  "b"  "c"  "d"
```

.small[
*Note*: The numeric values are [coerced](https://www.oreilly.com/library/view/r-in-a/9781449358204/ch05s08.html) into strings here.
]

---

## Matrices

Matrices are the basic rectangular data format in R.

```r
fancy_matrix <- matrix(1:16, nrow = 4)

fancy_matrix
```

```
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
```

You cannot store multiple data types, such as strings and numeric values in the same matrix.  Otherwise, your data will get coerced to a common type, as seen in the previous slide. This is something that happens already within vectors:

```r
c(1, 2, "evil string")
```

```
## [1] "1"           "2"           "evil string"
```

---

## Data frames

While matrices are used, e.g.,--\*drumroll\*-- for matrix operations, data frames resemble more the data formats most of you are probably already familiar with. We can build data frames by hand as here:

```r
library(randomNames) # a name generator package

fancy_data <-
  data.frame( 
    who = 
      randomNames(n = 10, which.names = "first"),
    age = 
      sample(14:49, 10, replace = TRUE), # you see what we are doing here?   
    salary_2018 = 
      sample(15:100, 10, replace = TRUE),  
    salary_2019 = 
      sample(15:100, 10, replace = TRUE)
  )
 
fancy_data
```
]

---
class: middle

```
##          who age salary_2018 salary_2019
## 1     Westen  14          15          58
## 2     Waseef  35          37          24
## 3      Logan  38          59          54
## 4     Derick  16          95         100
## 5    Patrick  23          71          95
## 6     Robert  43          93          94
## 7       Alix  22          69          93
## 8       Anna  38          78          84
## 9  Nicolette  46          91          71
## 10   Stephon  33          64          20
```

---

## Tibbles

- only the first ten observations are printed
  - the output is tidier!
- you get some additional metadata about rows and columns that you would normally only get when using `dim()` and other functions

You can check the [tibble vignette](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html) for technical details.
]

.pull-right[
<img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\tibble.png" width="60%" style="display: block; margin: auto;" />
]

---

## Tibble conversion

```r
library(tibble)
as_tibble(fancy_data)
```

```
## # A tibble: 10 x 4
##    who         age salary_2018 salary_2019
##    <chr>     <int>       <int>       <int>
##  1 Westen       14          15          58
##  2 Waseef       35          37          24
##  3 Logan        38          59          54
##  4 Derick       16          95         100
##  5 Patrick      23          71          95
##  6 Robert       43          93          94
##  7 Alix         22          69          93
##  8 Anna         38          78          84
##  9 Nicolette    46          91          71
## 10 Stephon      33          64          20
```

---

## One last type you should know: lists

Lists are perfect for storing numerous and potentially diverse pieces of information in one place.

```r
fancy_list <- 
  list(
    numeric_vector,
    character_vector,
    fancy_matrix,
    fancy_data
  )

fancy_list
```

---
class: middle
.tinyish[

```
## [[1]]
## [1] 1 2 3 4
## 
## [[2]]
## [1] "a" "b" "c" "d"
## 
## [[3]]
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
## 
## [[4]]
##          who age salary_2018 salary_2019
## 1     Westen  14          15          58
## 2     Waseef  35          37          24
## 3      Logan  38          59          54
## 4     Derick  16          95         100
## 5    Patrick  23          71          95
## 6     Robert  43          93          94
## 7       Alix  22          69          93
## 8       Anna  38          78          84
## 9  Nicolette  46          91          71
## 10   Stephon  33          64          20
```
]

---

## Nested lists

```r
fancy_nested_list <-
  list(
    fancy_vectors = list(numeric_vector, character_vector),
    data_stuff = list(fancy_matrix, fancy_data)
  )

fancy_nested_list
```

---
class: middle
.tinyish[

```
## $fancy_vectors
## $fancy_vectors[[1]]
## [1] 1 2 3 4
## 
## $fancy_vectors[[2]]
## [1] "a" "b" "c" "d"
## 
## 
## $data_stuff
## $data_stuff[[1]]
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
## 
## $data_stuff[[2]]
##          who age salary_2018 salary_2019
## 1     Westen  14          15          58
## 2     Waseef  35          37          24
## 3      Logan  38          59          54
## 4     Derick  16          95         100
## 5    Patrick  23          71          95
## 6     Robert  43          93          94
## 7       Alix  22          69          93
## 8       Anna  38          78          84
## 9  Nicolette  46          91          71
## 10   Stephon  33          64          20
```
]

---

## Accessing elements by index

Generally, the logic of `[index_number]` is used in `R` to access only a subset of information in an object, no matter if we have vectors or data frames.

Say, we want to extract the second element of our `character_vector` object, we could do that like this:

```r
character_vector[2]
```

```
## [1] "b"
```

---

## More complicated cases: matrices

Matrices can have more dimensions, often you want information from a specific row and column.

```r
a_wonderful_matrix[number_of_row, number_of_column]
```

*Note*: You can do the same indexing with `data.frame`s. We will talk more about this in the session on *Data Wrangling Basics*.

---

## Matrices and subscripts (as in mathematical notation)

Identifying rows, columns, or elements using subscripts is similar to matrix  notation:

```r
fancy_matrix[, 4] # 4th column of matrix
fancy_matrix[3,] # 3rd row of matrix
fancy_matrix[2:4, 1:3] # rows 2,3,4 of columns 1,2,3 
```

It's really like in math, and you can perform standard mathematical operations, such as matrix multiplications.

```r
fancy_matrix[2:4, 1:3] %*% fancy_matrix[1:3, 2:4]
```

```
##      [,1] [,2] [,3]
## [1,]  116  188  260
## [2,]  134  218  302
## [3,]  152  248  344
```

---

## The case of data frames

A nice feature of `data.frames` or `tibbles` is that their columns are names, just as variable names in ordinary data. It would be cumbersome to use index numbers to extract a specific column/variable, right? Do not fear:

```r
fancy_data$who
```

```
##  [1] "Westen"    "Waseef"    "Logan"     "Derick"    "Patrick"   "Robert"    "Alix"      "Anna"      "Nicolette" "Stephon"
```

Just place a `$`-sign between the data object and the variable name.

---

## `[]` in data frames

Sometimes we also have to rely on character strings as input information, e.g., for iterating over data. We can also use `[]` to access variables by name.

```r
fancy_data[1]
```

```
##          who
## 1     Westen
## 2     Waseef
## 3      Logan
## 4     Derick
## 5    Patrick
## 6     Robert
## 7       Alix
## 8       Anna
## 9  Nicolette
## 10   Stephon
```
]

```r
fancy_data["who"]
```

```
##          who
## 1     Westen
## 2     Waseef
## 3      Logan
## 4     Derick
## 5    Patrick
## 6     Robert
## 7       Alix
## 8       Anna
## 9  Nicolette
## 10   Stephon
```
]
 
---

## Difference between `[]` and `[[]]`

<img src="data:image/png;base64,#1_2_Data_Types_Import_Export_files/figure-html/hadley-tweet-1.png" width="80%" style="display: block; margin: auto;" />
https://twitter.com/hadleywickham/status/643381054758363136

---

## Data frame check 1, 2, 1, 2!

Once you start working with data in `R` a good first thing to do is to have a quick look at them. The most high-level information you can get is about the object type and its dimensions.

```r
# object type
class(fancy_data)
```

```
## [1] "data.frame"
```

```r
# number of rows and columns
dim(fancy_data)
```

```
## [1] 10  4
```

```r
# number of rows
nrow(fancy_data)
```

```
## [1] 10
```

```r
# number of columns
ncol(fancy_data)
```

```
## [1] 4
```
]

---

## Data frame check 1, 2, 1, 2!

You can also print the first 6 lines of the data frame with `head()`. You can easily change the number of lines by providing the number as the second argument to the `head()` function.

```r
head(fancy_data, 3)
```

```
##      who age salary_2018 salary_2019
## 1 Westen  14          15          58
## 2 Waseef  35          37          24
## 3  Logan  38          59          54
```

---

## Data frame check 1, 2, 1, 2!

If we want some more (detailed) information about the data set or object, we can use the `base R` function `str()`.

```r
str(fancy_data)
```

```
## 'data.frame':	10 obs. of  4 variables:
##  $ who        : chr  "Westen" "Waseef" "Logan" "Derick" ...
##  $ age        : int  14 35 38 16 23 43 22 38 46 33
##  $ salary_2018: int  15 37 59 95 71 93 69 78 91 64
##  $ salary_2019: int  58 24 54 100 95 94 93 84 71 20
```

---

## Data frame check 1, 2, 1, 2!

If you want to have a look at your full data set, you can use the `View()` function. In *RStudio*, this will open a new tab in the source pane through which you can explore the data set (including a search function). You can also click on the small spreadsheet symbol on the right side of the object in the environment tab to open this view.

```r
View(fancy_data)
```

---

## Viewing and changing names

We can print all names of an object using the `names()` function...

```r
names(fancy_data)
```

```
## [1] "who"         "age"         "salary_2018" "salary_2019"
```

...and we can also change names with it.

```r
names(fancy_data) <- c("name", "age", "salary_2018", "salary_2019")

names(fancy_data)
```

```
## [1] "name"        "age"         "salary_2018" "salary_2019"
```

However, there are more flexible ways of doing this as we will see in the session on *Data Wrangling Basics* tomorrow.

---

# [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_1_2_1_Data_Types.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_1_2_1_Data_Types.html)

---

## GESIS Panel Data on the Coronavirus Outbreak
.left-column[
<img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\gesis_panel_logo_web.jpg" width="372" style="display: block; margin: auto;" />
]

.right-column[
For most of the examples and exercises in this course we will use the [Public Use File (PUF) of the GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany](https://www.gesis.org/gesis-panel/coronavirus-outbreak/public-use-file-puf). You can [download the data set in different formats as well as the codebook and the questionnaire (in German) from the *GESIS* Data Archive](https://search.gesis.org/research_data/ZA5667) (note: you need to have/create a user account).

The *GESIS Panel* website provides [detailed documentation](https://www.gesis.org/gesis-panel/documentation), including a [cheatsheet](https://www.gesis.org/fileadmin/upload/GESIS_Panel/Cheatsheet/gesis_panel_cheatsheet.pdf).
]

---

## Gapminder Data
.left-column[ 
<img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\gapminder_logo.png" width="1200" style="display: block; margin: auto;" />
]

.right-column[
We will also use [data from *Gapminder*](https://www.gapminder.org/data/). During the course and the exercises, we work with data we have downloaded from their website. There also is an `R` package that bundles some of the *Gapminder* data: `install.packages("gapminder")`.

This `R` package provides ["[a]n excerpt of the data available at Gapminder.org. For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007."](https://cran.r-project.org/web/packages/gapminder/index.html)
]

---

## How to use the data in general

To code along and be able to do the exercises, you should store the data files for the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* in a folder called `data` in the same folder as the other materials for this course.

---

## `R` is data-agnostic
<img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\Datenimport.PNG" width="65%" style="display: block; margin: auto;" />

---

## Data formats & packages
.pull-left[
**What you will learn**
- Getting the most common data formats into `R`
  - e.g., CSV, *Stata*, *SPSS*, or *Excel* spreadsheets
- Using the different methods of doing that
- We will rely a lot on packages and functions from the `tidyverse` instead of using `base R`
]

.pull-right[
**What you won't learn**
- Getting old & obscure binary data formats into `R`
  - ... although [that is possible](https://cran.r-project.org/doc/manuals/r-release/R-data.html)
]

---

## Before writing any code: *RStudio* functionality for importing data
You can use the *RStudio* GUI for importing data via `Environment - Import data set - Choose file type`.

---

## Where to find data

**Browse Button in `RStudio`**
<img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\importBrowse.PNG" width="75%" style="display: block; margin: auto;" />

**Code preview in `Rstudio`**
<img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\codepreview.PNG" width="75%" style="display: block; margin: auto;" />

---

## Honestly, after some time you will write the code directly

.center[
<img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\coding_cat.gif" style="display: block; margin: auto;" />
.footnote[[Source](https://media.giphy.com/media/LmNwrBhejkK9EFP504/source.gif)]
]

---

## Honestly, after some time you will write the code directly

.center[
<img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\hadley-typing.gif" style="display: block; margin: auto;" />
[Source](https://tenor.com/view/hadley-wickham-rstats-typing-rcode-gif-11365139)
]

---

## Simple vs. not so simple file formats

Basic file formats, such as CSV (comma-separated value file), can directly be imported into `R`
- they are 'flat'
- few metadata
- basically text files

Other file formats, particularly the proprietary ones, require the use of additional packages
- they are complex
- a lot of metadata (think of all the labels in an *SPSS* file)
- they are binary (1110101)

---

## File formats wars

<img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\r-intro-gesis-2021\content\img\norm_normal_file_format.png" width="30%" style="display: block; margin: auto;" />
https://xkcd.com/2116/

---

## Disclaimer

**In the following slides, we'll jump right into importing data. We use a lot of different packages for this purpose, and you don't have to remember everything. It's just for making a point of how agnostic `R` actually is regarding the file type. Later on, we will dive more into the specifics of importing.**

---

## Importing a CSV file using `base R`

```r
titanic <- read.csv("./data/titanic.csv")

titanic
```

```
##    PassengerId Survived Pclass                                                      Name    Sex   Age SibSp Parch           Ticket
## 1            1        0      3                                   Braund, Mr. Owen Harris   male 22.00     1     0        A/5 21171
## 2            2        1      1       Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.00     1     0         PC 17599
## 3            3        1      3                                    Heikkinen, Miss. Laina female 26.00     0     0 STON/O2. 3101282
## 4            4        1      1              Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.00     1     0           113803
## 5            5        0      3                                  Allen, Mr. William Henry   male 35.00     0     0           373450
## 6            6        0      3                                          Moran, Mr. James   male    NA     0     0           330877
## 7            7        0      1                                   McCarthy, Mr. Timothy J   male 54.00     0     0            17463
## 8            8        0      3                            Palsson, Master. Gosta Leonard   male  2.00     3     1           349909
## 9            9        1      3         Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.00     0     2           347742
## 10          10        1      2                       Nasser, Mrs. Nicholas (Adele Achem) female 14.00     1     0           237736
## 11          11        1      3                           Sandstrom, Miss. Marguerite Rut female  4.00     1     1          PP 9549
## 12          12        1      1                                  Bonnell, Miss. Elizabeth female 58.00     0     0           113783
## 13          13        0      3                            Saundercock, Mr. William Henry   male 20.00     0     0        A/5. 2151
## 14          14        0      3                               Andersson, Mr. Anders Johan   male 39.00     1     5           347082
## 15          15        0      3                      Vestrom, Miss. Hulda Amanda Adolfina female 14.00     0     0           350406
## 16          16        1      2                          Hewlett, Mrs. (Mary D Kingcome)  female 55.00     0     0           248706
## 17          17        0      3                                      Rice, Master. Eugene   male  2.00     4     1           382652
## 18          18        1      2                              Williams, Mr. Charles Eugene   male    NA     0     0           244373
## 19          19        0      3   Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele) female 31.00     1     0           345763
## 20          20        1      3                                   Masselmani, Mrs. Fatima female    NA     0     0             2649
## 21          21        0      2                                      Fynney, Mr. Joseph J   male 35.00     0     0           239865
## 22          22        1      2                                     Beesley, Mr. Lawrence   male 34.00     0     0           248698
## 23          23        1      3                               McGowan, Miss. Anna "Annie" female 15.00     0     0           330923
## 24          24        1      1                              Sloper, Mr. William Thompson   male 28.00     0     0           113788
## 25          25        0      3                             Palsson, Miss. Torborg Danira female  8.00     3     1           349909
## 26          26        1      3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson) female 38.00     1     5           347077
## 27          27        0      3                                   Emir, Mr. Farred Chehab   male    NA     0     0             2631
## 28          28        0      1                            Fortune, Mr. Charles Alexander   male 19.00     3     2            19950
## 29          29        1      3                             O'Dwyer, Miss. Ellen "Nellie" female    NA     0     0           330959
## 30          30        0      3                                       Todoroff, Mr. Lalio   male    NA     0     0           349216
## 31          31        0      1                                  Uruchurtu, Don. Manuel E   male 40.00     0     0         PC 17601
## 32          32        1      1            Spencer, Mrs. William Augustus (Marie Eugenie) female    NA     1     0         PC 17569
## 33          33        1      3                                  Glynn, Miss. Mary Agatha female    NA     0     0           335677
## 34          34        0      2                                     Wheadon, Mr. Edward H   male 66.00     0     0       C.A. 24579
## 35          35        0      1                                   Meyer, Mr. Edgar Joseph   male 28.00     1     0         PC 17604
## 36          36        0      1                            Holverson, Mr. Alexander Oskar   male 42.00     1     0           113789
## 37          37        1      3                                          Mamee, Mr. Hanna   male    NA     0     0             2677
## 38          38        0      3                                  Cann, Mr. Ernest Charles   male 21.00     0     0       A./5. 2152
## 39          39        0      3                        Vander Planke, Miss. Augusta Maria female 18.00     2     0           345764
## 40          40        1      3                               Nicola-Yarred, Miss. Jamila female 14.00     1     0             2651
## 41          41        0      3            Ahlin, Mrs. Johan (Johanna Persdotter Larsson) female 40.00     1     0             7546
## 42          42        0      2  Turpin, Mrs. William John Robert (Dorothy Ann Wonnacott) female 27.00     1     0            11668
## 43          43        0      3                                       Kraeff, Mr. Theodor   male    NA     0     0           349253
## 44          44        1      2                  Laroche, Miss. Simonne Marie Anne Andree female  3.00     1     2    SC/Paris 2123
## 45          45        1      3                             Devaney, Miss. Margaret Delia female 19.00     0     0           330958
## 46          46        0      3                                  Rogers, Mr. William John   male    NA     0     0  S.C./A.4. 23567
## 47          47        0      3                                         Lennon, Mr. Denis   male    NA     1     0           370371
## 48          48        1      3                                 O'Driscoll, Miss. Bridget female    NA     0     0            14311
## 49          49        0      3                                       Samaan, Mr. Youssef   male    NA     2     0             2662
## 50          50        0      3             Arnold-Franchi, Mrs. Josef (Josefine Franchi) female 18.00     1     0           349237
## 51          51        0      3                                Panula, Master. Juha Niilo   male  7.00     4     1          3101295
## 52          52        0      3                              Nosworthy, Mr. Richard Cater   male 21.00     0     0       A/4. 39886
## 53          53        1      1                  Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.00     1     0         PC 17572
## 54          54        1      2        Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson) female 29.00     1     0             2926
## 55          55        0      1                            Ostby, Mr. Engelhart Cornelius   male 65.00     0     1           113509
## 56          56        1      1                                         Woolner, Mr. Hugh   male    NA     0     0            19947
## 57          57        1      2                                         Rugg, Miss. Emily female 21.00     0     0       C.A. 31026
## 58          58        0      3                                       Novel, Mr. Mansouer   male 28.50     0     0             2697
## 59          59        1      2                              West, Miss. Constance Mirium female  5.00     1     2       C.A. 34651
## 60          60        0      3                        Goodwin, Master. William Frederick   male 11.00     5     2          CA 2144
## 61          61        0      3                                     Sirayanian, Mr. Orsen   male 22.00     0     0             2669
## 62          62        1      1                                       Icard, Miss. Amelie female 38.00     0     0           113572
## 63          63        0      1                               Harris, Mr. Henry Birkhardt   male 45.00     1     0            36973
## 64          64        0      3                                     Skoog, Master. Harald   male  4.00     3     2           347088
## 65          65        0      1                                     Stewart, Mr. Albert A   male    NA     0     0         PC 17605
## 66          66        1      3                                  Moubarek, Master. Gerios   male    NA     1     1             2661
## 67          67        1      2                              Nye, Mrs. (Elizabeth Ramell) female 29.00     0     0       C.A. 29395
## 68          68        0      3                                  Crease, Mr. Ernest James   male 19.00     0     0        S.P. 3464
## 69          69        1      3                           Andersson, Miss. Erna Alexandra female 17.00     4     2          3101281
## 70          70        0      3                                         Kink, Mr. Vincenz   male 26.00     2     0           315151
## 71          71        0      2                                Jenkin, Mr. Stephen Curnow   male 32.00     0     0       C.A. 33111
## 72          72        0      3                                Goodwin, Miss. Lillian Amy female 16.00     5     2          CA 2144
## 73          73        0      2                                      Hood, Mr. Ambrose Jr   male 21.00     0     0     S.O.C. 14879
## 74          74        0      3                               Chronopoulos, Mr. Apostolos   male 26.00     1     0             2680
## 75          75        1      3                                             Bing, Mr. Lee   male 32.00     0     0             1601
## 76          76        0      3                                   Moen, Mr. Sigurd Hansen   male 25.00     0     0           348123
## 77          77        0      3                                         Staneff, Mr. Ivan   male    NA     0     0           349208
## 78          78        0      3                                  Moutal, Mr. Rahamin Haim   male    NA     0     0           374746
## 79          79        1      2                             Caldwell, Master. Alden Gates   male  0.83     0     2           248738
## 80          80        1      3                                  Dowdell, Miss. Elizabeth female 30.00     0     0           364516
## 81          81        0      3                                      Waelens, Mr. Achille   male 22.00     0     0           345767
## 82          82        1      3                               Sheerlinck, Mr. Jan Baptist   male 29.00     0     0           345779
## 83          83        1      3                            McDermott, Miss. Brigdet Delia female    NA     0     0           330932
##        Fare       Cabin Embarked
## 1    7.2500                    S
## 2   71.2833         C85        C
## 3    7.9250                    S
## 4   53.1000        C123        S
## 5    8.0500                    S
## 6    8.4583                    Q
## 7   51.8625         E46        S
## 8   21.0750                    S
## 9   11.1333                    S
## 10  30.0708                    C
## 11  16.7000          G6        S
## 12  26.5500        C103        S
## 13   8.0500                    S
## 14  31.2750                    S
## 15   7.8542                    S
## 16  16.0000                    S
## 17  29.1250                    Q
## 18  13.0000                    S
## 19  18.0000                    S
## 20   7.2250                    C
## 21  26.0000                    S
## 22  13.0000         D56        S
## 23   8.0292                    Q
## 24  35.5000          A6        S
## 25  21.0750                    S
## 26  31.3875                    S
## 27   7.2250                    C
## 28 263.0000 C23 C25 C27        S
## 29   7.8792                    Q
## 30   7.8958                    S
## 31  27.7208                    C
## 32 146.5208         B78        C
## 33   7.7500                    Q
## 34  10.5000                    S
## 35  82.1708                    C
## 36  52.0000                    S
## 37   7.2292                    C
## 38   8.0500                    S
## 39  18.0000                    S
## 40  11.2417                    C
## 41   9.4750                    S
## 42  21.0000                    S
## 43   7.8958                    C
## 44  41.5792                    C
## 45   7.8792                    Q
## 46   8.0500                    S
## 47  15.5000                    Q
## 48   7.7500                    Q
## 49  21.6792                    C
## 50  17.8000                    S
## 51  39.6875                    S
## 52   7.8000                    S
## 53  76.7292         D33        C
## 54  26.0000                    S
## 55  61.9792         B30        C
## 56  35.5000         C52        S
## 57  10.5000                    S
## 58   7.2292                    C
## 59  27.7500                    S
## 60  46.9000                    S
## 61   7.2292                    C
## 62  80.0000         B28         
## 63  83.4750         C83        S
## 64  27.9000                    S
## 65  27.7208                    C
## 66  15.2458                    C
## 67  10.5000         F33        S
## 68   8.1583                    S
## 69   7.9250                    S
## 70   8.6625                    S
## 71  10.5000                    S
## 72  46.9000                    S
## 73  73.5000                    S
## 74  14.4542                    C
## 75  56.4958                    S
## 76   7.6500       F G73        S
## 77   7.8958                    S
## 78   8.0500                    S
## 79  29.0000                    S
## 80  12.4750                    S
## 81   9.0000                    S
## 82   9.5000                    S
## 83   7.7875                    Q
##  [ reached 'max' / getOption("max.print") -- omitted 808 rows ]
```
]

---

## A `readr` example: `CSV` files

```r
library(readr)

titanic <- read_csv("./data/titanic.csv")
```

---
class: middle

```r
titanic
```

```
## # A tibble: 891 x 12
##    PassengerId Survived Pclass Name                                        Sex      Age SibSp Parch Ticket          Fare Cabin Embarked
##          <dbl>    <dbl>  <dbl> <chr>                                       <chr>  <dbl> <dbl> <dbl> <chr>          <dbl> <chr> <chr>   
##  1           1        0      3 Braund, Mr. Owen Harris                     male      22     1     0 A/5 21171       7.25 <NA>  S       
##  2           2        1      1 Cumings, Mrs. John Bradley (Florence Brigg~ female    38     1     0 PC 17599       71.3  C85   C       
##  3           3        1      3 Heikkinen, Miss. Laina                      female    26     0     0 STON/O2. 3101~  7.92 <NA>  S       
##  4           4        1      1 Futrelle, Mrs. Jacques Heath (Lily May Pee~ female    35     1     0 113803         53.1  C123  S       
##  5           5        0      3 Allen, Mr. William Henry                    male      35     0     0 373450          8.05 <NA>  S       
##  6           6        0      3 Moran, Mr. James                            male      NA     0     0 330877          8.46 <NA>  Q       
##  7           7        0      1 McCarthy, Mr. Timothy J                     male      54     0     0 17463          51.9  E46   S       
##  8           8        0      3 Palsson, Master. Gosta Leonard              male       2     3     1 349909         21.1  <NA>  S       
##  9           9        1      3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmin~ female    27     0     2 347742         11.1  <NA>  S       
## 10          10        1      2 Nasser, Mrs. Nicholas (Adele Achem)         female    14     1     0 237736         30.1  <NA>  C       
## # ... with 881 more rows
```
]

Note the column specifications: `readr` 'guesses' them based on the first 1000 observations (we will come back to this later).

---

## Importing *Excel* files with `readxl`

```r
library(readxl)

unicorns <- read_xlsx("./data/observations.xlsx")
```

No output ☹️

---
class: middle

```r
unicorns
```

```
## # A tibble: 42 x 3
##    countryname  year   pop
##    <chr>       <dbl> <dbl>
##  1 Austria      1670    85
##  2 Austria      1671    83
##  3 Austria      1674    75
##  4 Austria      1675    82
##  5 Austria      1676    79
##  6 Austria      1677    70
##  7 Austria      1678    81
##  8 Austria      1680    80
##  9 France       1673    70
## 10 France       1674    79
## # ... with 32 more rows
```

---

## *Stata* files with `haven`

```r
library(haven)

gp_covid <- 
  read_stata("./data/ZA5667_v1-1-0_Stata14.dta")

gp_covid
```

---

```
## # A tibble: 3,765 x 137
##    za_number version  doi       id cohort      sex   age_cat education_cat intention_to_vote   choice_of_party political_orien~ marstat
##    <chr>     <chr>    <chr>  <dbl>  <dbl> <dbl+lb> <dbl+lbl>     <dbl+lbl>         <dbl+lbl>         <dbl+lbl>        <dbl+lbl> <dbl+l>
##  1 ZA5667    v1-1-0 ~ 10.42~     1      3 1 [Männ~  7 [51 b~    3 [Hoch]     2 [Ja, ich wür~   1 [CDU/CSU]        6 [6]         2 [Led~
##  2 ZA5667    v1-1-0 ~ 10.42~     2      1 2 [Weib~  7 [51 b~    2 [Mittel]   2 [Ja, ich wür~   5 [BÜNDNIS 90/~    5 [5]         1 [Ver~
##  3 ZA5667    v1-1-0 ~ 10.42~     3      3 1 [Männ~  8 [61 b~    2 [Mittel]   2 [Ja, ich wür~   1 [CDU/CSU]        5 [5]         1 [Ver~
##  4 ZA5667    v1-1-0 ~ 10.42~     4      2 1 [Männ~  4 [36 b~    3 [Hoch]     2 [Ja, ich wür~   1 [CDU/CSU]        7 [7]         1 [Ver~
##  5 ZA5667    v1-1-0 ~ 10.42~     5      1 2 [Weib~  1 [<=25~    3 [Hoch]   -33 [Unit nonres~ -33 [Unit nonres~    4 [4]         2 [Led~
##  6 ZA5667    v1-1-0 ~ 10.42~     6      1 1 [Männ~ 10 [>=71~    2 [Mittel]   2 [Ja, ich wür~   6 [Alternative~   10 [10 Rechts] 1 [Ver~
##  7 ZA5667    v1-1-0 ~ 10.42~     7      1 2 [Weib~  4 [36 b~    2 [Mittel]   2 [Ja, ich wür~   6 [Alternative~    5 [5]         1 [Ver~
##  8 ZA5667    v1-1-0 ~ 10.42~     8      2 2 [Weib~  7 [51 b~    3 [Hoch]     2 [Ja, ich wür~   5 [BÜNDNIS 90/~    6 [6]         1 [Ver~
##  9 ZA5667    v1-1-0 ~ 10.42~     9      1 1 [Männ~  8 [61 b~    1 [Gering]   2 [Ja, ich wür~   1 [CDU/CSU]        6 [6]         1 [Ver~
## 10 ZA5667    v1-1-0 ~ 10.42~    10      1 1 [Männ~  1 [<=25~    3 [Hoch]     2 [Ja, ich wür~   2 [SPD]            7 [7]         2 [Led~
## # ... with 3,755 more rows, and 125 more variables: household <dbl+lbl>, hzcy001a <dbl+lbl>, hzcy002a <dbl+lbl>, hzcy003a <dbl+lbl>,
## #   hzcy004a <dbl+lbl>, hzcy005a <dbl+lbl>, hzcy006a <dbl+lbl>, hzcy007a <dbl+lbl>, hzcy008a <dbl+lbl>, hzcy009a <dbl+lbl>,
## #   hzcy010a <dbl+lbl>, hzcy011a <dbl+lbl>, hzcy012a <dbl+lbl>, hzcy013a <dbl+lbl>, hzcy014a <dbl+lbl>, hzcy015a <dbl+lbl>,
## #   hzcy016a <dbl+lbl>, hzcy018a <dbl+lbl>, hzcy019a <dbl+lbl>, hzcy020a <dbl+lbl>, hzcy021a <dbl+lbl>, hzcy022a <dbl+lbl>,
## #   hzcy023a <dbl+lbl>, hzcy024a <dbl+lbl>, hzcy025a <dbl+lbl>, hzcy026a <dbl+lbl>, hzcy027a <dbl+lbl>, hzcy028a <dbl+lbl>,
## #   hzcy029a <dbl+lbl>, hzcy030a <dbl+lbl>, hzcy031a <dbl+lbl>, hzcy032a <dbl+lbl>, hzcy033a <dbl+lbl>, hzcy034a <dbl+lbl>,
## #   hzcy035a <dbl+lbl>, hzcy036a <dbl+lbl>, hzcy037a <dbl+lbl>, hzcy038a <dbl+lbl>, hzcy039a <dbl+lbl>, hzcy040a <dbl+lbl>,
## #   hzcy041a <dbl+lbl>, hzcy042a <dbl+lbl>, hzcy043a <dbl+lbl>, hzcy044a <dbl+lbl>, hzcy045a <dbl+lbl>, hzcy046a <dbl+lbl>,
## #   hzcy047a <dbl+lbl>, hzcy048a <dbl+lbl>, hzcy049a <dbl+lbl>, hzcy050a <dbl+lbl>, hzcy051a <dbl+lbl>, hzcy052a <dbl+lbl>,
## #   hzcy053a <dbl+lbl>, hzcy054a <dbl+lbl>, hzcy055a <dbl+lbl>, hzcy056a <dbl+lbl>, hzcy057a <dbl+lbl>, hzcy058a <dbl+lbl>,
## #   hzcy059a <dbl+lbl>, hzcy060a <dbl+lbl>, hzcy061a <dbl+lbl>, hzcy062a <dbl+lbl>, hzcy063a <dbl+lbl>, hzcy064a <dbl+lbl>,
## #   hzcy065a <dbl+lbl>, hzcy066a <dbl+lbl>, hzcy067a <dbl+lbl>, hzcy068a <dbl+lbl>, hzcy069a <dbl+lbl>, hzcy070a <dbl+lbl>,
## #   hzcy071a <dbl+lbl>, hzcy072a <dbl+lbl>, hzcy073a <dbl+lbl>, hzcy074a <dbl+lbl>, hzcy075a <dbl+lbl>, hzcy076a <dbl+lbl>,
## #   hzcy077a <dbl+lbl>, hzcy078a <dbl+lbl>, hzcy079a <dbl+lbl>, hzcy080a <dbl+lbl>, hzcy081a <dbl+lbl>, hzcy083a <dbl+lbl>,
## #   hzcy084a <dbl+lbl>, hzcy085a <dbl+lbl>, hzcy086a <dbl+lbl>, hzcy087a <dbl+lbl>, hzcy088a <dbl+lbl>, hzcy089a <dbl+lbl>,
## #   hzcy090a <dbl+lbl>, hzcy091a <dbl+lbl>, hzcy092a <dbl+lbl>, hzcy093a <dbl+lbl>, hzcy095a <dbl+lbl>, hzcy096a <dbl+lbl>,
## #   hzcy097a <dbl+lbl>, hzcy098a <dbl+lbl>, hzcy099a <dbl+lbl>, hzza001a <dbl+lbl>, hzza002a <dbl+lbl>, hzza003a <dbl+lbl>, ...
```

---

## *SPSS* files with `haven`

The `haven` package also offers the function `read_spss()` for importing *SPSS* files.

The package also offers capabilities for handling *SPSS*-defined missing values by setting the option `user_na = TRUE` (default is `FALSE`).

*Note*: The [`sjlabelled` package](https://cran.r-project.org/web/packages/sjlabelled/index.html) can also be used for [working with user-defined missings from *SPSS* files](https://cran.r-project.org/web/packages/sjlabelled/vignettes/intro_sjlabelled.html).

**We will come back to *Stata* and *SPSS* files in a bit as they represent a specific file format in `R`: labelled data.**

---

## Other data import options

These were just some very first examples of applying functions for data import from the different packages. There are many more...

.pull-left[
`readr`
- `read_csv()`
- `read_tsv()`
- `read_delim()`
- `read_fwf()`
- `read_table()`
- `read_log()`
]

Not to mention all the helper functions and options. For example, we can define the cells to read from an *Excel* file by specifying the option `range = "C1:E4"` in `read_excel()`

---

## Data type specifications for `tibbles`

- characters
  - indicated by `<chr>`
  - specified by `col_character()`
- integers
  - indicated by `<int>`
  - specified by `col_integer()`
- doubles
  - indicated by `<dbl>`
  - specified by `col_double()`
- factors
  - indicated by `<fct>`
  - specified by `col_factor()`
- logical
  - indicated by `<lgl>`
  - specified by `col_logical()`

---

## Changing variable types

As mentioned before, `read_csv` 'guesses' the variable types by scanning the first 1000 observations. **NB**: This can go wrong!

Luckily, we can change the variable type...
- before/while loading the data
- and after loading the data

---

## While loading the data in `read_csv`

```r
titanic <-
  read_csv(
    "./data/titanic.csv",
    col_types = cols(
      PassengerId = col_double(),
      Survived = col_double(),
      Pclass = col_double(),
      Name = col_character(),
      Sex = col_character(),
      Age = col_double(),
      SibSp = col_double(),
      Parch = col_double(),
      Ticket = col_character(),
      Fare = col_double(),
      Cabin = col_character(),
      Embarked = col_character()
    )
  )

titanic
```

---

---

## While loading the data in `read_csv`

```r
titanic <-
  read_csv(
    "./data/titanic.csv",
    col_types = cols(
      PassengerId = col_double(),
      Survived = col_double(),
      Pclass = col_double(),
      Name = col_character(),
      Sex = col_factor(), # This one changed!
      Age = col_double(),
      SibSp = col_double(),
      Parch = col_double(),
      Ticket = col_character(),
      Fare = col_double(),
      Cabin = col_character(),
      Embarked = col_character()
    )
  )

titanic
```

---

```
## # A tibble: 891 x 12
##    PassengerId Survived Pclass Name                                        Sex      Age SibSp Parch Ticket          Fare Cabin Embarked
##          <dbl>    <dbl>  <dbl> <chr>                                       <fct>  <dbl> <dbl> <dbl> <chr>          <dbl> <chr> <chr>   
##  1           1        0      3 Braund, Mr. Owen Harris                     male      22     1     0 A/5 21171       7.25 <NA>  S       
##  2           2        1      1 Cumings, Mrs. John Bradley (Florence Brigg~ female    38     1     0 PC 17599       71.3  C85   C       
##  3           3        1      3 Heikkinen, Miss. Laina                      female    26     0     0 STON/O2. 3101~  7.92 <NA>  S       
##  4           4        1      1 Futrelle, Mrs. Jacques Heath (Lily May Pee~ female    35     1     0 113803         53.1  C123  S       
##  5           5        0      3 Allen, Mr. William Henry                    male      35     0     0 373450          8.05 <NA>  S       
##  6           6        0      3 Moran, Mr. James                            male      NA     0     0 330877          8.46 <NA>  Q       
##  7           7        0      1 McCarthy, Mr. Timothy J                     male      54     0     0 17463          51.9  E46   S       
##  8           8        0      3 Palsson, Master. Gosta Leonard              male       2     3     1 349909         21.1  <NA>  S       
##  9           9        1      3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmin~ female    27     0     2 347742         11.1  <NA>  S       
## 10          10        1      2 Nasser, Mrs. Nicholas (Adele Achem)         female    14     1     0 237736         30.1  <NA>  C       
## # ... with 881 more rows
```

---

## After loading the data

```r
titanic <-
  type_convert(
    titanic,
    col_types = cols(
      PassengerId = col_double(),
      Survived = col_double(),
      Pclass = col_double(),
      Name = col_character(),
      Sex = col_factor(),
      Age = col_double(),
      SibSp = col_double(),
      Parch = col_double(),
      Ticket = col_character(),
      Fare = col_double(),
      Cabin = col_character(),
      Embarked = col_character()
    )
  )
```

---

## Beyond flat files: labelled data

A lot of data comes in some sort of flat file format, such as `CSV`. In the social sciences, however, we often deal with proprietary file formats, such as *SPSS*'s `.sav` or *Stata*'s `.dta` files.

What these data typically include are labels. These labels are used to describe variables or variable values. They comprise some specific metadata inherent in these proprietary file formats.

*If you were able to travel back ten years in time and ask an `R` geek, she'd say that you cannot use labels in R. You'd either have to import, e.g., value labels as character strings or use their codes as factors. However, these days...*

---

## Not being able to use labelled data is a thing of the past

Nowadays, if you use the `haven` package, labels are built-in for the corresponding file types. For example:

```r
gp_covid <-
  haven::read_sav("./data/ZA5667_v1-1-0.sav")

gp_covid["age_cat"]
```

```
## # A tibble: 3,765 x 1
##                 age_cat
##               <dbl+lbl>
##  1  7 [51 bis 60 Jahre]
##  2  7 [51 bis 60 Jahre]
##  3  8 [61 bis 65 Jahre]
##  4  4 [36 bis 40 Jahre]
##  5  1 [<=25 Jahre]     
##  6 10 [>=71 Jahre]     
##  7  4 [36 bis 40 Jahre]
##  8  7 [51 bis 60 Jahre]
##  9  8 [61 bis 65 Jahre]
## 10  1 [<=25 Jahre]     
## # ... with 3,755 more rows
```

---

## Advantages of using labelled data

One could rejoice in not having to use a codebook anymore, just like in *SPSS* (although just looking at code output for glimpsing feels much more... data-geeky).

An advantage is definitely that you can potentially re-use the labels in figures and plots, and some `R` packages do that automatically, such as the [`sjPlot`](https://strengejacke.github.io/sjPlot/) package.

In addition, when you exchange your data with colleagues who do not use `R` or when you plan to publish your data (which you always should if that is possible), being able to export data you have manipulated in `R` in different formats is great.

**However, be aware of the missing values hell that you may enter due to different missing value definitions in *Stata* and *SPSS*.**

---

## Getting labels

For variables:

```r
sjlabelled::get_label(gp_covid$age_cat)
```

```
## [1] "Alter, kategorisiert"
```

For values:

```r
sjlabelled::get_labels(gp_covid$age_cat)
```

```
##  [1] "<=25 Jahre"      "26 bis 30 Jahre" "31 bis 35 Jahre" "36 bis 40 Jahre" "41 bis 45 Jahre" "46 bis 50 Jahre" "51 bis 60 Jahre"
##  [8] "61 bis 65 Jahre" "66 bis 70 Jahre" ">=71 Jahre"
```
]

---

## Setting labels: Variables

```r
gp_covid$age_cat <- 
  sjlabelled::set_label(gp_covid$age_cat, label = "Age, categorized")

sjlabelled::get_label(gp_covid$age_cat)
```

```
## [1] "Age, categorized"
```

---

## Setting labels: Values
.tinyish[

```r
gp_covid$age_cat <- 
  sjlabelled::set_labels(
    gp_covid$age_cat,
    labels = 
      c(
        "<=25 years", "26 to 30 years", "31 to 35 years", "36 to 40 years",
        "41 to 45 years", "46 to 50 years", "51 to 60 years", 
        "61 to 65 years", "66 to 70 years", ">=71 years"
      )
  )

sjlabelled::get_labels(gp_covid$age_cat)
```

```
##  [1] "<=25 years"     "26 to 30 years" "31 to 35 years" "36 to 40 years" "41 to 45 years" "46 to 50 years" "51 to 60 years"
##  [8] "61 to 65 years" "66 to 70 years" ">=71 years"
```
]

---

# [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_1_2_2_Flat_Files.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_1_2_2_Flat_Files.html)

---

## Exporting data

Sometimes our data have to leave `R`, for example, if we....
- share data with colleagues who do not use `R`
- want to continue where we left off
  - particularly if data wrangling took a long time
  
For such purposes, we also need a way to export our data.

All of the packages we have discussed in this session also have designated functions for that.

---

## Examples: CSV and Stata files

```r
write_csv(titanic, "titanic_own.csv")
```

```r
write_dta(titanic, "titanic_own.dta")
```

---

## `R`'s native file formats

If you plan to continue to work with `R` (something we would always recommend 😜), there are at least two native 'file formats' to choose from. The advantage of using them is that they are compressed files, so that they don't occupy unnecessarily large disk space. These two formats are `.Rdata`/`.rda` and `.rds`.
The key difference between them is that `.rds` can only hold one object, whereas `.Rdata`/`.rda` can also be used for storing several objects in one file.

---

## `.Rdata`/`.rda`

Saving

```r
save(mydata, file = "mydata.RData")
```

```r
load("mydata.RData")
```

---

## `.rds`

Saving

```r
saveRDS(mydata, "mydata.rds")
```

```r
mydata <- readRDS("mydata.rds")
```
  
*Note*: A nice property of `saveRDS()` is that just saves a representation of the object, which means you can name it whatever you want when loading.

---

## Saving just everything

If you have not changed the General Global Options in *RStudio* as suggested in the *Getting Started* session, you may have noticed that, when closing *Rstudio*, by default, the programs asks you whether you want to save the workspace image.

You can also do that whenever you want using the `save.image()` function:

```r
save.image(file = "my_fancy_workspace.RData")
```

.small[
*Note*: As we've said before, though, this is not something we'd recommend as a worfklow. Instead, you should (explicitly and separately) save your `R` scripts and data sets (in appropriate formats).
]

---

## Additional packages

Besides `readr`, `haven` and `readxl`, there also are some other packages that facilitate importing specific data types as tibbles:

- [`sjlabelled`](https://cran.r-project.org/web/packages/sjlabelled/index.html) for labelled data, e.g., from *SPSS* or *Stata*

- [`sf`](https://github.com/r-spatial/sf) for geospatial data

---

## Other packages for data import

For data import (and export) in general, there are even more options, such as...

- `base` R

- the [`foreign` package](https://cran.r-project.org/web/packages/foreign/index.html) for *SPSS* and *Stata* files

- [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) or [`fst`](https://www.fstpackage.org/) for large data sets

- [`jsonlite`](https://cran.r-project.org/web/packages/jsonlite/index.html) for `.json` files

- [`datapasta`](https://github.com/MilesMcBain/datapasta) for copying and pasting data into tribbles (e.g., from websites, *Excel* or *Word* files)

---

## Reminder regarding file paths

In general, you should avoid using absolute file paths to maintain your code reproducibly and future-proof. We already talked about this in the introduction, but this is particularly important for importing and exporting data.

As a reminder: Absolute file paths look like this (on different OS):

```r
# Windows
load("C:/Users/cool_user/data/fancy_data.Rdata")

# Mac
load("/Users/cool_user/data/fancy_data.Rdata")

# GNU/Linux
load("/home/cool_user/data/fancy_data.Rdata")
```

---

## Use relative paths

Instead of using absolute paths, it is recommended to use relative file paths. The general principle here is to start from a directory where your current script currently exists and navigate to your target location. Say we are in the "C:/Users/cool_user/" location on a Windows machine. To load your data, we would use:

```r
load("./data/fancy_data.Rdata")
```

If we were in a different folder, e.g., "C:/Users/cool_user/cat_pics/mittens/", we would use:

```r
load("../../data/fancy_data.Rdata")
```

---

Please first download the [Public Use File (PUF) of the GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany](https://search.gesis.org/research_data/ZA5667) as .sav, .dta, and .csv file.

# [Exercise](https://jobreu.github.io/r-intro-gesis-2021/exercises/Exercise_1_2_3_Statistical_Software_Files.html) time 🏋️‍♀️💪🏃🚴

## [Solutions](https://jobreu.github.io/r-intro-gesis-2021/solutions/Exercise_1_2_3_Statistical_Software_Files.html)