class: center, middle, inverse, title-slide .title[ # Tools and Workflows for Reproducible Research in the Quantitative Social Sciences ] .subtitle[ ## Introduction ] .author[ ### Johannes Breuer, Bernd Weiss, & Arnim Bleier ] .date[ ### 2022-11-17 ] --- layout: true --- ## About us **Johannes Breuer** .small[ - Senior researcher in the team *Data Augmentation*, department [*Survey Data Curation*](https://www.gesis.org/en/institute/departments/survey-data-curation) at *GESIS* - digital trace data for social science research - data linking (surveys + digital trace data) - (Co-)leader of the team *Research Data & Methods* at the [*Center for Advanced Internet Studies* (CAIS)](https://www.cais-research.de/) - Ph.D. in Psychology, University of Cologne - Research interests - Use and effects of digital media - Computational methods - Data management - Open science [johannes.breuer@gesis.org](mailto:johannes.breuer@gesis.org), [@MattEagle09](https://twitter.com/MattEagle09), [personal website](https://www.johannesbreuer.com/) ] --- ## About us **Bernd Weiß** .small[ - Head of team [*GESIS Panel*](https://www.gesis.org/en/gesis-panel/gesis-panel-home) and deputy head of the department [*Survey Design and Methodology*](https://www.gesis.org/en/institute/departments/survey-design-and-methodology) at *GESIS* - Obtained a doctorate (Sociology) from the University of Cologne in 2008 - Research interests: methods of empirical research in the social sciences, survey methodology, family sociology and juvenile delinquency - Approach to this workshop (and a disclaimer): Open Science and reproducible research are tools, but not part of my research agenda and I do not claim to be an expert in any of the things I will be talking about... [bernd.weiss@gesis.org](mailto:bernd.weiss@gesis.org), [ORCID: 0000-0002-1176-8408](https://orcid.org/0000-0002-1176-8408), [@berndweiss](https://twitter.com/berndweiss) ] --- ## About us **Arnim Bleier** .small[ - Senior researcher in the team *Designed Digital Data* in the department [*Computational Social Science*](https://www.gesis.org/en/institute/departments/computational-social-science) at *GESIS* - natural language processing - probabilistic graphical models - Ph.D. in statistics / machine learning, Leipzig University - Other research interests - Structural equation modeling - Distributed databases - Right to replicate [arnim.bleier@gesis.org](mailto:arnim.bleier@gesis.org), [@arnimb](https://twitter.com/arnimb), [Google Scholar](https://scholar.google.de/citations?user=_fof5_EAAAAJ) ] --- ## About you - What's your name? - Where do you work & what is your field? - What are your experiences with reproducible research practices (and the tools we cover in this course)? Please try to keep it brief. --- ## Goals of this course After this course you should be... - familiar with key concepts of reproducible research workflows - able to (start) work(ing) with tools for reproducible research, such as `Git`, *GitHub*, `R Markdown`, `Jupyter Notebooks`, and `Binder` - able to publish reproducible computational analysis pipelines with `R` --- ## Prerequisites For this course (esp. the exercises) you should have the following things installed on your computer: - A version of [`R`](https://www.r-project.org/) that is >= 4.0.0 - the following `R` packages: `rmarkdown`, `usethis`, `gitcreds` ```r # check if packages are installed and install missing ones packages = c("rmarkdown", "usethis", "gitcreds") install.packages(setdiff(packages, rownames(installed.packages()))) ``` - A recent version of [*RStudio*](https://posit.co/downloads/) - [`Git`](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) In addition, you should also have/create a [*GitHub*](https://github.com/) account. --- ## Prerequisites Did you have any trouble with the setup for this workshop? Installing/setting up... - [`git`](https://git-scm.com/) - [`R`](https://www.r-project.org/) - [*RStudio*](https://posit.co/downloads/) - the required `R` packages [`rmarkdown`](https://rmarkdown.rstudio.com/index.html), [`usethis`](https://usethis.r-lib.org/), & [`gitcreds`](https://gitcreds.r-lib.org/) - a [*GitHub*](https://github.com/) account ? --- ## Workshop Structure & Materials - The workshop consists of a combination of lectures and hands-on exercises - Slides and other materials are available at https://github.com/jobreu/reproducible-research-gesis-2022 - The workshop repository on the [*GESIS ILIAS*](https://ilias.gesis.org) contains some literature on tools and workflows for reproducibility as well as a timetable for this workshop --- ## Online format - If possible, we invite you to turn on your camera - Feel free to ask questions anytime - If you have an immediate question during the lecture parts, please send it via text chat, publicly or privately (ideally to a person who is currently not presenting) - If you have a question that is not urgent and might be interesting for everybody, you can also use audio (& video) to ask it at the end of a lecture part or during the exercises (please use the use the "raise hand" function in *Zoom* for this) - We would kindly ask you to mute your microphones when you are not asking (or answering) a question --- ## Course schedule - Day 1 <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Time </th> <th style="text-align:left;"> Topic </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;color: gray !important;"> 10:00 - 11:00 </td> <td style="text-align:left;font-weight: bold;"> Introduction </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 11:00 - 12:00 </td> <td style="text-align:left;font-weight: bold;"> Computer literacy for reproducible research </td> </tr> <tr> <td style="text-align:left;color: gray !important;color: gray !important;"> 12:00 - 13:00 </td> <td style="text-align:left;font-weight: bold;color: gray !important;"> Lunch Break </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 13:00 - 15:00 </td> <td style="text-align:left;font-weight: bold;"> Introduction to R Markdown </td> </tr> <tr> <td style="text-align:left;color: gray !important;color: gray !important;"> 15:00 - 15:15 </td> <td style="text-align:left;font-weight: bold;color: gray !important;"> Break </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 15:15 - 17:00 </td> <td style="text-align:left;font-weight: bold;"> Git & GitHub - Part 1 </td> </tr> </tbody> </table> --- ## Course schedule - Day 2 <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Time </th> <th style="text-align:left;"> Topic </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;color: gray !important;"> 09:30 - 10:30 </td> <td style="text-align:left;font-weight: bold;"> Git & RStudio </td> </tr> <tr> <td style="text-align:left;color: gray !important;color: gray !important;"> 10:30 - 10:45 </td> <td style="text-align:left;font-weight: bold;color: gray !important;"> Break </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 10:45 - 11:45 </td> <td style="text-align:left;font-weight: bold;"> Jupyter Notebooks & Binder </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 11:45 - 12:30 </td> <td style="text-align:left;font-weight: bold;"> Build your own Binder </td> </tr> <tr> <td style="text-align:left;color: gray !important;color: gray !important;"> 12:30 - 13:30 </td> <td style="text-align:left;font-weight: bold;color: gray !important;"> Lunch Break </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 13:30 - 14:00 </td> <td style="text-align:left;font-weight: bold;"> Sharing & Publishing </td> </tr> <tr> <td style="text-align:left;color: gray !important;color: gray !important;"> 14:00 - 14:15 </td> <td style="text-align:left;font-weight: bold;color: gray !important;"> Break </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 14:15 - 15:45 </td> <td style="text-align:left;font-weight: bold;"> Git & GitHub - Part 2 </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 15:45 - 17:00 </td> <td style="text-align:left;font-weight: bold;"> Recap & Outlook </td> </tr> </tbody> </table> --- ## Disclaimer We will cover several different tools that can be used for reproducible research in the quantitative social sciences. We will only be able to cover the basics of those tools, so if you want to continue to use them and use them in more advanced ways, you will probably need to "dig deeper" eventually and consult further resources (documentation, further tutorials, blog posts, or other publications). --- ## Disclaimer As you probably already know, there are a lot of different tools and workflows that can be employed for increasing the reproducibility of research. We will introduce you to some of those, but there is more, and, in the end, it depends on your personal preferences and needs what tools and workflows you employ.<sup>1</sup> In this course, we will focus on reproducible research using `R`. However, there also are solutions for reproducible research with other programming languages, such as `Python` or `Julia`, as well as statistical software packages, such as *SPSS* and *Stata*. .small[ [1] As you will see, we instructors also all have different preferences in our workflows and tool use. ] --- class: center, middle # Any questions so far? --- class: center, middle # Next up: What is reproducibility and why does it matter? --- ## Why should we aim for reproducibility? <img src="data:image/png;base64,#Introduction_files/figure-html/lazy-tweet-1.png" width="55%" style="display: block; margin: auto;" /> .small[ https://twitter.com/hadleywickham/status/598532170160873472?ref_src=twsrc%5Etfw ] --- ## What is reproducibility? As with (almost) everything in science, there are different definitions of reproducibility. We will discuss some of them in the following. --- ## Defining dimensions 3-dimensional concept space <img src="data:image/png;base64,#https://dh-trier.github.io/trr/img/trr-cube.jpg" width="50%" style="display: block; margin: auto;" /> <small><small>By Christof Schöch. Source: https://dh-trier.github.io/trr/#/2/1</small></small> --- ## Defining dimensions <img src="data:image/png;base64,#https://the-turing-way.netlify.app/_images/reproducibility.jpg" width="75%" style="display: block; margin: auto;" /> <small><small>[*The Turing Way Project*](https://the-turing-way.netlify.app/welcome.html) illustration by Scriberia. DOI: [10.5281/zenodo.3332807](https://doi.org/10.5281/zenodo.3332807)</small></small> --- ## *The Turing Way* definition <img src="data:image/png;base64,#https://the-turing-way.netlify.app/_images/reproducible-matrix.jpg" width="80%" style="display: block; margin: auto;" /> <small><small>Source: https://the-turing-way.netlify.app/reproducible-research/overview/overview-definitions.html</small></small> --- ## Replication or reproduction? <img src="data:image/png;base64,#https://dh-trier.github.io/trr/img/replication-typology-a.jpg" width="50%" style="display: block; margin: auto;" /> <small><small>By Christof Schöch. Source: https://dh-trier.github.io/trr/#/2/2</small></small> --- ## Defining reproducibility > A minimum standard on a spectrum of activities (“reproducibility spectrum”) for assessing the value or accuracy of scientific claims .highlight[based on the original methods, data, and code] [...] In some fields, this meaning is, instead, associated with the term “replicability” or ‘repeatability’ ([FORRT Glossary - Reproducibility](https://forrt.org/glossary/reproducibility/)) --- ## Computational reproducibility There also are distinctions between different types (or components) of reproducibility. One type/component that is especially relevant for this workshop is "Computational reproducibility": > Ability to .highlight[recreate the same results as the original study] (including tables, figures, and quantitative findings), .highlight[using the same input data, computational methods, and conditions of analysis]. The availability of code and data facilitates computational reproducibility, as does preparation of these materials (annotating data, delineating software versions used, sharing computational environments, etc). Ideally, computational reproducibility should be achievable by another second researcher (or the original researcher, at a future time), using only a set of files and written instructions ([FORRT Glossary - Computational reproducibility](https://forrt.org/glossary/computational-reproducibility/)) --- ## Tools & workflows 🛠 📋 In our case, **tools** are programming languages, programs, and other pieces of software that we can use to make our research (more easily) reproducible. **Workflows** are the ways in which we combine these tools to achieve our goal. --- ## Tools & workflows 🛠 📋 Choosing tools and establishing worklflows are somewhat idiosyncratic processes that depend on... - the requirements of your project (methods, data types...) - the availability of tools - your skills and knowledge - the preferences of collaborators ... --- ## Reproducible research workflows > being an open scientist means .highlight[adopting a few straightforward research management practices, which lead to less error-prone, reproducible research workflows] ([Klein et al., 2018](https://doi.org/10.1525/collabra.158), p. 11) --- ## Research management practices There are quite a few practices that researchers can adopt to increase the reproducibility of their work. - [Project-oriented workflow](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/) - [Clear folder structures](https://psych-transparency-guide.uni-koeln.de/folder-structure.html) - [Naming things](https://betterprogramming.pub/string-case-styles-camel-pascal-snake-and-kebab-case-981407998841?gi=3295916e039) - ... All of those practices as well as the use of the tools we cover in this workshop require a certain degree of **Computer literacy**.