class: center, middle, inverse, title-slide .title[ # Workflows for Reproducible Research with R & Git ] .subtitle[ ## Introduction ] .author[ ### Johannes Breuer, Bernd Weiss, & Arnim Bleier ] .date[ ### 2023-11-16 ] --- layout: true --- ## About us **Johannes Breuer** .small[ - Senior researcher in the team *Digital Society Observatory*, department [*Computational Social Science*](https://www.gesis.org/en/institute/departments/computational-social-science) at *GESIS* - digital trace data for social science research - data linking (surveys + digital trace data) - Leader of the team *Research Data & Methods* at the [*Center for Advanced Internet Studies* (CAIS)](https://www.cais-research.de/) - Ph.D. in Psychology, University of Cologne - Other research interests - Use and effects of digital media - Computational methods - Data management - Open science [johannes.breuer@gesis.org](mailto:johannes.breuer@gesis.org), [personal website](https://www.johannesbreuer.com/) ] --- ## About us **Bernd Weiß** .small[ - Head of team [*GESIS Panel*](https://www.gesis.org/en/gesis-panel/gesis-panel-home) and deputy head of the department [*Survey Design and Methodology*](https://www.gesis.org/en/institute/departments/survey-design-and-methodology) at *GESIS* - Obtained a doctorate (Sociology) from the University of Cologne in 2008 - Research interests: methods of empirical research in the social sciences, survey methodology, family sociology and juvenile delinquency - Approach to this workshop (and a disclaimer): Open Science and reproducible research are tools, but not part of my research agenda and I do not claim to be an expert in any of the things I will be talking about... [bernd.weiss@gesis.org](mailto:bernd.weiss@gesis.org), [ORCID: 0000-0002-1176-8408](https://orcid.org/0000-0002-1176-8408) ] --- ## About us **Arnim Bleier** .small[ - Senior researcher in the team *Designed Digital Data* in the department [*Computational Social Science*](https://www.gesis.org/en/institute/departments/computational-social-science) at *GESIS* - natural language processing - probabilistic graphical models - Ph.D. in statistics / machine learning, Leipzig University - Other research interests - Structural equation modeling - Distributed databases - Right to replicate [arnim.bleier@gesis.org](mailto:arnim.bleier@gesis.org), [Google Scholar](https://scholar.google.de/citations?user=_fof5_EAAAAJ) ] --- ## About you - What's your name? - Where do you work & what is your field? - What are your experiences with reproducible research practices (and the tools we cover in this course)? - What do you hope to get out of this course? Please try to keep it brief. --- ## Goals of this course After this course you should be... - familiar with key concepts of reproducible research workflows - able to work with frameworks and tools that can be used for maximizing reproducibility, such as `Git`, `R` packages for dependency management, or *Binder* - able to publish reproducible computational analysis pipelines with `R` --- ## Prerequisites For this course (esp. the exercises) you should have the following things installed on your computer: - A version of [`R`](https://www.r-project.org/) that is >= 4.0.0 - the following `R` packages: `usethis`, `gitcreds`, `groundhog` ```r # check if packages are installed and install missing ones packages = c("usethis", "gitcreds", "groundhog") install.packages(setdiff(packages, rownames(installed.packages()))) ``` - A recent version of [*RStudio*](https://posit.co/downloads/) - [`Git`](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) In addition, you should also have/create a [*GitHub*](https://github.com/) account. --- ## Prerequisites Did you have any trouble with the setup for this workshop? Installing/setting up... - [`git`](https://git-scm.com/) - [`R`](https://www.r-project.org/) - [*RStudio*](https://posit.co/downloads/) - the required `R` packages [`usethis`](https://usethis.r-lib.org/), [`gitcreds`](https://gitcreds.r-lib.org/), & [`groundhog`](https://groundhogr.com/) - a [*GitHub*](https://github.com/) account ? --- ## Workshop Structure & Materials - The workshop consists of a combination of lectures and hands-on exercises - Slides and other materials are available at https://github.com/jobreu/reproducible-research-gesis-2023 - The workshop repository on the [*GESIS ILIAS*](https://ilias.gesis.org) contains some literature on tools and workflows for reproducibility as well as a timetable for this workshop --- ## Online format - If possible, we invite you to turn on your camera - Feel free to ask questions anytime - If you have an immediate question during the lecture parts, please send it via text chat, publicly or privately (ideally to a person who is currently not presenting) - If you have a question that is not urgent and might be interesting for everybody, you can also use audio (& video) to ask it at the end of a lecture part or during the exercises (please use the use the "raise hand" function in *Zoom* for this) - We would kindly ask you to mute your microphones when you are not asking (or answering) a question --- ## Course schedule - Day 1 <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Time </th> <th style="text-align:left;"> Topic </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;color: gray !important;"> 09:30 - 10:45 </td> <td style="text-align:left;font-weight: bold;"> Introduction </td> </tr> <tr> <td style="text-align:left;color: gray !important;color: gray !important;"> 10:45 - 11:00 </td> <td style="text-align:left;font-weight: bold;color: gray !important;"> Coffee Break </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 11:00 - 12:00 </td> <td style="text-align:left;font-weight: bold;"> Computer literacy </td> </tr> <tr> <td style="text-align:left;color: gray !important;color: gray !important;"> 12:00 - 13:00 </td> <td style="text-align:left;font-weight: bold;color: gray !important;"> Lunch Break </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 13:00 - 15:00 </td> <td style="text-align:left;font-weight: bold;"> Git & GitHub - Part 1 </td> </tr> <tr> <td style="text-align:left;color: gray !important;color: gray !important;"> 15:00 - 15:15 </td> <td style="text-align:left;font-weight: bold;color: gray !important;"> Coffee Break </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 15:15 - 16:30 </td> <td style="text-align:left;font-weight: bold;"> Git & GitHub - Part 2 </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 16:30 - 17:00 </td> <td style="text-align:left;font-weight: bold;"> Q & A </td> </tr> </tbody> </table> --- ## Course schedule - Day 2 <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Time </th> <th style="text-align:left;"> Topic </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;color: gray !important;"> 09:00 - 09:30 </td> <td style="text-align:left;font-weight: bold;"> Recap Day 1 </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 09:30 - 11:00 </td> <td style="text-align:left;font-weight: bold;"> Dependency management </td> </tr> <tr> <td style="text-align:left;color: gray !important;color: gray !important;"> 11:00 - 11:15 </td> <td style="text-align:left;font-weight: bold;color: gray !important;"> Coffee Break </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 10:15 - 12:00 </td> <td style="text-align:left;font-weight: bold;"> Build your own Binder </td> </tr> <tr> <td style="text-align:left;color: gray !important;color: gray !important;"> 12:00 - 13:00 </td> <td style="text-align:left;font-weight: bold;color: gray !important;"> Lunch Break </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 13:00 - 14:30 </td> <td style="text-align:left;font-weight: bold;"> Binder & Notebooks </td> </tr> <tr> <td style="text-align:left;color: gray !important;color: gray !important;"> 14:30 - 14:45 </td> <td style="text-align:left;font-weight: bold;color: gray !important;"> Coffee Break </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 14:45 - 16:00 </td> <td style="text-align:left;font-weight: bold;"> Saving computational environments </td> </tr> <tr> <td style="text-align:left;color: gray !important;"> 16:00 - 17:00 </td> <td style="text-align:left;font-weight: bold;"> Recap & Outlook </td> </tr> </tbody> </table> --- ## Disclaimer We will cover several different tools that can be used for reproducible research in the quantitative social sciences. We will only be able to cover the basics of those tools, so if you want to continue to use them and use them in more advanced ways, you will probably need to "dig deeper" eventually and consult further resources (documentation, further tutorials, blog posts, or other publications). -- Our goal is that people with no or only very limited experience with the tools we will cover in this workshop are able to follow and keep up. For those who already have quite some experience with one or more of the tools we will cover, feel free to try out some additional things or options (you can, e.g., check some of the additional topics and tools from our [Outlook slides](https://jobreu.github.io/reproducible-research-gesis-2022/slides/Recap_outlook.html)), or have a closer look at documentation of the tools. This also applies if the exercises are too easy for you and you are done earlier with them). --- ## Disclaimer As you probably already know, there are a lot of different tools and workflows that can be employed for increasing the reproducibility of research. We will introduce you to some of those, but there is more, and, in the end, it depends on your personal preferences and needs what tools and workflows you employ.<sup>1</sup> In this course, we will focus on free and open-source software (FOSS).<sup>2</sup> We will also focus on `R`, but there are solutions for reproducible research with other programming languages, such as `Python` or `Julia`, as well as statistical software packages, such as *SPSS* and *Stata*. .small[ [1] As you will see, we instructors also all have different preferences in our workflows and tool use. [2] The only exceptions are *Docker* and *GitHub* which belongs to *Microsoft*. We chose *GitHub* over [*GitLab*](https://about.gitlab.com/), however, as it is generally more widely used and available to everybody, regardless of whether your institution maintains its own *GitLab* server or not. ] --- class: center, middle # Any questions so far? --- class: center, middle # Next up: What is reproducibility and why does it matter? --- ## Defining reproducibility > A minimum standard on a spectrum of activities (“reproducibility spectrum”) for assessing the value or accuracy of scientific claims .highlight[based on the original methods, data, and code] [...] In some fields, this meaning is, instead, associated with the term “replicability” or ‘repeatability’ ([FORRT Glossary - Reproducibility](https://forrt.org/glossary/reproducibility/)) --- ## *The Turing Way* definition <img src="data:image/png;base64,#https://the-turing-way.netlify.app/_images/reproducible-matrix.jpg" width="80%" style="display: block; margin: auto;" /> <small><small>Source: https://the-turing-way.netlify.app/reproducible-research/overview/overview-definitions.html</small></small> --- ## Defining dimensions 3-dimensional concept space <img src="data:image/png;base64,#https://dh-trier.github.io/trr/img/trr-cube.jpg" width="50%" style="display: block; margin: auto;" /> <small><small>By Christof Schöch. Source: https://dh-trier.github.io/trr/#/2/1</small></small> --- ## Computational reproducibility There also are distinctions between different types (or components) of reproducibility. One type/component that is especially relevant for this workshop is "Computational reproducibility": > Ability to .highlight[recreate the same results as the original study] (including tables, figures, and quantitative findings), .highlight[using the same input data, computational methods, and conditions of analysis]. The availability of code and data facilitates computational reproducibility, as does preparation of these materials (annotating data, delineating software versions used, sharing computational environments, etc). Ideally, computational reproducibility should be achievable by another second researcher (or the original researcher, at a future time), using only a set of files and written instructions ([FORRT Glossary - Computational reproducibility](https://forrt.org/glossary/computational-reproducibility/)) --- ## Reproducibility as a continuum <img src="data:image/png;base64,#../img/repro_spectrum.jpeg" width="95%" style="display: block; margin: auto;" /> <small><small>Source: Peng, R. D. (2011). Reproducible Research in Computational Science. *Science*, 334(6060), 1226–1227. https://doi.org/10.1126/science.1213847</small></small> --- ## Why reproducibility matters? Reproducibility ensures that research is .highlight[transparent, verifiable, and trustworthy]. Studies have repeatedly shown suboptimal reproducibility of research in the social sciences (see, e.g., [Artner et al., 2021](https://doi.org/10.1037/met0000365), [Hardwicke et al., 2020](https://doi.org/10.1098/rsos.190806), [Krähmer et al., 2023](https://doi.org/10.1371/journal.pone.0289380), [Trisovic et al., 2022](https://doi.org/10.1038/s41597-022-01143-6)). --- ## Motivations for reproducibility - Increasing the robustness and trustworthiness of your own research - Facilitating collaboration (through the use of common tools and standards) - Being kind to future you: - Editing and reusing your own code (e.g., for a paper revision or a follow-up study) - Easily being able to pick up a project after a break or finding and understanding things again after a longer time --- ## Tools & workflows 🛠 📋 In our case, **tools** are programming languages, programs, and other pieces of software that we can use to make our research (more easily) reproducible. **Workflows** are the ways in which we combine these tools to achieve our goal. --- ## Tools & workflows 🛠 📋 Choosing tools and establishing workflows are somewhat idiosyncratic processes that depend on... - the requirements of your project (methods, data types...) - the availability of tools - your skills and knowledge - the preferences of collaborators ... --- ## Reproducible research workflows > being an open scientist means .highlight[adopting a few straightforward research management practices, which lead to less error-prone, reproducible research workflows] ([Klein et al., 2018](https://doi.org/10.1525/collabra.158), p. 11) --- ## Research management practices There are quite a few practices that researchers can adopt to increase the reproducibility of their work. - Using free and open source software (FOSS) - [Project-oriented workflow](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/) - [Clear folder structures](https://psych-transparency-guide.uni-koeln.de/folder-structure.html) - [Naming things](https://betterprogramming.pub/string-case-styles-camel-pascal-snake-and-kebab-case-981407998841?gi=3295916e039) - ... Notably, all of those practices as well as the use of the tools we cover in this workshop require a certain degree of **Computer literacy**.