class: center, middle, inverse, title-slide .title[ # Workflows for Reproducible Research with R & Git ] .subtitle[ ## Dependency Management ] .author[ ### Johannes Breuer, Bernd Weiss, & Arnim Bleier ] .date[ ### 2023-11-17 ] --- layout: true --- ## Dependencies in `R` Most `R` packages depend on other `R` packages. All `R` packages depend on `R`. -- Both `R` and `R` packages have versions. Different versions of `R` packages may depend on different versions of `R` and different versions of other packages. --- ## Dependencies in `R` <img src="data:image/png;base64,#Dependency_Management_files/figure-html/dep-graph-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Dependencies in `R` <img src="data:image/png;base64,#../img/cran_usethis.png" width="175%" style="display: block; margin: auto;" /> <small><small>Source: https://cran.r-project.org/web/packages/usethis/index.html</small></small> --- class: center, middle ["It's ~~turtles~~ software all the way down"](https://en.wikipedia.org/wiki/Turtles_all_the_way_down) 🐢 -- ... or is it??? --- ## Digging ⛏ into your tool stack 🛠 Your full `R` setup consists of: 1. Specific versions of `R` packages<sup>1</sup> 2. A specific version of `R`<sup>2</sup> 3. A specific version of your operating system 4. Specific hardware .small[ [1] You can have different libraries with different (versions of) `R` packages. [2] You can also have different versions of `R` installed on your machine. ] --- ## Further dependencies <img src="data:image/png;base64,#../img/toppling-tower.jpg" width="30%" style="display: block; margin: auto;" /> <small><small>*Note*: We could also add system libraries between `R` and the OS (which are especially relevant in the [Linux/Unix world](https://www.tutorialspoint.com/operating_system/os_linux.htm)).</small></small> --- ## The danger of dependencies <img src="data:image/png;base64,#https://imgs.xkcd.com/comics/dependency.png" width="50%" style="display: block; margin: auto;" /> <small><small>https://xkcd.com/2347/</small></small> --- ## What to "ship" 🚢? - `R` code (+ underlying data) - this should include information about the packages used - information about the version of `R` and the used packages -- - your whole computational environment (focus of the next session) - overall goal: preventing what is known as "code rot" & "works-on-my-machine errors" (WOMME) --- ## Dependency management solutions As with almost everything in the `R` ecosystem, there are multiple solutions for dependency management: - a manual approach - [~~`checkpoint`~~](https://github.com/RevolutionAnalytics/checkpoint)<sup>1</sup> - [**`groundhog`**](https://groundhogr.com/) - [`renv`](https://rstudio.github.io/renv/)<sup>2</sup> - [`rang`](https://github.com/gesistsa/rang)<sup>3</sup> .small[ [1] Not an option anymore as it relied on the *CRAN Time Machine snapshots* from the *Microsoft R Application Network* (MRAN) which was [retired in July 2022](https://techcommunity.microsoft.com/t5/azure-sql-blog/microsoft-r-application-network-retirement/ba-p/3707161). [2] `renv` is the successor of [`packrat`](https://github.com/rstudio/packrat) (which is not maintained anymore). [3] Developed by members of the team [Transparent Social Analytics](https://www.gesis.org/en/institute/staff/orga/tiles/5/74?cHash=8fbd330b798c8dd7cb84097ddfd82054) at GESIS. ] --- ## Manual approach to dependency management There is an easy-to-use manual solution for providing information about the packages and `R` version used in your project: .small[ ```r sessionInfo() ``` ``` ## R version 4.3.2 Patched (2023-10-31 r85451 ucrt) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 10 x64 (build 19045) ## ## Matrix products: default ## ## ## locale: ## [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8 ## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C ## [5] LC_TIME=German_Germany.utf8 ## ## time zone: Europe/Berlin ## tzcode source: internal ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] depgraph_0.1.0 emo_0.0.0.9000 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 ## [6] dplyr_1.1.3 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ## [11] ggplot2_3.4.4 tidyverse_2.0.0 knitr_1.45 ## ## loaded via a namespace (and not attached): ## [1] tidyselect_1.2.0 viridisLite_0.4.2 farver_2.1.1 ## [4] fastmap_1.1.1 miniCRAN_0.2.16 xaringan_0.28 ## [7] promises_1.2.1 easypackages_0.1.0 digest_0.6.33 ## [10] timechange_0.2.0 mime_0.12 lifecycle_1.0.4 ## [13] ellipsis_0.3.2 processx_3.8.2 magrittr_2.0.3 ## [16] compiler_4.3.2 rlang_1.1.2 sass_0.4.7 ## [19] tools_4.3.2 igraph_1.5.1 utf8_1.2.4 ## [22] yaml_2.3.7 labeling_0.4.3 prettyunits_1.2.0 ## [25] htmlwidgets_1.6.2 bit_4.0.5 pkgbuild_1.4.2 ## [28] xml2_1.3.5 pkgload_1.3.3 miniUI_0.1.1.1 ## [31] withr_2.5.2 grid_4.3.2 fansi_1.0.5 ## [34] urlchecker_1.0.1 profvis_0.3.8 xtable_1.8-4 ## [37] colorspace_2.1-0 scales_1.2.1 woRkshoptools_0.1.0 ## [40] cli_3.6.1 rmarkdown_2.25 crayon_1.5.2 ## [43] generics_0.1.3 remotes_2.4.2.1 rstudioapi_0.15.0 ## [46] httr_1.4.7 tzdb_0.4.0 sessioninfo_1.2.2 ## [49] cachem_1.0.8 rvest_1.0.3 assertthat_0.2.1 ## [52] parallel_4.3.2 vctrs_0.6.4 devtools_2.4.5 ## [55] webshot_0.5.5 jsonlite_1.8.7 callr_3.7.3 ## [58] hms_1.1.3 ggrepel_0.9.4 bit64_4.0.5 ## [61] archive_1.1.6 systemfonts_1.0.5 fontawesome_0.5.2 ## [64] jquerylib_0.1.4 glue_1.6.2 ps_1.7.5 ## [67] xaringanExtra_0.7.0.9000 stringi_1.8.1 gtable_0.3.4 ## [70] later_1.3.1 munsell_0.5.0 pillar_1.9.0 ## [73] ggnetwork_0.5.12 htmltools_0.5.7 R6_2.5.1 ## [76] vroom_1.6.4 evaluate_0.23 shiny_1.7.5.1 ## [79] kableExtra_1.3.4 highr_0.10 memoise_2.0.1 ## [82] httpuv_1.6.12 bslib_0.5.1 uuid_1.1-1 ## [85] Rcpp_1.0.11 svglite_2.1.2 xfun_0.41 ## [88] fs_1.6.3 usethis_2.2.2 pkgconfig_2.0.3 ``` ] --- ## Manual approach to dependency management You can, e.g., provide the output of `sessionInfo()` in a `.txt` file. We have created a short `R` script which does that. You can find the `session_info_to_txt.R` script in the folder `content/R/` within the workshop materials. An exemplary output of that `session_info.txt` can be found in the root directory of the workshop materials. --- ## `groundhog` [`groundhog`](https://groundhogr.com/) is a lightweight package that allows you to increase the reproducibility of your `R` scripts. It does so by installing and loading "packages & their dependencies as available on chosen date on CRAN". --- ## Using `groundhog` All you need to do to use `groundhog` is specifying the packages you want to use in your script and a date. ```r install.packages("groundhog") library(groundhog) pkgs <- c("tidyverse", "janitor", "sjPlot") groundhog.library(pkgs, date = "2023-11-17") ``` --- ## How `groundhog` works From the [package website](https://groundhogr.com/back-end/): "groundhog relies on a database that contains virtually all package versions ever uploaded to CRAN, the date when they were published, and all dependencies." `groundhog` also used to rely on MRAN. Since that has been retired, however, it now uses its own package repository: [GRAN: Groundhog R Archive Neighbor](https://groundhogr.com/gran/). --- ## How `groundhog` works `groundhog` uses its own package library to install the packages you specified. From the documentation of the `restore.library()` function: "When groundhog installs a package, it installs it into groundhog's library." ```r library(groundhog) get.groundhog.folder() ``` ``` ## [1] "e:\\home/R_groundhog/groundhog_library/" ``` "Groundhog then immediately moves the installed package(s) (and their dependencies) to the default personal library." ```r .libPaths()[1] ``` ``` ## [1] "D:/Programme/R/library" ``` .small[ *Note*: You can find a lot more technical information on the [`groundhog` website](https://groundhogr.com/) and within the help files for the `groundhog` functions. ] --- ## Reversing changes made by `groundhog` You can reverse the changes made by `groundhog` to your default personal `R` package library: ```r restore.library() ``` --- ## Pros and cons of `groundhog` *Pros:* - can be easily used to make existing `R` scripts more reproducible - does not require a "project-based workflow"<sup>1</sup> - does not require any specific knowledge on the reproducer's side *Cons:* - works with package snapshots from specific dates, not with specific package versions - in reality, our installed packages are very often not up-to-date - installs specific package versions, but not specific versions of `R` .small[ [1] While you as someone who highly values reproducibility, of course, do use a project-based workflow, the people who want to reproduce your analysis might not 😉 ] --- ## Choosing a date The recommendation by the `groundhog` package authors for choosing a date is that "a good default is the first day of the month when starting your project". Once you have added `groundhog.library()` to your script, re-run it to make sure it produces the expected results. You can also update all of the packages you use and then specify the current date as the date for `groundhog.library()`. --- ## `groundhog` and `R` versions The `groundhog` database also includes information on base `R` releases. As `groundhog` does not install specific versions of `R`, you should still specify the version of `R` you used in your script. ```r R.version.string ``` ``` ## [1] "R version 4.3.2 Patched (2023-10-31 r85451 ucrt)" ``` You can find the release dates for all versions of R via the [CRAN archive page](https://cran.r-project.org/src/base/). --- ## Excursus: Updating `R` As you probably know, you can download the most recent version of `R` from the [CRAN website](https://cran.r-project.org/). You can download older versions of `R` via the CRAN archive packages for [*Windows*](https://cran.r-project.org/bin/windows/base/old/) and [*Mac OS X*](https://cran-archive.r-project.org/bin/macosx/).<sup>1</sup> .small[ [1] How you install and update `R` on Linux depends on your [distribution](https://cran.r-project.org/bin/linux/). ] --- ## Excursus: Updating `R` On Windows, you can also use the [`installr` package](https://talgalili.github.io/installr/) to update `R`.<sup>1</sup> ```r install.packages("installr") library(installr) updateR() ``` .small[ [1] The `installr` package also offers some interesting other functionalities, such as installing `Git` via the `install.git()` function. ] --- ## Excursus: Updating packages The easiest way of updating packages is simply using `install.packages()`. To update all packages or a specific set of packages, you can use the `update.packages()` function. ```r # update all installed packages update.packages() # update specific packages update.packages(oldPkgs = c("tidyverse", "janitor", "sjPlot")) ``` --- ## Excursus: Updating packages With the following code, you can detach and update all of the packages you have currently loaded in your `R` session (excluding core `R` packages): ```r loaded_pkgs <- search() loaded_pkgs <- loaded_pkgs[grep("^package:", loaded_pkgs)] exclude_pkgs <- c("package:base", "package:stats", "package:graphics", "package:grDevices", "package:utils", "package:datasets", "package:methods", "package:utils") loaded_pkgs <- loaded_pkgs[!loaded_pkgs %in% exclude_pkgs] for (pkg in loaded_pkgs) { detach(pkg, character.only = TRUE, unload = TRUE) } update_pkgs <- gsub("^package:", "", loaded_pkgs) install.packages(update_pkgs) ``` --- ## Excursus: Updating packages If you want to install a specific version of a package (not the most recent one), the easiest option is to use a function from the `remotes` package. ```r library(remotes) install_version("tidyverse", version = "1.3.0") ``` --- class: center, middle # [Exercise](https://jobreu.github.io/reproducible-research-gesis-2023/exercises/Exercise_Dependency_Management.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/reproducible-research-gesis-2023/solutions/Exercise_Dependency_Management.html) --- class: center, middle # More comprehensive dependency management options in `R` --- ## `renv` and its advantages<br> compared to `groundhog` Let's talk about another option: `renv` - `renv` is integrated with *RStudio* - `renv` is developed by professional software engineers (rather than academic researchers) - `renv` works with more repositories than `groundhog` For more information on `groundhog` vs. `renv`, see: https://groundhogr.com/renv/ --- ## What is `renv`? The package `renv` helps to create reproducible environments for *R projects* --- ## What is an `R` project? > "R experts keep all the files associated with a project together — input data, > R scripts, analytical results, figures. This is such a wise and common > practice that RStudio has built-in support for this via projects." .small[Source: https://r4ds.had.co.nz/workflow-projects.html] --- ## Why? Use `renv` to make your `R` projects more isolated, portable and reproducible. - Isolated: Installing a new or updated package for one project won’t break your other projects, and vice versa. That’s because `renv` gives each project its own private library. - Portable: Easily transport your projects from one computer to another, even across different platforms. `renv` makes it easy to install the packages your project depends on. - Reproducible: `renv` records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go. .small[Source: https://rstudio.github.io/renv/] --- ## Let's talk about libraries<br> (and packages) - When talking about a "reproducible R environment", this mostly (excluding `R` itself) refers to all the R packages that are being used - Most of us have *one* "system library" for `R` packages, where all installed packages can be found (for example: `D:/Programme/R/library`) - "The directories in `R` where the packages are stored are called the libraries" [1] .small[[1] https://hbctraining.github.io/Intro-to-R-flipped/lessons/04_introR_packages.html] --- ## Is one R system library a good idea? However, having *one* central system library for `R` packages may be a bad idea, especially when things break -- and things will break! ("dependency hell") <img src="data:image/png;base64,#https://imgs.xkcd.com/comics/dependency.png" width="38%" style="display: block; margin: auto;" /> <small><small>https://xkcd.com/2347/</small></small> --- ## renv offers "project libraries" - Instead of using *one* central system library for all installed `R` packages, `renv` sets up a "project library", i.e., each `R` project has its own library - Project libraries means that each project has its own independent collection of `R` packages, i.e., multiple projects can have the same `R` package but, e.g., different versions of it --- ## Where all the `R` packages<br> grow... repositories - The usual ways to install an `R` package is to download the package from an repository - Most famous is CRAN, the Comprehensive R Archive Network (other sources include Bioconductor or GitHub) - I am currently using: ```r getOption("repos") ``` ``` ## CRAN ## "https://cran.rstudio.com/" ## attr(,"RStudio") ## [1] TRUE ``` --- ## `renv` and repositories - `renv` creates an `R` project-related library and installs packages into that specific library - Hence, it needs to know where (a) package(s) can be found; not all packages can be found on CRAN --- ## How does `renv` work? - It is assumed that your work is based on the model of an "`R` project" (in *RStudio*) - Similar to `Git`, you first have to initialize your `R` project, letting `renv` know that it should take care of all things that are related to a reproducible `R` environment - At some moment in time, when everything is working fine, and you want to preserve this state of your `R` project, you can take an `R`-related "snapshot" of your project - This snapshot can be shared with others or your future self, guaranteeing that you can always restore this moment in time when everything in your `R` project was working well (meaning: in terms of a "reproducible `R` environment") --- ## Workflow in `renv` <img src="data:image/png;base64,#../img/bw/fig_renv.png" width="90%" style="display: block; margin: auto;" /> .smaller[Source: https://rstudio.github.io/renv/articles/renv.html] --- ## `renv::init()` - `renv::init()` is run once and adds a few new directories and files to your `R` project - `renv/library`: This directory is the project library that contains all packages currently used by your project - `renv.lock`: Records metadata about every package ensuring that it can be re-installed on a new machine - `.Rprofile`: Runs automatically every time you start `R` (**in that project folder**), renv uses this file to configure your `R` session to use the project library .small[Source: https://rstudio.github.io/renv/articles/renv.html#getting-started] --- ## `renv` in *RStudio* <img src="data:image/png;base64,#../img/bw/fig_renv-rstudio.png" width="75%" style="display: block; margin: auto;" /> --- ## `renv::snapshot()` and `renv::restore()` - Running `renv::snapshot()` updates `renv.lock` and documents the current collection (and their specific versions) of `R` packages that are being used in your `R` project - To share an `R` project, you need to provide access to `renv.lock`, `.Rprofile`, `renv/settings.json` and `renv/activate.R` -- or, from a Git perspective, that's all you need to commit into your `Git` repository[1] - Next, using `renv::restore()` a collaborator can exactly reproduce (download and install specific `R` packages) your `R` environment .smaller[[1] Conveniently, the initial `renv::init()` run also creates a `.gitignore` file in the `renv` folder excluding everything that does not need to go into the `Git` repository] --- ## Installing additional `R` packages Use `install.packages()` or `renv::install()` --- ## Updating `R` packages in your `R` project Use `renv::update()`... --- class: center, middle # More options for dependency management in `R` --- ## `rang` > "The goal of rang (Reconstructing Ancient Number-crunching Gears) is to > obtain the dependency graph of R packages at a specific time point. > Although this package can also be used to ensure the **current R computational > environment** can be reconstructed by future researchers, this package gears > towards **reconstructing historical R computational environments** which have not > been completely declared. For the former purpose, **packages such as renv, > groundhog, miniCRAN, and Require should be used**. One can think of rang as an > archaeological tool" (emphasis added). .small[Source: https://github.com/gesistsa/rang] --- ##`rang` (cont'd.) For more information, see: - https://chainsawriot.github.io/gesis2023_rang/#/title-slide - https://github.com/gesistsa/rang