How to create a simple R Package to easily share your data

This post steps you through how to create a simple R package to share your data.

By Harriet Goers in R packages Data Strategies of Resistance Data Project

March 11, 2022

This post will help you write a package that provides people with easy access to your data. Increasing the ease with which people can access your data should increase its impact. I will be writing this post as I develop my package, so it will change over time.

I am developing a package that provides R users with easy access to the Strategies of Resistance Data Project. SRDP seeks to advance our understanding of how self-determination movements unfold over time, and the conditions under which self-determination organizations use conventional politics, violent tactics, non-violent tactics, or some combination of these. Feel free to use the package to access the data!

Whenever I am developing an R package, I will always have Hadley Wickham and Jenny Bryan’s R Packages open. It is an incredible, thorough resource. I highly recommend that you give it a read.

STEP 2: Edit the DESCRIPTION

  1. Head over to the DESCRIPTION file.

  2. Edit the relevant fields. Here is what mine looks like:

Package: sRdpData
Type: Package
Title: Strategies of Resistance Data Project
Version: 0.1.0
Author: Harriet Jane Goers
Maintainer: Harriet Jane Goers <hgoers@umd.edu>
Description: This package provides you with easy, programmatic access to SRDP data.
License: `use_mit_license()`
Encoding: UTF-8
LazyData: true
URL: https://github.com/hgoers/sRdpData
BugReports: https://github.com/hgoers/sRdpData/issues
Depends: 
    R (>= 2.10)
RoxygenNote: 7.1.2
  1. Commit your changes to your git repo.

STEP 3: Add the data to the package

I am now going to clean the raw data and add it to my package. I will start with the organization-level data.

usethis::use_data_raw("orgs")

This creates a script in which I can clean up the raw data.

  1. I will first store a copy of the raw data in the data-raw folder that has been created.
library(tidyverse)

org <- rio::import(here::here("data-raw", "SRDP_Org_2019_release.dta")) %>% 
  labelled::remove_labels() %>% 
  janitor::clean_names() %>% 
  select(kgcid, group_name = group, facid, fac_name = facname, year, violence_state:political_nocoop) %>% 
  mutate(across(violence_state:political_nocoop, factor))

usethis::use_data(org, overwrite = TRUE)
  1. I now do the same for the group-level data.
usethis::use_data_raw("groups")
  1. And in the resulting script:
library(tidyverse)

group <- rio::import(here::here("data-raw", "SRDP_Mvmt_2019_release.dta")) %>%
  labelled::remove_labels() %>%
  janitor::clean_names() %>%
  select(kgcid, group_name = group, year, country) %>%
  mutate(country = countrycode::countrycode(country, "country.name", "country.name"),
         country_iso3c = countrycode::countrycode(country, "country.name", "iso3c"))

usethis::use_data(groups, overwrite = TRUE)
  1. I now need to document the data. To do this, I create a new script in the R folder.
usethis::use_r("data")
  1. I then document the datasets in the resulting script using roxygen. For details on how to document datasets, head over to the External Data chapter of R Packages.

  2. I then create the documentation by running the following:

devtools::document()
  1. I then preview the documents by calling ?orgs and ?groups. As you can see, this documentation serves the same role as the variables section in your codebook. This provides people with very easy access to the variable descriptions and units.

  2. As usual, commit your changes to your git repo.

STEP 4: Create the functions to call the datasets

I would like to provide people with some functions that will allow them to access the datasets in specific formats:

  • One observation (row) for each group/organization.

  • One observation (row) for each group/organization-year dyad.

  1. To do this, I call the following:
usethis::use_r("access_data.R")
  1. In the resulting script, I write my function:
srdp_orgs <- function(wide = FALSE) {

  if (wide == TRUE) {

    sRdpData::orgs %>%
      dplyr::select(kgcid:year) %>%
      dplyr::group_by(kgcid, group_name, facid, fac_name) %>%
      dplyr::summarise(start_year = min(year),
                       end_year = max(year)) %>%
      dplyr::mutate(start_year = dplyr::na_if(start_year, 1960),
                    end_year = dplyr::na_if(end_year, 2005))

  } else {

    sRdpData::orgs %>%
      dplyr::select(kgcid:year) %>%
      tibble::as_tibble()

  }

}
  1. I then test the function by calling load_all(). I should now be able to use the function srdp_orgs().
  2. As usual, commit your changes to your git repo.

STEP 5: Check your work

You should regularly check your work for errors.

  1. Run check().

  2. Carefully read the end section of the output. At this stage, you should have some warnings and notes.

  3. One warning will look something like this:

  '::' or ':::' imports not declared from:
    ‘dplyr’ ‘tibble’

Adding dependencies

This is telling you that you need to declare the packages upon which your package depends. When you use any functions outside of base R, you will need to do this. Users of your package will need to have these packages on their system. To declare a package, run the following:

usethis::use_package("package")

Now, you can rerun check() and your warning should be gone.

I often use pipes in my code. A handy way of ensuring that you document this dependency is: `usethis::use_pipe()`.

  1. As usual, commit your changes to your git repo.

STEP 6: Document your functions

  1. I now put my cursor anywhere in the function I just wrote.

  2. I then head up to the RStudio options and go to Code > Insert roxygen skeleton.

  3. I now fill out the relevant fields:

#' Access to SRDP organisation-level data.
#'
#' This function provides a dataset of all organizations, their groups, and
#' their start and end dates. It covers the period between 1960 and 2005. You
#' can use this function to access a long dataframe (one observation for each
#' organization-year dyad), or a wide dataframe (one observation for each
#' organization, with their start and end years provided in specific columns).
#'
#' Please note, if an organization started before 1960, its start year is listed
#' as \emph{NA}. If an organization ended after 2005, its end year is listed as \emph{NA}.
#'
#' @param wide Logical. When FALSE (default), provides a dataframe with one observation for every organization-year dyad. When TRUE, provides a dataframe with one observation for every organization
#'
#' @return A tibble, with each organization's kgcid, group name, facid, faction name, start year, and end year
#' @export
#'
#' @examples
#' orgs <- srdp_orgs(wide = TRUE)
srdp_orgs <- function(wide = FALSE) {

  if (wide == TRUE) {

    sRdpData::orgs %>%
      dplyr::select(kgcid:year) %>%
      dplyr::group_by(kgcid, group_name, facid, fac_name) %>%
      dplyr::summarise(start_year = min(year),
                       end_year = max(year)) %>%
      dplyr::mutate(start_year = dplyr::na_if(start_year, 1960),
                    end_year = dplyr::na_if(end_year, 2005))

  } else {

    sRdpData::orgs %>%
      dplyr::select(kgcid:year) %>%
      tibble::as_tibble()

  }

}
  1. I now call document() to add these comments into our package documentation.

  2. I can see the resulting documentation by calling ?srdp_orgs.

  3. As usual, commit your changes to your git repo.

STEP 7: Testing

This is a very simple script. Nonetheless, it is always worthwhile to add some tests to make sure your functions are behaving as you want them to.

  1. Call use_testthat(). This sets up testing infrastructure in your package.

  2. Next, call use_test("function_you_want_to_test"). This will create a new script in which you will write your test. I like to name my test scripts after the functions scripts that I am testing.

  3. Now, write a test for your function. Here’s mine:

test_that("srdp_orgs() returns the organization-level dataframe", {

  expect_equal(length(srdp_orgs()), 5)

})
  1. Load library(testthat).

  2. Call load_all().

  3. Call test(). Carefully read the output and adjust your test as needed.

  4. Write tests for your other functions (as needed).

STEP 8: Create a README to provide users with a quick reference guide for your package

A README serves as the homepage for your package. It should contain all kinds of useful information, including an introduction to your package, some examples of your functions, and instructions on how to download it for use.

  1. Call use_readme_rmd().

  2. Add some interesting examples and content.

  3. Once you are happy with it, call build_readme().

STEP 9: Check and then install

  1. Run check(). Address any errors, warnings, or notes (as needed).

  2. When you are happy with your package, run intall().

  3. Now, you can load your package like any other! Call library(sRdpData).

STEP 10: Share your package

  1. If your Github repo is public, people should now be able to install your package following the instructions automatically included in your README file.

  2. Next up, I will submit the package to CRAN. I will let you know how that goes…

Posted on:
March 11, 2022
Length:
7 minute read, 1324 words
Categories:
R packages Data Strategies of Resistance Data Project
See Also: