Highlighted Selections from:

Archiving Reproducible Research with R and Dataverse


Leeper, Thomas.“Archiving Reproducible Research with R and Dataverse.” The R Journal (2014): 1–8. Print. Accepted, to appear in the next volume. http://journal.r-project.org/archive/accepted/leeper.pdf

p.1: Reproducible research and data archiving are increasingly important issues in research involving statistical analyses of quantitative data. This article introduces the dvn package, which allows R users to publicly archive datasets, analysis files, codebooks, and associated metadata in Dataverse Network online repositories, an open-source data archiving project sponsored by Harvard University. In this article I review the importance of data archiving in the context of reproducible research, introduces the Dataverse Network, explain the implementation of the dvn package, and provide example code for archiving and releasing data using the package. -- Highlighted apr 20, 2014

p.1: A reproducible research workflow is most valuable when it allows others to replicate results and reproduce published results precisely. Thus complete records of all data and analyses need to be publicly and persistently available via a stable, freely available archive. -- Highlighted apr 20, 2014

p.1: Thus researchers remain in need of a service dedicated to the persistent preservation of data with sophisticated metadata support to allow others to easily find, understand, and use archived files. One such option is The Dataverse Network, a free-to-use, open-source data archive sponsored and developed by the Institute for Quantitative Social Science at Harvard University (Altman et al., 2001; King, 2007). -- Highlighted apr 20, 2014

p.1: The Dataverse Network is an increasingly prominent repository of data and associated files and is the default archive for many journals that currently require public archiving of replication files as a condition of publication. -- Highlighted apr 20, 2014

p.2: Each study archived in a dataverse contains, at a minimum, an identifying title, but may additionally include an array of metadata and files. This means The Dataverse Network can be used to store not just a free-standing data file, but associated files in almost any format and one might include files such as data, codebooks, analysis replication files, statistical packages, questionnaires, experimental materials, and so forth. -- Highlighted apr 20, 2014

p.2: Once created, a study is given a global identifier in the form of either a Handle, or DataCite DOI, digital object identifiers that uniquely and globally identify the study. Studies are version controlled when new files are added or when metadata is modified and previous versions remain persistently accessible. Furthermore, when data files are modified, the study identifier is updated with additional version information in the form of Universal Numeric Fingerprint (UNF) (Altman and King, 2007; Altman, 2008). -- Highlighted apr 20, 2014

p.2: This means that data stored in a dataverse can be cited (e.g., in scholarly publications) not only by their handle but also by their specific version using the UNF. -- Highlighted apr 20, 2014

p.2: Reproducible research efforts can thus be enhanced by use not only of the same named dataset but the exact same version of that dataset, a major advantage over storing data in other types of archive. -- Highlighted apr 20, 2014

p.7: With the popularity of The Dataverse Network increasing, it is a logical choice for archiving one’s reproducible data and analysis files. dvn thus provides R users with the tools necessary to quickly and easily integrate data archiving into the reproducible research workflow. -- Highlighted apr 20, 2014