Highlighted Selections from:

Committing to Data Quality Review

Peer, Limor; Green, Ann; and Elizabeth Stephenson. “Committing to Data Quality Review.” Practice Paper for the 9th International Digital Curation Conference (2014): 1–27. Pre-Print.

p.1: Amid the pressure and enthusiasm for researchers to share data, a rapidly growing number of tools and services have emerged. What do we know about the quality of these data? Why does quality matter? And who should be responsible for their quality? We believe an essential measure of data quality is the ability to engage in informed reuse which requires that data are independently understandable (CCSDS, 2012). In practice, this means that data must undergo quality review, a process whereby data and associated files are assessed, and required actions are taken, to ensure files are independently understandable for informed reuse. This paper explains what we mean by data quality review, what measures can be applied to it, and how it is practiced in three domain specific archives. We explore a selection of other data repositories in the research data ecosystem, as well as the roles of researchers, academic libraries, and scholarly journals in regard to their application of data quality measures in practice. We end with thoughts about the need to commit to data quality and who might be able to take on those tasks. -- Highlighted may 4, 2014

p.2: Judgments about the quality of data are often tied to specific goals such as authenticity, verity, openness, transparency, and trust (Altman, 2012; Bruce and Hillman, 2013). Data quality might also consist of a combination of goals. The categories of data quality as defined by Wang and Strong (1996) are often in competition with each other or prioritized differently by stakeholders. As Kevin Ashley (Ashley, 2013) recently observed, some may prize the completeness of the data while others their accessibility. He urges that curation practices “be explicit about quality metrics and curation processes in domain-independent ways.” -- Highlighted may 4, 2014

p.3: The concept of “informed use” has also made its way into recent efforts to establish common citation principles; among the “first principles” for data citation is the following: “Citations should facilitate access both to the data themselves and to such associated metadata and documentation as are necessary for both humans and machines to make informed use of the referenced data.” (CODATA, 2013) -- Highlighted may 4, 2014

p.3: One type of reuse – reproducing original analysis and results – sets an even higher bar for quality. When viewed through the lens of “really reproducible research,” data as well as code need to be made available to allow regeneration of the published results (Peer, 2013). -- Highlighted may 4, 2014

p.3: Victoria Stodden, speaking at the same conference, talked about the central role of algorithms and code in the reproducibility and credibility of science (Stodden, 2013b). The goal is to reduce the risk of having a less-than-perfect scientific record – for example, insufficient information about variables, values, coding, scales and weighting or lack of transparency in descriptions of methodology, sampling, and instrumentation – which makes it difficult to reuse data and to validate results, and hampers the transfer of knowledge and the progress of science. -- Highlighted may 4, 2014

p.6: A data quality review also involves some processing – examining and enhancing – of the actual data. These actions require performing various checks on the data, which can be both automated and manual procedures. The United Kingdom Data Archive (UKDA) provides a comprehensive list: “double-checking coding of observations or responses and out-of-range values, checking data completeness, adding variable and value labels where appropriate, verifying random samples of the digital data against the original data, double entry of data, statistical analyses such as frequencies, means, ranges or clustering to detect errors and anomalous values, correcting errors made during transcription.” In addition, data need to be reviewed for risk of disclosure of research subjects’ identities, of sensitive data, and of private information (Lyle, Alter & Green, 2014) and potentially altered to address confidentiality or other concerns. -- Highlighted may 4, 2014

p.6: As Victoria Stodden, a long-time advocate for code disclosure, put it, “A research process that uses computational tools and digital data introduces new potential sources of error: Were the methods described in the paper transcribed correctly into computer code? What were the parameters settings and input data files? How were the raw data filtered and prepared for analysis? Are the figures and tables produced by the code the same as reported in the published article? The list goes on.” -- Highlighted may 4, 2014

p.12: Research culture and habit seem to play a significant role (Sandve et al., 2013). Could research teams themselves take on more of the curatorial tasks similar to those done by data archives? Part of the solution might be to incorporate the right training, guidance and tools that support data quality into the habits of researchers as part of the efforts to make their data independently understandable over time. -- Highlighted may 4, 2014

p.12: These collaborative research environments do not provide long-term preservation, and ideally could develop seamless integration with long-lived repositories. The advantage of considering virtual research environments as essential components for data quality is that many of the data quality review tasks are performed before files are deposited in repositories. Capturing the ‘workflow’ of the research team could go a long way in addressing the challenges of producing data that is independently usable, especially if guidelines are followed in regard to documentation, file formats, persistent identifiers, and the inclusion of methodology statements and documents explaining research methods or decisions about sampling. -- Highlighted may 4, 2014

p.18: This endeavor requires more and better tools, as well as smart, effective partnership among the various stakeholders. “The social nature of science and the network of interested stakeholders in the future of access to scientific data,” says Gold (2007), “make it essential to develop social and policy tools to support this future.” As Jones et al (2006) observe, the key is “to find the balance of responsibility for documenting data between individual researchers and trained data stewards who have advanced expertise with appropriate metadata standards and technologies.” And, the National Digital Stewardship Alliance recently urged the scientific community to, “work together to raise the profile of digital preservation and campaign for more resources and higher priority given to digital preservation, and to highlight the importance of digital curation and the real costs of ensuring long term access (NDSA, 2014). -- Highlighted may 4, 2014