Data quality
Note: Currently, this page is under construction...
Introduction
Data quality in the social sciences is often discussed under the headings of validity and reliability when talking about concepts, measurement and data. However, few discuss transparency and data quality, and there is no consistent understanding what data quality means in the social sciences. In their famous book "Designing Social Inquiry", King et al. (1994, 24) set out five guidelines for "improving data quality" encompassing
- valid measures,
- reliable data collections,
- replicable analyses,
- a thorough documentation of the data generating process, and
- to "collect data on as many of [a theory’s] observable implications as possible".
Nowadays, much more data is available posing new challenges, though; counterfeit data, typos, and unintended mistakes in the data collection, for example, pose challenges beyond invalid and unreliable measures. The structured documentation of each indicator (including meta data) in WeSISpedia, common coding rules and formatting standards alongside the resources provided in WeSIS, for example, public data sets or community notebooks, directly speak to King et al.'s suggestions. Still, we also opted for a broader concept based on data quality definitions commonly used in information systems design (cf. Pipino et al. 2002). Concise and consistent representation or "free-of-errorness", for example, are taken care of by the validation check performed when uploading data to WeSIS (for more details see the file upload guide).
While such automated checks improve, among others, the compatibility with other social science data collections and thus increase the "I" and "R" in the FAIR concept of research data (cf. Wilkinson et al. 2016) -- namely "interoperability" and "re-usability" -- there are more vague aspects of data quality like believability, reputation or objectivity that go beyond the idea of statistical confidence and uncertainty measures. Seldom adressed, or even made explicit, in social science research, they still are at the core of many scientist common-sensical understanding of "data quality".
Initial discussions in the CRC on implementing a "traffic light system" for rating "data accuracy" highlighted the complexity of designing a scheme that is abstract and intuitive enough to immediately reveal this meta data to external users while at the same time being simple to code and broad enough to cover all, or at least "typical" instances that occured or may occur in the "data generation process".
Further workshops focused on a proposal for rating the quality of data in WeSIS rooted in a "provenance approach" (see e.g., Prat and Madnick 2007 or Buneman et al. 2001) which puts the origin of the data in the spotlight. The proposal entailed rating three dimensions (trustworthiness of a source, reasonableness of data, and confidence in data), each connected to a suggested rating scale. The main outcome of the discussion was that the first two dimensions were quite uncontroversial although the suggested scales needed refinement. However, the discussion also showed that despite a common understanding of the value of such ratings, social science is still marked by a resistance to "judge" the data given the work necessary to define a feasible scale and wordings, and concerns about "devaluing one's own work", or deviations and bias across people and projects.
Two subsequent workshops underlined the difficulty of balancing the need for implementing a rating scheme vs. prospect of creating an added value vs. (additional) work load required to provide and code the necessary information. Ultimately, an agreement was reached that goes beyond the "typical" social science concepts of validity and reliability entailing (1) to label any indicator on the indicator level whether it was "created" by CRC researchers, and (2) to code notable concerns at the data point/value level.
Dimensions of data quality in WeSIS
Against this background, "data quality" in WeSIS encompasses three main elements:
Automated validation checks
Concise and consistent representation and "free-of-errorness" are taken care of by the validation check performed when uploading data to WeSIS (for more details see the file upload guide). The check not only requires the data providers to comply to the agreed coding rules and file formatting standards but also checks for "common" mistakes like typos, country code--name or value--scale mismatches.
References
- Buneman, Peter, Sanjeev Khanna, and Wang-Chiew Tan. 2001. Why and Where: A Characterization of Data Provenance. University of Pennsylvania, Pennsylvania: Department of Computer & Information Science. Departmental Papers (CIS).
- King, Gary, Robert O. Keohane, and Sidney Verba. 1994. Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton: Princeton University Press.
- Pipino, Leo L., Yang W. Lee, and Richard Y. Wang. 2002. "Data quality assessment." Communications of the ACM 45 (4): 211–18. https://doi.org/10.1145/505248.506010
- Prat, Nicolas, and Stuart E. Madnick. 2007. Measuring Data Believability: A Provenance Approach. Massachusetts Institute of Technology, Cambridge: Composite Information Systems Laboratory (CISL), Sloan School of Management.
- Wilkinson, Mark D., Michel Dumontier, I. J. J. Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz B. da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair J. G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A. C. 't Hoen, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons. 2016. "The FAIR Guiding Principles for scientific data management and stewardship." Scientific data 3: 160018. https://doi.org/10.1038/sdata.2016.18