Data quality

From WeSISpedia
Revision as of 17:19, 30 May 2024 by Nils Duepont (talk | contribs)
Jump to: navigation, search

Note: Currently, this page is under construction...

Introduction

Data quality in the social sciences is often discussed under the headings of validity and reliability when talking about concepts, measurement and data. However, few discuss transparency and data quality, and there is no consistent understanding what data quality means in the social sciences. In their famous book "Designing Social Inquiry", King et al. (1994, 24) set out five guidelines for "improving data quality" encompassing

  1. valid measures,
  2. reliable data collections,
  3. replicable analyses,
  4. a thorough documentation of the data generating process, and
  5. to "collect data on as many of [a theory’s] observable implications as possible".

Nowadays, much more data is available posing new challenges, though; counterfeit data, typos, and unintended mistakes in the data collection, for example, pose challenges beyond invalid and unreliable measures. The structured documentation of each indicator (including meta data) in WeSISpedia, common coding rules and formatting standards alongside the resources provided in WeSIS, for example, public data sets, community notebooks or an indicator's version control, directly speak to King et al.'s suggestions. Still, we also opted for a broader concept based on data quality definitions commonly used in information systems design (cf. Pipino et al. 2002). Concise and consistent representation or "free-of-errorness", for example, are taken care of by the validation check performed when uploading data to WeSIS (for more details see the file upload guide).

While such automated checks improve, among others, the compatibility with other social science data collections and thus increase the "I" and "R" in the FAIR concept of research data (cf. Wilkinson et al. 2016) -- namely "interoperability" and "re-usability" -- there are more vague aspects of data quality like believability, reputation or objectivity that go beyond the idea of statistical confidence and uncertainty measures. Seldom adressed, or even made explicit, in social science research, they still are at the core of many scientist's common-sensical understanding of "data quality".

Initial discussions in the CRC on implementing a "traffic light system" for rating "data accuracy" highlighted the complexity of designing a scheme that is abstract and intuitive enough to immediately reveal this meta data to external users while at the same time being simple to code and broad enough to cover all, or at least "typical" instances that occured or may occur in the "data generation process". At the same time, there is a sparsity of attempts in the social sciences to rate data, not to speak of a "gold standard" of a framework.

Further workshops therefore focused on a proposal for rating the quality of data in WeSIS rooted in a "provenance approach" (see e.g., Prat and Madnick 2007 or Buneman et al. 2001) which puts the origin of the data in the spotlight. The proposal entailed rating three dimensions (trustworthiness of a source, reasonableness of data, and confidence in data), each connected to a suggested rating scale drawn from other disciplines e.g., the likelihood scale from the IPBES "Guide on the production of assessment" (IPBES 2018). The main outcome of the discussion was that the first two dimensions were quite uncontroversial although the suggested scales needed refinement and adoption to the peculiarities of the CRC's data collections. However, the discussion also showed that despite a common understanding of the value of such ratings, social science is still marked by a resistance to "judge" the data given the work necessary to define a feasible scale and wordings, and concerns about "devaluing one's own work" or deviations and bias across people and projects.

Two subsequent workshops underlined the difficulty of balancing the need for implementing a rating scheme vs. prospect of creating an added value vs. (additional) work load required to provide and code the necessary information. Ultimately, an agreement was reached that goes beyond the "typical" social science concepts of validity and reliability entailing (1) to label any indicator on the indicator level whether it was "created" by domain experts of the CRC, and (2) to code notable concerns at the data point/value level.

Dimensions of data quality in WeSIS

Against this background, "data quality" in WeSIS encompasses three main elements:


1. Automated validation checks

Concise and consistent representation and "free-of-errorness" are taken care of by the validation check performed when uploading data to WeSIS (for more details see the file upload guide). The check not only requires the data providers to comply to the agreed coding rules and file formatting standards but also checks for "common" mistakes like typos, country code--name or value--scale mismatches.


2. Data origin

Beyond the structured documentation of each indicator detailing the sources, each indicator is labeled according to one of three categories referring to the origin of the data on the most abstract level.

Category Meaning
Original CRC data Data gathered, compiled and coded by CRC researchers, e.g., qualitative judgments, text codings or annotations.
CRC compiled data Data gathered, harmonized and/or extended by CRC researchers, i.e., merely statistical data aggregated via reproducible workflows (scripts) incl. imputations.
3rd party data Data taken "as is" without further treatments from third party sources, except for matching the source's country codes to the WeSIS scheme and adding the necessary meta data required for upload to WeSIS.


3. Plausibility concerns

The workshops made clear that any numerical value like a likelihood of a value would not fit the heterogeneity of concerns encountered in the CRC's data collections. Instead, the rating captures the concerns researchers may have regarding the consistency or plausibility of a data value, i.e., at the data point level. The guiding question and corresponding nominal response categories are:

Q: "Are there notable concerns about the plausibility of this data point?"

Response Meaning
No (The default)
Yes, from a time-series perspective Concerns refer to implausible values like breaks or jumps in a time-series or large deviations for which there is no obvious explanation.
Yes, from a cross-sectional perspective Concerns refer to implausible values when comparing the focal value to values/levels of other entities given the focal time point.
Yes, due to limited sources Concerns refer to values where there is limited information from uncertain or unsound sources.
Yes, due to contradicting sources Concerns refer to values where at least two sources exist but contain contradicting or otherwise mis-matching information.
Other Reasons not covered by the other response categories.

If multiple concerns are raised for the focal value, only the most prevalent one is coded. If a researcher chooses "Other" as the response, it is mandatory to provide an explanation in the additional column "comment" in the WeSIS upload file template.



References

  • Buneman, Peter, Sanjeev Khanna, and Wang-Chiew Tan. 2001. Why and Where: A Characterization of Data Provenance. University of Pennsylvania, Pennsylvania: Department of Computer & Information Science. Departmental Papers (CIS).
  • Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES). 2018. The IPBES Guide on the production of assessments. Core version. IPBES/6/INF/17. https://www.ipbes.net/document-library-catalogue/ipbes6inf17 (Accessed January 17, 2023)
  • King, Gary, Robert O. Keohane, and Sidney Verba. 1994. Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton: Princeton University Press.
  • Pipino, Leo L., Yang W. Lee, and Richard Y. Wang. 2002. "Data quality assessment." Communications of the ACM 45 (4): 211–18. https://doi.org/10.1145/505248.506010
  • Prat, Nicolas, and Stuart E. Madnick. 2007. Measuring Data Believability: A Provenance Approach. Massachusetts Institute of Technology, Cambridge: Composite Information Systems Laboratory (CISL), Sloan School of Management.
  • Wilkinson, Mark D., Michel Dumontier, I. J. J. Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz B. da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair J. G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A. C. 't Hoen, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons. 2016. "The FAIR Guiding Principles for scientific data management and stewardship." Scientific data 3: 160018. https://doi.org/10.1038/sdata.2016.18