Data quality
Data quality in the social sciences is often discussed under the headings of validity and reliability when talking about concepts, measurement and data. However, few discuss transparency and data quality, and there is no consistent understanding what "data quality" means in the social sciences. This page describes the process, the rationale and the measures we undertook for improving data quality in WeSIS.
Contents
Introduction
In their famous book "Designing Social Inquiry", King et al. (1994, 24) set out five guidelines for "improving data quality" encompassing
- valid measures,
- reliable data collections,
- replicable analyses,
- a thorough documentation of the data generating process, and
- to "collect data on as many of [a theory’s] observable implications as possible".
Nowadays, much more data is available posing new challenges, though; counterfeit data, typos, and unintended mistakes in the data collection, for example, pose challenges beyond invalid and unreliable measures. The structured documentation of each indicator (including meta data) in WeSISpedia, common coding rules and formatting standards alongside the resources provided in WeSIS, for example, public data sets, community notebooks or an indicator's version control, directly speak to King et al.'s suggestions. Still, we also opted for a broader concept based on data quality definitions commonly used in information systems design (cf. Pipino et al. 2002). Concise and consistent representation or "free-of-errorness", for example, are taken care of by the validation check performed when uploading data to WeSIS (for more details see the file upload guide).
While such automated checks improve, among others, the compatibility with other social science data collections and thus increase the "I" and "R" in the FAIR concept of research data (cf. Wilkinson et al. 2016) -- namely "interoperability" and "re-usability" -- there are more vague aspects of data quality like believability, reputation or objectivity that go beyond the idea of statistical confidence and uncertainty measures. Seldom adressed, or even made explicit, in social science research, they still are at the core of many scientist's common-sensical understanding of "data quality", particularly given their domain-specific knowledge.
Initial discussions in the CRC on implementing a "traffic light system" for rating "data accuracy" brought out the complexity, or rather difficulty, of designing a scheme that is abstract and intuitive enough to share the knowlodge about the data and immediately reveal this meta data to external users while at the same time being simple to code and broad enough to cover all, or at least "typical" instances that occured or may occur in the "data generation process". At the same time, there is a sparsity of attempts in the social sciences to rate data, not to speak of a "gold standard" of a framework.
Further workshops therefore focused on a proposal for rating the quality of data in WeSIS rooted in a "provenance approach" (see e.g., Prat and Madnick 2007 or Buneman et al. 2001) which puts the origin of the data in the spotlight. The proposal entailed rating three dimensions (trustworthiness of a source, reasonableness of data, and confidence in data), each connected to a suggested rating scale drawn from other disciplines e.g., the likelihood scale from the IPBES "Guide on the production of assessment" (IPBES 2018). The main outcome of the discussion was that the first two dimensions were quite uncontroversial although the suggested scales needed refinement and adoption to the peculiarities of the CRC's data collections. However, the discussion also showed that despite a common understanding of the value of such ratings, social science is still marked by a resistance to "judge" the data given the extra work necessary to define a feasible scale and wordings and apply it. Furthermore, concerns about "devaluing one's own work" or deviations and bias across people and projects were raised.
Two subsequent workshops underlined the difficulty of balancing the commitment for implementing a rating scheme vs. prospect of creating an added value vs. (additional) work load required to provide and code the necessary information. Ultimately, an agreement was reached that goes beyond the "typical" social science concepts of validity and reliability entailing (1) to label any indicator on the indicator level whether it was "created" by domain experts of the CRC, and (2) to code notable concerns at the data point/value level.
Dimensions of data quality in WeSIS
Against this background, "data quality" in WeSIS encompasses three main elements:
1. Automated validation checks
Concise and consistent representation and "free-of-errorness" are taken care of by the validation check performed when uploading data to WeSIS (for more details see the file upload guide). The check not only requires the data providers to comply to the agreed coding rules and file formatting standards but also checks for "common" mistakes like typos, country code--name or value--scale mismatches.
2. Data origin
Beyond the structured documentation of each indicator detailing the sources, each indicator is labeled according to one of three categories referring to the origin of the data on the most abstract level.
Category | Meaning |
---|---|
Original CRC Data | Data gathered, compiled and coded by CRC researchers, e.g., qualitative judgments, text codings or annotations. |
CRC Compiled Data | Data gathered, harmonized and/or extended by CRC researchers, i.e., foremost statistical data aggregated via reproducible workflows (scripts) incl. imputations. |
3rd Party Data | Data taken "as is" without further treatments from third party sources, except for matching the source's country codes to the WeSIS scheme and adding the necessary meta data required for upload to WeSIS. |
As the labels refer to an entire indicator, they are input via the mandatory column "data origin" of each indicator table page (e.g., Old age and survivors). In WeSIS, the information is shown as a label (or sticker) on each indicator page right below the variable name.
3. Plausibility concerns
The workshops made clear that any numerical value like a likelihood of a value would not fit the heterogeneity of concerns encountered in the CRC's data collections. Instead, the rating captures the concerns researchers may have regarding the consistency or plausibility of a data value, i.e., at the data point level, based on their domain knowledge. The guiding question and corresponding nominal response categories are:
Q: "Are there notable concerns about the plausibility of this data point?"
Response | Meaning |
---|---|
No | (The default) |
Yes, from a time-series perspective | Concerns refer to implausible values like breaks or jumps in a time-series or large deviations for which there is no obvious explanation. |
Yes, from a cross-sectional perspective | Concerns refer to implausible values when comparing the focal value to values/levels of other entities given the focal time point. |
Yes, due to limited sources | Concerns refer to values where there is limited information from uncertain or unsound sources. |
Yes, due to contradicting sources | Concerns refer to values where at least two sources exist but contain contradicting or otherwise mis-matching information. |
Other | Reasons not covered by the other response categories. |
If multiple concerns are raised for the focal value, only the most prevalent one is coded.
If a researcher chooses "Other" as the response, it is recommended to provide an explanation in the optional column comment
in the WeSIS upload file template.
As the concerns are expressed at the value level, they are input together with the actual data via the mandatory column plausibility_concerns
of the file upload templates.
In WeSIS, the information is available in the data table of each indicator page.
Technical implementation
To implement the agreed solution, with the release of WeSIS version 2.5 the upload workflow was modified. After discussing the pros and cons of different variants (e.g., parsing from the indicator table in WeSISpedia, adding a mandatory column to the upload file template, or exploiting the very first line of the csv file), it was decided to grab the information for the label from the indicator table pages in WeSISpedia. As these are also used for the validation check during the upload simply adding a column "data_origin" to the table of each policy field served the purpose. This column is now mandatory, standardized, and of type "String" allowing only the above mentioned three categories for coding the origin of data, else nothing will be displayed in WeSIS. The database (DB) structure was affected in that only one additional column called "data_origin" was added to the DB table "indicators".
For the plausibility concerns, the file upload templates were appended to include a new column "plausibility_concerns". This column is now mandatory, standardized, and of type "String" allowing only the above mentioned six categories for coding researcher's concerns. The DB structure was affected in that one additional column called "plausibility_concerns" was added to the database tables "ds_values" and "ds_dyadic_values".
The first version of the file upload template in place until November 2023 already included two placeholders for a data quality "rating", namely the then mandatory columns "data_quality" and "data_quality_confidence" as placeholders for a "traffic light" or some other scheme. While the codings remained undetermined at that time, some data providers made use of them inputting their own remarks, reasonings and quality judgements. To not lose this information,
- the information contained in the DB was kept,
- but both columns are now marked as deprecated;
- the validation check ignores whether the upload file (still) contains the old columns, or not, and treats them as optional column names; this way, the researchers are still able to update "their" information for data already existent before the above agreement was reached.
For existing data already in the DB, the plausibility rating is not available (and will not be unless the data is updated); until the codings will gradually become available with an indicator update, it was agreed that WeSIS will display a "Not Available" in the data table for the time being.
References
- Buneman, Peter, Sanjeev Khanna, and Wang-Chiew Tan. 2001. Why and Where: A Characterization of Data Provenance. University of Pennsylvania, Pennsylvania: Department of Computer & Information Science. Departmental Papers (CIS).
- Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES). 2018. The IPBES Guide on the production of assessments. Core version. IPBES/6/INF/17. https://www.ipbes.net/document-library-catalogue/ipbes6inf17 (Accessed January 17, 2023)
- King, Gary, Robert O. Keohane, and Sidney Verba. 1994. Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton: Princeton University Press.
- Pipino, Leo L., Yang W. Lee, and Richard Y. Wang. 2002. "Data quality assessment." Communications of the ACM 45 (4): 211–18. https://doi.org/10.1145/505248.506010
- Prat, Nicolas, and Stuart E. Madnick. 2007. Measuring Data Believability: A Provenance Approach. Massachusetts Institute of Technology, Cambridge: Composite Information Systems Laboratory (CISL), Sloan School of Management.
- Wilkinson, Mark D., Michel Dumontier, I. J. J. Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz B. da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair J. G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A. C. 't Hoen, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna-Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan van der Lei, Erik van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons. 2016. "The FAIR Guiding Principles for scientific data management and stewardship." Scientific data 3: 160018. https://doi.org/10.1038/sdata.2016.18