Evaluation of Contradictions as a Data Quality Measure for Real-world and Clinical Research Data
Doctoral thesis
Date of Examination:2025-09-22
Date of issue:2025-10-07
Advisor:Prof. Dr Dagmar Krefting
Referee:Prof. Dr Dagmar Krefting
Referee:Prof. Dr. Ramin Yahyapour
Files in this item
Name:PhD_Dissertation.pdf
Size:6.69Mb
Format:PDF
Description:Content of thesis
Abstract
English
In health research discourse, there is the assumption that a controlled, planned clinical research naturally produces cleaner datasets, while real-world data suffers from quality deficiencies due to its spontaneous collection without anticipation of secondary research use. However, this binary view simplifies a complex reality. Both clinical research and real-world data exhibit quality issues, challenging the convention that data origin determines data integrity. What proves to be more vital is the way data quality (DQ) is being assessed—particularly when examining the exhibited relationships between data items that reflect the true complexity of healthcare information. Health data is inherently interconnected e.g., height relates to weight and age; medication dosages connect to patient characteristics and medical conditions. Yet most data quality assessment (DQA) studies limit their logical evaluations to only a handful of interdependent items, missing the broader web of relationships that could reveal critical quality issues. Contradictions within datasets serve as an important DQ indicator (measurable attribute of data) that extend far beyond simple data entry errors. When values contradict each other, they reveal misunderstandings of domain knowledge, failures in data capture infrastructure, and problems in data transformation processes. Given the complexity of the contradiction patterns, a comprehensive assessment becomes essential. Comprehensive DQA in health research requires evaluating data on multiple dimensions, such as completeness, conformance, and plausibility, to determine the suitability of data for intended use. These dimensions are used to group related measurable attributes of data referred to as indicators. While some DQ indicators are implemented as generic rules (e.g., completeness), contradiction, which denotes pair(s) of measurements or facts that cannot logically coexist, are multi-dimensional. Some contradictions require common knowledge for their evaluation—e.g., the disparity in age information at baseline and during follow-up visits. However, contradictions in the pre-analytic states of blood samples require knowledge of experts. Although various tools have been developed to assess contradictions in health data, the evolving nature of contradiction rules presents the following challenges for existing tools: 1) limited reusability across diverse health datasets due to incompatible predefined rules, 2) insufficient contextual information required for comprehensive contradiction assessment, and 3) lack of consistent representation of contradiction patterns to support transparent rule implementations and computational efficiency. To address these gaps in a systematic manner, the first step of this thesis was to adapt an existing tool developed based on an existing framework to assess broader contradiction rules introduced in the German Corona Consensus (GECCO) dataset. Building on the experience from the modification of Schmidt et al.'s tool, this thesis developed a new framework that generalizes the DQA tool to accept custom contradiction rules compatible with diverse health datasets. This is intended to empower domain experts in defining effective contradiction rules on an ad-hoc basis while maintaining alignment with established DQ indicators. Although the primary dependencies between health data items form the basis for contradiction rule definition, the required context for a comprehensive analysis is equally relevant. For example, diabetes mellitus (DM) and insulin medication (INS) are two interdependent items, however, it is the timestamps that indicate if the INS was administered before the DM diagnosis, which is a prelude for a conclusive contradiction finding. Through a qualitative grading scheme for analyzing metadata defined in a study database, this work created a mechanism to measure the availability of the contextual information required to support a context-aware contradiction analysis. However, having access to rich contextual information is only part of the solution—the way contradiction rules are structured and implemented is equally critical for effective DQA. While researchers have worked extensively on harmonizing DQ indicators, the challenge extends beyond categorization to implementation efficiency. Though the taxonomies for contradictions, such as logical and empirical contradictions, reflect the domain description, the rules implemented in the DQA systems do not rely on semantics. Extracting the structure of the varied contradiction patterns ensures transparency and consistency in rule implementation. This work evaluated the performance of different rule implementations and offered an optimization method that integrates the fastest and traceable units of the rules in one unified implementation. This is relevant because broadly defined DQA rules can cause performance degradation in large database infrastructures. With these several perspectives on contradictions as a DQ indicator, the contributions in this thesis enhance contradiction assessment by improving reusability across diverse health datasets, enabling context-aware assessments, and optimizing rule implementation for transparency and computational efficiency. This systematic approach supports high-quality health data assessment while maintaining traceability to support data cleansing efforts.
Keywords: data quality; health data; contradictions; rule-based systems
