Aspects of Temporal Patient Similarity in Complex Diseases
Doctoral thesis
Date of Examination:2024-11-29
Date of issue:2024-12-09
Advisor:Prof. Dr. Ulrich Sax
Referee:Prof. Dr. Ulrich Sax
Referee:Prof. Dr. Hossein Estiri
Referee:Prof. Dr. Riccardo Bellazzi
Files in this item
Name:Dissertation_Jonas_Huegel_upload.pdf
Size:28.8Mb
Format:PDF
Description:Dissertation
Abstract
English
When comparing patients with complex chronic diseases, such as cancer or Post COVID-19, in Real-World Data (RWD) it is crucial to consider not only their condition at the time point of analysis but also their disease trajectories over time. Typical data mining approaches and patient similarity measures tend to overlook the inherent temporal dimension of RWD, such as Electronic Health Record (EHR) data. To measure patient similarity, the performance of available metrics must be analyzed to select the one resulting in the most realistic similarity scores. Therefore, I applied 88 graph and set theory algorithm combinations to the International Statistical Classification of Diseases and Related Health Problems (ICD) code sets of 29 pancreatic cancer patients in a comprehensive benchmark. Introducing a scaling term resulted in a better representation of comorbidities. While this approach showed a significant correlation (0.75) with clinician-derived similarity scores, it did not consider the inherent temporal dimension of RWD. One possibility to integrate the temporal aspects is to use Transitive Sequential Pattern Mining (tSPM). Based on on the original tSPM algorithm, I developed the Transitive Sequential Pattern Mining Plus (tSPM+) algorithm to mine temporal representations from clinical data. The tSPM algorithm massively outperforms the tSPM algorithm by reducing the memory consumption and the runtime by up to factor 40 and 900, respectively. Furthermore, it provides the duration of the patterns and additional utility functions. I explored encoding sequential patterns mined from EHRs instead of the raw EHR data to render nontemporal Machine Learning (ML) models time-sensitive and to derive temporal patient characteristics. In the context of precision oncology, I investigated available knowledge bases, and data types and contributed to the development of several Extract, Transform, Load (ETL) pipelines in multiple cancer-related research projects to identify available on-premise data. This effort resulted in a pancreatic and a lung cancer cohort, which is feasible for applying the tSPM+ workflow. In two proof-of-concept studies, I integrated sequential patterns into downstream ML approaches, such as Random Forest classification, by extending the SPM+ workflow to extract the temporal characteristics of these cohorts. Subsequent data reviews using a new network visualization approach confirmed that the identified temporal characteristics were clinically sound for both cohorts. Multiple complex diseases, such as Post COVID-19, are defined by complex definitions of exclusions, which are challenging to implement on RWD. In a second, highly relevant use case, I demonstrated how sequential patterns in concert with the utility functions of tSPM+ can be used to curate a Post COVID-19 precision cohort with patient-specific symptoms achieving a positive predictive value of 0.79. This use case provides significant opportunities for Post COVID-19 research by allowing researchers to build symptoms-specific cohorts in large databases. In conclusion, this thesis presents a fundamental approach for integrating the temporal dimension of EHR data into ML tasks for complex chronic diseases, addressing a critical gap in the field of clinical research informatics and precision medicine. This work lays the foundation for further endeavors in modeling temporal disease trajectories, contributing towards a better understanding and treatment of complex chronic diseases.
Keywords: real-world data; transitive sequential pattern mining; cancer; post COVID-19; temporal characterization; machine learning; ehr data