Toward Trustworthiness of Deep Learning Models for 12-Lead ECGs

Bender, Theresa

von Theresa Bender

Dissertation

Datum der mündl. Prüfung:2023-11-23

Erschienen:2024-01-31

Betreuer:Prof. Dr. Ulrich Sax

Gutachter:Prof. Dr. Ulrich Sax

Gutachter:Prof. Dr. Ulrich Parlitz

Gutachter:Prof. Dr. Thomas Tolxdorff

Zum Verlinken/Zitieren: http://dx.doi.org/10.53846/goediss-10315

Dateien

Name:Dissertation_SUB_Upload.pdf

Size:7.00Mb

Format:PDF

ViewOpen

Lizenzbestimmungen:

Zusammenfassung

Englisch

A 12-lead electrocardiogram (ECG), a common examination tool in cardiology, represents the electrical activity of the heart as waveforms. Predictions and classifications with deep learning (DL) algorithms show great potential to aid clinicians in the diagnosis and treatment of patients. However, since clinicians are responsible for the treatment and thus the outcome of single patients, they need to understand the reasoning behind these model’s decisions. Important criteria for the acceptance of DL models in clinical settings are covered by aspects of trustworthiness, such as safety and privacy. In this work, new methods and tools are developed to evaluate and quantify technical aspects of trustworthiness on a pre-trained deep neural network (DNN) for 12-lead ECG classification of six clinically relevant abnormalities. The open source DNN by Ribeiro et al. indicated a good performance on test data and was trained on a large data set. It is systematically analyzed for its reproducibility, explainability, robustness, and generalizability with multiple public and clinical data sets. For this, F1-scores are calculated and evaluated for different groups, and quantitative measurements for relevance scores of post-hoc explainable artificial intelligence (XAI) methods are analyzed. Moreover, raw ECG data recorded in clinical routine is exported and integrated into the local research infrastructure to evaluate the generalizability of the model in clinical settings. The results of the DNN with the original test data set can be reproduced with errors in the range of rounding errors. The DNN exhibits similarly high performance on the PTB-XL and CPSC 2018 public data sets, as well as on a large export of resting ECGs from Schiller devices acquired at the University Medical Center G¨ottingen. Applying XAI to the DNN reveals features similar to cardiological textbook knowledge, such as lead V1 being most important and missing P-waves in atrial fibrillation, and this is validated on all data sets. The noise annotations of PTB-XL are further analyzed regarding their influence on the performance of the pre-trained DNN. The results indicate that the DNN is able to detect atrial fibrillation in 12-lead ECGs with high accuracy, even in the presence of data quality issues, according to human experts. The experiments that concern performance and explainability are repeated on roughly 150, 000 local recordings and yield similar results on these real-world data. These exemplary analyses of the trustworthiness of a DNN provide promising results and will be further investigated. Considering several aspects of trustworthiness, it is possible to foster trust in DNNs for clinical applications.

Keywords: Biosignal Processing; Deep Learning; Electrocardiogram; Trustworthiness; Explainability; Robustness

Statistik