Machine Learning-based Approaches to Integrate Heterogeneous Data for Biological Knowledge Transfer
Cumulative thesis
Date of Examination:2024-02-09
Date of issue:2024-06-13
Advisor:Prof. Dr. Anne-Christin Hauschild
Referee:Prof. Dr. Anne-Christin Hauschild
Referee:Dr. Johannes Soeding
Files in this item
Name:Dissertation_youngjun_submiss.pdf
Size:18.9Mb
Format:PDF
Description:Main article
Abstract
English
Recent developments in high-throughput data generation methodologies, such as next-generation sequencing or MALDI-TOF mass spectrometry, are creating a strong necessity for data science to transform the field of biomedical research. Over the past decade, these technologies have facilitated the accumulation of extensive omics data. Although this advancement has greatly contributed to knowledge expansion in biomedical research, biomedical studies are still limited due to data heterogeneity: batch effects, heterogeneity in data types, and biological heterogeneity of different species. These challenges complicate the applicability of statistical methods and machine learning models for complex analysis scenarios with various datasets. Consequently, there is a growing demand for methodologies to handle heterogeneous biomedical data. In this thesis, I investigated the aforementioned three different challenges, namely batch effects, data type heterogeneity, and biological heterogeneity, and developed novel methodologies to address them. The first challenge in data heterogeneity is batch effects, a systematic non-biological variation added to omics datasets during data acquisition. Batch effects are one of the factors hindering the integrative analysis of the same data types in biomedical research. In this thesis, batch effects in MALDI-TOF mass spectrometry data and single-cell RNA sequencing data were investigated. Different hospitals generated a large scale of mass spectrometry data from patient samples. Due to different procedures and protocols, batch effects exist on different levels in each dataset and impede data integrative analysis. I examined these batch effects using three different machine learning models, namely logistic regression, lightGBM, and neural network. With recent advancements in single-cell RNA sequencing, it has become widely employed in diverse studies to produce large-scale sequencing data for cell populations within tissues. However, the presence of batch effects in single-cell sequencing datasets necessitates appropriate pre-processing procedures when integrating multiple datasets from different studies. Initially, I investigated the impact of batch effects in integrative analysis of multiple single cell RNA sequencing data. Following that, a simple approach with low-dimensional embedding and data transformation for batch effect mitigation was proposed. Data transformations have a significant impact on subsequent downstream analysis by altering the data distribution. However, a normalization step with data transformation is often neglected. Therefore, different data transformation methods were examined with three distinct datasets and evaluated regarding their effect on batch mitigation in integrative analysis using dimensionality reduction with clustering. This result shows that a significant proportion of batch effects can be mitigated by simple data transformation, and it showed comparable results with already published deep neural network models. The next challenge in data heterogeneity is the various data types in biomedical data. The integration of various data types in biomedical research presents a challenge, given the differences in data formats and underlying hypotheses. To address this challenge, a multi-modal analysis methodology capable of effectively handling diverse data types is required. In my thesis, a meta-transfer learning approach based on a few-shot learning model was proposed to integrate bulk- and single-cell sequencing datasets. This approach was highly effective for single-cell RNA sequencing data analysis scenarios with small sample sizes. It was able to mitigate batch effects and predict cell types of different datasets. This result suggests a new approach to utilize the large amount of bulk-cell sequencing data available in public databases. By leveraging existing bulk-cell sequencing data, researchers can overcome study size constraints and batch effects in single-cell data analysis. The last challenge related to biological heterogeneity in biomedical research originates from a variety of species and their unique genome. This biological heterogeneity poses a challenging task for data integration and may require a distinct machine learning model for transfer learning. Transfer learning can be classified into homogeneous and heterogeneous categories based on the features' characteristics within the source and target datasets. In my thesis, transfer learning approaches were examined and developed with two different datasets, MALDI-TOF mass spectrometry and single-cell RNA sequencing datasets. The analysis with mass spectrometry data showed the potential value of cross-species transfer learning for antimicrobial resistance prediction. Although data from clinical practices showed higher biological heterogeneity with various species, training machine learning models using aggregated data from different species proved beneficial for predicting antimicrobial resistance in unknown species. Furthermore, a new methodology with heterogeneous transfer learning was introduced to integrate different species datasets in a data-driven way. The conventional approach for cross-species transfer learning relies on gene homology. However, this dependence severely limits wide applications in non-model organisms. Thus, the new methodology was designed to be independent of gene homology by exploiting shared labels, such as cell types or experimental conditions, among datasets. This species-agnostic transfer learning approach successfully integrates single-cell RNA sequencing datasets from different species. This thesis thoroughly explores the challenges posed by data heterogeneity in biomedical research and presents corresponding machine learning methods to address these challenges. It offers a comprehensive perspective on the issue of data heterogeneity within the field. By conducting additional integrative analyses that leverage diverse datasets, one can enhance robustness and generalizability, thus contributing to addressing the reproducibility crisis.
Keywords: Machine learning; Bioinformatics; Computational biology; Data integration; Transfer learning