Knowledge Integration and Representation for Biomedical Analysis
by Halima Alachram
Date of Examination:2021-02-04
Date of issue:2021-02-25
Advisor:Prof. Dr. Edgar Wingender
Referee:Prof. Dr. Edgar Wingender
Referee:Prof. Dr. Winfried Kurth
Files in this item
Name:Thesis_Halima Alachram.pdf
Size:5.49Mb
Format:PDF
Description:PhD thesis
Abstract
English
Information-based health systems aimed at improving clinical decision-making are appealing as they are able to cope with the rising amount of information that clinicians are experiencing and provide a framework for incorporating validated expertise in health care. Such systems need biomedical analytical expertise, patient-specific data, and a system for reasoning that incorporates data and knowledge to produce and provide clinicians with valuable information during care delivery. Biomedical research has been developed to exploit high-throughput data profiles that provide insights into human disease pathogenesis and diagnosis. The interpretation of high-throughput data involves the comparison of data and knowledge from heterogeneous resources, whether in the biomedical field or in genomics. Enrichment analysis is commonly used for the functional study of gene lists detected by high-throughput techniques like expression microarray experiments. It utilizes statistical methods to detect biological characteristics that are expressed more than expected by chance in a gene set under study. Additionally, healthcare is also seeking closer integration with biomedical data to boost personalized medicine and to provide better treatments. Ontologies, which identify entities and relations used in a domain, play a key role in the automated integration of patient data with relevant knowledge to support clinical research and drug discovery. Moreover, biomedical literature provides valuable insights into the identification of potential treatments, and it can support biomedicine researchers on their way to new findings. With the enormous amount of biomedical literature and the rapid growth of the number of new publications, the wealth of scientific knowledge represented in free text is increasing dramatically. Extracting relevant information and analyzing text data is helpful to discover relationships between biological entities and answer biological questions. In this thesis, I developed applications that exploit biomedical knowledge represented in different forms and existing in different resources to deliver helpful information in Systems Medicine. The first application is a Java-based enrichment analysis tool which is based on an enrichment function developed in a recent study that uses the logistic regression approach to identify significant categories. I developed a Java command-line interface that uses the logistic regression function in R to integrate the tool into a Java-based platform and to ease its usability by Java users. Moreover, to facilitate the interoperability between clinical and molecular data existing in biomedical resources, I developed a lexical mapping module in Java to facilitate the mapping of biomedical concepts. I used the module to map the International Classification of Diseases (ICD) terms that represent the names of disease phenotypes in clinical systems to disease concepts in the National Cancer Institute Thesaurus (NCIT) and the Medical Subject Heading (MeSH®) vocabulary. In addition, to deliver the pathway and molecular information integrated into the NCIT ontology, I developed a plugin for the NCIT ontology using the OBA service which is a service that facilitates access to ontologies structures. Using this plugin, I implemented functions that can model disease pathways based on genes. Furthermore, I used the word2vec implementation in two approaches to generate biomedical embeddings. The word2vec is one of the most widely used implementations of word embeddings due to its training performance. For the first approach, I used the Dis2Vec model, a vocabulary driven word2vec model, to extract disease-drug associations, and I was able to capture visually validated associations. For the second approach, I created and processed a corpus using different preprocessing strategies to obtain embeddings for further comparison. E.g., one passage substituted synonymous terms by their preferred terms in biomedical databases and assigned type labels to words in order to filter similarities for entity types like genes, drugs, or human diseases. To ease the exploration of biomedical concepts and their relations in the embedding, I developed a web service that uses functions to query the embeddings. I validated similarities between entities in obtained embeddings using existing knowledge in biomedical databases. Comparisons showed that relations between entities such as known protein-protein interactions (PPIs), common pathways and cellular functions, or narrower disease ontology groups correlated with higher vector cosine similarity. Word representations as produced by text mining algorithms like word2vec, therefore capture biologically meaningful relations between entities. Furthermore, I extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-convolutional neural networks (CNNs) on breast cancer gene expression data to predict the occurrence of metastatic events. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction networks or with networks derived using other word embedding algorithms. Graph-CNNs trained with word2vec-embedding-derived networks performed best for the metastatic event prediction task compared to PPI or other text mining-based networks.
Keywords: Data integration; Knowledge representation; Biomedical ontologies; Text mining; Word embedding; Machine learning; Biomedical analysis