Knowledge Integration and Representation for Biomedical Analysis

Alachram, Halima

dc.contributor.advisor	Wingender, Edgar Prof. Dr.
dc.contributor.author	Alachram, Halima
dc.date.accessioned	2021-02-25T13:48:36Z
dc.date.available	2021-02-25T13:48:36Z
dc.date.issued	2021-02-25
dc.identifier.uri	http://hdl.handle.net/21.11130/00-1735-0000-0005-158D-5
dc.identifier.uri	http://dx.doi.org/10.53846/goediss-8464
dc.language.iso	eng	de
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject.ddc	510	de
dc.title	Knowledge Integration and Representation for Biomedical Analysis	de
dc.type	doctoralThesis	de
dc.contributor.referee	Wingender, Edgar Prof. Dr.
dc.date.examination	2021-02-04
dc.description.abstracteng	Information-based health systems aimed at improving clinical decision-making are appealing as they are able to cope with the rising amount of information that clinicians are experiencing and provide a framework for incorporating validated expertise in health care. Such systems need biomedical analytical expertise, patient-specific data, and a system for reasoning that incorporates data and knowledge to produce and provide clinicians with valuable information during care delivery. Biomedical research has been developed to exploit high-throughput data profiles that provide insights into human disease pathogenesis and diagnosis. The interpretation of high-throughput data involves the comparison of data and knowledge from heterogeneous resources, whether in the biomedical field or in genomics. Enrichment analysis is commonly used for the functional study of gene lists detected by high-throughput techniques like expression microarray experiments. It utilizes statistical methods to detect biological characteristics that are expressed more than expected by chance in a gene set under study. Additionally, healthcare is also seeking closer integration with biomedical data to boost personalized medicine and to provide better treatments. Ontologies, which identify entities and relations used in a domain, play a key role in the automated integration of patient data with relevant knowledge to support clinical research and drug discovery. Moreover, biomedical literature provides valuable insights into the identification of potential treatments, and it can support biomedicine researchers on their way to new findings. With the enormous amount of biomedical literature and the rapid growth of the number of new publications, the wealth of scientific knowledge represented in free text is increasing dramatically. Extracting relevant information and analyzing text data is helpful to discover relationships between biological entities and answer biological questions. In this thesis, I developed applications that exploit biomedical knowledge represented in different forms and existing in different resources to deliver helpful information in Systems Medicine. The first application is a Java-based enrichment analysis tool which is based on an enrichment function developed in a recent study that uses the logistic regression approach to identify significant categories. I developed a Java command-line interface that uses the logistic regression function in R to integrate the tool into a Java-based platform and to ease its usability by Java users. Moreover, to facilitate the interoperability between clinical and molecular data existing in biomedical resources, I developed a lexical mapping module in Java to facilitate the mapping of biomedical concepts. I used the module to map the International Classification of Diseases (ICD) terms that represent the names of disease phenotypes in clinical systems to disease concepts in the National Cancer Institute Thesaurus (NCIT) and the Medical Subject Heading (MeSH®) vocabulary. In addition, to deliver the pathway and molecular information integrated into the NCIT ontology, I developed a plugin for the NCIT ontology using the OBA service which is a service that facilitates access to ontologies structures. Using this plugin, I implemented functions that can model disease pathways based on genes. Furthermore, I used the word2vec implementation in two approaches to generate biomedical embeddings. The word2vec is one of the most widely used implementations of word embeddings due to its training performance. For the first approach, I used the Dis2Vec model, a vocabulary driven word2vec model, to extract disease-drug associations, and I was able to capture visually validated associations. For the second approach, I created and processed a corpus using different preprocessing strategies to obtain embeddings for further comparison. E.g., one passage substituted synonymous terms by their preferred terms in biomedical databases and assigned type labels to words in order to filter similarities for entity types like genes, drugs, or human diseases. To ease the exploration of biomedical concepts and their relations in the embedding, I developed a web service that uses functions to query the embeddings. I validated similarities between entities in obtained embeddings using existing knowledge in biomedical databases. Comparisons showed that relations between entities such as known protein-protein interactions (PPIs), common pathways and cellular functions, or narrower disease ontology groups correlated with higher vector cosine similarity. Word representations as produced by text mining algorithms like word2vec, therefore capture biologically meaningful relations between entities. Furthermore, I extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-convolutional neural networks (CNNs) on breast cancer gene expression data to predict the occurrence of metastatic events. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction networks or with networks derived using other word embedding algorithms. Graph-CNNs trained with word2vec-embedding-derived networks performed best for the metastatic event prediction task compared to PPI or other text mining-based networks.	de
dc.contributor.coReferee	Kurth, Winfried Prof. Dr.
dc.subject.eng	Data integration	de
dc.subject.eng	Knowledge representation	de
dc.subject.eng	Biomedical ontologies	de
dc.subject.eng	Text mining	de
dc.subject.eng	Word embedding	de
dc.subject.eng	Machine learning	de
dc.subject.eng	Biomedical analysis	de
dc.identifier.urn	urn:nbn:de:gbv:7-21.11130/00-1735-0000-0005-158D-5-7
dc.affiliation.institute	Fakultät für Mathematik und Informatik	de
dc.subject.gokfull	Informatik (PPN619939052)	de
dc.identifier.ppn	174948420X

Dateien

Name:Thesis_Halima Alachram.pdf

Größe:5.494Mb

Format:PDF

Beschreibung:PhD thesis

Öffnen

Name:: Thesis_Halima Alachram.pdf
Größe:: 5.494Mb
Format:: PDF
Beschreibung:: PhD thesis

Öffnen

Das Dokument erscheint in:

Fakultät für Mathematik und Informatik (inkl. GAUSS) [518]

Zur Kurzanzeige