dc.description.abstracteng | Information-based health systems aimed at improving clinical decision-making are appealing as they
are able to cope with the rising amount of information that clinicians are experiencing and provide a
framework for incorporating validated expertise in health care. Such systems need biomedical analytical
expertise, patient-specific data, and a system for reasoning that incorporates data and knowledge to
produce and provide clinicians with valuable information during care delivery. Biomedical research has
been developed to exploit high-throughput data profiles that provide insights into human disease
pathogenesis and diagnosis. The interpretation of high-throughput data involves the comparison of data
and knowledge from heterogeneous resources, whether in the biomedical field or in genomics.
Enrichment analysis is commonly used for the functional study of gene lists detected by high-throughput techniques like expression microarray experiments. It utilizes statistical methods to detect
biological characteristics that are expressed more than expected by chance in a gene set under study.
Additionally, healthcare is also seeking closer integration with biomedical data to boost personalized
medicine and to provide better treatments. Ontologies, which identify entities and relations used in a
domain, play a key role in the automated integration of patient data with relevant knowledge to support
clinical research and drug discovery. Moreover, biomedical literature provides valuable insights into
the identification of potential treatments, and it can support biomedicine researchers on their way to
new findings. With the enormous amount of biomedical literature and the rapid growth of the number
of new publications, the wealth of scientific knowledge represented in free text is increasing
dramatically. Extracting relevant information and analyzing text data is helpful to discover relationships
between biological entities and answer biological questions.
In this thesis, I developed applications that exploit biomedical knowledge represented in different forms
and existing in different resources to deliver helpful information in Systems Medicine. The first
application is a Java-based enrichment analysis tool which is based on an enrichment function
developed in a recent study that uses the logistic regression approach to identify significant categories.
I developed a Java command-line interface that uses the logistic regression function in R to integrate
the tool into a Java-based platform and to ease its usability by Java users.
Moreover, to facilitate the interoperability between clinical and molecular data existing in biomedical
resources, I developed a lexical mapping module in Java to facilitate the mapping of biomedical
concepts. I used the module to map the International Classification of Diseases (ICD) terms that
represent the names of disease phenotypes in clinical systems to disease concepts in the National Cancer
Institute Thesaurus (NCIT) and the Medical Subject Heading (MeSH®) vocabulary. In addition, to
deliver the pathway and molecular information integrated into the NCIT ontology, I developed a plugin
for the NCIT ontology using the OBA service which is a service that facilitates access to ontologies
structures. Using this plugin, I implemented functions that can model disease pathways based on genes.
Furthermore, I used the word2vec implementation in two approaches to generate biomedical
embeddings. The word2vec is one of the most widely used implementations of word embeddings due
to its training performance. For the first approach, I used the Dis2Vec model, a vocabulary driven
word2vec model, to extract disease-drug associations, and I was able to capture visually validated
associations. For the second approach, I created and processed a corpus using different preprocessing
strategies to obtain embeddings for further comparison. E.g., one passage substituted synonymous terms
by their preferred terms in biomedical databases and assigned type labels to words in order to filter
similarities for entity types like genes, drugs, or human diseases. To ease the exploration of biomedical
concepts and their relations in the embedding, I developed a web service that uses functions to query
the embeddings. I validated similarities between entities in obtained embeddings using existing
knowledge in biomedical databases. Comparisons showed that relations between entities such as known
protein-protein interactions (PPIs), common pathways and cellular functions, or narrower disease
ontology groups correlated with higher vector cosine similarity. Word representations as produced by
text mining algorithms like word2vec, therefore capture biologically meaningful relations between
entities. Furthermore, I extracted gene-gene networks from two embedding versions and used them as
prior knowledge to train Graph-convolutional neural networks (CNNs) on breast cancer gene expression
data to predict the occurrence of metastatic events. Performances of resulting models were compared
to Graph-CNNs trained with protein-protein interaction networks or with networks derived using other
word embedding algorithms. Graph-CNNs trained with word2vec-embedding-derived networks
performed best for the metastatic event prediction task compared to PPI or other text mining-based
networks. | de |