Kernel-Based Pathway Approaches for Testing and Selection
von Stefanie Friedrichs
Datum der mündl. Prüfung:2017-09-25
Betreuer:Prof. Dr. Heike Bickeböller
Gutachter:Prof. Dr. Thomas Kneib
Gutachter:Prof. Dr. Tim Beißbarth
EnglischWith the number of single nucleotide polymorphisms (SNPs) available in genetic data currently constantly increasing, the evaluation of SNP sets has become a successful approach toward elucidating the genetic inﬂuence on various complex diseases. The joint investigation of multiple SNPs increases the probability of detecting moderate and weak association signals and bypasses the multiple testing problem inherent to testing procedures on the genome-wide scale. Furthermore, this approach assists in the biological interpretation of analysis results, which may be supported by the analysis of SNP sets representing a pathway, here denoting a set of genes fulﬁlling a particular biological function jointly. The association between a pathway-representing SNP set and a phenotype may be analysed appropriately with the kernel machine approach. This evaluates the genotypes of multiple SNPs jointly by transforming them into a kernel matrix, comprising the genetic similarity measures for any pair of individuals in the study. The kernel matrix is calculated by a predeﬁned kernel function. Multiple kernel functions have been proposed, some of which are capable of integrating further biological knowledge on a pathway and allow for varying types of effect. The network kernel function enables the direct incorporation of a pathway’s network structure, while at the same time considering additive as well as interaction effects in the investigated SNP set. A multitude of databases are available nowadays offering an increasing amount of biologically meaningful information on pathways, genes, and genetic markers. The initial work in this thesis investigates possibilities and the impact of integrating additional biological information into existing approaches in the analysis of genetic data. The impact of marker density, SNP-set aggregation with respect to linkage disequilibrium structures, and knowledge sources were considered. In this context, the software package kangar00 was developed in R, offering a wide range of functions relating to data download, pre-processing, transformation, and evaluation for single-pathway testing in the logistic kernel machine framework, implemented, and made freely available. The identiﬁcation of speciﬁc biological processes inﬂuencing disease risk is still very challenging, despite the integration of growing amounts of biological data. Single-pathway methods cannot usually discriminate causal processes inﬂuencing disease susceptibility from isolated genetic effects included in a pathway resulting from gene overlaps. Moreover, they usually lack the ability to predict any trait of interest. The main objective of this thesis is the development of a new method in the evaluation of SNP sets, focussing on the analysis of those representing pathways. The resulting analysis approach enables the mutual investigation of multiple sets of SNPs through the adaptation of a boosting algorithm. Boosting originates from the ﬁeld of machine learning, in which it was developed as a classiﬁcation approach. Its main idea is to combine functions with poor classiﬁcation performance iteratively into a strong classifying set. If the functions considered only depend on a subset of the explanatory variables available, variable selection may be performed while the model is ﬁtted. We made use of this to perform selection on a set of pathways by employing a kernel function dependent on SNP sets representing pathways. Since all pathways of interest are investigated jointly in the boosting algorithm, correlations between them are also considered. We may therefore discriminate biological processes inﬂuential on disease susceptibility from single effect genes included in a pathway resulting from gene overlap. Our software package kangar00 includes an interface to a boosting algorithm, together with which all functionalities necessary to apply kernel boosting are available. Thanks to its inherent properties and the freely available software implementation, kernel boosting has great potential to elucidate key biological functions involved in disease risk, while creating a directly interpretable model to predict disease status.
Keywords: Kernel Approaches; Boosting; GWAS Analysis; SNP-set analysis; Analysis of Case-Control Studies; Integration of biological information; Pathway analysis; Multiple-pathway method; Logistic Regression; Gene-network overlap