Kernel Methods for Genes and Networks to Study Genome-Wide Associations of Lung Cancer and Rheumatoid Arthritis
by Saskia Freytag
Date of Examination:2014-01-08
Date of issue:2014-02-28
Advisor:Prof. Dr. Heike Bickeböller
Referee:Prof. Dr. Thomas Kneib
Referee:Prof. Dr. Martin Schlather
Files in this item
Name:FinalPhDThesisFreytag.pdf
Size:428.Kb
Format:PDF
Description:Cumulative PhD Thesis
Abstract
English
The search for genetic causes of common complex diseases has been revolutionized by the ability to genotype exceptionally large numbers of single nucleotide polymorphisms (SNPs) in hundreds of individuals at an affordable cost. Statistical analysis of the data generated in hundreds of such genome-wide association studies has been able to identify genetic risk variants with differing degrees of success. Overall, these genetic risk variants account only for a fraction of the observed genetic heritability. Reasons suggested for this shortcoming range from the identification of statistical problems with conventional analysis tools to the failure to model the complexity of the human organism properly. One proposition to uncover a portion of the ’missing heritability’ is the analysis of biologically meaningful SNP sets. Methods based on SNP sets are typically powerful and aid the interpretation of results through the incorporation of biological knowledge. A popular approach in the identification of associations between an investigated disease and SNP sets lies in kernel methods, in particular the logistic kernel machine test. Such methods formulate the estimation problem in a reproducing kernel Hilbert space of functions, which is uniquely defined by a positive semi-definite kernel. This has the benefit of facilitating the construction and estimation of a wide variety of genetic effect models. However, this immense flexibility can also prove problematic. The choice of kernel most suitable for a particular problem is seldom obvious and the choice made seriously affects the ability of the kernel method to discover genuine associations. One of the main objectives of this thesis is the development of appropriate kernels for the analysis of SNP sets, such as genes or pathways. Here, a pathway is defined as a network of interacting genes responsible for achieving a specific cell function or regulation. In this thesis, I introduce a kernel that corrects for bias incurred through differently-sized pathways in terms of the number of SNPs or genes. This kernel also reflects the basic architecture of a pathway. This concept is expanded by constructing another kernel that integrates specific gene-gene regulations. Through simulation studies and implementation of real data on rheumatoid arthritis and lung cancer, I demonstrate both robustness as well as practical usefulness of the logistic kernel machine test with the two kernels introduced above. Another main objective of this thesis is to compare kernel methods with other approaches in the analysis of pathways or genes. This includes comparing the performance of various multi-marker methods for ranking genes according to their strength of association with an investigated disease. In many genetic scenarios, it is possible to show that the performance of kernel methods is superior. In addition, this thesis includes a comparative study of chips currently used to genotype SNPs in the human genome. The chips are assessed with regard to their coverage of the genome, price, and efficiency.
Keywords: genetic epidemiology; kernel methods; biological pathways; genome-wide association studies; genes; networks; logistic kernel machine test; ranking; SNP chips; missing heritability; efficiency; coverage