Assessment and Advancement of Genotype Imputation for genome-wide Association Studies
Doctoral thesis
Date of Examination:2025-05-09
Date of issue:2025-05-20
Advisor:Prof. Dr. Heike Bickeböller
Referee:Prof. Dr. Heike Bickeböller
Referee:Prof. Dr. Thomas Kneib
Files in this item
Name:Doktorarbeit_bib_online_fertig.pdf
Size:4.25Mb
Format:PDF
Abstract
English
In genome-wide association studies (GWASs), genetic markers are independently tested across the whole genome to find genetic variants associated with a phenotype for a given population. The most commonly used markers for such analyses are single nucleotide polymorphisms (SNPs), which capture a great amount of genetic variation in humans. To keep the cost of genotyping low while maintaining power, a common approach is genotype imputation. Instead of fully in-depth sequencing of all individuals within a study, only a subset of SNPs is genotyped. This subset contains informative SNPs spread across the genome. The gaps between genotyped SNPs are imputed from a reference panel comprising ideally a large number of fully sequenced individuals of the same population. The imputation algorithm utilizes the genetic structure of linkage disequilibrium (LD) to find adequate matches between the reference panel and the study data set. To ensure confidence in the results of imputation and any following analysis, imputation quality is estimated, and SNPs not meeting a set quality threshold are discarded. Since the latter are not tested for association, this also reduces the multiple testing problem. Imputation quality measures estimate the accuracy based on the distribution of the imputed SNPs. This may result in poorly imputed SNPs not being discovered as such by probable, but wrongly, imputed genotypes. Further, there is no definite recommendation for setting the threshold for imputation quality, as different thresholds either prioritize discarding possibly wrongly imputed SNPs or preserving possibly correctly imputed SNPs. This method does not consider LD, which plays a major role both in imputation and the interpretation of GWAS results. One main objective of this thesis is to assess genotype imputation and quality control in GWAS settings. In this thesis, I compare the performance of different imputation tools and the performance of imputation quality control methods, both on simulated data and real data where some genotyped SNPs were removed and re-imputed. By direct comparison between imputed SNPs and ground truth genotypes, the accuracy of imputation and the effectiveness of quality control is quantified to identify weaknesses and explore solutions. Further, I conducted a simulation study to assess the performance of imputation quality control and introduced a new method for imputation quality control in GWAS, the Midrange Filter. By aggregating SNPs in close proximity to spikes, the Midrange Filter outperforms established imputation quality thresholds in the simulation study, which is supported in a real data application on the PsyCourse study. An implementation is publicly available. In addition, this thesis includes the analysis of longitudinal phenotypes of healthy controls in the PsyCourse study, many of which are hardly ever applied to individuals not diagnosed with psychological diseases on the affective-to-psychotic spectrum. The investigation found no strong evidence against the stability assumption of questionnaires and psychiatric scales. Further, the retest effect was identified in cognitive tests.
Keywords: GWAS; Genotype Imputation; Simulation Study; Quality Control; PsyCourse Study; Midrange Filter