Accounting for Epistasis in Genomic Phenotype Prediction
by Elaheh Vojgani
Date of Examination:2021-01-22
Date of issue:2021-06-03
Advisor:Prof. Dr. Henner Simianer
Referee:Prof. Dr. Henner Simianer
Referee:Prof. Dr. Timothy M. Beissinger
Referee:Prof. Dr. Thomas Kneib
Files in this item
Name:Dissertation_Elaheh Vojgani.pdf
Size:12.5Mb
Format:PDF
Description:PhD Dissertation
Abstract
English
Wide availability of genomic data has had a considerable impact on plant and animal breeding programs which enables the study of genotypes and their relationships with phenotypes. Improving genomic prediction accuracy is of great interest in plant and animal breeding for selection purposes. In quantitative genetics, the standard models account for additive genetic effects while epistasis effects have been widely ignored due to their computational load. In this thesis, the significance of incorporating epistasis interactions in the genomic prediction of phenotypes are investigated. Chapter 1 presents a general introduction to the significant effects of genomic data specifically in animal and plant studies in both breeding value prediction and genomic prediction of phenotypes. Then different additive and epistasis models are reviewed and the challenges they encounter when considering epistasis are detailed. Finally, the univariate and multivariate statistical settings for genomic prediction of phenotypes are compared in their predictive abilities. The main chapters of this thesis are the three corresponding articles presented in Chapters 2, 3, and 4. In Chapter 2, “Phenotype Prediction under Epistasis” is discussed through developed epistatic models defined as Epistatic Random Regression BLUP (ERRBLUP) and selective Epistatic Random Regression BLUP (sERRBLUP) implemented in the developed R-package named “EpiGP”, which is able to process large scale genomic data in a computationally efficient manner. ERRBLUP is considered as a full epistatic model which incorporates all pairwise SNP interactions, while sERRBLUP is a selective epistatic model which incorporates a subset of pairwise SNP interactions selected according to their absolute effect sizes or the effect variances. These models are compared to GBLUP as an additive model in univariate statistical framework with the genotypes from the publicly available wheat dataset and respective simulated phenotypes. The results indicate that sERRBLUP leads to a considerable increase in predictive ability compared to ERRBLUP and GBLUP when the optimum proportion of SNP interactions is maintained in the model. GBLUP, ERRBLUP and sERRBLUP are developed in bivariate statistical setting in Chapter 3 in the article “Accounting for epistasis improves genomic prediction of phenotypes with univariate and bivariate models across environments” where two environments are modeled as two separate traits in multi-trait model. In Chapter 3, GBLUP, ERRBLUP and sERRBLUP are compared in both univariate and bivariate statistical frameworks in maize dataset derived from 910 doubled haploid lines of two European landraces Kemater Landmais Gelb and Petkuser Ferdinand Rot grown in six locations in Germany and Spain in the year 2017 for eight phenotypic traits. In the maize dataset, pairwise SNP interaction selection based on effect variances is considered as the selection criteria due to its robustness compared to selection based on effects sizes in sERRBLUP model. Our results indicate the superiority of the sERRBLUP over GBLUP and ERRBLUP in both univariate and bivariate statistical settings when selecting the subset of interactions with the highest effect variances. The comparison between univariate and bivariate models also reveals the superior predictive abilities of bivariate models over univariate models. In chapter 4, we analyze the utility of haplotype blocks in contrast to LD-pruning in the article "Bivariate genomic prediction of phenotypes by selecting epistatic interactions across years based on haplotype blocks and pruned sets of SNPs". For this, we consider a model in which observations of the same trait in different years (2017 & 2018) are considered as two separate traits in a multivariate model. This is done in the 873 doubled haploid lines in the respective maize dataset in four locations in Germany and Spain in both years 2017 and 2018. The results are in line with our finding from the bivariate model when considering two environments as the two separate traits indicating the superiority of bivariate sERRBLUP over GBLUP in most cases. Overall, the prediction accuracies obtained by LD-pruning and haplotype blocks are similar. However, the use of haplotype blocks can significantly reduce the computation time. Moreover, we explore genomic correlation, phenotypic correlation and trait’s heritability as three influential factors on bivariate model’s predication accuracy. The results illustrate the significance of genomic correlation between growing seasons in the bivariate model’s prediction accuracy. Phenotypic correlation and heritability of the traits also affect this increase in predictive ability to some extent. In this thesis, the main studied trait in the maize dataset is plant height at V4 growth stage (PH_V4) and the results for series of other phenotypic traits are presented in supplementary material in Chapter 3 and Chapter 4. Finally, the general discussion is presented in Chapter 5 in which our proposed selection method in sERRBLUP model is compared with other methods of variable selection indicating the superiority of our proposed selection method in sERRBLUP. Furthermore, the influential factors on the predictive ability of the genomic prediction models are investigated. In this regard, linkage disequilibrium based SNP pruning as a potential approach to reduce the number of SNPs in order to make the application of epistasis models feasible is shown to result in predictive abilities as good as or better than those obtained from utilizing full panel of SNPs. Moreover, the cross validation scenario in bivariate statistical settings is shown to be an important factor affecting the bivariate models’ predictive abilities. In addition, the level of genotype overlap is found to be significantly correlated with the increase in the bivariate model’s predictive ability under the cross validation scenario which leads to higher predictive ability. Under the assumption of high level of genotype overlap, the genomic correlation is significantly correlated to the bivariate models’ predictive abilities for highly heritable traits. Phenotypic correlation is also shown to be an influential factor in this context. Finally, incorporating transcriptomic data into epistasis genomic prediction models, incorporating weather data into epistasis multi-trait genomic prediction models and exploring single-trait and multi-trait epistasis GWAS are proposed as the potential field of research and further investigations for future studies in the context of epistasis models.
Keywords: Genomic prediction, Prediction across environments, Prediction across years, Epistasis, GBLUP, ERRBLUP, sERRBLUP, EpiGP, Multi-trait models, Interaction, Genomic correlation, Haplotype blocks, Linkage disequilibrium (LD) based SNP pruning