Methods and software to enhance statistical analysis in large scale problems in breeding and quantitative genetics
by Torsten Pook
Date of Examination:2019-06-27
Date of issue:2019-11-08
Advisor:Prof. Dr. Henner Simianer
Referee:Prof. Dr. Henner Simianer
Referee:Prof. Dr. Timothy M. Beissinger
Referee:Prof. Dr. Hans-Peter Piepho
Files in this item
Name:Doktorarbeit_Pook_Torsten.pdf
Size:8.52Mb
Format:PDF
Description:Phd Thesis
Abstract
English
The aim of this thesis is the development of methods and software to enhance the statistical analysis in large scale problems in breeding and quantitative genetics. In Chapter 1 a brief introduction to the subject of big data is given and the topics relevant for the following chapters are presented. In Chapter 2 a new method (HaploBlocker) for the identification of haplotype blocks and libraries is presented that is also implemented in the associated R-package HaploBlocker. In contrast to commonly applied methods for the identifying haplotype blocks, HaploBlocker not only utilizes population-wide measures of linkage disequilibrium (LD), such as the correlation between genetic markers, but also analyzes groups of haplotypes for segments with the same genetic origin identity-by-descent, IBD). Haplotype blocks are defined as a sequence of genetic markers that has a predefined minimum frequency in the population and only haplotypes with a similar sequence of markers are considered to carry that block. Since the identified blocks are subpopulation specific, much longer haplotype blocks than in conventional methods can be identified. This in turn leads not only to a substantial reduction in the number of variables for later analysis, but also to potentially more informative variables than single nucleotide polymorphisms (SNP). By using HaploBlocker a dataset of 501 doubled haploid lines in a European maize landrace genotyped at 501'124 SNPs was reduced to 2'991 haplotype blocks with an average length of 2'685 SNPs. Despite the lower number of variables, 94% of the genetic diversity of the original dataset can be explained by the block dataset. Steps of quality control must be performed before genetic data can be analyzed in methods such as HaploBlocker. A central part of any quality control protocol is imputation, which is discussed in Chapter 3. The phasing accuracy is of central importance for HaploBlocker and is therefore a special focus in the analysis. In addition, the applicability of commonly applied imputation software for livestock and crop datasets is evaluated, as commonly used tools were originally developed for the use in human genetics. In particular, the software BEAGLE is examined here, as it enables the user to adapt the algorithm to the genetic structure of the dataset by tuning parameter settings. The error rates of imputation were reduced by up to 98.5% by parameter tuning such as the effective population size. In addition, further influencing factors for imputation such as the construction of a suitable reference dataset and the choice and validation of the used reference genome were considered. In Chapter 4 the software MoBPS (Modular Breeding Program Simulator) that was developed within the scope of this thesis, is presented. MoBPS is an R-package that can assist scientists and breeders to simulate both breeding programs and historical populations. Among others, resulting breeding programs can be compared in terms of their economic impact, resulting genetic gain and inbreeding. MoBPS uses a modular and flexible design that allows for the simulation of different breeding programs, but is still very efficient in terms of computing time and memory usage. In the first part of the discussion (Chapter 5) the influence of imputation on the structure of different haplotyping methods is discussed and subsequently the use of HaploBlocker for genomic prediction is analyzed. In the second part of the discussion, different breeding programs that can be simulated via MoBPS are showcased and potential analyses that can be performed based on these simulations are briefly discussed. Particular attention will be paid to the use of genome editing to accelerate the genetic progress for quantitative traits. In the third and last section of this chapter, an outlook on possible further application areas for HaploBlocker and MoBPS is given. In the supplementary of this thesis, the user manuals for the two R-packages developed in this work are given (Supplementary A and B).
Keywords: haplotype blocks; breeding; simulation; R-package; big data; imputation; quantitative genetics; breeding program