Identification of regulatory SNPs and epistatic SNP pairs using deep learning and information theory
Dissertation
Datum der mündl. Prüfung:2022-07-12
Erschienen:2022-08-24
Betreuer:Prof. Dr. Armin Schmitt
Gutachter:Prof. Dr. Armin Schmitt
Gutachter:Prof. Dr. Stephan Waack
Gutachter:Prof. Dr. Murtaza Özgür Yeniay
Dateien
Name:Publishing_Dissertation_Felix_Heinrich.pdf
Size:11.8Mb
Format:PDF
Zusammenfassung
Englisch
In the last two decades, new technologies have made DNA genotyping and sequencing far more time and cost efficient. The resulting tremendous increase in the amount of available genomic data allows for a deeper understanding of the relationship between the genotype and the phenotype. In this thesis, I present two novel frameworks which analyze specific aspects of the relationship between the genotype and the phenotype, namely the identification of regulatory SNPs (rSNPs) as well as the detection of epistatic SNP pairs. In my first framework, I utilized deep learning to train a convolutional neural network for the prediction of promoter sequences in the species Vicia faba. By exploiting the conservation of promoter signatures across closely related species, I avoided the need for the expensive and time-consuming task of assembling and annotating a reference genome for the species under study. With the detected promoter regions, I was then able to analyze putative rSNPs in terms of their effects on the binding of transcription factors. Finally, my results revealed two rSNPs which were highly associated with the trait under study, namely the vicine and convicine content (V+C) of the plants. These markers could then be further used in plant breeding programs that target a low V+C content. Furthermore, I thereby demonstrated that an annotated reference genome is not always necessary for this type of analysis. For my second framework, I developed a method named MIDESP for the detection of epistatic interactions between SNP pairs based on mutual information. This method extends the existing information theory-based approaches for epistasis detection in two key areas. First, by adopting a kth-nearest neighbor-based approach for estimating mutual information, it is the first mutual information-based method which can be applied to detect epistasis for qualitative as well as quantitative phenotypes. Secondly, the method incorporates the average product correction (APC) to deal with possible complications in a genotype-phenotype dataset, which may otherwise give rise to the detection of false-positive interactions. I showcase the performance of MIDESP and its different aspects by means of simulated as well as real datasets, which were related to bovine tuberculosis and the weight of chicken eggs, respectively. Comparing the results with and without the application of the APC showed that the correction is necessary to reduce the prediction of false-positive interactions. Overall, both of my frameworks provide novel insights into specific mechanisms underlying the relationship between the genotype and the phenotype and identify important SNPs that are participating in these mechanisms.
Keywords: information theory; mutual information; epistasis; deep learning; regulatory SNPs; convolutional neural networks; vicia faba