Alignment-free Phylogeny Reconstruction Based On Quartet Trees
von Thomas Dencker
Datum der mündl. Prüfung:2020-03-04
Betreuer:Prof. Dr. Burkhard Morgenstern
Gutachter:Prof. Dr. Burkhard Morgenstern
Gutachter:Prof. Dr. Stephan Waack
EnglischTraditional methods for phylogeny reconstruction are based on multiple sequence alignments and character-based methods. This combination of computationally expensive methods leads to very accurate results, but it is ill-suited to handle the enormous amount of sequence data that is available today. As a consequence, very fast alignment-free methods have been developed. These methods calculate pairwise distances in order to build phylogenetic trees. However, current alignment-free methods are generally less accurate than traditional methods. In this thesis, I developed Multi-SpaM which is a novel alignment-free approach that tries to combine the best of both worlds. This method quickly finds small gap-free ‘microalignments’ – so-called blocks – involving four sequences. A binary pattern defines at which positions the nucleotides have to match. At the remaining don’t care positions, the possibly mismatching nucleotides are first used to remove random matches with a filtering procedure previously introduced by Filtered Spaced-Word Matches (FSWM). Then, the character-based method RAxML is used to find the optimal quartet tree for each block. Subsequently, all quartet trees are amalgamated into a supertree with Quartet MaxCut. This approach can be used to build phylogenetic trees of high quality. Furthermore, I showed multiple ways that could help to improve Multi-SpaM. The distances between two adjacent blocks involving the same four sequences can be used to identify putative insertions and deletions from which accurate quartet trees can be derived. These trees could be used both on their own and in combination with the quartet trees produced by Multi-SpaM to build or improve phylogenetic trees using Quartet MaxCut. As an alternative, we also used Maximum-Parsimony to infer accurate phylogenies from these putative insertions and deletions. In other experiments, I tried to give the individual quartet trees weights based on SH-like support values and tried to use Neighbor-Joining in order to speed up Multi-SpaM. Moreover, I contributed to another extension of the FSWM approach. Here, we used these matches as anchor points for a genome alignment tool called mugsy. We found that a higher number of homologous pairs could be aligned for more distantly related species in comparison to other anchor points used with the same alignment program.
Keywords: alignment-free; sequence comparison; phylogeny reconstruction