Show simple item record

Filtered spaced-word matches: a novel approach to fast and accurate sequence comparison

dc.contributor.advisorMorgenstern, Burkhard Prof. Dr.
dc.contributor.authorLeimeister, Chris-Andre
dc.titleFiltered spaced-word matches: a novel approach to fast and accurate sequence comparisonde
dc.title.translatedFiltered spaced-word matches: a novel approach to fast and accurate sequence comparisonde
dc.contributor.refereeSöding, Johannes Dr.
dc.description.abstractengStandard methods for biological sequence comparison and phylogeny reconstruction are traditionally based on sequence alignments. These methods are very accurate but also computationally expensive. Because of the exponentially growing amount of biological sequence data, alignment-free methods have become more important over the past decades. Alignment-free methods are substantially faster than alignment-based methods and are essential for large scale sequence comparison. One major application of alignment-free methods is whole genome phylogeny reconstruction. To this end, distances between pairs of genomes are calculated and subsequently clustered. Current alignment-free methods are fast but less accurate than alignment-based approaches. In this thesis, I developed the filtered spaced-word matches (FSWM) approach, a new alignment-free method for fast and accurate whole genome phylogeny reconstruction. FSWM rapidly identifies spaced-word matches which are defined by patterns of match and don’t care positions. The fraction of non-matching nucleotides at the don’t care positions are used to estimate evolutionary distances. To reduce the noise from random matches, I developed a filtering technique which calculates a similarity score for each spaced-word match and discards matches with a score below a threshold. This filtering removes most of the unwanted background matches and the distances calculated based on the remaining spaced-word matches are very accurate. Moreover, I investigated if FSWM can be used to identify anchor points for genome alignments. I integrated a slightly modified version of FSWM into mugsy, a popular multiple-genome-alignment pipeline. If FSWM is used to identify anchor points, more homologies are found and aligned and the alignments are of higher quality. Furthermore, I transferred the idea of FSWM from genomic sequences to protein sequences. I developed Prot-SpaM, a fast tool which estimates evolutionary distances between pairs of whole proteoms. Prot-SpaM is the first alignment-free tool that estimates the number of substitutions between pairs of protein sequences without sequence
dc.contributor.coRefereeBeißbarth, Tim Prof. Dr.
dc.subject.engsequence comparisonde
dc.affiliation.instituteGöttinger Zentrum für molekulare Biowissenschaften (GZMB)de
dc.subject.gokfullMolekularbiologie, Gentechnologie (PPN619462973)de

Files in this item


This item appears in the following Collection(s)

Show simple item record