Fast methods for metagenomic sequence search and annotation
Cumulative thesis
Date of Examination:2022-02-21
Date of issue:2022-06-16
Advisor:Dr. Johannes Söding
Referee:Dr. Johannes Söding
Referee:Prof. Dr. Stephan Waack
Referee:Prof. Dr. Antonio Fernandez-Guerra
Files in this item
Name:thesis-final-without-cv.pdf
Size:38.9Mb
Format:PDF
Abstract
English
The past two decades have seen the development of metagenomics, the study of genes and genomes of multiple organisms simultaneously. In contrast to traditional genomic techniques, which require isolating and growing individual organisms in the lab, in metagenomics, samples are directly taken from the environment, sequenced and then analyzed in silico. Modern sequencing techniques have enabled high throughput read-out of DNA and RNA of microorganism communities in marine, soil, gut and many other environments. The plethora of data generated using these techniques poses a major challenge for existing computational techniques. This burden translates directly to computational run times and the cost of resources required to carry out metagenomic analyses. Thus, computational methods developed for metagenomic analysis require exceptional efficiency and speed. At the same time, metagenomic studies become relevant for more and more fields of research, requiring that techniques be suited for a wide range of scientific disciplines. In this work, I present three methods I developed to address the throughput bottlenecks of data analysis in metagenomics. (1) The MMseqs2 webserver is a user-friendly extension of the popular homology search method MMseqs2 designed for non-expert bioinformaticians. I accelerated MMseqs2 to process single queries much more quickly and introduced an API to enable MMseqs2's use in web applications. (2) MMseqs2 taxonomy is a method for fast and accurate taxonomy assignment of metagenomic contigs. (3) ColabFold is a method to make the groundbreaking AlphaFold2 protein structure predictions widely accessible, accelerating its input sequence alignment generation and improving its accuracy by assembling a novel database enriched with metagenomic sequences from a multitude of datasets. These methods improve upon the state-of-the-art by introducing novel algorithms and accelerating previous ones - such that previously infeasible analyses become possible - and making our metagenomic toolbox accessible to users of a wide range of skill levels.
Keywords: Proteins; Sequence Analysis; Metagenomics; Protein Structure Prediction; Homology; Webserver