Computational methods for de novo assembly and sequencing error correction of short reads in the era of (viral) metagenomicsDoctoral thesis
Date of Examination:2022-10-04
Date of issue:2022-12-15
Advisor:Dr. Johannes Söding
Referee:Dr. Johannes Söding
Referee:Prof. Dr. Burkhard Morgenstern
Files in this item
This file will be freely accessible after 2023-10-03.
EnglishViruses can affect all types of living cells, including bacteria, archaea and eukaryotes. Especially in the form of bacteriophages - bacteria infecting viruses - they have a huge impact on their host communities, driving bacterial diversity, shaping composition, interactions, functions and even genomes. Despite their importance, only very little is known about the viral component in microbial communities. Recent advances in sequencing technologies and the advent of metagenomics allow for a culture-independent analysis of the whole genetic material from an environmental sample. This allows to discover previously uncharacterized and newly emerging viruses within their natural environment. In this work, I address two computational tasks in the data analysis in metagenomics, with a focus on the viral fraction. In the first part of this thesis, I introduce PenguiN (protein-guided nucleotide assembler), a new metagenomic de novo assembler. PenguiN utilizes full-read overlaps calculated in linear time on both amino acid and nucleotide sequences within a greedy iterative assembly procedure. PenguiN is built upon the protein-level assembler Plass. In a first stage, six-frame translated reads are assembled to proteins, whereas the underlying nucleotide sequence is assembled simultaneously, resulting in full open reading frames (ORFs). In a second stage, the resulting ORFs are then linked with nucleotide reads to bridge intergenic regions as well, enabling the assembly of whole genomes. Additionally, I introduce a new extension strategy using a Bayesian model to identify the best overlaps in each iteration and describe a strategy to detect circular sequences. Utilizing full-read overlaps in linear time, PenguiN overcomes the sensitivity-specificity trade-off seen in k-mer based (de Bruijn graph) state-of-the-art metagenomic assemblers, while being much faster than existing overlap-based assemblers. Moreover, focusing on the viral fraction of microbial communities, I show that PenguiN can assemble longer contigs and more complete genomes than existing assembly tools and overcomes the typical loss of population diversity seen in metagenomic assemblies. Further, I show that PenguiN can also obtain long viral contigs at very low read coverage. On a simulated metagenome, I obtain a 3- to 11-fold increase in the per-nucleotide sensitivity compared to the next best tool at comparable per-nucleotide precision. On a metatranscriptomic dataset from 82 aquatic and activated sludge samples, PenguiN assembles about 75-90% (343-376) more complete ssRNA phage genomes than state-of-the-art tools. In the second part of the thesis, I introduce CoCo, a new software tool for sequencing error correction. By identifying sequencing errors as discontinuities in spaced k-mer frequencies along a read, CoCo can make local decisions instead of using a global threshold. Together with a very conservative two-side correction strategy, this allows to be more specific for low frequency variants than tools that rely on global k-mer count statistics. Moreover, I introduce a memory efficient data structure to store the spaced k-mer counts. This makes it possible to run CoCo on large and complex metagenomic datasets. Using CoCo's corrected sequencing reads for PenguiN’s assembly improves the final contigs, which become more continuous and more accurate.
Keywords: Assembly; Metagenomics; Sequence Analysis; Error correction; Virus