Computational methods for prokaryotic (meta)genomic annotation and function prediction using horizontal information transfer
Doctoral thesis
Date of Examination:2023-08-24
Date of issue:2024-08-05
Advisor:Dr. Johannes Söding
Referee:Prof. Dr. Burkhard Morgenstern
Referee:Dr. Juliane Liepe
Files in this item
Name:RuoshiZHANG_PhD_thesis_SUB.pdf
Size:7.98Mb
Format:PDF
Description:PhD Thesis
Abstract
English
In recent decades, there has been a surge in genomic sequences resulting from large-scale genomic and metagenomic sequencing projects, driven by the significant drop in sequencing costs. More powerful methods to annotate the deluge of metagenomic and genomic sequences are urgently needed. Metagenomic samples are mostly comprised of viruses and prokaryotes (bacteria and archaea), which are living under a constant flux and with complex interplay. This poses new challenges to understand fully the information brought to light by these new genomic sequences, particularly with regard to three key questions: Who are they? What do they do? With whom do they interact? In this work, I will present three projects aimed at shedding light on the latter two questions. To overcome the bottlenecks of detection sensitivity and speed, I improved upon the state-of-the-art methods by proposing novel and efficient algorithms designed to run on large-scale genomic and metagenomic data. (1) SpacePHARER is a fast and sensitive method to identify phage-host relationships de novo from CRISPR spacer sequences. Compared to conventional BLASTN approaches, SpacePHARER is 1.4 to 4 times more sensitive and up to 47 times faster, making it suitable for analyzing massive metagenomic datasets. (2) Spacedust is a sensitive method for systematically identifying conserved gene clusters de novo from prokaryotic genomes, which facilitates protein functional annotation, as functionally associated genes tend to cluster close to each other in the prokaryotic genomes. By performing sensitive structural similarity searches with Foldseek, Spacedust enables comprehensive and efficient all-vs-all searches of thousands of genomes. (3) An self-supervised method for accurate and systematic operon prediction that can be universally applied on prokaryotic genomes and metagenomic bins, which provides insights into the function and regulations of genes. This method leverages conserved clusters detected by Spacedust and incorporates intergenic distance information through a self-training approach, providing valuable insights into gene function and regulation. The performance of this method surpasses existing operon prediction techniques on a diverse set of seven genomes. These approaches are particularly valuable because they do not rely on prior knowledge or bias towards well-characterized model organisms. Instead, they leverage the diversity of organ-isms uncovered through metagenomics to fully explore and exploit the unannotated genomic sequences. This work underscores the immense potential that often remains unexplored within large-scale prokaryotic genomic data, and represents a significant advancement in our capability to decode them.
Keywords: Metagenomics; Homology; Sequence Analysis; Phage-host relationships; Protein function annotation