|dc.description.abstracteng||In recent years, advances in sequencing techniques resulted in an explosive increase in sequencing data. Here, computational methods and bioinformatical analyses are presented that provide approaches to keep pace with the growing amount of data.
In the post-genomic era, an important step to derive knowledge from sequence information is to find protein-coding genes in the genomes. Scipio, a tool to reconstruct exon-intron gene structures, was improved for accurate cross-species gene reconstruction. It performed best in comparison to other tools in reconstructing the dynein heavy chain genes in the whole Loxodonta africana (elephant) genome based on human protein sequences. Only eleven of 1,202 exons were missed and six exons were predicted wrongly. Scipio is specialised to cope with sequencing errors and incomplete assembled genomes. The web interface WebScipio provides direct access to almost all public available eukaryotic genome sequences (December 2012: ~3,200 genome files of ~1,000 species).
Alternative splicing is a wide-spread mechanism to increase the protein inventory. About 95% of the multi-exon genes are spliced alternatively in human. A new computational method was developed to predict a special type of alternatively spliced exons, mutually exclusive exons (MXEs). In the case of mutually exclusive splicing exactly one exon of a cluster of neighbouring exons is retained in the mRNA. Those exons code for the same region in the three-dimensional structure of the protein, and therefore are predicted based on similarity and length constraints as well as compatible splice sites. The new algorithm reconstructed the MXEs in diverse genes, for example in a dynein heavy chain gene of the human parasite Schistosoma mansoni, in the myosin heavy chain gene of the waterflea Daphnia magna and in the Dscam genes of several Drosophila species. In addition, all but two of 28 MXEs annotated in the Drosophila melanogaster X chromosome were identified correctly. The algorithm was integrated int the WebScipio interface.
The continuous process of whole genome sequencing paves the way for genome-wide analyses of gene expression mechanisms like mutually exclusive splicing. The database application Kassiopeia was implemented to provide genome-wide analyses of MXEs in several organisms. It contains the mutually exclusive exomes of human, the fruit fly Drosophila melanogaster, eleven additional Drosophila species, the flatworm Caenorhabditis elegans, and the thale cress Arabidopsis thaliana. Further datasets of several species are in preparation. For each cluster of mutually exclusive exons, Kassiopeia provides EST validation data, cross-species support data, protein secondary structure predictions, and RNA secondary structure predictions. All gene annotations are searchable by BLAST and linked to organism-specific databases, like Flybase. Kassiopeia includes diverse parameters to filter the predicted exon candidates.
The detailed analysis of mutually exclusive splicing in the model organism Drosophila melanogaster is presented. The high-quality gene annotation of Flybase (release r5.36) was used to evaluate the quality of the prediction method. 218 of 261 annotated MXEs could be reconstructed, resulting in a sensitivity of 83.5%. The study reports 44 newly predicted exon candidates, of which five are annotated in the current release of Flybase (r5.48), eight are supported by RNA-Seq or EST data, and 29 seem to be conserved in related Arthropods.
Another algorithm was implemented that reconstructs tandem gene duplicates. Gene duplications play an important role in the origin of new genes. The algorithm is able to identify putative tandem gene duplicates which can be encoded on the forward or reverse strand or which are spread over hundreds of thousands of nucleotides. The algorithms has also been integrated into the WebScipio interface.
Meaningful evolutionary information can be derived from genomic sequences alone. An alignment-free method based on Chaos Game Representations (CGRs) was used to derive phylogentic trees of the Brassicales clade. Two algorithms, Fitch-Margoliash and Neighbour joining, and the bootstrapping method were applied to three different kinds of data: whole genome sequences, expressed sequence tag data and mitochondrial genome sequences. The methods gave reasonable results in comparison to reference trees derived from established alignment methods. The study provides a reference to evaluate further alignment-free approaches.||de