A Bioinformatics Pipeline for Identifying Dysregulated Pathways in Cancer from Comparative RNA-Seq Transcriptome Analysis
von Darius Wlochowitz
Datum der mündl. Prüfung:2022-03-29
Erschienen:2022-04-07
Betreuer:Prof. Dr. Edgar Wingender
Gutachter:Prof. Dr. Edgar Wingender
Gutachter:Prof. Dr. Stephan Waack
Dateien
Name:Darius_Wlochowitz_Thesis.pdf
Size:66.2Mb
Format:PDF
Description:dwl_Main
Zusammenfassung
Englisch
Cancer is characterized as a multifactorial disease which undergoes genetic and epige- netic changes during invasive tumor growth. Thus, numerous tumor samples have been profiled using high-throughput sequencing technologies such as microarray and RNA sequencing (RNA-Seq) to obtain their transcriptomes. However, disentangling such high- dimensional data to identify dysregulated signaling pathways remains a difficult task. To close this gap, bioinformatics pipelines are needed to uncover gene misregulation by establishing causal regulatory links between transcription factors (TFs) and their target genes. TFs are proteins that control gene expression by recognizing short motifs called transcription factor binding sites (TFBSs) in DNA regulatory regions like promoters, en- hancers, and silencers. To this end, the goal of this thesis was to establish and evaluate a bioinformatics pipeline for comparing phenotypes based on RNA-Seq. The individual workflows of the pipeline comprise methods in RNA-Seq data analysis, promoter analysis, comprehensive functional categorization, and master regulator analysis (MRA), thereby identifying differentially expressed genes (DEGs), TFs, biological processes, and master regulators (MRs). For promoter analysis, a discriminative motif discovery approach using the Boruta feature selection algorithm is proposed, which distinguishes two DEG promoter sequence datasets based on TFBS patterns. In addition, a gene clustering approach is proposed using the Jensen-Shannon divergence (JSD), principal component analysis (PCA), and the k-means algorithm, which groups DEG promoters based on TFBS patterns related to the discrimi- native motifs. The gene clusters obtained are subjected to Gene Ontology (GO) functional categorization and MRA. The utility of the pipeline was demonstrated using three heterogenous gene expression studies that are characterized by distinct signaling pathway activity in cancer. In the course of promoter analysis, the results indicated that Boruta’s ranking-based importance scores can be used to identify biologically relevant TFs. Furthermore, the results indicated clearly separated gene clusters characterized by uniquely significant GO terms and MRs. In conclusion, the pipeline provides a useful bioinformatics framework for the comparative study of phenotypes based on RNA-Seq to reveal variations in transcriptional regulation and pathway repertoire.
Keywords: transcriptional regulation; promoter analysis; transcription factor; master regulator; differential expression analysis