Information theoretical approaches for the identification of potentially cooperating transcription factors
by Cornelia Meckbach
Date of Examination:2019-06-21
Date of issue:2019-08-08
Advisor:Prof. Dr. Edgar Wingender
Referee:Prof. Dr. Edgar Wingender
Referee:Prof. Dr. Stephan Waack
Referee:Prof. Dr. Ralf Hofestädt
Files in this item
Name:CorneliaMeckbachOhneCV.pdf
Size:19.6Mb
Format:PDF
Abstract
English
Transcription factors (TFs) are a special class of proteins that usually bind regulatory DNA regions such as promoters and enhancers in order to control the expression of their target genes. Today, it is well known that in higher organisms, the combinatorial interplay between TFs is crucial for a flexible and precise gene regulation. Thereby, the cooperation between TFs is highly diverse and can take place between TFs that are bound to the same DNA region, referring to intra-regional TF cooperations as well as between TFs that are bound to different DNA regions (i.e. enhancer and promoter regions), referring to inter-regional TF cooperations. The computational identification of these TF cooperations is still a challenging problem in bioinformatics and can be addressed by using predicted transcription factor binding sites (TFBSs) as basis of the analysis. In this thesis, I present two information theoretical approaches for the identification of cooperating TFs based on their TFBS distributions in regulatory DNA regions. My first approach identifies potentially intra-regional cooperating TFs based on the co-occurrence of their binding sites. Thereby, I adapted the pointwise mutual information from the field of linguistics to the field of bioinformatics by using it for the identification of co-occurring TFBSs. For this, I consider the genome as a document, the sequences under study as sentences and the predicted TFBSs as words in these sentences. I successfully applied this approach to a simulation data set, biological data sets and performed a comparison study with existing methods. Although the results reveal that my approach properly identifies known and novel TF cooperations, the differentiation between sequence-set specific pairs and common/general important ones is missing. Addressing this point, I extended my method and created background sequence-sets to estimate the background co-occurrence of each TFBS pair, incorporated it in the calculation and classified the significant pairs as sequence-set specific or common ones. Applying this extended version to several gene sets, the overlap between the sequence-set specific pairs is considerably decreased in comparison to the original version. In order to complement my first method, I established a second approach for the determination of inter-regional TF associations that might be involved in the interaction process between promoter and enhancer regions. This approach is based on the sequences of known promoter-enhancer interactions and estimates the association between TFBS distributions of different DNA regions based on multivariate mutual information (MMI). Thereby, I created background sequence sets by preserving the (olig-) nucleotide composition and directly incorporated them in the MMI computation as a third random variable. Considering this approach, I compared the performance of four different mutual information quantities. Finally, I demonstrated the performance of this approach by successfully applying it to simulation and biological data sets and by comparing it with an existing method.
Keywords: transcription factor; transcription factor cooperation; information theory; mutual information; gene expression regulation