Probabilistic Models to Detect Important Sites in Proteins
by Truong Khanh Linh Dang
Date of Examination:2020-09-24
Date of issue:2021-02-23
Advisor:Prof. Dr. Stephan Waack
Referee:Prof. Dr. Stephan Waack
Referee:Prof. Dr. Carsten Damm
Files in this item
Name:Dissertation_WithoutCV.pdf
Size:7.25Mb
Format:PDF
Description:PhD dissertation
Abstract
English
Proteins are molecular machines playing almost every fundamental role in activities of life. Their biological functions are mostly driven through conformational transitions and interaction interfaces with other bio-molecules such as DNA sequences, proteins and other ligands. In quest of the mechanism underlying protein functions, I conducted two projects aiming, firstly, to explore the structural change of proteins via identifying their rigid bodies, and secondly, to devise new sequence-based features to predict DNA-binding sites in proteins. Despite many previous efforts to calculate rigid domains in proteins, it is still highly desirable to develop new segmentation algorithms which are able to efficiently segment high-throughput of proteins, meanwhile to avoid protein-dependent parameters tuning such as the number of rigid domains. Thus, I introduce a new rigid domain segmentation method where I use a graph whose vertices are amino acids to represent multiple conformational states of a protein. This graph is later reduced by a coarse graining such as the Louvain clustering algorithm. Afterward, the domain-wise relationships among clusters in the reduced graph were inferred through a binary labeling of its edges which becomes feasible thanks to the line graph transformation and generalized Viterbi algorithm. Because of the binary labeling, our method does not require the number of rigid domains as an input parameter like other existing methods. I validate our graph-based method on 487 examples from DynDom database and compare our segments with other methods on several proteins whose structural changes range from medium to large and their molecular motions have been studied extensively in the literature. The algorithm code as well as usage instruction is available at https://github.com/dtklinh/GBRDE. In the second project, the identification of DNA-binding sites in proteins could be obtained either through structure- or sequence-based approaches. In spite of obtaining good results, structure-based methods require protein 3D structures which are expensive and time-consuming. In contrast, the sequence-based ones are efficiently applicable to entire protein databases, yet demand carefully designed features. Thus, I present a new information theoretic feature extracted from the Jensen–Shannon Divergence (JSD) where I harvest the differences between amino acids distributions of binding and non-binding sites. For the evaluation, I ran a five-fold cross validation on 263 proteins with Random Forest (RF) classifier along with features comprising of our new sequence-based feature and several popular ones such as position-specific scoring matrix (PSSM), orthogonal binary vector (OBV), and secondary structure (SS). The results show that by concatenating our features, there is a significant improvement of RF classifier performance in terms of sensitivity and Matthews correlation coefficient (MCC).
Keywords: Protein structural transition; Graph algorithms; Generalized Viterbi algorithm; Jensen–Shannon divergence; Random Forest; DNA-binding sites