Zur Kurzanzeige

Probabilistic Models to Detect Important Sites in Proteins

dc.contributor.advisorWaack, Stephan Prof. Dr.
dc.contributor.authorDang, Truong Khanh Linh
dc.date.accessioned2021-02-23T12:34:57Z
dc.date.available2021-02-23T12:34:57Z
dc.date.issued2021-02-23
dc.identifier.urihttp://hdl.handle.net/21.11130/00-1735-0000-0005-1583-F
dc.identifier.urihttp://dx.doi.org/10.53846/goediss-8461
dc.language.isoengde
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject.ddc510de
dc.titleProbabilistic Models to Detect Important Sites in Proteinsde
dc.typedoctoralThesisde
dc.contributor.refereeWaack, Stephan Prof. Dr.
dc.date.examination2020-09-24
dc.description.abstractengProteins are molecular machines playing almost every fundamental role in activities of life. Their biological functions are mostly driven through conformational transitions and interaction interfaces with other bio-molecules such as DNA sequences, proteins and other ligands. In quest of the mechanism underlying protein functions, I conducted two projects aiming, firstly, to explore the structural change of proteins via identifying their rigid bodies, and secondly, to devise new sequence-based features to predict DNA-binding sites in proteins. Despite many previous efforts to calculate rigid domains in proteins, it is still highly desirable to develop new segmentation algorithms which are able to efficiently segment high-throughput of proteins, meanwhile to avoid protein-dependent parameters tuning such as the number of rigid domains. Thus, I introduce a new rigid domain segmentation method where I use a graph whose vertices are amino acids to represent multiple conformational states of a protein. This graph is later reduced by a coarse graining such as the Louvain clustering algorithm. Afterward, the domain-wise relationships among clusters in the reduced graph were inferred through a binary labeling of its edges which becomes feasible thanks to the line graph transformation and generalized Viterbi algorithm. Because of the binary labeling, our method does not require the number of rigid domains as an input parameter like other existing methods. I validate our graph-based method on 487 examples from DynDom database and compare our segments with other methods on several proteins whose structural changes range from medium to large and their molecular motions have been studied extensively in the literature. The algorithm code as well as usage instruction is available at https://github.com/dtklinh/GBRDE. In the second project, the identification of DNA-binding sites in proteins could be obtained either through structure- or sequence-based approaches. In spite of obtaining good results, structure-based methods require protein 3D structures which are expensive and time-consuming. In contrast, the sequence-based ones are efficiently applicable to entire protein databases, yet demand carefully designed features. Thus, I present a new information theoretic feature extracted from the Jensen–Shannon Divergence (JSD) where I harvest the differences between amino acids distributions of binding and non-binding sites. For the evaluation, I ran a five-fold cross validation on 263 proteins with Random Forest (RF) classifier along with features comprising of our new sequence-based feature and several popular ones such as position-specific scoring matrix (PSSM), orthogonal binary vector (OBV), and secondary structure (SS). The results show that by concatenating our features, there is a significant improvement of RF classifier performance in terms of sensitivity and Matthews correlation coefficient (MCC).de
dc.contributor.coRefereeDamm, Carsten Prof. Dr.
dc.subject.engProtein structural transitionde
dc.subject.engGraph algorithmsde
dc.subject.engGeneralized Viterbi algorithmde
dc.subject.engJensen–Shannon divergencede
dc.subject.engRandom Forestde
dc.subject.engDNA-binding sitesde
dc.identifier.urnurn:nbn:de:gbv:7-21.11130/00-1735-0000-0005-1583-F-4
dc.affiliation.instituteFakultät für Mathematik und Informatikde
dc.subject.gokfullInformatik (PPN619939052)de
dc.identifier.ppn1749248425


Dateien

Thumbnail

Das Dokument erscheint in:

Zur Kurzanzeige