Classifiers for Discrimination of Significant Protein Residues and Protein-Protein Interaction Using Concepts of Information Theory and Machine Learning

Asper, Roman Yorick

Klassifikatoren zur Unterscheidung von Signifikanten Protein Residuen und Protein-Protein Interaktion unter Verwendung von Informationstheorie und maschinellem Lernen

von Roman Yorick Asper

Dissertation

Datum der mündl. Prüfung:2011-10-26

Erschienen:2012-01-18

Betreuer:Prof. Dr. Stephan Waack

Gutachter:Prof. Dr. Stephan Waack

Gutachter:Prof. Dr. Carsten Damm

Zum Verlinken/Zitieren: http://dx.doi.org/10.53846/goediss-2515

Dateien

Name:asper.pdf

Size:9.28Mb

Format:PDF

Description:Dissertation

ViewOpen

Lizenzbestimmungen:

Zusammenfassung

Englisch

The field of protein-analysis is a major research area for bioinformatics. Especially the field of predicting important sites in proteins is in the focus of research to reduce the cost and time involved in the experimental approach of protein-analysis. Due to our success with theoretical approaches for detecting horizontal gene transfer we decided to use a similar approach for the problem of predicting important residues in a protein chain. To be able to have an efficient predictor, classifiers are needed to separate the important protein residues from the rest of the protein chain. Developing and refining two classifiers is the topic of this thesis. The first classifier is based on information theory and uses the concept of entropy and mutual information to rate protein residues. We use multiple sequence alignments to calculate the entropy of a residue pair and its mutual information. This is an indicator for the correlation between these two residues and thus an indicator for co-evolution. Through statistical means, we identify residues that have significant entropy values under the aspect of coevolution. By using a threshold, the top rated residues are classified as important sites of the protein. This classifier is very successful in detecting Single Nucleotide Polymorphism. The second classifier is based on the distribution of amino acids in a protein and focuses on detecting protein interfaces by using concepts from machine learning. Based upon existing data we analyze the neighborhood of known interface residues and use a machine learning algorithm to create a hypothesis. This hypothesis is then used to predict interface residues on a selected protein chain. This classifier has a very good accuracy and the focus can be easily adjusted to fit variable approaches to protein-analysis. These two classifiers offer a good base for predicting important protein sites and show promising results in experiments. Due to the theoretical concepts involved they can be easily adapted for other analytical purposes as well.

Keywords: Protein; Protein-Protein-Interaction; Machine Learning; Information Theory; Coevolution; Classifier; Patch

Weitere Sprachen

Der Bereich der Protein-Analyse ist eines der Hauptforschungsfelder der Bioinformatik. Ein besonderer Schwerpunkt liegt auf der Vorhersage von wichtigen Proteinstellen. Dadurch soll der zeitliche und materielle Aufwand von experimentellen Methoden der Protein-Analyse verringert werden. Durch unsere Erfolge mit der Anwendung von theoretischen Methoden bei der Identifizierung von horizontalen Gen-Transfers haben wir entschieden eine ähnliche Vorgehensweise für die Problematik der Vorhersage von wichtigen Stellen einer Proteinkette anzuwenden. Für die Erstellung eines effizientes Vorhersagetools braucht man Klassifikatoren, welche die wichtigen Proteinresiduen identifizieren. Zwei Klassifikatoren dieser Art zu entwickeln und zu verbessern ist der Schwerpunkt dieser Arbeit. Der erste Klassifikator basiert auf Konzepten der Informationstheorie und benutzt Entropie und gegenseitige Information zur Bewertung von Proteinresiduen. Wir setzen multiple Sequenzalignments ein, um die Entropie eines Residuenpaares und dessen gegenseitige Information zu berechnen. Diese ist ein Indikator für die Abhängigkeit dieser beiden Residuen und damit auch ein Indiz für Koevolution. Durch statistische Methoden, ermitteln wir die Residuen, welche unter dem Aspekt der Koevolution signifikante Entropiewerte aufweisen. Dann werden die signifikanten Residuen bewertet und die besten Residuen mittels eines Schwellenwertes ausgewählt. Diese Residuen werden im nächsten Schritt als wichtige Positionen des Proteins klassifiziert. Dieser so ermittelte Klassifikator hat eine sehr gute Erfolgsrate bei der Entdeckung von Einzelnukleotid-Polymorphismen. Der zweite Klassifikator basiert auf der Verteilung von Aminosäuren in Proteinen und wurde speziell entwickelt, um unter zu Hilfenahme von maschinellem Lernen Protein Interfaces, zu erkennen. Wir benutzen einen maschinellen Lernalgorithmus zur Analyse der Nachbarschaft von bekannten Interfaceresiduen und zur Erstellung einer Hypothese. Diese Hypothese kann dann genutzt werden, um Interfaceresiduen auf beliebigen Proteinketten vorherzusagen. Dieser Klassifikator hat eine sehr gute Genauigkeit und der Fokus kann sehr einfach angepasst werden und ermöglicht so variable Varianten der Protei- Analyse. Beide Klassifikatoren bieten eine gute Grundlage zur Vorhersage von wichtigen Proteinpositionen und haben sehr gute Ergebnisse in Experimenten gezeigt. Durch die zugrundeliegenden theoretischen Konzepte können sie leicht für andere analytische Aufgaben nutzbar gemacht werden.

Schlagwörter: Protein-Protein-Interaktion; maschinelles Lernen; Informationstheorie; Koevolution; Klassifikator

Statistik