dc.description.abstracteng | Transcription factors control the essential step of gene expression via recognizing the over-
represented binding sites (or motifs) on the genome. One crucial task is to accurately predict
these binding sites on the genome, to understand the regulatory mechanisms.
This thesis approaches this task in three parts.
In the first part, I introduce a tool, BaMMmotif2, that I have developed to identify motifs
de novo from DNA sequencing data. Compared to the existing position weight matrix
(PWM)-based motif discovery tools, the higher-order Bayesian Markov models (BaMMs)
have the advantages of learning the interdependence of the nucleotides for transcription factor
binding while being fast and having high predictive accuracy. The core of the BaMMs is that
the higher-order probability is learned by combining the k-mer counts and the probability of
one order lowers with a pseudo-factor α tuning the weights between the two. I optimize a
position- and order-specific pseudo-factor α for higher-order BaMMs. I also introduce the
method to learn the positional preferences of the transcription factors. Besides, I apply a
masking step to the input sequences to train the model only with the most relevant positions,
and thus it helps distinguish weak motifs when multiple binding motifs are present in the
data.
In the second part, I introduced a new and better motif performance score, the average
recall (AvRec score), to give the users some guidance on evaluating the motif quality. Besides,
to validate the existing motif detection tools, I developed a full scheme including (I) N-fold
cross-validation, (II) cross-platform validation, and (III) cross-cell-line validation. In 5-fold
cross-validation, BaMMmotif2 outperforms the selected state-of-the-art tools in this field,
with at least 13.6% and 12.2% median increase in the AvRec score using in vivo and in
vitro data, respectively. In the cross-cell-line validations on 238 datasets, BaMMmotif2
gains >11% median increases in the AvRec score. BaMMs also perform the best in the
cross-platform validation on 16 data sets. By applying BaMMs for the CTCF motif to scan
the whole human genome, I discover 1.5 million CTCF binding sites with high accuracy.
This result could lead to a better understanding of the genome 3D structure and its biological
functions.In the third part, we offer the community an interactive web server with the tool and
database: bammmotif.soedinglab.org. It provides four main functionalities: (I) de novo
predicting motifs from DNA/RNA sequences, (II) finding motif occurrences given a sequence
and a motif model, (III) searching for similar known motifs in the database, given a novel
motif model, and (IV) offering databases with higher-order BaMMs for different organisms. | de |