Zur Kurzanzeige

Understanding cellular differentiation by modelling of single-cell gene expression data

dc.contributor.advisorSoeding, Johannes Dr.
dc.contributor.authorPapadopoulos, Nikolaos
dc.date.accessioned2019-08-16T08:59:14Z
dc.date.available2019-08-16T08:59:14Z
dc.date.issued2019-08-16
dc.identifier.urihttp://hdl.handle.net/21.11130/00-1735-0000-0003-C196-9
dc.identifier.urihttp://dx.doi.org/10.53846/goediss-7605
dc.language.isoengde
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject.ddc570de
dc.titleUnderstanding cellular differentiation by modelling of single-cell gene expression datade
dc.typecumulativeThesisde
dc.contributor.refereeSoeding, Johannes Dr.
dc.date.examination2019-08-08
dc.description.abstractengOver the course of the last decade single-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, as one experiment routinely covers the expression of thousands of genes in tens or hundreds of thousands of cells. By quantifying differences between the single cell transcriptomes it is possible to reconstruct the process that gives rise to different cell fates from a progenitor population and gain access to trajectories of gene expression over developmental time. Tree reconstruction algorithms must deal with the high levels of noise, the high dimensionality of gene expression space, and strong non-linear dependencies between genes. In this thesis we address three aspects of working with scRNA-seq data: (1) lineage tree reconstruction, where we propose MERLoT, a novel trajectory inference method, (2) method comparison, where we propose PROSSTT, a novel algorithm that simulates scRNA-seq count data of complex differentiation trajectories, and (3) noise modelling, where we propose a novel probabilistic description of count data, a statistically motivated local averaging strategy, and an adaptation of the cross validation approach for the evaluation of gene expression imputation strategies. While statistical modelling of the data was our primary motivation, due to time constraints we did not manage to fully realize our plans for it. Increasingly complex processes like whole-organism development are being studied by single-cell transcriptomics, producing large amounts of data. Methods for trajectory inference must therefore efficiently reconstruct \textit{a priori} unknown lineage trees with many cell fates. We propose MERLoT, a method that can reconstruct trees in sub-quadratic time by utilizing a local averaging strategy, scaling very well on large datasets. MERLoT compares favorably to the state of the art, both on real data and a large synthetic benchmark. The absence of data with known complex underlying topologies makes it challenging to quantitatively compare tree reconstruction methods to each other. PROSSTT is a novel algorithm that simulates count data from complex differentiation processes, facilitating comparisons between algorithms. We created the largest synthetic dataset to-date, and the first to contain simulations with up to 12 cell fates. Additionally, PROSSTT can learn simulation parameters from reconstructed lineage trees and produce cells with expression profiles similar to the real data. Quantifying similarity between single-cell transcriptomes is crucial for clustering scRNA-seq profiles to cell types or inferring developmental trajectories, and appropriate statistical modelling of the data should improve such similarity calculations. We propose a Gaussian mixture of negative binomial distributions where gene expression variance depends on the square of the average expression. The model hyperparameters can be learned via the hybrid Monte Carlo algorithm, and a good initialization of average expression and variance parameters can be obtained by trajectory inference. A way to limit noise in the data is to apply local averaging, using the nearest neighbours of each cell to recover expression of non-captured mRNA. Our proposal, nearest neighbour smoothing with optimal bias-variance trade-off, optimizes the k-nearest neighbours approach by reducing the contribution of inappropriate neighbours. We also propose a way to assess the quality of gene expression imputation. After reconstructing a trajectory with imputed data, each cell can be projected to the trajectory using non-overlapping subsets of genes. The robustness of these assignments over multiple partitions of the genes is a novel estimator of imputation performance. Finally, I was involved in the planning and initial stages of a mouse ovary cell atlas as a collaboration.de
dc.contributor.coRefereeSchuh, Melina Dr.
dc.subject.engsingle-cellde
dc.subject.engtranscriptomicsde
dc.subject.engmodellingde
dc.subject.engtoolde
dc.subject.engcomputationalde
dc.subject.engstatisticsde
dc.subject.engbayesiande
dc.subject.engheuristicde
dc.subject.engsimulationde
dc.subject.engtrajectory inferencede
dc.identifier.urnurn:nbn:de:gbv:7-21.11130/00-1735-0000-0003-C196-9-8
dc.affiliation.instituteBiologische Fakultät für Biologie und Psychologiede
dc.subject.gokfullBiologie (PPN619462639)de
dc.identifier.ppn167230752X


Dateien

Thumbnail

Das Dokument erscheint in:

Zur Kurzanzeige