Understanding cellular differentiation by modelling of single-cell gene expression data

Papadopoulos, Nikolaos

dc.contributor.advisor	Soeding, Johannes Dr.
dc.contributor.author	Papadopoulos, Nikolaos
dc.date.accessioned	2019-08-16T08:59:14Z
dc.date.available	2019-08-16T08:59:14Z
dc.date.issued	2019-08-16
dc.identifier.uri	http://hdl.handle.net/21.11130/00-1735-0000-0003-C196-9
dc.identifier.uri	http://dx.doi.org/10.53846/goediss-7605
dc.language.iso	eng	de
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject.ddc	570	de
dc.title	Understanding cellular differentiation by modelling of single-cell gene expression data	de
dc.type	cumulativeThesis	de
dc.contributor.referee	Soeding, Johannes Dr.
dc.date.examination	2019-08-08
dc.description.abstracteng	Over the course of the last decade single-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, as one experiment routinely covers the expression of thousands of genes in tens or hundreds of thousands of cells. By quantifying differences between the single cell transcriptomes it is possible to reconstruct the process that gives rise to different cell fates from a progenitor population and gain access to trajectories of gene expression over developmental time. Tree reconstruction algorithms must deal with the high levels of noise, the high dimensionality of gene expression space, and strong non-linear dependencies between genes. In this thesis we address three aspects of working with scRNA-seq data: (1) lineage tree reconstruction, where we propose MERLoT, a novel trajectory inference method, (2) method comparison, where we propose PROSSTT, a novel algorithm that simulates scRNA-seq count data of complex differentiation trajectories, and (3) noise modelling, where we propose a novel probabilistic description of count data, a statistically motivated local averaging strategy, and an adaptation of the cross validation approach for the evaluation of gene expression imputation strategies. While statistical modelling of the data was our primary motivation, due to time constraints we did not manage to fully realize our plans for it. Increasingly complex processes like whole-organism development are being studied by single-cell transcriptomics, producing large amounts of data. Methods for trajectory inference must therefore efficiently reconstruct \textit{a priori} unknown lineage trees with many cell fates. We propose MERLoT, a method that can reconstruct trees in sub-quadratic time by utilizing a local averaging strategy, scaling very well on large datasets. MERLoT compares favorably to the state of the art, both on real data and a large synthetic benchmark. The absence of data with known complex underlying topologies makes it challenging to quantitatively compare tree reconstruction methods to each other. PROSSTT is a novel algorithm that simulates count data from complex differentiation processes, facilitating comparisons between algorithms. We created the largest synthetic dataset to-date, and the first to contain simulations with up to 12 cell fates. Additionally, PROSSTT can learn simulation parameters from reconstructed lineage trees and produce cells with expression profiles similar to the real data. Quantifying similarity between single-cell transcriptomes is crucial for clustering scRNA-seq profiles to cell types or inferring developmental trajectories, and appropriate statistical modelling of the data should improve such similarity calculations. We propose a Gaussian mixture of negative binomial distributions where gene expression variance depends on the square of the average expression. The model hyperparameters can be learned via the hybrid Monte Carlo algorithm, and a good initialization of average expression and variance parameters can be obtained by trajectory inference. A way to limit noise in the data is to apply local averaging, using the nearest neighbours of each cell to recover expression of non-captured mRNA. Our proposal, nearest neighbour smoothing with optimal bias-variance trade-off, optimizes the k-nearest neighbours approach by reducing the contribution of inappropriate neighbours. We also propose a way to assess the quality of gene expression imputation. After reconstructing a trajectory with imputed data, each cell can be projected to the trajectory using non-overlapping subsets of genes. The robustness of these assignments over multiple partitions of the genes is a novel estimator of imputation performance. Finally, I was involved in the planning and initial stages of a mouse ovary cell atlas as a collaboration.	de
dc.contributor.coReferee	Schuh, Melina Dr.
dc.subject.eng	single-cell	de
dc.subject.eng	transcriptomics	de
dc.subject.eng	modelling	de
dc.subject.eng	tool	de
dc.subject.eng	computational	de
dc.subject.eng	statistics	de
dc.subject.eng	bayesian	de
dc.subject.eng	heuristic	de
dc.subject.eng	simulation	de
dc.subject.eng	trajectory inference	de
dc.identifier.urn	urn:nbn:de:gbv:7-21.11130/00-1735-0000-0003-C196-9-8
dc.affiliation.institute	Biologische Fakultät für Biologie und Psychologie	de
dc.subject.gokfull	Biologie (PPN619462639)	de
dc.identifier.ppn	167230752X

Dateien

Name:thesis_nocv.pdf

Größe:28.27Mb

Format:PDF

Beschreibung:Dissertation

Öffnen

Name:: thesis_nocv.pdf
Größe:: 28.27Mb
Format:: PDF
Beschreibung:: Dissertation

Öffnen

Das Dokument erscheint in:

Fakultät für Biologie und Psychologie (inkl. GAUSS) [1621]

Zur Kurzanzeige