# Phylogenetic Forest Spaces

by Jonas Lueg

Date of Examination:2022-10-04

Date of issue:2023-01-10

Advisor:Prof. Dr. Stephan F. Huckemann

Referee:Prof. Dr. Stephan F. Huckemann

Referee:Dr. Tom M. W. Nye

## Files in this item

Name:Dissertation_Jonas_Lueg_Revision.pdf

Size:1.07Mb

Format:PDF

Description:Revised dissertation from Jonas Lueg
## Abstract

### English

Given a fixed set of operational taxonomic units (OTUs), when studying their evolutionary relationships among each other, a common way of depicting those is to construct a phylogenetic tree, a graph-theoretic tree with a root, that is the common ancestor (which one assumes to exist). Herein each leaf represents a unique OTU and edges have real positive lengths representing the evolutionary distance of some kind, which can be time, but also other metrics. Nowadays, those trees are generated from multiple measured sequence data such as DNA, RNA or protein sequences, and having various measurements from the same species, one usually obtains many different proposals for phylogenetic trees representing the evolutionary relationships. As one believes that there exists one true phylogenetic tree, one wishes to combine the different proposals into a single phylogenetic tree. As a statistician, the natural thing to do is to average the proposals in some sense in order to obtain a mean, that is again a phylogenetic tree, and which allows for further statistical methodology, such as confidence regions or testing. However, this requires some appropriate mathematical structure on the set of phylogenetic trees, preferable with geometry, i.e. at least a metric space. The Billera-Holmes-Vogtmann tree space (BHV space) is an example for such a structure, it is a metric space that is CAT(0) and furthermore a Riemann stratified space of type B, cf. Billera et al. (2001). This space has favorable properties like completeness and global non-positive curvature, which implies that between two points, there exists a unique geodesic connecting them. Furthermore, there is a polynomial time algorithm to compute exact geodesics in this space, cf. Owen & Provan (2011). Nonetheless, the space is artificially constructed from embedding it in a high dimensional Euclidean space and as a consequence does not behave as biological understanding would expect a metric space of phylogenetic trees to behave, as is discussed for example in Garba et al. (2018). They consider the distance between two trees with different structures in the case that the edge lengths go to infinity. The same example motivates one to actually include disconnected forests into the space. A promising construction that covers the example from above and behaves as biological intuition suggests is our recently introduced Wald Space, cf. Garba et al. (2021a). It is based on the characterization of phylogenetic trees as covariance matrices, which is a byproduct of a generalization of the popular biological substitution models that are used to calculate likelihoods for trees given genetic sequence data, so those substitution models are the backbone of phylogenetic tree estimation. In other words, the Wald Space is a space that is consistent with the tree estimation methods that are currently used, up to the generalizations that have been made. More details can be found in Garba et al. (2021a). In this work, we concentrate on the Wald Space purely from the perspective that it is a mathematical structure and thus we try to enable for a better understanding of the Wald Space. To this end, we introduce the mathematical structures required: metric spaces, Riemannian manifolds, Riemann stratified spaces, as well as, for our construction essential, various geometries on the manifold of strictly positive definite symmetric real matrices. Furthermore, we introduce various possible ways to represent the phylogenetic trees and forests that we consider. Having finished the introduction of the more general and known concepts, we briefly introduce the BHV Space. Then we define and describe the Wald Space, which is a topological stratified space, and we investigate its topological features. This part is the core of the thesis. Finally, we equip the Wald Space with a geometry that can be chosen to some degree and find that these spaces are then Riemann stratified spaces of type (A). Last but not least, we propose some numerical algorithms to calculate geodesics and distances in the Wald Space equipped with a geometry. Those are not tested in this work, but to some extent in Lueg et al. (2021).**Keywords:**phylogenetics; Riemannian manifold; metric space; geometry; Fréchet mean; phylogenetic tree; stratified space