On Combinatorial Properties of Subsequences
Doctoral thesis
Date of Examination:2024-05-30
Date of issue:2024-08-22
Advisor:Prof. Dr. Florin Manea
Referee:Prof. Dr. Florin Manea
Referee:Prof. Dr. Carsten Damm
Referee:PD Dr. Henning Bordihn
Files in this item
Name:Koss - Dissertation.pdf
Size:1.63Mb
Format:PDF
Abstract
English
A word, also known as string, is a sequence of letters, while a subsequence of a word w is another word u which can be obtained by deleting some of the letters of w. Words in general, and subsequences in particular, play an important in role in many fields of theoretical and applied computer science, being the primary way to represent information: files are strings of bits, while DNA is represented as sequences of nucleotides, that is, letters A,C,G and T. Subsequences can be seen as lossy-representations of a word: they are natural mathematical model for situations where one has to deal with strings with missing symbols, such as it is quite often the case when processing DNA data, when dealing with digital signals communicated through lossy channels, or when dealing with corrupted data files. Natural problems when dealing with subsequences are either related to matching (deciding whether a longer word contains a shorter one as subsequence) and analysis (e.g., identifying the absent subsequences of a given length from a word, or checking whether two words have the same set of subsequences). In this thesis, in Chapters 3 and 4, we first focus on the set of strings which appear as subsequences in given words; and on the set of strings which do not occur in a given word as subsequences, that is, are absent. We provide efficient algorithms to determine whether two words have exactly the same set of subsequences (up to a predefined length k) and compact data structures to store shortest and minimal subsequences not occurring in a word. Further, in Chapter 5, we extend the classic setting to subsequences of formal languages, e.g., regular languages given by a finite automaton. We investigate the lower and upper bounds for the universality index of words occurring in a language and provide a polynomial time algorithm computing the former and, as well, show NP-completeness for the latter (in case this upper bound is finite). In a different line of research, in Chapter 6, we impose constraints on the factors of w bounded by the first and last letter of a subsequence u and, respectively, on the factors occurring between two consecutive letters of u. in this setting, we present optimal algorithms to determine whether u occurs as subsequence or, respectively, whether it is a minimal absent subsequence. Furthermore, we show NP-completeness for the problem to determine whether all words of length k occur as subsequence or to determine whether a given string u is a shortest absent subsequence, which is in stark constrast to the case of unconstrained subsequences presented in the previous chapters. Likewise, in Chapter 7, we investigate the longest common subsequence problem (LCS) in the framework of Chapter 6. Further, we impose length constraints, called gap constraints, on the factors occurring between the embedding of the letters of the subsequences in the the two input words. We define several natural possibilities to define a sequence of gap constraints and investigate the computation of the LCS for each of these possibilities. In Chapter 8, we introduce Soliton Automata and give combinatorial insights leading to sufficient conditions for a Soliton Automaton to be strongly connected and, respectively, complete, answering, in this way, an open question from the literature.
Keywords: Theoretic Computer Science; Combinatorics on Words; Fine Grained Complexity; String Algorithms; Matching & Analysis Problems; Subsequences