Modification Analysis in Historical Paraphrastical Parallel Text
An Empirical Work on Stable and Changing Elements in Historical Text Reuse
von Maria Berger
Datum der mündl. Prüfung:2019-05-02
Erschienen:2019-10-08
Betreuer:Dr. Marco Büchler
Gutachter:Dr. Marco Büchler
Gutachter:Prof. Dr. Caroline Sporleder
Dateien
Name:thesis1.pdf
Size:1.79Mb
Format:PDF
Zusammenfassung
Englisch
Clarifying the genesis of a passed down text is of outmost importance for many scholarly disciplines within the humanities such as history, literary studies, and Bible studies. The computational detection of such passed down texts in the form of historical text reuse, including citations, quotations or allusions, unintended reuse of a saying, or even of cross-linguistic reuse in the form of translations, can be applied in many respects. It can help tracing down historical content (a.k.a., lines of transmission), which is essential to the field of textual criticism. In modern literature it can help assigning text to authors. In the context of massive digitization projects, it can identify relationships between text excerpts referring to the same source. Specifically, detecting copies of the same historical text that have diverged over time is an important task. While detecting reuse in contemporary languages is well-understood|given the existence of extensive research, techniques, and corpora, automatically detecting historical text reuse is much more difficult. Corpora of historical languages often encompass various genres, linguistic varieties, and topics. In fact, the automated detection of historical text reuse is much less understood, requiring empirical work to improve its automation. Especially, the analysis of text reuse by quantitative methods is crucial to understand reuse in detail. This work presents a technique for describing text reuse modi cation on a finegrained level and collects empirical data based on the application of the technique to several datasets and use cases. In detail, this work presents a linguistic analysis of text reuse in two medieval datasets. In a more comprehensive analysis, it investigates modifications in a monolingual parallel corpus of English Bible translations and a parallel Corpus of German Bible translations. We design and implement an automated technique to analyze how a source text is modified compared to its reuse/parallel version, taking linguistic resources into account to understand how they help characterizing the transformation. Precisely, an operation set is designed considering operations based on morphological cognates and lexicon-based operations based on semantic relations to find a mapping between a source text and its reused/parallel version and apply it on top of a statistical alignment output to learn how precisely and to what extent text is modified. The work is complemented by a manual analysis of subsets of the medieval reuse datasets, and a manual evaluation of the alignment precision on subsets of the English Bible Corpus. The results show the lack of resources for ancient texts, while lexical database for modern languages are widely available and can partially enhance the technique presented in this work. However, especially for a sufficiently preprocessed historical English text, linguistic resources can effectively support understanding the paraphrastical text reuse modification process. These results can support practitioners and researchers working on detecting historical reuse.
Keywords: modification analysis; historical language; paraphrastical text; text reuse; non-literal text reuse; synset databases