Entity-Centric Text Mining for Historical Documents
by Maria Coll Ardanuy
Date of Examination:2017-07-07
Date of issue:2017-11-14
Advisor:Prof. Dr. Caroline Sporleder
Referee:Prof. Dr. Caroline Sporleder
Referee:Prof. Dr. Ramin Yahyapour
Referee:Prof. Dr. Ulrich Heid
Referee:Prof. Dr. Dieter Hogrefe
Referee:Prof. Dr. Gerhard Lauer
Referee:Prof. Dr. Wolfgang May
Files in this item
Name:thesis web.pdf
Size:5.69Mb
Format:PDF
Abstract
English
Recent years have seen an important increase of digitization projects in the cultural heritage domain. As a result, growing efforts have been directed towards the study of natural language processing technologies that support research in the humanities. This thesis is a contribution to the study and development of new text mining strategies that allow a better exploration of contemporary history collections from an entity-centric perspective. In particular, this thesis focuses on the challenging problems of disambiguating two specific kinds of named entities: toponyms and person names. They are approached as two clearly differentiated tasks, each of which exploiting the inherent characteristics that are associated to each kind of named entity. Finding the correct referent of a toponym is a challenging task, and this difficulty is even more pronounced in the historical domain, as it is not uncommon that places change their names over time. The method proposed in this thesis to disambiguate toponyms, GeoSem, is especially suited to work with collections of historical texts. It is a weakly-supervised model that combines the strengths of both toponym resolution and entity linking approaches by exploiting both geographic and semantic features. In order to do so, the method makes use of a knowledge base built using Wikipedia as a basis and complemented with additional knowledge from GeoNames. The method has been tested on a historical toponym resolution benchmark dataset in English and improved on the state of the art. Furthermore, five datasets of historical news in German and Dutch have been created from scratch and annotated. The method proposed in this thesis performs significantly better on them than two out-of-the-box state-of-the-art entity linking methods when only locations are considered for evaluation. Person names are likewise highly ambiguous. This thesis introduces a novel method for disambiguating person names from news articles. The method, SNcomp, exploits the relation between the ambiguity of a person name and the number of entities referred to by it. Modeled as a clustering problem in which the number of target entities is unknown, the method dynamically adapts its clustering strategy to the most suitable configuration for each person name depending on how common this name is. SNcomp has a strong focus on social relations and returns sets of automatically created social networks of disambiguated person entities extracted from the texts. The performance of the method has been tested on three person name disambiguation benchmark datasets in two different languages and is on par with the state of the art reported for one of the datasets, while using less specific resources. This thesis contributes to the fields of natural language processing and digital humanities. Information about entities and their relations is often crucial for historical research. Both methods introduced in this thesis have been designed and developed with the goal of assisting historians in delving into large collections of unstructured text and exploring them through the locations and the people that are mentioned in them.
Keywords: digital humanities; text mining; toponym disambiguation; person name disambiguation; historical text mining