• Deutsch
    • English
  • Deutsch 
    • Deutsch
    • English
  • Einloggen
Dokumentanzeige 
  •   Startseite
  • Naturwissenschaften, Mathematik und Informatik
  • Fakultät für Mathematik und Informatik (inkl. GAUSS)
  • Dokumentanzeige
  •   Startseite
  • Naturwissenschaften, Mathematik und Informatik
  • Fakultät für Mathematik und Informatik (inkl. GAUSS)
  • Dokumentanzeige
JavaScript is disabled for your browser. Some features of this site may not work without it.

Entity-Centric Text Mining for Historical Documents

von Maria Coll Ardanuy
Dissertation
Datum der mündl. Prüfung:2017-07-07
Erschienen:2017-11-14
Betreuer:Prof. Dr. Caroline Sporleder
Gutachter:Prof. Dr. Caroline Sporleder
Gutachter:Prof. Dr. Ramin Yahyapour
Gutachter:Prof. Dr. Ulrich Heid
Gutachter:Prof. Dr. Dieter Hogrefe
Gutachter:Prof. Dr. Gerhard Lauer
Gutachter:Prof. Dr. Wolfgang May
crossref-logoZum Verlinken/Zitieren: http://dx.doi.org/10.53846/goediss-6563

 

 

Dateien

Name:thesis web.pdf
Size:5.69Mb
Format:PDF
ViewOpen

Lizenzbestimmungen:


Zusammenfassung

Englisch

Recent years have seen an important increase of digitization projects in the cultural heritage domain. As a result, growing efforts have been directed towards the study of natural language processing technologies that support research in the humanities. This thesis is a contribution to the study and development of new text mining strategies that allow a better exploration of contemporary history collections from an entity-centric perspective. In particular, this thesis focuses on the challenging problems of disambiguating two specific kinds of named entities: toponyms and person names. They are approached as two clearly differentiated tasks, each of which exploiting the inherent characteristics that are associated to each kind of named entity. Finding the correct referent of a toponym is a challenging task, and this difficulty is even more pronounced in the historical domain, as it is not uncommon that places change their names over time. The method proposed in this thesis to disambiguate toponyms, GeoSem, is especially suited to work with collections of historical texts. It is a weakly-supervised model that combines the strengths of both toponym resolution and entity linking approaches by exploiting both geographic and semantic features. In order to do so, the method makes use of a knowledge base built using Wikipedia as a basis and complemented with additional knowledge from GeoNames. The method has been tested on a historical toponym resolution benchmark dataset in English and improved on the state of the art. Furthermore, five datasets of historical news in German and Dutch have been created from scratch and annotated. The method proposed in this thesis performs significantly better on them than two out-of-the-box state-of-the-art entity linking methods when only locations are considered for evaluation. Person names are likewise highly ambiguous. This thesis introduces a novel method for disambiguating person names from news articles. The method, SNcomp, exploits the relation between the ambiguity of a person name and the number of entities referred to by it. Modeled as a clustering problem in which the number of target entities is unknown, the method dynamically adapts its clustering strategy to the most suitable configuration for each person name depending on how common this name is. SNcomp has a strong focus on social relations and returns sets of automatically created social networks of disambiguated person entities extracted from the texts. The performance of the method has been tested on three person name disambiguation benchmark datasets in two different languages and is on par with the state of the art reported for one of the datasets, while using less specific resources. This thesis contributes to the fields of natural language processing and digital humanities. Information about entities and their relations is often crucial for historical research. Both methods introduced in this thesis have been designed and developed with the goal of assisting historians in delving into large collections of unstructured text and exploring them through the locations and the people that are mentioned in them.
Keywords: digital humanities; text mining; toponym disambiguation; person name disambiguation; historical text mining
 

Statistik

Hier veröffentlichen

Blättern

Im gesamten BestandFakultäten & ProgrammeErscheinungsdatumAutorBetreuer & GutachterBetreuerGutachterTitelTypIn dieser FakultätErscheinungsdatumAutorBetreuer & GutachterBetreuerGutachterTitelTyp

Hilfe & Info

Publizieren auf eDissPDF erstellenVertragsbedingungenHäufige Fragen

Kontakt | Impressum | Cookie-Einwilligung | Datenschutzerklärung
eDiss - SUB Göttingen (Zentralbibliothek)
Platz der Göttinger Sieben 1
Mo - Fr 10:00 – 12:00 h


Tel.: +49 (0)551 39-27809 (allg. Fragen)
Tel.: +49 (0)551 39-28655 (Fragen zu open access/Parallelpublikationen)
ediss_AT_sub.uni-goettingen.de
[Bitte ersetzen Sie das "_AT_" durch ein "@", wenn Sie unsere E-Mail-Adressen verwenden.]
Niedersächsische Staats- und Universitätsbibliothek | Georg-August Universität
Bereichsbibliothek Medizin (Nur für Promovierende der Medizinischen Fakultät)
Robert-Koch-Str. 40
Mon – Fri 8:00 – 24:00 h
Sat - Sun 8:00 – 22:00 h
Holidays 10:00 – 20:00 h
Tel.: +49 551 39-8395 (allg. Fragen)
Tel.: +49 (0)551 39-28655 (Fragen zu open access/Parallelpublikationen)
bbmed_AT_sub.uni-goettingen.de
[Bitte ersetzen Sie das "_AT_" durch ein "@", wenn Sie unsere E-Mail-Adressen verwenden.]