Big Data Infrastructure for Analysing Digitalized Library Collections
Doctoral thesis
Date of Examination:2025-07-09
Date of issue:2025-07-31
Advisor:Prof. Dr. Ramin Yahyapour
Referee:Prof. Dr. Ramin Yahyapour
Referee:Prof. Dr. Bela Gipp
Files in this item
Name:triet-doan-phd-thesis.pdf
Size:13.7Mb
Format:PDF
Abstract
English
Digital Humanities (DH) represents an interdisciplinary field at the intersection of digital technologies and the study of the humanities. A variety of projects fall under the umbrella of DH, including digital archives, cultural analytics, online publishing, and other related endeavors. The present study focuses on text analysis in the context of DH. The objective of this work is to address two key challenges: the acquisition of data and the conduct of large-scale text analysis. The initial challenge arises from the difficulty in locating historical texts. The second issue arises from the fact that it is not a simple process for DH scientists to conduct an analysis on a large amount of text. Following interviews with numerous DH scientists and discussions with relevant stakeholders, a list of functional and non-functional requirements has been compiled. In light of this, an evaluation of the available services on the market is conducted. It is regrettable that none of the aforementioned services aligns with our requirements. Consequently, a service has been developed with the objective of addressing the aforementioned issues. The newly developed service is designated as MINE. The service offers a search engine that enables users to locate historical texts from a range of data sources. Moreover, users are afforded the option of constructing corpora from the search results or uploaded files. Subsequently, users may instruct the system to analyze their corpora in accordance with the selected text analysis models and parameters. These analyses are executed on a high-performance cluster, which is a powerful computing infrastructure. This allows scientists to perform much larger analyses than they would be able to on their personal desktops or laptops. Although MINE is still in the prototype phase at this time, the majority of the defined requirements have already been achieved. For features which are still under development or discussion, a comprehensive plan for their future implementation is also available.
Keywords: digital humanities; search engine; hpc; knowledge graph
