Mathematical Entity Linking Methods and Applications
Doctoral thesis
Date of Examination:2024-06-28
Date of issue:2024-08-15
Advisor:Prof. Dr. Bela Gipp
Referee:Prof. Dr. Bela Gipp
Referee:Dr. Martin Klein
Referee:Prof. Dr. Hans Friedrich Witschel
Files in this item
Name:Dissertation_Philipp_Scharpf.pdf
Size:4.77Mb
Format:PDF
Description:Dissertation
This file will be freely accessible after 2026-06-27.
Abstract
English
Entity Linking (EL) is a vital component of Information Retrieval (IR) systems that extract abstract representations from text documents. As such, EL plays a crucial role in various applications such as semantic search, Recommender Systems, Question Answering, document classification, plagiarism detection, and conversational systems (chatbots). Until now, many EL approaches have been developed in academia and industry. However, they are designed only for natural language texts. Documents from Science, Technology, Engineering, and Mathematics (STEM) disciplines typically contain many mathematical expressions (formulas and identifiers, e.g., variables or constants) alongside text. Mathematical Information Retrieval (MathIR) systems, such as Mathematical Question Answering (MathQA), require the classical EL approaches to be generalized to Mathematical Entity Linking (MathEL), which maps mathematical expressions to a semantic knowledge base, such as Wikidata. To tackle the research gap in EL approaches for mathematical expressions, this thesis aims to propose, implement, and evaluate methods and applications of MathEL using rule-based Artificial Intelligence, Machine Learning, and Wikidata. The research is guided by the following question: “How can classical EL methods and applications be transferred to enable MathEL?” To answer the research question, the following research tasks are derived: 1) Review the state of the art in classical EL and find out why the reviewed approaches are insufficient for MathEL, 2) design and evaluate supervised and unsupervised methods for MathEL, 3) design and evaluate applications of MathEL, 4) discuss the achievements and challenges to outline future work. Among the MathEL research contributions are: 1) formula classification with up to 90% accuracy, 2) semantic formula search outperforming the search engine Google, 3) formula annotation acceleration to less than half the time, 4) formula Question Answering outperforming the knowledge engine Wolfram Alpha, and 5) reliable and scalable formula question generation using a Computer Algebra System and a Knowledge Graph. Among the developed open-source demonstrators of the MathEL methods and applications are: 1) interactive visualizations of Machine Learning methods for Formula Concept (FC) discovery and recognition, 2) a formula and identifier annotation Recommender System for Wikipedia articles and STEM documents (AnnoMathTeX), 3) a semantic formula search and mathematical Question Answering system (MathQA), 4) a physics question generation and test engine (PhysWikiQuiz), and 5) an explainable fine-grained hierarchical classification system for mathematical documents (AutoMSC).
Keywords: Information Retrieval; Mathematical Information Retrieval; Machine Learning; Question Answering; Question Generation; Document Classification; Artificial Intelligence