Aspect-based Document Similarity for Literature Recommender Systems
Doctoral thesis
Date of Examination:2023-04-25
Date of issue:2023-06-20
Advisor:Prof. Dr. Bela Gipp
Referee:Prof. Dr. Bela Gipp
Referee:Ph.D Martin Klein
Referee:Prof. Dr. Sack Harald
Files in this item
Name:thesis_digital_without_cv.pdf
Size:5.38Mb
Format:PDF
Abstract
English
Literature recommendation systems assist readers in the discovery of relevant documents. Content-based systems recommend documents similar to the currently viewed document. However, the simple distinction between similar and dissimilar documents neglects the many aspects that make documents similar. For instance, two scientific papers may use a similar methodology while covering different research problems. Current document similarity measures are aspect-free, i.e., they cannot differentiate between specific aspects of the document content. To address this limitation, this thesis proposes aspect-based document similarity for literature recommendations. By incorporating aspect information, recommendations can account for specific aspects of the document content. This thesis makes three contributions: First, it evaluates document representations and similarity measures and demonstrates that the lack of aspect information notably impacts recommendations. Second, it designs a new scientific document representation method that improves upon the state-of-the-art. Third, it designs two approaches for aspect-based document similarity that address the limitations of aspect-free similarity. The thesis evaluates existing document similarity methods, focussing on methods that use graph and text information. The qualitative and quantitative evaluations reveal that although the overall user satisfaction is comparable between the two information sources, users perceive the recommendations from these sources as different. Therefore, the choice of similarity measures affects the generated recommendations, i.e., they implicitly address different aspects. Furthermore, the thesis designs a novel scientific document representation method. The method is called SciNCL and relies on citation graph embeddings to select the most informative samples for the contrastive fine-tuning of a text-based document encoder. SciNCL achieves state-of-the-art results and is applicable for both aspect-free and aspect-based similarity. Subsequently, the thesis first designs an aspect-based document similarity measure based on a pairwise multi-class classification approach. Unlike aspect-free similarity, which is a pairwise binary document classification - similar or not, the extension to a multi-class classification allows measuring similarity for a given aspect. The pairwise classification approach is implemented and evaluated for Wikipedia articles and scientific literature. The thesis also implements a second approach using specialized document representations to further improve the efficiency of aspect-based similarity. By formulating aspect-based similarity as a vector similarity problem in aspect-specific embedding spaces, aspect information is encoded only once per document and aspect. This makes the approach scale linearly with the corpus size. Further evaluations reveal that aspect-free representations have an implicit bias towards one aspect, confirming the problem of missing aspect information. The specialized document representations mitigate potential risks from implicit biases by making them explicit and controllable. Finally, the practicality of aspect-based document similarity is demonstrated with a prototypical research paper recommender system. The prototype provides diverse recommendations from different aspects and recommendations tailored to specific aspects.
Keywords: Document similarity; Recommender systems; Aspect; Literature