Measuring metadata quality
by Péter Király
Date of Examination:2019-06-24
Date of issue:2019-07-26
Advisor:Prof. Dr. Gerhard Lauer
Referee:Prof. Dr. Gerhard Lauer
Referee:Dr. Marco Büchler
Referee:Prof. Dr. Ramin Yahyapour
Files in this item
Name:Király, Measuring metadata quality. Thesis, ...pdf
Size:2.27Mb
Format:PDF
Abstract
English
In the last 15 years different aspects of metadata quality have been investigated. Researchers measured the established metrics on a variety of metadata collections. One common aspect of the majority of these research projects is that the tools they produce as a necessary side effect were not intended to be reused in other projects. This research, while focusing mainly on a specific metadata collection, Europeana, investigates practical aspects of metadata quality measurement such as reusability, reproducability, scalability and adaptability. Europeana.eu - the European digital platform for cultural heritage - aggregates metadata describing 58 million cultural heritage objects from more than 3200 libraries, museums, archives and audiovisual archives across Europe. The collection is heterogeneous with objects in different formats and languages and descriptions that are formed by different indexing practices. Often these records are also taken from their original context. In order to develop effective services for accessing and using the data we should know their strengths and weaknesses or in other words the quality of these data. The need for metadata quality is particularly motivated by its impact on user experience, information retrieval and data re-use in other contexts. In Chapter 2 the author proposes a method and an open source implementation to measure some structural features of these data, such as completeness, multilinguality and uniqueness. The investigation and exposure of record patterns is another aspect to reveal quality issues. One of the key goals of Europeana is to enable users to retrieve cultural heritage resources irrespective of their origin and the material's metadata language. The presence of multilingual metadata descriptions is therefore essential for successful cross-language retrieval. Quantitatively determining Europeana's crosslingual reach is a prerequisite for enhancing the quality of metadata in various languages. Capturing multilingual aspects of the data requires us to take data aggregation lifecycle into account including data enhancement processes such as automatic data enrichment. In Chapter 3 the author presents an approach developed together with some members of Europeana Data Quality Committee for assessing multilinguality as part of data quality dimensions, namely completeness, consistency, conformity and accessibility. The chapter describes the defined and implemented measures, and provides initial results and recommendations. The next chapter (Chapter 4) { investigating the applicability of the above mentioned approach { describes the method and results of validation of 16 library catalogues. The format of the catalog record is Machine Readable Cataloging (MARC21) which is the most popular metadata standard for describing books. The research investigates the structural features of the record and as a result finds and classifies different commonly found issues. The most frequent issues are usage of undocumented schema elements, improper values instead of using terms from controlled vocabulary, or the failure to meet other strict requirements. The next chapters describe the engineering aspects of the research. First (Chapter 5), a short account of the structure of an extensible metadata quality assessment framework is given, which supports multiple metadata schemas, and is flexible enough to work with new schemas. The software has to be scalable to be able to process huge amount of metadata records within a reasonable time. Fundamental requirements that need to be considered during the design of such a software are i) the abstraction of the metadata schema (in the context of the measurement process), ii) how to address distinct parts within metadata records, iii) the work ow of the measurement, iv) a common and powerful interface for the individual metrics, and v) interoperability with Java and REST APIs. Second (Chapter 6), is an investigation of the optimal parameter settings for a long running, standalone mode Apache Spark based, stateless process. It measures the effects of four different parameters and compares the application's behaviour in two different servers. The most important lessons learned in this experiment is that allocating more resources does not necessary imply better performance. Moreover, what we really need in an environment with limited and shared resources is a `good enough' state which respectfully let other processes run. To find the optimal settings, it is suggested to pick up a smaller sample, which is similar to the full dataset in important features, and measure performance with different settings. The settings worth to check are number of cores, memory allocation, compression of the source files, and reading from different file systems (if they are available). As a source of ground truth Spark's default log, Spark event log, or measuring points inside the application can be used. The final chapter explains future plans, the applicability of the method to other subdomains, such as Wikicite (the open citation data collection of Wikidata) and research data, and research collaborations with different cultural heritage institutions.
Keywords: metadata; cultural heritage; data science; Big Data