Masterarbeit

Assessing Data Quality in Digital Research Dataset Metadata to improve Discoverability and Interdisciplinary Reuse

Research Area

Intelligent Information Management

Students

Advisers

Nowadays, scientists are encouraged to publish their research artifacts in established research data repositories so that others can find and reuse it. In this publishing process, researchers typically provide additional information as metadata to describe characteristics of such as dataset. This metadata is then exposed in different typical formats by a data repository together with the research artifact and used for for indexing and crawling activities. Nevertheless, the discoverability of such a research dataset is often still limited and mainly focusing on administrative metadata and less on structured descriptive metadata on the content of such a dataset.

In this project, we focus on data quality metrics that particularly address the discoverability of published research datasets and their assessment on existing metadata descriptions. In a first step, a typical research dataset discovery process and its shortcomings have to be described and which typical metadata description formats and schemas are commonly used for that. After a requirement and state-of-the-art analysis, relevant data quality critera have to be identified that can be measured on research dataset meta descriptions to assess the fitness to find, access and interdisciplinary reuse the dataset by other researchers for a particular use case. In a second step, it has to be checked which of these criteria can be assessed in an automated process on an existing meta description and it has to be conceptionally shown, how these measurements can be run for a given research dataset. Existing approaches such as FAIRmetrics or the OpenAIRE guidelines for Data archives can be taken into consideration for that. An implementation and evaluation has to show the feasibility and correctness of the approach based on a representative corpus of research datasets from different repositories or application domains.