Automatically assuring Data Quality aspects in semantic research metadata
In the context of OpenScience, researchers are encouraged to publish their research datasets in common data repositories so that others can find and reuse it. To increase the findability, accesibility, interoperability and reusability (FAIR) of such a research dataset, metadata has to be provided to describe all relevant characteristics of the dataset. However, assuring data quality for published research data and metadata in data repositories is not trivial and often involves human interaction, tedious reviews, or is not done at all.
This project focuses on an automated meta data quality assessment component for published research datasets. Data quality considerations can be limited to semantic metadata descriptions of research datasets from a particular knowledge domain. In a first step, it has to be identified where research datasets are published and which typical metadata description formats are commonly used. After a requirement and state-of-the-art analysis, relevant data quality critera have to be identified that can be measured on research dataset meta descriptions. In a second step, it has to be checked which of these criteria can be assessed in an automated fashion on a semantic meta description and it has to be conceptionally shown, how these measurements can be run for a given research dataset. Existing approaches such as FAIRmetrics can be applied for that. An implementation and evaluation has to show the feasibility and correctness of the approach based on a representative corpus of research datasets.