Automated Semantic Metadata Extraction from Research Data
In the context of OpenScience, researchers are encouraged to publish their research datasets in common data repositories so that others can find and reuse it. To increase the findability of such a research dataset, metadata has to be provided to describe all characteristics of the contained data. However, the manual annotation of research datasets is a tedious process and the meta information is often provided in an ambiguous, literal way. On the other hand, a variety of tools already exists that is capabile of extracting information from an input file of a particular file type. Therefore, it is assumed, that these tools can also be applied to extract meta information that can be used to describe certain characteristics of research datasets in a semantic fashion.
The aim of this project is identify and apply these metadata extractors for research datasets from the computer science knowledge domain. After defining the term research dataset, a list of relevant file types has to be identified that can occur in this knowledge domain Then, a requirement analysis and a state of the art analysis on existing tools has to be done to extract relevant meta information from different types of research data such as tabular files and multimedia files. Next, a concept has to be designed how different appropriate meta information extraction tools can be used to extract meta data from a given input research dataset file. To increase the reusability of the extracted meta data, the information shall be outputted in an RDF serialization format based on appropriate, well-established ontologies. An implementation and evaluation has to show the feasibility and correctness of the approach. NLP techniques for documents such as pdf or doc files can be excluded from this research project.