Masterarbeit

Text Analysis to identify scientific concepts in Research Dataset Abstracts

Completion

2021/05

Research Area

Intelligent Information Management

Students

Egisa Kasemi

student

Advisers

Prof. Dr.-Ing. Martin Gaedke

professor

Room: 1/B319

Phone: +49 371 531 25530

Email: gaedke@informatik.tu-chemnitz.de

Description

In the context of OpenScience, researchers are encouraged to publish their research datasets in common data repositories so that others can find and reuse it. As the files contained in such a research dataset are commonly not self-descriptive, the authors have to provide additional meta-information about the nature, format and content of this dataset. This commonly includes the name of the authors, a title, a description text and some keywords. However, keywords often do not contain all characteristics of such a dataset. Most of the information is provided in an unstructured way in the description text (abstract) of such a meta description. It would be a benefit, if automated means can extract relevant entities from such a text and semantically map them to a corresponding concept identifier in a well-known terminology.

The objective of the Master's thesis project is to apply an appropriate Text Mining approach, such as Natural Language Processing (NLP), on a set of research dataset abstracts in order to identify relevant entities, such as the examined object, research objective, used device or software, file type, scientific method or other measurement characteristics, as long as they are mentioned in this descriptive text.
The use case can be limited to a particular research domain, such as research datasets from human-machine interaction.

To achieve this, a requirement analysis has be performed first. Then, a state-of-the-art analysis concerning existing approaches has be conducted. A concept has to be designed and described, in which a semi-structured meta data description for a research dataset form a common data repository is provided as an input, the approach analyses the abtract text and the solution reconciles all identified entities to an appropriate Linked Data identifier. An implementation has to show the feasibility of this approach and an evaluation has to assess quality parameters such as the accuracy in a practical environment.