Masterarbeit
     Text Analysis to identify scientific concepts in Research Dataset Abstracts
    Text Analysis to identify scientific concepts in Research Dataset Abstracts
        Completion
2021/05
Research Area
Intelligent Information Management
Students
 
                    Egisa Kasemi
Advisers
 
             
            Prof. Dr.-Ing. Martin Gaedke
Description
In the context of OpenScience, researchers are encouraged to
          publish their research datasets in common data repositories so that others can find and
          reuse it. As the files contained in such a research dataset are commonly not
          self-descriptive, the authors have to provide additional meta-information about the
          nature, format and content of this dataset. This commonly includes the name of the
          authors, a title, a description text and some keywords. However, keywords often do not
          contain all characteristics of such a dataset. Most of the information is provided in an
          unstructured way in the description text (abstract) of such a meta description. It would
          be a benefit, if automated means can extract relevant entities from such a text and
          semantically map them to a corresponding concept identifier in a well-known
          terminology.
The objective of the Master's thesis project is to
          apply an appropriate Text Mining approach, such as Natural Language Processing (NLP), on a
          set of research dataset abstracts in order to identify relevant entities, such as the
          examined object, research objective, used device or software, file type, scientific method
          or other measurement characteristics, as long as they are mentioned in this descriptive
          text.
The use case can be limited to a particular research domain, such as
          research datasets from human-machine interaction.
To achieve this,
          a requirement analysis has be performed first. Then, a state-of-the-art analysis
          concerning existing approaches has be conducted. A concept has to be designed and
          described, in which a semi-structured meta data description for a research dataset form a
          common data repository is provided as an input, the approach analyses the abtract text and
          the solution reconciles all identified entities to an appropriate Linked Data identifier.
          An implementation has to show the feasibility of this approach and an evaluation has to
          assess quality parameters such as the accuracy in a practical environment.
 
                    


