Masterarbeit / Bachelorarbeit / Teamorientierte Projektarbeit
Information Extraction from Scientific Documents
The submission and publication process in science is mainly document-oriented. Full Papers for a conference are commonly submitted as pdf documents, whose content is more or less appropriately structured depending in the actual field of study. However, searching particular information in those documents is often limited to a keyword-based full-text search because relevant concept can only be partially accessed in an automated fashion. As a consequence, a systematic literature review is often time-consuming as the entire document has to be read for a particular information.
Therefore, the proposed topic should deal with the automation of knowledge extraction from pdf documents by applying modern approaches from the text mining and NLP domain. First of all, a state-of-the art literature review has to be done on which appropriate strategies and software tools exist. In the following, common data sets of scientific publications have to be analyzed which data is relevant to extract. This can easily start with the title, name of the authors, and the conference or journal information but has to be extended to more sophisticated concepts such as access to definitions, the main problem, methodology or evaluation method. All extracted information should then be mapped to a semantic representation so that it can be further processed by other systems. An evaluation of the proposed method has to show the degree to which information extraction is succesful and how the own approach performs in comparison to other existing solutions.