Bachelorarbeit / Praktikum / Studienarbeit

Crawling the Internet of Things Scientific Publications to Create a Large Text Corpus

Completion

2021/08

Research Area

Web Engineering

Advisers

The Web has become an increasingly important resource for finding information. In particular, scientific articles capture the main vocabularies and key finding of a specific field. Therefore, the scientific articles in a domain can be used to identify the key concepts within a domain using a large text corpus. However, with a growing number of scientific publications in a domain, it is nearly impossible for humans to manually identify those publications and to create the text corpus. Currently, in the literature, there is no large text corpus for capturing the key concepts within the Internet of Things domain.

Therefore, the main aim of this thesis is to develop a solution that can crawl the existing scientific publications from existing portals like IEEE, Scopus, Elsevier, etc., according to a set of keywords and construct a meaningful text dataset. The generated text corpus can then be used for many different tasks such as training machine learning models to automatically identify certain patterns or cluster the main topics researched within the IoT domain etc., The thesis also involves a state of the art on existing dataset creation approaches as well as the demonstration of the solution with a prototypical implementation and a suitable evaluation.

——

If you are interested in this topic, please contact me via email (mahda.noura@informatik.tu-chemnitz.de) to discuss the topic in detail together.

Description (German)

Das Web ist zu einer immer wichtigeren Ressource für die Suche nach Informationen geworden. Insbesondere erfassen wissenschaftliche Artikel die wichtigsten Vokabeln und Schlüsselergebnisse eines bestimmten Fachgebiets. Daher können die wissenschaftlichen Artikel in einer Domäne verwendet werden, um die Schlüsselkonzepte innerhalb einer Domäne unter Verwendung eines großen Textkorpus zu identifizieren. Mit einer wachsenden Anzahl wissenschaftlicher Veröffentlichungen in einem Bereich ist es für Menschen jedoch nahezu unmöglich, diese Veröffentlichungen manuell zu identifizieren und den Textkorpus zu erstellen. Derzeit gibt es in der Literatur keinen großen Textkorpus zur Erfassung der Schlüsselkonzepte im Bereich Internet der Dinge.

Daher ist das Hauptziel dieser Arbeit die Entwicklung einer Lösung, mit der vorhandene wissenschaftliche Veröffentlichungen aus vorhandenen Portalen wie IEEE, Scopus, Elsevier usw. anhand einer Reihe von Schlüsselwörtern gecrawlt und ein aussagekräftiger Textdatensatz erstellt werden kann. Der generierte Textkorpus kann dann für viele verschiedene Aufgaben verwendet werden, z. B. zum Trainieren von Modellen für maschinelles Lernen, um bestimmte Muster automatisch zu identifizieren oder die im IoT-Bereich untersuchten Hauptthemen usw. zu gruppieren. Die Arbeit beinhaltet auch einen Stand der Technik zu bestehenden Ansätzen zur Erstellung von Datensätzen sowie die Demonstration der Lösung mit einer prototypischen Implementierung und einer geeigneten Bewertung.

——

Wenn Sie an diesem Thema interessiert sind, kontaktieren Sie mich bitte per E-Mail (mahda.noura@informatik.tu-chemnitz.de), um das Thema gemeinsam ausführlich zu besprechen.