Leveraging Large Language Models for Automated Web Scraping: A Focus on Academic Career Planning
In the highly competitive academic landscape, where the number of applicants far exceeds the available professorial positions, planning an academic career path is a complex task. Crucial data on relevant job positions is scattered across numerous platforms and job portals, in addition to individual job listings on university homepages. This vast and ever-changing data landscape must be laboriously searched and compiled manually, which often leads to errors and incomplete results. Furthermore, tracking potential future vacancies presents an additional challenge. Therefore, a solution is needed that can automatically collect, organize, and present this information efficiently and user-friendly, thereby potentially increasing academics' chances of finding a professorship that matches their skills and aspirations.
This thesis aims at designing and implementing an innovative, automated system for collecting, filtering, and presenting data about academic job positions that can aid academics in career planning. The proposed solution needs to leverage both traditional web scraping approaches and state-of-the-art Large Language Models (LLMs) to extract and consolidate publicly accessible information. The data should be pulled from both structured and unstructured data sources, only utilizing publicly accessible web resources such as DFG GERiT, DBPedia, search engines, and university homepages, with the scope set to academic opportunities within the D-A-CH region (Germany, Austria, Switzerland). The necessary data points for each position include, but are not limited to: Country, State, University, Type, Faculty/Department, Institute, Chair, Professorship, URLs, Job Holder, Appointment Date, Grouping (W1-W3), Vacancy, and Application Deadlines.
The objective of this thesis is the creation of a solution or the combination of existing techniques to solve the problem of automatically providing academic career planning information based on methods from the fields of Web Scraping, Data Science and Natural Language Processing (NLP) as described above. This comprises the analysis of the state of the art of suitable extraction methods as well as the demonstration of the solution by implementation and an appropriate experimental evaluation assessing the accuracy and completeness of the data extraction, and the usability of the results representation.