Authors:
Arman Arzani
;
Theodor Josef Vogl
;
Marcus Handte
and
Pedro José Marrón
Affiliation:
University of Duisburg-Essen, Essen, Germany
Keyword(s):
Innovation Management, Data Mining, University Structure Extraction, Web Page Classification.
Abstract:
To support innovation coaches in scouting activities such as discovering expertise, trends inside a university and finding potential innovators, we designed INSE, an innovation search engine which automates the data gathering and analysis processes. The primary goal of INSE is to provide comprehensive system support across all stages of innovation scouting, reducing the need for manual data collection and aggregation. To provide innovation coaches with the necessary information on individuals, INSE must first establish the structure of the organization. This includes identifying the associated staff and researchers in order to assess their academic activities. While this could in theory be done manually, this task is error-prone and virtually impossible to do for large organizations. In this paper, we propose a generic organization mining approach that combines a rule-based algorithm, LLMs and finetuned sequence-to-sequence classifier on university websites, independent of web techno
logies, content management systems or website layout. We implement the approach and evaluate the results against four different universities, namely Duisburg-Essen, Münster, Dortmund, and Wuppertal. The evaluation indicate that our approach is generic and enables the identification of university aggregators pages with F1 score of above 85% and landing pages of entities with F1 scores of 100% for faculties, above 78% for institutes and chairs.
(More)