
ified the LLM and the sequence-to-sequence classi-
fier approach for the identification of landing pages
of high/low-level entities. Finally, we evaluated
our results against four different universities, namely
Duisburg-Essen, M
¨
unster, Dortmund, and Wupper-
tal. The results indicate that the implemented ap-
proach works across universities, capable of identi-
fying university structure and its entities with average
F1 scores of 85% for the aggregator pages, 100% for
faculties, and 78% for institutes/chairs.
As part of INSE, we are working to build a graph-
ical user interface around our approach with the ob-
jective of supporting the innovation coaches of our
university in scouting and screening tasks. For future
work, we are planning to investigate a visual-based
approach for the aggregator and landing page identi-
fication via convolutional neural networks.
ACKNOWLEDGEMENTS
This work has been funded by GUIDE REGIO, which
aims to improve the ability of the science support cen-
ter of the University of Duisburg-Essen in the iden-
tification, qualification, and incubation of innovation
potentials.
REFERENCES
Abdelakfi, M., Mbarek, N., and Bouzguenda, L. (2021).
Mining organizational structures from email logs: an
nlp based approach. Procedia Computer Science,
192:348–356.
Adoma, A. F., Henry, N.-M., and Chen, W. (2020). Com-
parative analyses of bert, roberta, distilbert, and xl-
net for text-based emotion recognition. In 2020 17th
international computer conference on wavelet active
media technology and information processing (IC-
CWAMTIP), pages 117–121. IEEE.
Aich, S., Chakraborty, S., and Kim, H.-C. (2019). Convolu-
tional neural network-based model for web-based text
classification. International Journal of Electrical &
Computer Engineering (2088-8708), 9(6).
Arzani, A., Handte, M., and Marr
´
on, P. J. (2023). Chal-
lenges in implementing a university-based innovation
search engine. In KDIR, pages 477–486.
Bart
´
ık, V. (2010). Text-based web page classification with
use of visual information. In 2010 International Con-
ference on Advances in Social Networks Analysis and
Mining, pages 416–420. IEEE.
Isogai, S., Ogata, S., Kashiwa, Y., Yazawa, S., Okano, K.,
Okubo, T., and Washizaki, H. (2024). Toward extract-
ing learning pattern: A comparative study of gpt-4o-
mini and bert models in predicting cvss base vectors.
In 2024 IEEE 35th International Symposium on Soft-
ware Reliability Engineering Workshops (ISSREW),
pages 127–134. IEEE.
Kumar, R., Punera, K., and Tomkins, A. (2006). Hierar-
chical topic segmentation of websites. In Proceedings
of the 12th ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 257–
266.
Li, W.-S., Kolak, O., Vu, Q., and Takano, H. (2000). Defin-
ing logical domains in a web site. In Proceedings
of the eleventh ACM on Hypertext and hypermedia,
pages 123–132.
Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W.,
Tay, Y., Zhou, D., Le, Q. V., Zoph, B., Wei, J.,
et al. (2023). The flan collection: Designing data
and methods for effective instruction tuning. In In-
ternational Conference on Machine Learning, pages
22631–22648. PMLR.
Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N.,
Chenaghlu, M., and Gao, J. (2021). Deep learning–
based text classification: a comprehensive review.
ACM computing surveys (CSUR), 54(3):1–40.
Ni, Z., Wang, S., and Li, H. (2011). Mining organizational
structure from workflow logs. In Proceeding of the
International Conference on e-Education, Entertain-
ment and e-Management, pages 222–225. IEEE.
Nurek, M. and Michalski, R. (2020). Combining machine
learning and social network analysis to reveal the or-
ganizational structures. Applied Sciences, 10(5):1699.
Rehm, G. (2006). Hypertextsorten: Definition, Struk-
tur, Klassifikation. PhD thesis, Universit
¨
atsbibliothek
Giessen.
Sava, D. (2024). Text-based classification of websites using
self-hosted large language models: An accuracy and
efficiency analysis. B.S. thesis, University of Twente.
Sun, A. and Lim, E.-P. (2003). Web unit mining: finding
and classifying subgraphs of web pages. In Proceed-
ings of the twelfth international conference on Infor-
mation and knowledge management, pages 108–115.
Sun, A. and Lim, E.-P. (2006). Web unit-based mining
of homepage relationships. Journal of the Ameri-
can Society for Information Science and Technology,
57(3):394–407.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
M.-A., Lacroix, T., Rozi
`
ere, B., Goyal, N., Hambro,
E., Azhar, F., et al. (2023). Llama: Open and ef-
ficient foundation language models. arXiv preprint
arXiv:2302.13971.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C.,
Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz,
M., et al. (2019). Huggingface’s transformers: State-
of-the-art natural language processing. arXiv preprint
arXiv:1910.03771.
Yang, C. C. and Liu, N. (2009). Web site topic-hierarchy
generation based on link structure. Journal of the
American Society for Information Science and Tech-
nology, 60(3):495–508.
A Hybrid Approach for Mining the Organizational Structure from University Websites
199