
taxonomy, may not always be accessible via CSS or
XPath selectors, as they can be nested within prod-
uct descriptions or other unstructured text rather than
explicitly encoded in the HTML source.
As illustrated in Figure 5, the GPC attribute list
for bread products contains detailed attribute val-
ues that are often embedded in unstructured formats.
This complexity suggests that, for such cases, di-
rect extraction may outperform indirect extraction ap-
proaches.
6 CONCLUSION
Our comparative study shows that both direct and in-
direct LLM-based extraction approaches can effec-
tively automate information retrieval from product
web pages. The indirect approach offers substan-
tial cost savings and comparable accuracy, particu-
larly when pages follow largely consistent templates,
even if minor structural variations exist within a sin-
gle shop.
Future work will focus on evaluating a wider
range of LLMs, refining the dynamic function gener-
ation process, and expanding applicability to more di-
verse web structures and attribute types. In particular,
our goal is to utilize attributes defined in the GPC tax-
onomy with our methods by reliably classifying prod-
ucts to their lowest hierarchy level (brick), thereby
improving automated attribute extraction from com-
plex product pages.
Ultimately, our results suggest that LLM-based
automation could serve as a practical and scalable al-
ternative to manually defined web scraping, enabling
seamless integration with existing data models and a
wide range of downstream applications.
CODE AVAILABILITY
All Jupyter notebooks, scripts, and data can be found
in repositories within the following group: https://
gitlab.rlp.net/ISS/smartcrawl.
ACKNOWLEDGEMENTS
This work was funded by the German Federal
Ministry of Education and Research, BMBF, FKZ
01|S23060.
Parts of the text have been enhanced and linguis-
tically revised using artificial intelligence tools. All
concepts and implementations described are the intel-
lectual work of the authors.
REFERENCES
Dang, M.-H., Pham, T. H. T., Molli, P., Skaf-Molli, H., and
Gaignard, A. (2024). LLM4Schema.org: Generating
Schema.org Markups with Large Language Models.
Guo, Y., Li, Z., Jin, X., Liu, Y., Zeng, Y., Liu, W., Li,
X., Yang, P., Bai, L., Guo, J., and Cheng, X. (2025).
Retrieval-Augmented Code Generation for Universal
Information Extraction. In Wong, D. F., Wei, Z.,
and Yang, M., editors, Natural Language Process-
ing and Chinese Computing, pages 30–42, Singapore.
Springer Nature.
Gur, I., Nachum, O., Miao, Y., Safdari, M., Huang, A.,
Chowdhery, A., Narang, S., Fiedel, N., and Faust, A.
(2023). Understanding HTML with Large Language
Models.
Huang, W., Gu, Z., Peng, C., Li, Z., Liang, J., Xiao, Y.,
Wen, L., and Chen, Z. (2024). AutoScraper: A Pro-
gressive Understanding Web Agent for Web Scraper
Generation.
Krosnick, R. and Oney, S. (2023). Promises and Pitfalls
of Using LLMs for Scraping Web UIs. Published:
https://public.websites.umich.edu/
∼
rkros/papers/
LLMs webscraping CHI2023 workshop.pdf, last
visited 29.04.2025.
Li, Z., Zeng, Y., Zuo, Y., Ren, W., Liu, W., Su, M., Guo,
Y., Liu, Y., Lixiang, L., Hu, Z., Bai, L., Li, W.,
Liu, Y., Yang, P., Jin, X., Guo, J., and Cheng, X.
(2024). KnowCoder: Coding Structured Knowledge
into LLMs for Universal Information Extraction. In
Ku, L.-W., Martins, A., and Srikumar, V., editors, Pro-
ceedings of the 62nd Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long
Papers), pages 8758–8779, Bangkok, Thailand. Asso-
ciation for Computational Linguistics.
Evaluation of LLM-Based Strategies for the Extraction of Food Product Information from Online Shops
715