Automating Opinion Extraction from Semi-Structured Webpages: Leveraging Language Models and Instruction Finetuning on Synthetic Data

Dawid Plaskowski, Szymon Skwarek, Dominika Grajewska, Maciej Niemir, Agnieszka Ławrynowicz

2024

Abstract

To address the challenge of extracting opinions from semi-structured webpages such as blog posts and product rankings, encoder-decoder transformer models are employed. We enhance the models’ performance by generating synthetic data using large language models like GPT3.5 and GPT-4, diversified through prompts featuring various text styles, personas and product characteristics. Different fine-tuning strategies are experimented, training both with and without domain-adapted instructions, as well as, training on synthetic customer reviews, targeting tasks such as extracting product names, pros, cons, and opinion sentences. Our evaluation shows a significant improvement in the models’ performance in both product characteristic and opinion extraction tasks, validating the effectiveness of using synthetic data for fine-tuning and signals the potential of pretrained language models to automate web scraping techniques from diverse web sources.

Download


Paper Citation


in Harvard Style

Plaskowski D., Skwarek S., Grajewska D., Niemir M. and Ławrynowicz A. (2024). Automating Opinion Extraction from Semi-Structured Webpages: Leveraging Language Models and Instruction Finetuning on Synthetic Data. In Proceedings of the 16th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART; ISBN 978-989-758-680-4, SciTePress, pages 681-688. DOI: 10.5220/0012384900003636


in Bibtex Style

@conference{icaart24,
author={Dawid Plaskowski and Szymon Skwarek and Dominika Grajewska and Maciej Niemir and Agnieszka Ławrynowicz},
title={Automating Opinion Extraction from Semi-Structured Webpages: Leveraging Language Models and Instruction Finetuning on Synthetic Data},
booktitle={Proceedings of the 16th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
year={2024},
pages={681-688},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012384900003636},
isbn={978-989-758-680-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 16th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART
TI - Automating Opinion Extraction from Semi-Structured Webpages: Leveraging Language Models and Instruction Finetuning on Synthetic Data
SN - 978-989-758-680-4
AU - Plaskowski D.
AU - Skwarek S.
AU - Grajewska D.
AU - Niemir M.
AU - Ławrynowicz A.
PY - 2024
SP - 681
EP - 688
DO - 10.5220/0012384900003636
PB - SciTePress