Authors:
Carlos Rocha
;
Jonatas Grosman
;
Fernando Correia
;
Venicius Rego
and
Hélio Lopes
Affiliation:
Department of Informatics, PUC-Rio, Marquês de São Vicente, 225 RDC, 4th floor - Gávea, Rio de Janeiro, Brazil
Keyword(s):
Data Annotation, Large Language Model, Visual Question-Answering, Documents, Machine Learning.
Abstract:
Documents are crucial for the economic and academic systems, yet extracting information from them can be complex and time-consuming. Visual Question Answering (VQA) models address this challenge using natural language prompts to extract information. However, their development depends on annotated datasets, which are costly to produce. To face this challenge, we propose a four-step process that combines Computer Vision Models and Large Language Models (LLMs) for VQA data annotation in financial reports. This method starts with Document Layout Analysis and Table Structure Extraction to identify document structures. Then, it uses two distinct LLMs for the generation and evaluation of question and answer pairs, automating the construction and selection of the best pairs for the final dataset. As a result, we found Mixtral-8x22B and GPT-4o mini to be the most cost-benefit for generating pairs, while Claude 3.5 Sonnet performed best for evaluation, aligning closely with human assessments.