Authors:
Stephen Price
1
and
Danielle L. Cote
2
Affiliations:
1
Department of Computer Science, Worcester Polytechnic Institute, Worcester, MA, U.S.A.
;
2
Department of Mechanical and Materials Engineering, Worcester Polytechnic Institute, Worcester, MA 01609, U.S.A.
Keyword(s):
Natural Language Processing, Large Language Models, Document Analysis, Decision-Making, Bias, Reproducibility, Nondeterminism.
Abstract:
In recent years, large language models (LLMs) have demonstrated their ability to perform complex tasks such as data summarization, translation, document analysis, and content generation. However, their reliability and efficacy in real-world scenarios must be studied. This work presents an experimental evaluation of an LLM for document analysis and candidate recommendation using a set of resumes. Llama3.1, a state-of-the-art open-source model, was tested with 30 questions using data from five resumes. On tasks with a direct answer, Llama3.1 achieved an accuracy of 99.56%. However, on more open-ended and ambiguous questions, performance, and reliability decreased, revealing limitations such as bias toward particular experience, primacy bias, nondeterminism, and sensitivity to question phrasing.