experience. Nananukul et al. proposed a framework
for enhancing NPC dialogue with a more narrative-
driven approach using large language models,
focusing on games like Final Fantasy VII and
Pokémon. The goal is to enable NPCs to respond with
personality-appropriate reactions and tone in specific
scenarios (Nananukul & Wongkamjan, 2024). By
gathering character details, situational descriptions,
skills, and personality traits, a knowledge graph is
created to structure the data of game characters and
scenarios. Tailored prompt templates are then
developed for different games and characters,
providing character personality, describing specific
contexts, and outlining the dialogue goals and style.
The relevant information from the knowledge graph
is incorporated into the prompts, improving the
contextual relevance of the model’s generated
dialogues. In Final Fantasy VII, battle scenario
dialogues were tested, requiring characters to
generate dynamic responses based on the battle state,
such as enemy health or skill usage. In Pokémon,
NPC dialogues were tested, with Red being given
different personalities (e.g., talkative, confident) to
evaluate the diversity of the generated responses. The
results indicate that the quality of dialogue generation
is relatively high, with GPT-4 being able to
understand the character's behavior and reactions in
specific contexts and generate reasonable and natural
dialogues. It performs well in expressing simple
personalities, such as talkative or shy, with accuracy.
However, its ability to express more complex
personalities, such as mature or introverted, is
limited. The generated content may seem overly
positive or superficial, particularly in expressing
nuanced personalities like maturity or introspection.
Additionally, when generating repetitive dialogues or
overly positive tones, it does not always align with the
character’s established traits, such as Cloud's cold
personality.
Huang discussed and analyzed the performance of
GPT-4 and GPT-3.5 Turbo. He used GPT-4 for NPC
dialogue generation in RPG games, leveraging its
large number of parameters and enhanced context
window to generate more natural, coherent, and
highly contextually relevant text (Huang, 2024). By
providing the model with the game's background and
NPC character settings, and capturing dynamic
information, NPCs can generate dialogue that better
aligns with the NPC's background and the game's
current state.
Chubar designed an RPG game and used GPT to
generate procedural content, creating a game with
rich narratives and dynamic content. The study tested
the performance of the two models by comparing the
number of worlds and total units generated under two
different diversity conditions (Chubar, 2024). With a
temperature of 0.6, GPT-3.5 Turbo generated 15
worlds, totaling 489 units, with an average of
approximately 32.6 units per world. In comparison,
GPT-4 generated 5 worlds, totaling 210 units, with an
average of 42 units per world, showing a higher unit
density. At a temperature of 1.2, the generated worlds
were the same as those at 0.6, but GPT-3.5 Turbo
produced fewer total units (437), suggesting that
higher temperature may lead to greater content
diversity at the cost of unit density. Meanwhile, GPT-
4 generated fewer total units (195), with a slight
decrease in the average number of units per world
(39). GPT-4's world unit density was higher at a lower
temperature (0.6) (42 units per world), indicating that
its generated content is superior to GPT-3.5 Turbo in
terms of control and detail richness. As the
temperature increased from 0.6 to 1.2, the number and
density of units slightly decreased, but remained
relatively high, indicating that GPT-4 can maintain a
high level of quality while increasing generation
diversity. In contrast, the content generated by GPT-
3.5 Turbo showed a more significant decline.
4.2 Task Description
LLMs can also be used for task description
generation. Due to their exceptional text generation
capabilities, using LLMs for task generation in games
can significantly reduce the workload of developers.
By simply setting the game background and relevant
content for the LLMs, high-quality task descriptions
can be generated. Värtinen et al. explored the
application of GPT-2 and GPT-3 in generating task
descriptions for Role-Playing Games (RPGs). As
players' demand for game content continues to rise,
developers face the challenge of manually designing
tasks. The paper proposes an improved version of
GPT-2 called Quest-GPT-2, designed to
automatically generate RPG task descriptions
(Värtinen, Hämäläinen, & Guckelsberger, 2022). The
study evaluated the model’s performance through
calculated metrics and a large-scale user study
involving 349 players. Task data was extracted from
six classic RPG games—Baldur’s Gate 1 & 2,
Oblivion & Skyrim, Torchlight II, and Minecraft. The
GPT-2 fine-tuned version, Quest-GPT-2, and GPT-3
were used to generate 500 task descriptions. These
descriptions were then rated by the 349 RPG players
based on three criteria: "Does the task description
match the game task?", "Is the text coherent and
logical?", and "Does it match the RPG task style?".
The results showed that the task descriptions