A Multi-Stage Approach to Image Consistency in Zero-Shot Character

Art Generation for the D&D Domain

Gayashan Weerasundara

and Nisansa de Silva

Department of Computer Science & Engineering, University of Moratuwa, Sri Lanka

Keywords:

Machine Learning, Image Generation, Consistency, Dungeons & Dragons.

Abstract:

In the evolving landscape of computer graphics, the pursuit of consistency in image generation remains a

compelling challenge. This paper delineates a comprehensive methodology that harnesses the capabilities of

the Stable Diffusion (SD) model, an adept open-source image generation tool, to generate consistent and high-

ﬁdelity imagery. Beginning with the training of a generalized art style for the popular tabletop role-playing

game Dungeons and Dragons, our research journeyed through speciﬁc character generation and the creation of

exhaustive character sheets, culminating in the generation of consistent character images across various poses.

1 INTRODUCTION

The computer graphics ﬁeld has recently transformed

with new image generation techniques, emphasizing

realism and consistency, crucial in gaming, anima-

tion, and virtual reality. This is particularly rele-

vant for the Dungeons & Dragons (D&D) universe,

where consistent, high-ﬁdelity character images are

vital to maintaining its authentic high-fantasy (Peiris

and de Silva, 2022) atmosphere. However, most re-

cent studies (Peiris and de Silva, 2023; Weerasundara

and de Silva, 2023) focused on natural language pro-

cessing aspects of the game rather than image gener-

ation. Recent strides in diffusion models, especially

the Stable Diffusion (SD) model (Rombach et al.,

2022), noted for producing quality images consis-

tently (Ho et al., 2020), have not been fully explored

in D&D character generation.

Our paper explores using the SD model to cre-

ate consistent, high-quality D&D character images.

We discuss training a generalized art style, transition-

ing to character generation, and producing character

sheets and consistent poses. Our approach combines

recent research with the SD model’s capabilities, aim-

ing to advance computer graphics applications, partic-

ularly in the D&D realm. We detail our methodology,

compare it with existing techniques, and analyze our

results. Our goal is to present a new approach to com-

puter graphics, focusing on the D&D universe.

https://orcid.org/0000-0003-1419-8938

https://orcid.org/0000-0002-5361-4810

2 RELATED WORK

In the evolving ﬁeld of text-to-image synthesis and

story visualization, several key studies have made

notable contributions. StoryDALL-E by Maharana

et al. (2022) is pivotal, focusing on enhancing large

transformers for story visualization. They introduced

story continuation and evaluated it on datasets such

as PororoSV (Li et al., 2019), FlintstonesSV (Gupta

et al., 2018), and DiDeMoSV (Anne Hendricks et al.,

2017), achieving success in sequential image genera-

tion.

Jeong et al. (2023) presents a novel method for

creating storybooks from text using diffusion models.

Rahman et al. (2023) contributes to generating visu-

ally consistent stories based on visual memory. Pan

et al. (2022) explores coherent story creation using la-

tent diffusion models. Cho et al. (2023) work delves

into the reasoning and biases of generative transform-

ers. The work by Esser et al. (2021) focuses on high-

resolution image synthesis using transformers, while

Liang et al. (2019) discuss content parsing in text-to-

image synthesis through CPGAN.

Other studies have also advanced the ﬁeld, includ-

ing Maharana et al. (2021)’s work on semantic con-

sistency in visual stories, Frans et al. (2022)’s explo-

ration of text-to-drawing synthesis, and Wang et al.

(2018)’s development of high-resolution image syn-

thesis models. These studies collectively form the

foundation of our research, inﬂuencing our approach

and objectives in text-to-image synthesis and story vi-

sualization.

Weerasundara, G. and de Silva, N.

A Multi-Stage Approach to Image Consistency in Zero-Shot Character Art Generation for the D&D Domain.

DOI: 10.5220/0012317600003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 3, pages 235-242

ISBN: 978-989-758-680-4; ISSN: 2184-433X

235

3 METHODOLOGY

The process of generating consistent characters can

be broken down to ﬁve main steps.

3.1 Training a General Art Style

Our research began by embedding a generalized art

style into the diffusion model, crucial for setting the

aesthetic for later stages. We collected a diverse

dataset through web scraping and from ofﬁcial D&D

books, focusing on the unique art style of D&D—a

mix of realism and fantasy.

We curated the dataset to ensure each image

matched our aesthetic goal, which resembled ofﬁ-

cial D&D art, capturing the essence of oil and wa-

tercolor paintings. This curation was vital for training

our model to create an aesthetic embedding for style

transfer, central to our research.

Figure 1: Comparing ofﬁcial images, style embedding, and

generated images using various artists as basis.

Figure 1 demonstrates the effectiveness of our

training. It compares ofﬁcial and fan-made D&D im-

ages with those generated by our model, using various

artist styles. Our approach avoided negative prompts,

allowing for a natural representation of the chosen

style. The results show our method’s efﬁciency and

the potential of merging artistic elements with com-

putational techniques.

3.2 Training a Speciﬁc Character

Generation

Our work honed in on speciﬁc character image gener-

ation within the domain of the Stable Diffusion (SD)

model, utilizing two advanced techniques: Dream-

Booth (Ruiz et al., 2023) and LoRA (Low-Rank Adap-

tation) (Hu et al., 2021), each known for their im-

age generation capabilities. We crafted a curated

dataset from web-scraped images and free ofﬁcial

D&D sources such as dndBeyond, focusing on two

D&D characters: Tasha from Tasha’s Cauldron of

Everything (Crawford et al., 2020), and Strahd from

Curse of Strahd (Perkins et al., 2016).

Figure 2: Comparing ofﬁcial character art, and images gen-

erated from LoRA, DreamBooth, and the base model.

The DreamBooth technique was applied directly

to the images for training. LoRA’s approach, in con-

trast, required detailed descriptions for each image,

omitting the character and using a keyword instead.

For this, we used the BLIP (Li et al., 2022) image cap-

tioning system to create preliminary captions, which

were then carefully reﬁned to align with the images.

Figure 2 offers a visual representation of our en-

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

236

Figure 3: Character sheets generated by a given prompt using ControlNet.

Figure 4: Character sheets generated using the trained LORA.

deavours. It juxtaposes the images generated using

LoRA, DreamBooth, and the base model against the

ofﬁcial character art of Tasha and Strahd. The results

were illuminating. The LoRA model, in particular,

stood out for its exceptional capability.

3.3 Character Sheet Generation

Leveraging a generated character image, our objec-

tive was to create a comprehensive character sheet us-

ing a specialized LORA and a ControlNet (Zhang and

Agrawala, 2023).

3.3.1 ControlNet

We selected controlnet-openpose-sdxl-1.0

due

to its bias towards generating full-body images. First,

we generated an image of a character and then keep-

ing the seed value, prompt, scale and negative embed-

ding consistent, we provided a reference pose sheet to

the ControlNet. The system then extrapolated this im-

age, generating a detailed character sheet inclusive of

different poses and variations.

Figure 3 shows character sheets generated using

ControlNet. Generated images consist of speciﬁc

characters in various poses in a blank background.

The ﬁrst image shows a sample reference sheet.

https://bit.ly/3RKSPNa

3.3.2 Training a LORA

While ControlNet allows the generation of character

sheets, it is not consistent. In a random generation,

it is possible to get distorted images very easily. To

make the generation more consistent, we need to em-

bed the concept of a character sheet into the gener-

ated images. For that, we used a LORA model. To

train a LORA, we need to create a labelled dataset of

reference images. We used web scraping on Google

images with keywords such as Character sheet and

reference sheet and manually ﬁltered undesired im-

ages. Resultant images were captioned with WD14

tagger to get preliminary labels. Next tags which are

common to most of the images such as white back-

ground, Character sheet, reference sheet, turnover

are removed while adding a common unique identi-

ﬁer as a trigger word. Finally, a LORA for that spe-

ciﬁc character can be trained on the labelled image

dataset. Results for character sheet LORA can be seen

in Figure 4

3.4 Separation and Training for Speciﬁc

Character

Upon obtaining the character sheet, our focus shifted

towards extracting undistorted and apt images that

showcase the character in a variety of poses. This ex-

https://bit.ly/3RuKhc7

A Multi-Stage Approach to Image Consistency in Zero-Shot Character Art Generation for the D&D Domain

237

traction process is crucial as it provides the necessary

training data for developing a dedicated model capa-

ble of generating images of the character in speciﬁc,

predeﬁned poses.

3.4.1 Image Processing for Segmentation

The initial step in this process involved a series of

image-processing techniques aimed at isolating each

image from the character sheet. The image was ﬁrst

padded to ensure that the boundaries of each character

were well-deﬁned. To reduce artifacting in the subse-

quent steps, Gaussian blur was applied to the image.

Following this, the image was converted to a single

colour channel, and Otsu thresholding was applied to

create a binary mask. This mask was then overlaid on

top of the padded image with a low alpha value, cre-

ating a masking effect that highlights each character

distinctly. Figure 5 illustrates the sequence of image

processing techniques employed to segment the im-

ages on the character sheet.

Figure 5: Sequence of image processing techniques em-

ployed for segmenting images.

3.4.2 Object Detection for Image Separation

Following the image processing, a ﬁne-tuned

YOLO v8 (Jocher et al., 2022) object detection model

was employed as a segmentation model to separate

the masked images. The model was adept at iden-

tifying and isolating the individual characters high-

lighted by the mask, where the regions of interest

were then mapped against the original images to ex-

tract the character representations in various poses.

3.4.3 Model Training and Image Generation

The segmented images served as the training data for

a dedicated model, designed to generate images of

the character in various predeﬁned poses. The train-

ing process was meticulous, ensuring that the model

learned the nuances of the character’s appearance and

pose variations. The trained model demonstrated pro-

ﬁciency in generating consistent, high-quality images

that accurately represented the character in the speci-

ﬁed poses.

3.5 Generating Consistent Characters

Using the Trained Model

The culmination of our research efforts was the gener-

ation of consistent character images using the trained

model. We used a combination of trained LORA

models alongside style embedding trained on DnD

images to archive consistent character generation.

The primary objective was to ensure that the gen-

erated images maintained a high degree of consis-

tency in terms of character details, pose accuracy, and

overall visual quality. Using the trained model, we

tested multiple prompts under different seed values.

The seed values were varied to introduce randomness

in the generation process, ensuring that the model’s

robustness and consistency were thoroughly tested.

Each prompt was designed to elicit a speciﬁc response

from the model, be it a particular pose, expression, or

background setting.

The generated images showcased remarkable con-

sistency as shown in Figure 6. The characters were

homogeneous in style, and the poses were accurate,

conﬁrming the efﬁcacy of our training process. Fur-

thermore, the visual quality of the images was high,

with sharp details and vibrant colours, making them

suitable for professional use in storyboards, anima-

tions, and other media.

3.6 Saving Context for Re-Usability

A pivotal aspect of our methodology is the incor-

poration of a mechanism for saving context, which

signiﬁcantly enhances the re-usability of the trained

LORA models. This mechanism is designed to cap-

ture and store essential contextual information during

the training of a LORA model, thereby facilitating its

subsequent utilization in generating images that align

with speciﬁc character descriptions.

3.6.1 Context Capturing and Storage

During the training of a LORA model, images are

captioned using the WD14 tagger. This process re-

sults in the generation of a set of tags that describe the

common features across every image, such as white

hair, old man, white beard, which are indicative of the

character’s attributes. These common tags are then

replaced with a single unique trigger word, serving as

an identiﬁer for the character’s context.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

238

Figure 6: Examples showcasing the consistency in gener-

ated characters.

The captured context, comprising the trigger

word, the common words (tags), and the trained

LORA SafeTensor, is systematically stored in a ded-

icated database, referred to as the context database

(context db). This structured storage enables the ef-

ﬁcient retrieval and re-use of context for generating

consistent character images.

Figure 7: Workﬂow illustrating the process of context re-

usability.

3.6.2 Context Re-Usability Workﬂow

Figure 7 depicts the workﬂow for context re-usability.

Upon receiving text input, the system evaluates

whether the text contains similarities to the common

words stored in the context db. As per our previous

study, which was conducted on extracting named en-

tities for D&D domain Weerasundara and de Silva

(2023), it is possible to identify whether a named

character is included in the prompt and match them

with the context db. If a match is found, the cor-

responding LORA model, identiﬁed by the trigger

word, is utilized to generate images that adhere to

the character’s context. This approach eliminates the

need for re-training a LORA model for known con-

texts, thereby optimizing the image generation pro-

cess.

In scenarios where the text input does not match

any stored context, the system follows the established

pipeline to train a new LORA model. The newly ac-

quired context is then added to the context db, ex-

panding the repository of reusable contexts for future

image generation tasks.

3.6.3 Implications and Efﬁciency

The implementation of context saving and re-usability

signiﬁcantly contributes to the efﬁciency and versatil-

ity of our image generation system. By leveraging

stored contexts, the system can rapidly generate im-

ages that are contextually consistent with known char-

acter descriptions, thereby reducing computational

overhead and response time. Furthermore, the contin-

uous expansion of the context db ensures the adapt-

ability of the system to diverse character contexts,

making it a valuable asset in applications such as

gaming, animation, and virtual reality.

4 EXPERIMENTAL SETUP

In this study, we meticulously designed an experi-

mental framework to validate the efﬁcacy of our pro-

posed methodology. Figure 8 illustrates the com-

prehensive pipeline of our approach, encompassing

the initial character generation, training of character-

speciﬁc LORA, and the ﬁnal generation of consistent

images.

Our experimental setup was conﬁgured to en-

sure the reproducibility and reliability of the results.

The models were trained on RTX 4090 with 24GB

VRAM, 61GB RAM, 16vCPU on the cloud service

A Multi-Stage Approach to Image Consistency in Zero-Shot Character Art Generation for the D&D Domain

239

Figure 8: Architecture of the pipeline used for the system.

runpod.io

using the Kohya ss GUI

. The parame-

ters were ﬁne-tuned, and multiple iterations were con-

ducted to optimize the performance of the Stable Dif-

fusion (SD) model and the LORA model.

5 RESULTS

The results of our experiments were promising, show-

casing the capability of our methodology in generat-

ing high-ﬁdelity and consistent character images. The

generated images exhibited remarkable consistency in

style, pose accuracy, and visual quality, adhering to

the predeﬁned artistic style of the D&D universe.

To validate the effectiveness of our approach, we

conducted a comparative study against established

benchmarks, focusing on the structural similarity in-

dex measure (SSIM) (Zermani et al., 2021). SSIM is

a widely recognized metric for comparing the simi-

larity between two images, providing insights into the

perceptual changes between the generated images and

the ground truth.

Figure 9 presents the average SSIM comparison

between the ofﬁcial images, baseline generations, and

the images generated using the trained model per each

category of images which indicates a greater consis-

tency between ofﬁcial art with trained models com-

pared to base models.

In our study, we evaluated the consistency and

D&D style adherence of generated images through an

https://www.runpod.io/

https://github.com/bmaltais/kohya ss

Figure 9: SSIM Comparison between ground truth, and im-

ages generated from trained LoRA, and the base model.

online survey with 291 participants. They rated the

consistency of characters across multiple images and

the stylistic alignment with traditional D&D imagery

on a scale of 1 to 10.

Figure 10: Density Distribution of Feedback on Generated

Images.

Figure 10 density plot synthesizes these ratings.

The x-axis shows character consistency ratings and

the y-axis indicates style adherence. The darker ar-

eas represent a higher consensus among participants.

The results demonstrate that our method effectively

produces images that are both consistent in character

depiction and faithful to the D&D style, as indicated

by the concentration of responses in the upper regions

of the plot.

The plot reveals a notable clustering of responses

in the high-density regions, suggesting a collective

agreement among participants. The results indicate

a signiﬁcant trend wherein our image generation ap-

proach has successfully maintained a high level of

consistency in character portrayal and adherence to

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

240

Table 1: Training conﬁgurations for LoRA models trained for the proposed pipeline.

Parameter Pre-trained Character LoRA Zero-Shot Character LoRA D&D Style LoRA Character-sheet LoRA

Number of Images 40 - 60 4 - 12 676 488

Repeats 10 - 15 20 - 40 2 4

Epochs 15 - 25 20 - 30 12 15

Precision bf16 bf16 bf16 bf16

Learning Rate 0.0001 0.0001 0.0003 0.0002

Warmup 0 0 0 - 5% 0 - 5%

Resolution 1024x1024 1024x1024 1024x1024 1024x1024

Optimizer Adafactor Adafactor Adafactor Adafactor

Batch Size 2 - 4 1 4 4

Gradient Checkpointing True True True True

Xformers False False False False

the D&D style. The darker regions of the plot signify

a concentration of responses indicating both high con-

sistency and high stylistic ﬁdelity, thereby afﬁrming

the effectiveness of our generative model in produc-

ing images that align with the expectations of D&D

imagery.

We have also conducted a comparison using the

LLaVA (Liu et al., 2023) model and the FID (Heusel

et al., 2017) score. LLaVA is an open-source multi-

model that has both vision capabilities and LLM ca-

pabilities, thus act as an ideal agent for comparative

tasks. By giving a series of images combined as a sin-

gle image, the LLaVA model can be asked to provide

a similarity score (1-10) while asking speciﬁcally to

provide only the score. Averages of resultant scores

are mentioned in Table 2.

Table 2: Comparison between stable diffusion image to

image conversion, InstructPix2Pix (Brooks et al., 2022)

method and the proposed method.

Image to Image InstructPix2Pix Proposed Method

LLaVA (Liu et al., 2023) 6.4 7.5 8.3

FID (Heusel et al., 2017) 51.7 38.9 13.2

From Table 2, it can be seen that the performance

of the proposed methodology in generating consistent

images is remarkably high.

6 DISCUSSION

Our research focused on leveraging the Stable Dif-

fusion (SD) model to generate consistent and high-

ﬁdelity images of characters within the Dungeons

& Dragons universe. We successfully utilized ad-

vanced techniques like LoRA and ControlNet to cre-

ate visually appealing and consistent character images

across different poses and scenarios, primarily target-

ing applications in gaming, animation, and virtual re-

ality. Comparative analysis using benchmarks such

as SSIM and human validation indicated that our ap-

proach maintained structural similarity with ground

truth and outperformed baseline models, demonstrat-

ing its potential in the ﬁeld of computer graphics.

However, our research faced challenges in curat-

ing a diverse dataset for style embedding and charac-

ter sheet LoRA. Additionally, the generation of char-

acter sheets sometimes resulted in distortions due to

the high number of variations. Despite these chal-

lenges, our ﬁndings hold signiﬁcant promise for fu-

ture developments in the image generation ﬁeld.

7 CONCLUSION

In conclusion, our research delineated a comprehen-

sive approach for generating consistent and high-

ﬁdelity character images using the Stable Diffusion

model. The results demonstrated the potential of our

methodology to contribute to the evolving landscape

of computer graphics. While challenges were encoun-

tered, the insights gained and the avenues opened for

future work make this research a valuable addition to

the ﬁeld.

8 FUTURE WORK

The ﬁndings of our research open several avenues

for future exploration. The reﬁnement of the dataset

curation process, particularly through the integration

of the proposed method in Avrahami et al. (2023),

we can more effectively reﬁne the images selected

for training, potentially improving both the quality

and consistency of generated images. Additionally,

the application of our methodology to genres beyond

Dungeons & Dragons presents an exciting prospect.

REFERENCES

Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J.,

Darrell, T., and Russell, B. (2017). Localizing mo-

ments in video with natural language. In Proceedings

A Multi-Stage Approach to Image Consistency in Zero-Shot Character Art Generation for the D&D Domain

241

of the IEEE international conference on computer vi-

sion, pages 5803–5812.

Avrahami, O., Hertz, A., Vinker, Y., Arar, M., Fruchter,

S., Fried, O., Cohen-Or, D., and Lischinski, D.

(2023). The chosen one: Consistent characters

in text-to-image diffusion models. arXiv preprint

arXiv:2311.10093.

Brooks, T., Holynski, A., and Efros, A. A. (2022). In-

structpix2pix: Learning to follow image editing in-

structions. arXiv preprint arXiv:2211.09800.

Cho, J., Zala, A., and Bansal, M. (2023). Dall-eval: Probing

the reasoning skills and social biases of text-to-image

generation models. In Proceedings of the IEEE/CVF

International Conference on Computer Vision, pages

3043–3054.

Crawford, J., Dillon, D., Petrisor, B., Schneider, F. W., and

Teague, E. (2020). Tasha’s Cauldron of Everything.

Wizards of the Coast Publishing.

Esser, P., Rombach, R., and Ommer, B. (2021). Tam-

ing transformers for high-resolution image synthesis.

In Proceedings of the IEEE/CVF conference on com-

puter vision and pattern recognition, pages 12873–

12883.

Frans, K., Soros, L., and Witkowski, O. (2022). Clip-

draw: Exploring text-to-drawing synthesis through

language-image encoders. NeurIPS, 35:5207–5218.

Gupta, T., Schwenk, D., Farhadi, A., Hoiem, D., and Kemb-

havi, A. (2018). Imagine this! scripts to compositions

to videos. In Proceedings of the European conference

on computer vision (ECCV), pages 598–613.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and

Hochreiter, S. (2017). Gans trained by a two time-

scale update rule converge to a local nash equilibrium.

NeurIPS, 30.

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion

probabilistic models. Advances in neural information

processing systems, 33:6840–6851.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,

S., Wang, L., and Chen, W. (2021). Lora: Low-rank

adaptation of large language models. arXiv preprint

arXiv:2106.09685.

Jeong, H., Kwon, G., and Ye, J. C. (2023). Zero-shot gener-

ation of coherent storybook from plain text story using

diffusion models. arXiv preprint arXiv:2302.03900.

Jocher, G., Chaurasia, A., Stoken, A., Borovec, J., Kwon,

Y., Michael, K., Fang, J., Yifu, Z., Wong, C., Montes,

D., et al. (2022). ultralytics/yolov5: v7. 0-yolov5 sota

realtime instance segmentation. Zenodo.

Li, J., Li, D., Xiong, C., and Hoi, S. (2022). Blip:

Bootstrapping language-image pre-training for uniﬁed

vision-language understanding and generation. In In-

ternational Conference on Machine Learning, pages

12888–12900. PMLR.

Li, Y., Gan, Z., Shen, Y., Liu, J., Cheng, Y., Wu, Y., Carin,

L., Carlson, D., and Gao, J. (2019). Storygan: A se-

quential conditional gan for story visualization. In

Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 6329–

6338.

Liang, J., Pei, W., and Lu, F. (2019). Cpgan: full-

spectrum content-parsing generative adversarial net-

works for text-to-image synthesis. arXiv preprint

arXiv:1912.08562.

Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2023). Visual in-

struction tuning. arXiv preprint arXiv:2304.08485.

Maharana, A., Hannan, D., and Bansal, M. (2021). Improv-

ing generation and evaluation of visual stories via se-

mantic consistency. arXiv preprint arXiv:2105.10026.

Maharana, A., Hannan, D., and Bansal, M. (2022).

Storydall-e: Adapting pretrained text-to-image trans-

formers for story continuation. In European Confer-

ence on Computer Vision, pages 70–87. Springer.

Pan, X., Qin, P., Li, Y., Xue, H., and Chen, W. (2022). Syn-

thesizing coherent story with auto-regressive latent

diffusion models. arXiv preprint arXiv:2211.10950.

Peiris, A. and de Silva, N. (2022). Synthesis and evalua-

tion of a domain-speciﬁc large data set for dungeons

& dragons. arXiv preprint arXiv:2212.09080.

Peiris, A. and de Silva, N. (2023). SHADE: semantic hy-

pernym annotator for Domain-Speciﬁc entities - DnD

domain use case. In 2023 IEEE 17th International

Conference on Industrial and Information Systems

(ICIIS), page 6, Peradeniya, Sri Lanka.

Perkins, C., Hickman, T., and Hickman, L. (2016). Curse

of Strahd. Wizards of the Coast.

Rahman, T., Lee, H.-Y., Ren, J., Tulyakov, S., Mahajan, S.,

and Sigal, L. (2023). Make-a-story: Visual memory

conditioned consistent story generation. In Proceed-

ings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 2493–2502.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and

Ommer, B. (2022). High-resolution image synthesis

with latent diffusion models. In Proceedings of the

IEEE/CVF conference on computer vision and pattern

recognition, pages 10684–10695.

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M.,

and Aberman, K. (2023). Dreambooth: Fine tuning

text-to-image diffusion models for subject-driven gen-

eration. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

22500–22510.

Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., and

Catanzaro, B. (2018). High-resolution image synthe-

sis and semantic manipulation with conditional gans.

In CVPR, pages 8798–8807.

Weerasundara, G. and de Silva, N. (2023). Comparative

analysis of named entity recognition in the dungeons

and dragons domain.

Zermani, M., Larabi, M.-C., and Fernandez-Maloigne, C.

(2021). A comprehensive assessment of the structural

similarity index. Signal Processing: Image Commu-

nication, 99:116336.

Zhang, L. and Agrawala, M. (2023). Adding conditional

control to text-to-image diffusion models. arXiv

preprint arXiv:2302.05543.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

242