Generating Videos from Stories Using Conditional GAN

Takahiro Kozaki, Fumihiko Sakaue and Jun Sato

Nagoya Institute of Technology, Nagoya 466-8555, Japan

{kozaki@cv., sakaue@, junsato@}nitech.ac.jp

Keywords:

Generative AI, Story, Multiple Sentences, Video, GAN, Captioning.

Abstract:

In this paper, we propose a method for generating videos that represent stories described in multiple sentences.

While research on generating images and videos from single sentences has been advancing, the generation of

videos from long stories written in multiple sentences has not been achieved. In this paper, we use adversarial

learning to train pairs of multi-sentence stories and videos to generate videos that replicate the ﬂow of the

stories. We also introduce caption loss for generating more contextually aligned videos from stories.

1 INTRODUCTION

In recent years, techniques for generating images

from text information have progressed much, mak-

ing it possible to generate a high-quality image from

text information that describes the content of the im-

age (Rombach et al., 2022; Ramesh et al., 2021). Fur-

thermore, research on video generation from text in-

formation is also progressing (Ho et al., 2022), and

it is becoming possible to generate short videos from

text information.

However, generating a long video that represents a

long story composed of multiple sentences has not yet

been achieved. If long story videos could be automat-

ically generated from long stories, it could be used in

a variety of ﬁelds such as movie production. Thus, in

this paper, we propose a method for generating long

videos from multiple sentences in a story. This repli-

cates the human ability to imagine and visualize the

scenes described in the text while reading a novel.

For visualizing stories, Li et al. proposed Story-

GAN (Li et al., 2019), which takes the entire text of

a story as input and generates a sequence of keyframe

images that represent the overall ﬂow of the story. The

StoryGAN successfully generates keyframe images

corresponding to each sentence in the story. However,

since only one keyframe image is generated from a

single sentence, the resulting image sequence can-

not represent the motions of characters and moving

scenes described in the sentence.

Thus, in this research, we propose a method for

generating long videos from input stories composed

of multiple sentences as shown in Fig. 1. In con-

trast to StoryGAN, which generates a single image

Figure 1: Generating long videos from long story sentences.

The proposed method generates long videos with scene

changes according to the story sentences. Our method gen-

erates multiple short videos from multiple sentences in a

story, and then combines these videos to form a long video

that represents the entire story.

from each sentence, our method generates a short

video from each sentence and then combines these

videos to generate a long video representing the en-

tire story. In our network, we introduce a caption loss

that focuses on the appropriateness of regenerated sto-

ries from the generated videos, aiming to reproduce

long videos from multiple sentences more accurately.

Our network is trained and tested by using the Pororo

dataset (Li et al., 2019) as in the case of StoryGAN.

2 RELATED WORK

In the ﬁeld of image generation, Generative Adversar-

ial Networks (GANs) have achieved signiﬁcant suc-

cess (Goodfellow et al., 2014; Zhu et al., 2017; Brock

et al., 2019). Furthermore, the introduction of con-

ditional GANs (cGANs), which allow the generation

Kozaki, T., Sakaue, F. and Sato, J.

Generating Videos from Stories Using Conditional GAN.

DOI: 10.5220/0012405300003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 2: VISAPP, pages

671-677

ISBN: 978-989-758-679-8; ISSN: 2184-4321

671

Figure 2: Network structure of the proposed method. Our story video generator consists of GRU, Text2Gist (T2G) and

convolutional image generator (G

). G

generates N frames of continuous images from each sentence s

. As a result, T × N

story video images are generated from T sentences. The image discriminator D

compares generated image ˆx

with sentence

, while the story video discriminator D

compares a set of T ×N images with a set of T sentences. Initial vector h

for T2G

is obtained from input story S by using story encoder proposed in (Li et al., 2019).

of images that adhere to speciﬁc conditions, has fur-

ther improved the capabilities of GANs (Isola et al.,

2017). Moreover, recent research is also advancing in

generating videos from text (Li et al., 2018). How-

ever, these GANs typically receive very short text in-

puts, often as short as a single sentence, and are not

capable of generating videos from longer texts, such

as stories composed of multiple sentences.

Research on text-based image generation has

made signiﬁcant progress in recent years (Reed et al.,

2016; Ramesh et al., 2021; Rombach et al., 2022;

Ramesh et al., 2022). One major trend in im-

age generation from text is to combine well-trained

text encoders (Radford et al., 2021) with image

decoders (Jonathan Ho, 2020; Dosovitskiy et al.,

2021), and generate an image representing multi-

ple pieces of text. There has been signiﬁcant tech-

nological progress in the ﬁeld of image generation,

and nowadays we can generate quite good high-

resolution images which represent the content of mul-

tiple texts (Ramesh et al., 2021; Rombach et al.,

2022).

Another important task in the ﬁeld of text-based

image generation is generating videos from text in-

formation (Li et al., 2018; Ho et al., 2022). In this

ﬁeld, research has been conducted to generate short

videos representing a single sentence. However, un-

like generating a single image, generating a video is

a very difﬁcult problem. For this reason, it has not

yet been possible to generate a long video represent-

ing a long text such as a novel. Thus, generating long

videos from long story sentences, such as novels is a

very challenging problem.

For visualizing long stories in multiple images, Li

et al. proposed StoryGAN (Li et al., 2019) which

generates multiple keyframe images from long story

sentences. In StoryGAN, the generator performs ad-

versarial learning with two discriminators, an image

discriminator and a story discriminator, and generates

a keyframe image from each sentence in the story. As

a result, it generates a set of keyframe images that

represent the long story. Furthermore, several im-

provements have been proposed based on this Sto-

ryGAN (Zeng et al., 2019; Li et al., 2020; Maha-

rana et al., 2021). Li et al. (Li et al., 2020) intro-

duced Weighted Activation Degree (WAD) for im-

proving the discriminator in StoryGAN. Maharana et

al. (Maharana et al., 2021) introduced a MART-based

transformer (Lei et al., 2020) to model the correlation

between the text and image in StoryGAN. They also

used video captioning to improve the quality of gen-

erated images.

However, in these methods, only a single

keyframe image is generated from each sentence, and

it is not possible to generate a video from each sen-

tence. Thus, generating long videos from long story

sentences is still a challenging problem.

3 GENERATING VIDEOS FROM

STORIES

3.1 Generating Videos from Multiple

Sentences

In this research, we aim to generate long videos

from stories composed of multiple sentences. Build-

ing upon StoryGAN (Li et al., 2019), we propose

a method for generating contextually aligned videos

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

672

(a) D

(b) D

Figure 3: Image discriminator (D

) and story video discriminator (D

). Story video discriminator identiﬁes whether generated

entire video aligns with story sentences by comparing video vector obtained from T × N images [ ˆx

, ··· , ˆx

] through CNN

and story vector obtained from T sentences [s

, ··· , s

] by using sentence encoder (Cer et al., 2018). Image discriminator

evaluates each image and sentence in the same way.

Figure 4: Input story sentences S and sentences

S obtained

by captioning the generated story video

X are converted

into sentence vectors by sentence encoder (Cer et al., 2018),

and are used for computing caption loss L

caption

from stories with multiple sentences, where instead

of keyframes, we generate videos for each sentence.

In the following, we will refer to each individual sen-

tence as a ”sentence” and the collection of sentences

as a ”story.”

In this study, we utilize data where sentences are

paired with videos for GAN training. By inputting

multiple sentences from this dataset into the gener-

ator, we train it to generate videos corresponding to

each input sentence. By concatenating all the videos

generated from these sentences, we create a long

video representing the ﬂow of the entire story.

The basic network structure of the proposed

method is shown in Figure 2. First, the story en-

coder converts an input story, S = [s

, s

, ·· · , s

consisting of T sentences into a low-dimensional ini-

tial vector h

. This is achieved by using the story

encoder proposed in StoryGAN (Li et al., 2019). Sub-

sequently, the generator generates T sequences of N

frames of videos ˆx

= [ ˆx

, ˆx

, ·· · , ˆx

] (t = 1, ·· · , T )

from this initial vector h

, each sentence s

and

noise ε. Then, the generated T videos are con-

catenated in sequence to form the ﬁnal video

X =

[ ˆx

, ˆx

, ·· · , ˆx

This generator was built by extending the genera-

tor proposed in StoryGAN (Li et al., 2019). In Sto-

ryGAN, the generator consists of a convolutional im-

age generator, GRU and Text2Gist, and generates a

single image ˆx

from a single sentence s

. In this re-

search, we also use a convolutional image generator,

GRU and Text2Gist, but we generate N frames of con-

tinuous images [ ˆx

, ·· · , ˆx

] from each sentence s

This is achieved by iteratively using a pair of GRU

and Text2Gist (T2G) N times inputting the same sen-

tence s

as shown in Figure 2. As a result, we generate

a total of T × N images from T sentences.

On the other hand, the discriminator receives gen-

erated videos ˆx

(t = 1, · ·· , T ) or ground truth videos

(t = 1, ··· , T ) corresponding to the sentences s

(t = 1, ·· · , T ) in the story S, and evaluate the authen-

ticity of the generated videos. We use two types of

discriminator, that is image discriminator and story

video discriminator. The image discriminator identi-

ﬁes whether each frame image aligns with the corre-

sponding sentence, while the story video discrimina-

tor identiﬁes whether the entire video aligns with the

story. The image discriminator D

compares an image

vector obtained from a generated image ˆx

through

CNN and a sentence vector obtained from a sentence

by using a sentence encoder (Cer et al., 2018),

while the story video discriminator D

compares a

video vector obtained from T × N images through

CNN and a story vector obtained from T sentences

by using the sentence encoder as shown in Figure 3.

The comparison of the image vector and the sentence

vector or the video vector and the story vector is con-

ducted by MLP with two layers. These two types of

discriminators enhance the accuracy of both the over-

all consistency of the video and the alignment with

the original sentences, improving the overall quality

of the generated story videos.

The story video generator G is trained by using the

image discriminator D

and the story video discrimi-

nator D

as follows:

∗

= argmin

max

αL

image

+ βL

story

+ L

(1)

Generating Videos from Stories Using Conditional GAN

673

Figure 5: Some of the characters in Pororo dataset.

where, L

image

represents the loss associated with the

image discriminator D

, while L

story

represents the

loss associated with the story video discriminator D

as follows:

image

∑

t=1

∑

n=1

logD

, s

, h

)

+ log(1 − D

(G(ε

, s

), s

, h

)) (2)

story

= log D

(X, S)

+ log(1 − D

([[G(ε

, s

)]

n=1

]

t=1

, S))(3)

The story video discriminator D

discriminates

the ground truth story video X and the story video

X = [[G(ε

, s

)]

n=1

]

t=1

generated by the generator G

from the noise ε

and the story s

Following StoryGAN (Li et al., 2019), we also use

the following KL divergence L

between the stan-

dard Gaussian distribution and the learned distribu-

tion, which helps prevent mode collapse in the gener-

ation of the initial vector.

= KL(N (µ(S), diag(σ

(S)))∥N (0, I )) (4)

where, µ(S) and σ

(S) are the mean and the variance

of initial vectors derived from input story S.

3.2 Improving Accuracy by Caption

Loss

Next, we will consider improving the accuracy of

video generation using video captioning. It is not

easy to quantitatively evaluate the appropriateness of

videos generated from text. One way to perform such

difﬁcult evaluations is to input the generated video

into a well-trained captioning network to generate

sentences that represent the video and see how close

the generated sentences are to the original input sen-

tences.

Thus, in this research, we perform caption-

ing on the video

X generated by the story video

generator using a trained video captioning net-

work (Vladimir Iashin, 2020), and compute the cap-

tion loss L

caption

by comparing the generated story

sentences

S with the original input story sentences

S as shown in Fig. 4. The loss obtained in this way is

a measure of how well the generated video represents

the input sentence. Therefore, by adding this loss to

the network training, we can further improve the ac-

curacy of the generator.

Table 1: Cosine similarity between input sentences and sen-

tences obtained from generated videos.

proposed w/o caption loss

cos similarity (↑) 0.211 0.154

The similarity between the input story sentences

S and the sentences

S obtained by captioning the

generated story video is evaluated by using the cosine

similarity between sentence vectors, v and ˆv, derived

from S and

S by using the sentence encoder (Cer

et al., 2018). Therefore, the caption loss L

caption

deﬁned as follows:

caption

= 1 − cos(v, ˆv) (5)

By adding this L

caption

to the network training as fol-

lows, the video generation accuracy of the generator

can be further improved:

∗

= argmin

max

αL

image

+ βL

story

+γL

caption

(6)

4 DATASET

Since the proposed method generates videos from sto-

ries consisting of multiple sentences, a dataset con-

sisting of paired stories and videos is required. Thus,

in this research, we construct a dataset based on the

Pororo dataset (Kim et al., 2017), which consists of

short video clips and their description texts.

The Pororo dataset is an anime video dataset with

Pororo as the main character of the stories. Fig. 5

shows some of the characters in the story. The anime

video is divided into short video clips every few sec-

onds, and each short video clip is annotated with a

sentence. Therefore, the Pororo dataset consists of

pairs of short video clip and its description text. The

Pororo dataset consists of a series of episodes, so the

stories are connected in continuous data.

The story GAN and other related research use

Pororo-SV dataset (Li et al., 2019) which is created by

extracting a single image randomly from each short

video clip, so that the dataset consists of pairs of an

image and a sentence. On the other hand, in this re-

search, we need to train our network to generate N

images from each sentence. Thus, we extracted N

consecutive images from each video clip at equal in-

tervals and created pairs of N images and a sentence.

Since the stories are connected in continuous data, we

cut out T consecutive pairs of an image sequence and

a sentence as one set of story data. Therefore, one set

of data is composed of T ×N images and T sentences.

In this research, we set T = 5 and N = 10, and gen-

erated 2995 sets of data. Each story sentence in this

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

674

(a) story 1

(b) story 2

Figure 6: Generated video images. Proposed method can generate a series of short videos from multiple sentences, while

StoryGAN can only generate a single keyframe image from each sentence and cannot visualize the motion of the characters

and scenes described in each sentence.

Generating Videos from Stories Using Conditional GAN

675

dataset was encoded into a ﬁxed-length sentence vec-

tor using the Universal Sentence Encoder (Cer et al.,

2018).

5 EXPERIMENTS

We next show the results of some experiments on gen-

erating story videos from story sentences. We trained

our network for 120 epochs using 2500 sets of train-

ing data described in section 4. The trained network

was tested by using 495 sets of test data.

We ﬁrst show video images generated from test

story sentences. Fig. 6 (a), (b) and (c) show three

different story sentences and video images generated

from our method. For comparison, we also show

the results from StoryGAN (Li et al., 2019) and our

method trained without using caption loss. As shown

in this ﬁgure, the proposed method can generate a se-

ries of short videos from multiple sentences, while

StoryGAN can only generate a single keyframe im-

age from each sentence and cannot visualize the mo-

tion of the characters and scenes described in the

sentence. We can also see that the characters men-

tioned in the sentences are properly generated in the

videos obtained from the proposed method. Compar-

ing methods with and without the caption loss, the

proposed method using the caption loss can generate

videos with larger changes, indicating that the gener-

ated videos have richer expressions.

To evaluate the accuracy of the generated videos

quantitatively, we captioned the generated video by

using the captioning network (Vladimir Iashin, 2020)

trained by using the Pororo dataset, and computed

the cosine similarity with the input sentence. Ta-

ble 1 shows the cosine similarity between the input

sentences and the sentences obtained from the gener-

ated videos. For comparison, we also show the co-

sine similarity when we do not use the caption loss in

our method. As shown in Table 1, the caption loss in

our method can improve the quality of the generated

videos.

6 CONCLUSION

In this paper, we proposed a method for generating

videos that represent stories described in multiple sen-

tences.

The existing methods of text-to-video can only

generate short videos from texts, and generating long

story videos from long story sentences is a challeng-

ing problem. We in this paper extended the Story-

GAN and showed that we can generate story videos

from story sentences. Furthermore, we showed that

by using the caption loss, we can further improve the

accuracy of generated story videos.

Our method replicated the human ability to imag-

ine and visualize the scenes described in the sen-

tence while reading a novel. On the other hand, a

large amount of experience is required to imagine rich

scenes from story sentences, so using the foundation

model may improve its ability to perform this task.

REFERENCES

Brock, A., Donahue, J., and Simonyan, K. (2019). Large

scale gan training for high ﬁdelity natural image syn-

thesis. In Proc. ICLR.

Cer, D., Yang, Y., yi Kong, S., Hua, N., Limtiaco, N.,

John, R. S., Constant, N., Guajardo-Cespedes, M.,

Yuan, S., Tar, C., Sung, Y.-H., Strope, B., and

Kurzweil, R. (2018). Universal sentence encoder. In

arXiv:1803.11175.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,

N. (2021). An image is worth 16x16 words: Trans-

formers for image recognition at scale. In Proc, ICLR.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. In Proc.

Advances in neural information processing systems,

pages 2672–2680.

Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M.,

and Fleet, D. J. (2022). Video diffusion models. In

arXiv:2204.03458.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017).

Image-to-image translation with conditional adversar-

ial networks. In Proc. IEEE Conference on Computer

Vision and Pattern Recognition, pages 1125–1134.

Jonathan Ho, Ajay Jain, P. A. (2020). Denoising diffusion

probabilistic models. In Proc. Conference on Neural

Information Processing Systems.

Kim, K., Heo, M., Choi, S., and Zhang, B. (2017). Deep-

story: video story qa by deep embedded memory net-

works. In Proc. of International Joint Conference on

Artiﬁcial Intelligence, page 2016–2022.

Lei, J., Wang, L., Shen, Y., Yu, D., Berg, T. L., and Bansal,

M. (2020). Mart: Memory-augmented recurrent trans-

former for coherent video paragraph captioning. In

arXiv:2005.05402.

Li, C., Kong, L., and Zhou, Z. (2020). Improved-

storygan for sequential images visualization. Journal

of Visual Communication and Image Representation,

73(102956).

Li, Y., Gan, Z., Shen, Y., Liu, J., Cheng, Y., Wu, Y., Carin,

L., Carlson, D., and Gao, J. (2019). Storygan: A se-

quential conditional gan for story visualization. In

Proc. of IEEE Conference on Computer Vision and

Pattern Recognition, pages 6329–6338.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

676

Li, Y., Min, M. R., Shen, D., Carlson, D., and Carin, L.

(2018). Video generation from text. In Proc. AAAI

Conference on Artiﬁcial Intelligence.

Maharana, A., Hannan, D., and Bansal, M. (2021). Improv-

ing generation and evaluation of visual stories via se-

mantic consistency. In Proc. Conference of the North

American Chapter of the Association for Computa-

tional Linguistics: Human Language Technologies,

page 2427–2442.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,

Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,

J., Krueger, G., and Sutskever, I. (2021). Learning

transferable visual models from natural language su-

pervision. In Proc. ICML.

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen,

M. (2022). Hierarchical text-conditional image gener-

ation with clip latents. In arXiv:2204.06125.

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Rad-

ford, A., Chen, M., and Sutskever, I. (2021). Zero-shot

text-to-image generation. In arXiv:2102.12092.

Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B.,

and Lee, H. (2016). Generative adversarial text to im-

age synthesis. In Proc. ICML.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Om-

mer, B. (2022). High-resolution image synthesis with

latent diffusion models. In Proc. IEEE Conference

on Computer Vision and Pattern Recognition, page

10684–10695.

Vladimir Iashin, E. R. (2020). A better use of audio-visual

cues: Dense video captioning with bi-modal trans-

former. In Proc. British Machine Vision Conference.

Zeng, G., Li, Z., and Zhang, Y. (2019). Pororogan: An im-

proved story visualization model on pororo-sv dataset.

In Proc. International Conference on Computer Sci-

ence and Artiﬁcial Intelligence, page 155–159.

Zhu, J., Park, T., Isola, P., and Efros, A. (2017). Unpaired

image-to-image translation using cycle-consistent ad-

versarial networks. In Proc. International Conference

on Computer Vision, pages 2223–2232.

Generating Videos from Stories Using Conditional GAN

677