The three artificial intelligence models introduced
below represent different directions of creative AI in
the field of visual effect generation.
3.1 Stable Diffusion 3(SD3)
Esser et al. proposed an improved noise sampling
technique (Esser et al., 2024), which has been used in
SD3 to enhance generation quality. SD3 is a diffusion
model focused on image generation, and its main
function is to generate relevant images from text.
Under traditional flow model training, the
computational cost is high. The SD3 model
transforms the generation problem into optimizing a
new loss function, which allows the model to more
quickly reach the optimal solution and improve
efficiency.
𝐿
𝑥
=−
1
2
𝐸
~𝑢
𝑡
,𝜖~𝑁
0,1
𝑤
𝜆
‖
𝜖
𝑧
,𝑡
‖
1
Function 1 is the loss function of the model at the
initial data point x_0. 𝐸
~u(t),ϵ~N(0,I) denotes the
expectation over a distribution. 𝑤
is a weighting
term at time step t. 𝜆
is another weighting factor at
time step t. ||ϵ
(𝑧
,t)-〖ϵ||〗^2 is the squared error
between the model's predicted noise ϵ_Θ (𝑧
,t) and
the actual noise ϵ.[7]
The flow trajectory is simultaneously optimized
to enhance the corrected flow model, aiming to
improve the training effectiveness, making the model
more accurate and efficient. However, at intermediate
time steps (when t is close to 0.5), the error tends to
increase. Therefore, this method is equivalent to a
weighted loss:
𝑤
=
𝑡
1−𝑡
𝜋
𝑡
2
Function 2 is a time-step-dependent weighting
factor, where t/(1-t) gives more weight when t is
larger, allowing the model to focus on recovering data
from noise.
It also proposes architecture for text-to-image
generation, Multimodal Diffusion Transformer,
which can simultaneously handle information from
both text and image modalities. Although this model
still have some issues, such as: while removing the
Text-to-Text Transfer Transformer (T5) text encoder
improves the efficiency of image generation, the
performance significantly decreases when generating
written text without the T5 text encoder. However,
the model's expansion trend shows no signs of
saturation, and there is still considerable room for
improvement in the future. The SD3 model
demonstrates the achievements of artificial
intelligence in image generation and represents how
AI can provide a large number of uniformly styled
images during the game development phase, saving
costs and development time after training.
3.2 Movie Gen
Polyak et al. proposed Movie Gen, a model based on
GAN training that focuses on video generation
(Polyak et al., 2024). Its main feature is generating
high-quality videos while maintaining audio
consistency. Movie Gen uses a 30B-parameter
Transformer model to generate videos from text. The
model is first pre-trained on low-resolution images,
then jointly pre-trained on high-resolution images and
videos, and finally fine-tuned on high-quality videos.
This approach allows the production of videos.
By adding conditions to the pre-trained generative
model, the model generates personalized videos by
referencing images and text prompts. The key feature
of Movie Gen is that it uses a 13B-parameter model
to generate sound effects and music that are
synchronized with the video, based on text prompts.
However, Movie Gen still has many
shortcomings. For example, the generated videos may
have issues with complex shapes and physics, and
during action-intensive scenes, such as tap dancing,
or when the visual is obstructed or small, such as
footstep sounds, the audio may be out of sync.
Additionally, it does not support language generation.
Despite these issues, it can still be seen as a
significant milestone in greatly reducing game
development difficulty and costs.
Game developers can more easily use CG (videos
played in games) within their games, offering more
personalized and context-sensitive CGs, such as
generating corresponding CGs based on the player's
character appearance, thus enhancing the player's
immersion. It provides an option between CG videos,
which are more immersive, and real-time rendering,
which is less prone to interference, offering a more
balanced solution in real-time rendering and CG
playback.
3.3 Genie
Genie is a generative interactive environment
proposed by Bruce et al. in 2024 (Bruce et al., 2024).
This model is based on a GAN diffusion model and
can be considered a foundational world model. The
key feature of Genie is its ability to generate
interactive virtual environments through text, images,