and image authenticity needs to be improved. To
solve these problems, Li et al. proposed an improved
CycleGAN network model, replacing the original
Resnet network with a U-net to better retain image
details and structure; integrating self-attention
mechanism into the generator and discriminator to
further enhance the attention to important details and
reconstruction ability, and generate more realistic and
delicate transfer effects (Li, 2023).
AdaIN (Huang, 2017) utilizes Encoder, Decoder
structures, allowing the transmission of arbitrary
styles without training a separate network, but due to
the method's failure to retain the content image's
depth information, rendering quality is poor. Wu et al.
extended and improved the AdaIN method by
integrating the depth computation module of the
content image into the Encoder, Decoder structure
while preserving the structure, resulting in a final
output of style-enhanced images that balances
efficiency and depth information, thereby improving
rendering quality (Wu, 2020).
This paper will introduce and summarize the basic
concepts of style transfer, the specific implementation
steps of convolutional neural network subnetworks
(such as Visual Geometry Group Network(VGG)) in
style transfer, and the steps of subnetworks (such as
CycleGAN) of generative adversarial networks in
style transfer. Finally, the implementation flow of
AdaIN algorithm in style transfer is introduced, and
the future research directions of style transfer are
prospected.
2 IMAGE STYLE TRANSFER
BASED ON NEURAL
CONVOLUTIONAL
NETWORKS
2.1 Introduction to Neural
Convolutional Networks
2.1.1 The Basic Mechanism and Principle of
Convolutional Neural Networks in
Style Transfer
The input layer, pooling layer, fully connected layer,
convolutional layer, and output layer are the five
levels that make up a CNN.
(1) Input layer: receives input image information
(2) Convolution layer: extracting local charact-
eristics of the picture. The convolution layer contains
a set of learnable convolution nuclei, each of which
can be used to detect and extract certain features of
the input image.
(3) Pooling layer: while keeping sufficient feature
information, shrink the feature map's size. Maximum
pooling is the most widely utilized of the two basic
pooling techniques, the other being average pooling.
(4) Full connection layer: expand the features of
the pooled layer to generate a set of one-bit data into
the output layer
(5) Output layer: classify images or generate
target images
Generally, several convolutional layers are
connected to a pooling layer to form a module. The
final module will link to at least one complete
connection layer after a number of comparable
modules have been connected in turn. The final full
connection layer will link to the output layer
following the extraction of the module's input
features by the full connection layer.
2.1.2 The effect of CNN in style transfer
Feature extraction: The convolutional layer of the
CNN network facilitates the efficient extraction of
both the style and content features from the style and
content images, allowing for further mining of the
image's contents.
Style learning: By merging the extracted content
features with the learned style features, the CNN
network is able to transmit the style of the target
images while also learning the feature representation
of the incoming style images.
2.2 Image Style Migration Based on
VGG Network
2.2.1 VGG-19 Network Model
Simonyan created the deep convolutional neural
network model known as the VGG (Visual Geometry
Group Network) in 2014. The VGG network
performs well at extracting content and style elements
from images in deep learning-based image style
transfer research.
Three fully connected layers, five pooling layers,
and sixteen convolutional layers make up the VGG-
19 network. The pooling layer is 2 × 2, the
convolutional step and padding are unified to 1, and
the 3 × 3 convolution kernel is used in all
convolutional layers. The maximum pooling method
is adopted, and each N convolutional layer and one
pooling layer form a block. Each block of the input
image passes through, the extracted feature image
size gradually decreases and the retained content
gradually decreases. Finally, without flattening the