Texture Translation of PBR Materials Based on Pix2pix-Turbo
Jiamu Liu
a
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China
Keywords: Physically Based Rendering, Texture Translation, Pix2pix-Turbo.
Abstract: Physically Based Rendering (PBR), a high-quality method in 3D model rendering, is widely used in modern
games and 3D short films. However, generating corresponding PBR textures is relatively complex and
challenging. This paper proposes a new task called PBR texture translation. The task involves generating
corresponding texture maps such as height, normal, and roughness maps based on the base color image of a
given PBR texture using an image-to-image translation model. Additionally, this paper improves the latest
image translation model, pix2pix-turbo, by incorporating a classifier and expert models, and specifically
adjusting the text-image alignment via a Text Prompt through experiments. After training on the MatSynth
dataset, the model achieved a Minimum Mean Squared Error (MSE) of 1181.41 and a maximum Structural
Similarity Index (SSIM) of 0.614 on the height texture of the test set, reducing MSE by 1,443.53 and
improving SSIM by 0.13 compared to the original model. The contributions of this research include proposing
the PBR texture translation task and improving the pix2pix-turbo model to make it more suitable for texture
translation tasks.
1 INTRODUCTION
In the field of 3D modeling and game development,
the quality and effectiveness of textures are critical
factors that determine the outcome of the work.
Specifically, high-quality, realistic texture maps can
provide users with a more immersive, authentic, and
aesthetically pleasing experience. Therefore, creating
high-quality texture maps and appropriately applying
them in 3D modeling software to ensure accurate and
suitable shading on models is a top priority in current
research in this field.
Physically Based Rendering (PBR) is an
important technology for enhancing the realism and
immersive experience of 3D modeling. This
technology was introduced in 2004 by Matt Pharr
(Pharr, Humphreys, 2004). At that time, computer
rendering techniques were not very advanced, leading
to "plastic-like" appearances for metallic objects in
games and 3D short films. However, after over a
decade of effort and collaboration from talents across
various fields, modern 3D engines have seamlessly
integrated PBR technology. This integration has
eliminated the plastic-like appearance, bringing
a
https://orcid.org/0009-0003-7982-380X
objects closer to reality and sometimes achieving
near-photorealism. In the field of offline rendering,
the famous "Disney Principled Bidirectional
Reflectance Distribution Function," introduced by
Disney at SIGGRAPH 2012, significantly improved
the usability of PBR. In the same year, Disney applied
this technology to introduce the metallic workflow,
which played a key role in the production of the
critically acclaimed Wreck-It Ralph, marking a major
leap in the depiction of metallic textures. In the realm
of real-time rendering, various game developers
shared their advancements in PBR technology at
SIGGRAPH conferences. Notably, Brian Karis’ talk
Real Shading in Unreal Engine 4 at SIGGRAPH 2013
highlighted Unreal Engine 4 as the first game engine
to use PBR technology, making it an indispensable
tool in the game industry.
PBR technology requires eight different texture
maps that collectively determine how the material in
a 3D engine interacts with light to simulate real-world
physical laws. Currently, the creation of PBR textures
relies on professional artists, who go through a
complex and tedious process of handling image
content. The final quality of the PBR textures is
322
Liu and J.
Texture Translation of PBR Materials Based on Pix2pix-Turbo.
DOI: 10.5220/0013516500004619
In Proceedings of the 2nd International Conference on Data Analysis and Machine Learning (DAML 2024), pages 322-328
ISBN: 978-989-758-754-2
Copyright © 2025 by Paper published under CC license (CC BY-NC-ND 4.0)
highly dependent on the artist’s experience and
judgment. Independent game and film developers
also struggle to find suitable PBR textures at the
initial stages of production, which significantly
increases both time and learning costs. Therefore, a
key research area is how to generate high-quality
PBR textures quickly and accurately, aligned with the
user's expectations.
In the field of PBR texture generation, current
research can generally be categorized into three main
approaches. The first approach, such as the method
proposed by Vecchio and Martin, focuses on
automatically extracting corresponding textures from
images. Vecchio and colleagues employed a diffusion
model and introduced rolled diffusion and patched
diffusion, achieving an SSIM of 0.729 and LPIPS of
0.184 (Vecchio, Martin, Roullier, et al., 2023; Martin,
Roullier, Rouffet, et al., 2022). The second approach,
as proposed by Guo and Hu, involves generating PBR
textures based on various conditions and rules. Guo
and colleagues built a MaterialGAN model using
StyleGAN2, optimizing latent space representations
to better generate target textures under constrained
conditions, achieving the lowest LPIPS of 0.071,
significantly outperforming previous models (Guo,
Smith, Hašan, et al., 2020; Hu, Hašan, Guerrero, et
al., 2022). The third approach involves more
convenient methods like text-to-texture, as recently
proposed by Vecchio and Siddiqui. Siddiqui and
colleagues' Meta 3D AssetGen model used multi-
view and symbolic distance functions to represent 3D
shapes more reliably, improving Chamfer distance by
17% and LPIPS by 40% (Vecchio, 2024; Siddiqui,
Monnier, Kokkinos, et al., 2024).
These methods have undoubtedly improved the
convenience and speed of PBR material creation.
However, since many of these models and
applications rely on textual descriptions or various
constraints to generate textures, the final results may
not be as satisfactory to users as those created from
handpicked or photographed textures. Additionally,
since the results of these models are closely tied to the
quality of the training dataset, previous models may
struggle to generate high-quality PBR textures based
on outdated training data. This paper will make
corresponding adjustments and improvements to the
pix2pix-turbo method, aiming to achieve PBR texture
translation based on an improved pix2pix-turbo
model. The goal is to enhance the quality of the
generated images, enabling the rapid translation of
consistent and high-quality PBR textures.
2 METHOD
2.1 MatSynth Dataset
The MatSynth dataset (Vecchio, Deschaintre, 2024)
is a high-definition PBR texture dataset containing
over 4,000 ultra-high-resolution textures. Curated
and published by Giuseppe Vecchio and Valentin
Deschaintre, the dataset focuses on a variety of
materials under the CC0 and CC-BY licensing
frameworks, sourced from AmbientCG,
CGBookCase, PolyHeaven, ShareTexture,
TextureCan, and part of artist Julio Sillet's materials
released under the CC-BY license. The dataset covers
13 types of materials: ceramic, concrete, fabric,
ground, leather, marble, metal, misc, plaster, plastic,
stone, terracotta, and wood. Each material category
contains over 200 sets of PBR textures, and each set
includes Basecolor, Diffuse, Normal, Height,
Roughness, Metallic, Specular, and Opacity maps.
In addition, the dataset's publishers visually
inspected and filtered out low-quality and low-
resolution PBR textures, and enhanced the original
dataset using a method that blends semantic
compatibility. The MatSynth dataset provides an
important data source for the texture generation field,
addressing the scarcity of high-quality datasets over
the past six years, which were plagued by issues such
as low resolution, copyright restrictions, and limited
material variety. Figure 1 shows an example of a
wood texture from the dataset, containing eight
different texture maps.
Figure 1: MatSynth dataset example (Photo/Picture credit:
Original).
2.2 Pix2pix Principle
The pix2pix-turbo model is a successor to pix2pix
(Isola, Zhu, Zhou, et al. 2017). Before the pix2pix
method was introduced, image translation tasks were
a significant and extensive branch of image
processing. Many methods require different model
Texture Translation of PBR Materials Based on Pix2pix-Turbo
323
architectures and loss functions to adapt to the
specific task at hand. Due to the diversity of tasks
(e.g., facade reconstruction versus Monet-style
translation), the resulting model architectures and loss
functions varied greatly, making it difficult to
standardize operations across tasks.
However, just as in the field of Natural Language
Processing (NLP), where all NLP tasks can be
generalized as question-answer tasks, the pix2pix
method introduced a unified approach for image
translation tasks. This method uses Conditional
Generative Adversarial Networks (CGANs) for
image translation, but unlike traditional CGANs, the
discriminator in pix2pix operates on image pairs
rather than single images. The generator uses a U-Net
architecture to retain more details from the original
image, ensuring that the generated image contains
both high-level features (e.g., textures) and low-level
features (edges, corners, contours, colors).
For instance, given an original image 𝑥, noise
input 𝑧, and corresponding target image 𝑦, with the
U-Net generator represented as 𝐺 and the
discriminator as 𝐷, the generated fake image would
be 𝐺(𝑥). Instead of having the discriminator compare
𝐺(𝑥) with 𝑦 , it distinguishes between the pairs
(𝑥,𝐺(𝑥)) and (𝑥,𝑦) . The discriminator does not
directly assess whether the generated image is real or
fake but rather determines whether the generated or
target image forms a valid image pair with the
original image. This strengthens the model's ability to
maintain correspondence between the original and
target images.
Based on this setup, the loss functions for the
generator and discriminator are as follows:
min
max
𝑉(𝐷,𝐺) = 𝔼
,
log𝐷
(
𝑥,𝑦
)
+
𝔼
,
log(1 𝐷
𝑥,𝐺
(
𝑥,𝑧
)
)
(1)
Nevertheless, at this stage, the generated image
and the original image still do not have a pixel-level
correspondence. Since the cGAN generator
inevitably requires noise data, the generated image
may have slight shifts at the edges. To deceive the
discriminator D, the generator may produce blurry
edges to minimize the loss. However, having blurry
edges is not ideal for a high-quality generated image.
To prevent the generator from producing blurry
edges, an L1 loss is introduced, ensuring that the
generated image closely matches the target image at
the pixel level. L1 is chosen over L2 because L1
represents the median, whereas L2 represents the
mean, and L2 tends to produce more blurriness
compared to L1.
With the introduction of L1 loss, the final loss
function is as follows:
min
max
𝐿

(
𝐷,𝐺
)
=𝔼
,
log𝐷
(
𝑥,𝑦
)
+
𝔼
,
log
1−𝐷𝑥,𝐺
(
𝑥,𝑧
)
(2)
𝐺
=min
max
𝐿

(
𝐷,𝐺
)
+𝜆𝐿
(𝐺) (3)
2.3 Pix2pix-Turbo Principle
Although pix2pix can achieve good results in image
translation tasks, it struggles with tasks that require
precise or complex image descriptions, as it cannot
effectively learn intricate patterns. Additionally, the
original pix2pix requires training the generator from
scratch, which incurs significant time costs, and the
model's inference speed is not optimal.
To address these issues, Gaurav Parmar and
colleagues (Parmar, Park, Narasimhan, et al., 2024).
Introduced improvements by incorporating a Text
Encoder and leveraging pre-trained diffusion models
(such as SD-Turbo). Instead of training the generator
from scratch, they fine-tuned it using LoRA (Low-
Rank Adaptation) through text prompts and input-
target image pairs. To align text with images, they
used CLIP (Radford, 2021), a model for connecting
natural language supervision with visual models.
Additionally, skip connections were introduced
between the encoder and decoder of the generator
network to balance detail loss caused by generator
changes. Consequently, the model's loss function
includes not only the original pix2pix generator and
discriminator losses but also CLIP similarity loss and
reconstruction loss (including L2 loss and LPIPS loss
to measure differences between the generated and
target images).
With these improvements, the pix2pix-turbo
model allows fine-grained control over generated
content using text prompts. It also achieves faster
training and inference times compared to the original
pix2pix, while producing images with superior
overall quality and better detail retention.
2.4 Model Improvement
To apply the pix2pix-turbo method to the domain of
PBR texture generation, several adjustments to the
model are necessary. The original method was
designed for one-to-one correspondence between an
input image and a target image, whereas the task in
this paper requires generating multiple texture
maps—such as height, normal, and roughness—from
a single base color image. This turns the task into a
DAML 2024 - International Conference on Data Analysis and Machine Learning
324
one-to-many problem, necessitating a modification of
the original model structure.
Additionally, since different materials exhibit
distinct properties, their corresponding texture maps
may vary significantly. For example, the metallic map
for the ceramic category is mostly black, as ceramics
do not exhibit metallic properties. Conversely, metal
textures often contain large white areas, representing
the presence of metallic shine. Therefore, to
distinguish between different material categories and
generate appropriate texture maps for each, the model
needs to incorporate a classifier that can identify the
input material type.
In this experiment, the yolov8m-cls model is
employed for image classification, helping the system
better recognize the material category of the input
image.
At the same time, due to the overlapping
characteristics of different types of PBR materials in
the training dataset (for example, Ground textures
may include small amounts of stone as
embellishments), even though the input image can
largely be classified into a specific category for the
texture translation task, there needs to be a fallback
mechanism. To ensure that the expert model assigned
by the classifier can successfully process the input
image, the system provides a universal expert model
as a secondary option. This fallback guarantees that,
in cases where the classifier makes an error, the input
image won’t be processed by an incorrect expert
model and yield poor results. Instead, the universal
expert model offers an alternative path, allowing the
user to obtain a more reliable output.
Next is the Text Prompt design. Unlike traditional
image translation tasks, where features are easier to
describe, the specific requirements for PBR texture
maps are more abstract. For example, in a standard
task, if you want to transform a daytime image into a
nighttime one, the text prompt "night" suffices.
Similarly, if you want a circle image to be filled with
violet and have an orange background, a text prompt
like "violet circle with orange background" would
work. This is because CLIP, during training, has
aligned abstract concepts like "night" and colors such
as "violet" or "orange." However, when it comes to
more technical terms like height map, it is uncertain
whether CLIP can adequately align with these
concepts. Experimental validation is needed to
determine how well CLIP handles such specialized
terms.
Additionally, it is crucial to assess the quality of
the generated texture maps. This study uses MSE to
evaluate the overall similarity between the generated
and target images, while SSIM measures the
structural similarity. A subjective evaluation of the
rendered textures after model inference is also
employed to assess the practical performance of the
generated images.
Ultimately, the modified pix2pix-turbo
architecture is shown in Figure 2.
2.5
Experimental Procedure
2.5.1 MatSynth Dataset Preprocessing
The MatSynth dataset is provided for download in
Parquet format, with a total size of over 400 GB. To
ensure the training process is both efficient and
manageable, the preprocessing steps involved
downloading the dataset and cropping the material
images from 4096x4096 to 512x512. The images
were then categorized by type and stored accordingly,
compressing the dataset from over 400 GB to 8 GB
for easier data transfer and training.
Figure 2: The architecture of the improved pix2pix-turbo model (Photo/Picture credit: Original).
Texture Translation of PBR Materials Based on Pix2pix-Turbo
325
Additionally, the image formats within the dataset
were converted for consistency. To standardize the
input and output formats, all single-channel grayscale
images were converted to RGB three-channel images.
Similarly, RGBA four-channel images were also
converted to RGB three-channel images to maintain
format uniformity, facilitating the model’s processing
of input and output.
2.5.2 Text Prompt Design
To select the best Text Prompt for fine-grained
alignment of material representations, this study
designed experiments focused on the Height map to
investigate how different text prompts affect the
generation results. For clarity, the experiment
selected the most distinguishable height conditions
from category 11, Terracotta. This category primarily
consists of brick wall structures, making it suitable for
evaluating the effectiveness of different prompts
using both MSE and SSIM metrics, as well as
subjective visual assessment.
The results of the experiment were obtained by
training the model multiple times with different text
prompts and averaging the evaluation metrics. One
prompt simply required converting the base color to
height, while another provided a detailed description
of the height map characteristics and conversion
requirements. Each set of experiments was run three
times, and the average values for MSE and SSIM
were compared to determine the effectiveness of the
text prompts.
2.5.3 Model Training
First, the image classification task was trained using
the yolov8m-cls model. Although yolov8n-cls is
faster and yolov8x-cls is more accurate, the yolov8m-
cls model was chosen for its balance between
accuracy and time efficiency. The training dataset for
the classification task is a subset of the training
dataset for the texture conversion task, and it only
includes base color images. The final training size
was set to 512x512, with the number of epochs
configured to 100.
There are 13 material categories in the training
set, with each category providing only three types of
textures for training: Height, Normal, and Roughness.
Diffuse textures were excluded because they are
nearly identical to the base color in most categories,
leaving insufficient training samples. Specular and
Opacity textures were also excluded as they generally
consist of solid colors with minimal variation, making
them less valuable for training. Metallic textures were
only available for metal material categories, so they
were not used in training for every material type.
Additionally, for categories with too few samples
after splitting by type, a universal expert model was
used as a substitute.
After preparing the text prompts for each material
type, the model was trained using an RTX 4090 24GB
GPU. Each texture was 512x512 in size, with a
maximum of 10,000 training steps.
3 EXPERIMENTAL RESULTS
To test the conversion generation capability of the
model, the improved pix2pix-turbo model was
evaluated using the Mean Squared Error (MSE) and
Structural Similarity Index (SSIM) metrics across 13
different material categories. MSE measures the
mean squared error of the pixel differences between
the target image and the converted image, while
SSIM compares the images in higher-level
dimensions such as brightness and contrast.
Generally, for high-quality generated images, MSE
should be lower and SSIM should be higher.
The experiment first compared the effects of
different prompts on the model's performance. Two
types of prompts were used: one that directly
requested conversion and another that provided a
detailed description of the conversion rules and
material characteristics. For the Terracotta category
(category 11), each prompt was used to train the
model three times, and the final conversion results
were averaged based on their performance on the test
set. The results showed that the model trained with
the first prompt had an average MSE of 2868.36 and
an average SSIM of 0.49, while the model trained
with the second prompt achieved an average MSE of
2672.22 and an average SSIM of 0.51. Both metrics
were better for the second prompt, indicating that the
quality of the generated images improved with more
detailed and specific prompts.
This difference is evident from the height maps
generated, as shown in the images below. The first
prompt did not specify that the height map should be
black and white, which led to a lower penalty from
the CLIP model for non-black-and-white colors in the
generated images. This resulted in some parts of the
generated images not being black and white, directly
affecting the MSE and SSIM scores. The final
generated material effects are shown in Figure 3.
DAML 2024 - International Conference on Data Analysis and Machine Learning
326
Figure 3 Comparison of Training Results with Different
Text Prompts: The left image shows the target image, the
middle image shows the converted image generated after a
detailed description of conversion rules and characteristics,
and the right image shows the converted image generated
after providing a simple conversion instruction
(Photo/Picture credit: Original).
In addition, the experiment was conducted on the
unmodified pix2pix-turbo method to compare the
image conversion results in the PBR material domain
before and after the model improvement. In Figure 4,
the upper image is a comparison of MSE metrics,
with the x-axis representing material category
numbers and the y-axis representing MSE. The blue
line shows the MSE of the original model for Height,
the orange line shows the MSE of the improved
model for Height, and the green line shows the MSE
of the improved model for Normal. Similarly, the
lower image is a comparison of SSIM metrics, with
the x-axis representing material category numbers
and the y-axis representing SSIM. The blue line
shows the SSIM of the original model for Height, the
orange line shows the SSIM of the improved model
for Height, and the green line shows the SSIM of the
improved model for Normal.
From Figure 4, the MSE comparison results for
categories 5, 6, and 7 show that because the training
set was divided by category, some categories did not
have enough training data to effectively support the
domain-specific expert models. As a result, the full-
domain expert model was used as an alternative,
which led to some material types performing on par
with the original model, while others, such as
categories 1, 3, and 8, achieved better results.
In some cases, the MSE for height maps is
relatively poor while the SSIM is good. This is due to
the fact that height information provides a relative
estimate but cannot accurately determine the exact
height difference, leading to larger pixel differences,
while structural information can still be well
transferred.
Additionally, it's worth noting that for the normal
map of category 5, marble, the MSE is lower and the
SSIM is exceptionally high. This is because the
marble surface has fewer protrusions, which leads to
better results in the comparison.
The subjective visual comparison is shown in
Figure 5. Judging from the visual results, the details
produced by the original pix2pix model are the
poorest, with many artifacts and jagged edges, and the
solid color areas are inadequately filled. The original
pix2pix-turbo model, lacking expert and
classification systems, failed to effectively represent
the height variations, resulting in nearly grayscale
outputs. In contrast, the improved pix2pix-turbo
model generates clearer height maps.
Figure 4: Comparison of MSE and SSIM Results between the Improved Model and the Original Model: The upper chart
shows the MSE comparison and the lower chart shows the SSIM comparison. The blue line represents the original model,
while the orange and green lines represent the improved model (Photo/Picture credit: Original).
Texture Translation of PBR Materials Based on Pix2pix-Turbo
327
Figure 5: Comparison of height maps generated by different
models. The left image shows the conversion result using
the pix2pix model, the middle image shows the result from
the original pix2pix-turbo model, and the right image
displays the result from the improved pix2pix-turbo model
(Photo/Picture credit: Original).
Additionally, the dataset includes many examples
where the source and target images have weak
correlations. For instance, in the Ceramic category, a
significant portion of the materials are tiles. This
means that even if the base color image has complex
patterns, the height map might simply consist of
straightforward square segments. These examples do
not provide the model with meaningful variation
patterns, which contributes to a decline in the final
generated quality.
4 CONCLUSIONS
This paper improves the pix2pix-turbo model by
incorporating multi-layer LoRA for material
generation and introducing a classifier along with a
combination of domain-specific expert models and a
general expert model. The improved model achieved
a minimum MSE of 1181.41 and a maximum SSIM
of 0.614 on the MatSynth dataset's height maps,
which represents a reduction in MSE by 1,443.53 and
an increase in SSIM by 0.13 compared to the original
model. The results demonstrate that the improved
model has a strong capability for PBR material image
conversion, allowing for the rapid generation of high-
quality PBR material images from input basecolor
images.
From the experimental results, it is evident that
the modified pix2pix-turbo model for PBR material
conversion performs better than the original model
and the standard pix2pix model. Additionally, the
CLIP text-image alignment tool shows that more
precise input leads to better material generation
results. However, CLIP may not fully understand
certain terms. For example, the quality of images
generated with the terms "rough" and "smooth" for
roughness material does not match the quality
achieved for height and normal maps.
Future research could focus on the semantic
aspects of images, using other models to evaluate
height, normal, and roughness features in specific
areas. Combining models like pix2pix-turbo with
advanced image semantic understanding could
enhance the realism and accuracy of PBR material
conversion effects, addressing the model's current
limitations in roughness material conversion.
REFERENCES
Guo, Y., Smith, C., Hašan, M., Sunkavalli, K. and Zhao, S.,
2020. MaterialGAN: Reflectance capture using a
generative SVBRDF model. arXiv preprint
arXiv:2010.00114.
Hu, Y., Hašan, M., Guerrero, P., Rushmeier, H. and
Deschaintre, V., 2022. Controlling material
appearance by examples. Computer Graphics Forum,
41(4), pp.117-128.
Isola, P., Zhu, J. Y., Zhou, T. and Efros, A. A., 2017. Image-
to-image translation with conditional adversarial
networks. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp.1125-
1134.
Martin, R., Roullier, A., Rouffet, R., Kaiser, A. and
Boubekeur, T., 2022. MaterIA: Single image high
resolution material capture in the wild. Computer
Graphics Forum, 41(2), pp.163-177.
Parmar, G., Park, T., Narasimhan, S. and Zhu, J. Y., 2024.
One-step image translation with text-to-image models.
arXiv preprint arXiv:2403.12036.
Pharr, M. and Humphreys, G., 2004. Physically based
rendering: From theory to implementation.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S. and Sutskever, I., 2021. Learning
transferable visual models from natural language
supervision. In International Conference on Machine
Learning, pp.8748-8763. PMLR.
Siddiqui, Y., Monnier, T., Kokkinos, F., Kariya, M.,
Kleiman, Y., Garreau, E. and Novotny, D., 2024. Meta
3D AssetGen: Text-to-mesh generation with high-
quality geometry, texture, and PBR materials. arXiv
preprint arXiv:2407.02445.
Vecchio, G., 2024. StableMaterials: Enhancing diversity in
material generation via semi-supervised learning.
arXiv preprint arXiv:2406.09293.
Vecchio, G. and Deschaintre, V., 2024. MatSynth: A
modern PBR materials dataset. In Proceedings of the
IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp.22109-22118.
Vecchio, G., Martin, R., Roullier, A., Kaiser, A., Rouffet,
R., Deschaintre, V. and Boubekeur, T., 2023.
Controlmat: A controlled generative approach to
material capture. ACM Transactions on Graphics.
DAML 2024 - International Conference on Data Analysis and Machine Learning
328