Image Inpainting on the Sketch-Pencil Domain with Vision Transformers

Jose Luis Flores Campana, Luís Gustavo Lorgus Decker, Marcos Roberto e Souza,

Helena de Almeida Maia and Helio Pedrini

Institute of Computing, University of Campinas, Campinas, SP, 13083-852, Brazil

Keywords:

Image Inpainting, Sketch-Pencil, Image Processing, Transformers.

Abstract:

Image inpainting aims to realistically ﬁll missing regions in images, which requires both structural and tex-

tural understanding. Traditionally, methods in the literature have employed Convolutional Neural Networks

(CNN), especially Generative Adversarial Networks (GAN), to restore missing regions in a coherent and re-

liable manner. However, CNNs’ limited receptive ﬁelds can sometimes result in unreliable outcomes due to

their inability to capture the broader context of the image. Transformer-based models, on the other hand,

can learn long-range dependencies through self-attention mechanisms. In order to generate more consistent

results, some approaches have further incorporated auxiliary information to guide the model’s understanding

of structural information. In this work, we propose a new method for image inpainting that uses sketch-

pencil information to guide the restoration of structural, as well as textural elements. Unlike previous works

that employ edges, lines, or segmentation maps, we leverage the sketch-pencil domain and the capabilities of

Transformers to learn long-range dependencies to properly match structural and textural information, resulting

in more consistent results. Experimental results show the effectiveness of our approach, demonstrating either

superior or competitive performance when compared to existing methods, especially in scenarios involving

complex images and large missing areas.

1 INTRODUCTION

Image inpainting is a task that aims to ﬁll unknown re-

gions of a damaged image. Over the years, the signiﬁ-

cance of image inpainting has grown considerably, as

it has found applications in various real-world scenar-

ios such as object removal (Zeng et al., 2020), photo

restoration (Wan et al., 2020), and image editing (Yu

et al., 2019). The major challenge of this task lies

in the necessity to properly restore missing regions

with content that is both visually realistic and seman-

tically plausible. To achieve this, the restoration pro-

cess must encompass not only the structural aspects

but also the textural nuances of the missing regions,

ensuring coherence between them in the global con-

text of the image.

Several approaches have been proposed to pur-

sue realism in image inpainting (Ghorai et al., 2019;

Gamini and Kumar, 2019; Liu et al., 2020; Yu et al.,

2018; Nazeri et al., 2019; Dong et al., 2022; Liao

et al., 2021; Suvorov et al., 2022; Yang et al., 2020).

Classical approaches focus on restoring missing re-

gions using diffusion-based and patch-based mod-

els (Ghorai et al., 2019; Gamini and Kumar, 2019).

However, these approaches suffer from restoring

plausible structures and realistic textures by ignoring

the global context of the image.

More recently, approaches based on convolutional

neural networks (CNN) and generative adversarial

networks (GAN) have emerged to address these prob-

lems. However, these approaches still present some

challenges:

(i) limited receptive ﬁelds: this limitation raises the

difﬁculty of achieving the restoration of seman-

tically coherent structures, due to the difﬁculties

of CNNs to capture the broader context of the

image;

(ii) complex models: handling large masks can lead

to creating models that manage to capture the

global context of the image and produce high-

quality results. However, this model requires the

use of multiple components, which transforms

the model into a more complex one and requires

more parameters/time to learn;

(iii) incomplete structures: CNN-based models can

produce incomplete results, due to a lack of un-

derstanding of structural information, such as

edges or lines, that guide the coherent restora-

tion of the image.

122

Campana, J., Decker, L., Souza, M., Maia, H. and Pedrini, H.

Image Inpainting on the Sketch-Pencil Domain with Vision Transformers.

DOI: 10.5220/0012363500003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theor y and Applications (VISIGRAPP 2024) - Volume 3: VISAPP, pages

122-132

ISBN: 978-989-758-679-8; ISSN: 2184-4321

In response, some methods employ various convo-

lution and upsampling operators (Liu et al., 2020) or

even dilated convolutions (Yu et al., 2018) to restore

the global context. Unfortunately, these strategies of-

ten result in the creation of duplicate patterns or blurry

artifacts. Other methods have used strategies such as

wavelets (Yu et al., 2021) or contextual attention (Yu

et al., 2018) to capture global context without the need

for multiple convolution operators. However, these

methods can lead to the generation of artifacts for

complex patterns. Some others have employed aux-

iliary information such as edges (Nazeri et al., 2019),

lines (Dong et al., 2022), segmentation maps (Liao

et al., 2021), gradients (Yang et al., 2020) to guide

the structural or texture restoration of the inpainting

model. Nevertheless, applying semantically incorrect

auxiliary information to image inpainting models can

lead to inconsistent results.

Therefore, these challenges motivate the creation

of a model capable of inferring auxiliary structural

and texture information consistently to guide our in-

painting model to restore the damaged regions in a se-

mantically coherent and visually detailed way. More

speciﬁcally, we employ a Transformer-based model to

learn the auxiliary information since the Transformers

have the ability to model long-range dependencies,

compared to the limited receptive ﬁelds of CNNs. In

addition, we use new auxiliary information extracted

from the sketch-pencil domain (also known as hand-

drawn sketch or pencil drawing). This domain helps

us to effectively encapsulate the structural informa-

tion, as well as infer better texture content to guide

the inpainting model to obtain coherent and detailed

results.

In our proposed method, a damaged image is con-

verted into the sketch-pencil domain. This feeds a

Transformer Structure Texture Restoration (TSTR)

model based on the architecture proposed by Cam-

pana et al. (2022). Our TSTR model employs the

patch partitioning strategy to capture relevant struc-

tural information such as edges, and texture content

from the context global image to restore the miss-

ing regions. Subsequently, a Efﬁcient Transformer

Inpainting (ETI) model (Campana et al., 2022) is

utilized to predict the structure and texture guided

from the restored sketch-pencil image, generating an

inpainted image from (i) the restored sketch-pencil

image and (ii) the damaged image (Fig 1). In ad-

dition, both models employ the patch-self attention

strategy (Campana et al., 2022) to reduce memory

consumption and computational power compared to

the global-self attention approach (Dosovitskiy et al.,

2021).

The main contribution of this work is an image

Figure 1: Approach Overview. Left: Input image with

missing regions. The missing regions are represented in

the white pixels. Center: Inpainted sketch-pencil image.

The input image is converted into the sketch-pencil domain,

resulting in a damaged sketch-pencil image. Our sketch

model inpaints the damaged sketch-pencil regions to obtain

the inpainted sketch-pencil image. Right: Image inpaint-

ing results of the proposed inpainting model. The inpainted

sketch-pencil image combined with the damaged image is

employed to restore the missing region and obtain the in-

painted image.

inpainting method based on Vision Transformers that

use auxiliary information extracted from the sketch-

pencil domain to guide a semantically reliable and

visually realistic restoration. We conducted experi-

ments on Places2, PSV, and CelebA datasets. Our re-

sults outperformed state-of-the-art competitors in per-

ceptual measures, namely FID and LPIPS, for the two

latter datasets and achieved competitive results on the

former.

This text is organized as follows. Section 2

presents recent image inpainting methods relevant to

this work, including those that utilize auxiliary meth-

ods to guide the image inpainting task and those

based on vision transformers. Section 3 thoroughly

describes our approach using two vision transform-

ers for image inpainting. Section 4 presents our re-

sults, along with information about datasets and train-

ing implementation. We conducted quantitative and

qualitative comparisons with state-of-the-art meth-

ods. Section 5 provides an ablation study of the im-

ages in the sketch-pencil domain. Finally, we present

our conclusions in Section 6.

2 BACKGROUND

Image Inpainting based on Auxiliary Information.

Some works used auxiliary information sources to

deal with difﬁcult situations in image inpainting, such

as large masks and complex elements. This additional

information may include edges, lines, gradients, or

segmentation maps, which guide the inpainting pro-

cess. For example, Nazeri et al. (2019) proposed a

two-step network called EdgeConnect. In the ﬁrst

step, it ﬁlls an edge map computed by the Canny edge

detector. Then, it predicts the inpainted image us-

ing both the restored edge map and the original in-

Image Inpainting on the Sketch-Pencil Domain with Vision Transformers

123

put image. In another approach, Dong et al. (2022)

presented a method called ZITS, which consists of

two main components. The ﬁrst component, Trans-

former Structural Restoration, completes the edges

and lines predicted from the damaged image. The sec-

ond component, Fourier Texture Restoration, uses the

inpainted edges and lines, along with the damaged im-

age, to achieve the ﬁnal inpainted image. In this work,

we employ auxiliary information extracted from the

sketch-pencil domain, which brings more structural

consistent information as well as texture details com-

pared to other auxiliary information, such as edges.

Image Inpainting based on Transformers. Vision

Transformers have gained great popularity in com-

puter vision due to their exceptional performance

across a range of tasks, including image classiﬁ-

cation (Dosovitskiy et al., 2021) and semantic seg-

mentation (Strudel et al., 2021), among others. In

the context of image inpainting, transformers have

emerged as powerful alternatives to methods based

on Convolutional and Generative Adversarial Net-

works, primarily due to their self-attention mecha-

nism, which enables the capture of global context.

Li et al. (2022) introduced a Transformer-based ap-

proach that employs dynamic masks to effectively

handle large masks. Meanwhile, Cao et al. (2022)

proposed a method based on the Masked Autoen-

coders (MAE) (He et al., 2022), where features ex-

tracted from the MAE model are utilized in the

Attention-based CNN Restoration (ACR) model to

learn the intricacies of reconstructing missing regions.

Campana et al. (2022) proposed a model based on

Transformers that use different patch sizes and a vari-

able number of heads in the self-attention mechanism

to capture the global context of the image efﬁciently

in training and inference time. Based on the lat-

ter, here we propose the use of Transformers in the

sketch-pencil domain to infer the structural informa-

tion leveraging the global context image. This infor-

mation helps our inpainting model to generate coher-

ent and detailed results.

3 PROPOSED METHOD

This section describes the main steps of our sketch-

pencil image inpainting method using Vision Trans-

formers.

Overview. Figure 2 illustrates the proposed

pipeline. Our Transformer Structure Texture Restora-

tion (TSTR) computes the inpainted sketch-pencil

image

= TSTR(I

, I

, M) (Section 3.1) from the

inputs: the damaged image I

, the damaged sketch-

pencil image I

, and a binary mask M. The Efﬁcient

Transformer Inpainting calculates the inpainted

image I

out

= ETI(I

) (Section 3.2) guided from the

inpainted sketch-pencil image and taking as input the

damaged image. TSTR and ETI are, respectively, the

Transformer Structure Texture Restoration and the

Efﬁcient Transformer Inpainting models.

3.1 Transformer Structure Texture

Restoration

By leveraging the inherent capacity of Transformers

to capture global context information (Dosovitskiy

et al., 2021), we adopted the work proposed by Cam-

pana et al. (2022, 2023) as our baseline. The Trans-

former Structure Texture Restoration framework is

employed to guide the restoration of the inpainting

model by improving the structural and texture infor-

mation in the sketch-pencil domain.

A sketch image is an artistic visual effect that re-

sembles a hand-drawn sketch or a pencil drawing (Qiu

et al., 2019). This effect can be achieved through var-

ious techniques in image processing and computer vi-

sion ﬁelds. Figure 3 presents the conversion process

giving an input image into an image in the sketch-

pencil domain or a sketch-pencil image

How Much Sketch-Pencil Information Is Needed?

We explored the most adequate amount of edges,

lines, and texture information that our sketch-pencil

image must have for better ﬁnal inpainting. This

amount is controlled by the Gaussian ﬁlter parameter

(δ).

Figure 4 shows the impact of δ in the sketch-pencil

image. We chose δ = 21 since it produces darker

and thicker edges and lines, making them suitable for

shading and adding texture to our inpainting model.

This appropriate selection of δ value enabled us to op-

timize the guidance provided by the sketch-pencil do-

main for both structural and textural restoration dur-

ing the image inpainting process.

Proposed Framework. Given the original image

and its inpainting mask M, both with a size of

256×256 pixels, our ﬁrst step is to compute the

sketch-pencil image and its damaged version I

Subsequently, TSTR uses I

and M to compute the

inpainted sketch-pencil image

. The TSTR archi-

tecture is composed of an encoder, eight transformer

blocks, and a decoder. A description of these compo-

nents is provided in the following subsections.

https://github.com/rra94/sketchify/tree/master

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

124

Embedding (E1): 256

Patch size (P1): (4,4)

# heads (H1): 8

Embedding (E3): 256

Patch size (P3): (8,8)

# heads (H3): 4

Embedding (E5): 256

Patch size (P5): (16,16)

# heads (H5): 2

Embedding (E7): 256

Patch size (P7): (32,32)

# heads (H7): 1

Inpainted

Image

Damaged

Image

Inpainting

Transformer

Block

3x3 Conv

Damaged

Image

Sketch

Transformer

Block

Damaged

Sketch

Inpainted

Sketch

Patch

Self Attention

Block

Patch

Self Attention

Block

Patch

Self Attention

Block

Patch

Self Attention

Block

Patch

Self Attention

Block

Embedding (E1): 256

Patch size (P1): (2,2)

# heads (H1): 4

Patch

Self Attention

Block

Embedding (E1): 256

Patch size (P1): (4,4)

# heads (H1): 4

Mask

Figure 2: An illustration of our image inpainting method based on vision transformers. Left: TSTR restores the damaged

sketch-pencil image from inputs including the damaged image and mask. Right: ETI computes the inpainted image by using

the restored sketch-pencil image and the damaged image.

to gray

invert

Real Image

Gray Image Inverted

Image

Smoothed

Image

Sketch-Pencil

Image

to smooth

invert

Smoothed Inverted

Image

Gray Image

Figure 3: Conversion of an input image into the sketch-

pencil domain using image processing techniques. (1) The

input image is transformed into a gray-scale image. (2) The

gray image is inverted. (3) Gaussian blur is applied to the

inverted image. (4) The smoothed inverted image is inverted

to the original. (5) The sketch-pencil image is computed by

blending the smoothed gray image with the gray image.

GaussianBlur with filter 7 GaussianBlur with filter 21

Image

Figure 4: Illustration showing the degree of structural and

texture information that can be recovered by using a larger

or smaller Gaussian ﬁlter.

3.1.1 Encoder-Decoder

We used two convolutional layers on both the encoder

and decoder to reduce computations and memory us-

age in transformers. For each layer, we employed

LeakyReLU as the activation function and Instance

Normalization to help stabilize the training process

and learn better representations. Furthermore, con-

volutional layers may be particularly advantageous to

effectively capture structural information, leading to

a better representation and optimization (Raghu et al.,

2021).

3.1.2 Sketch-Pencil Transformer Blocks

We used the patch self-attention mechanism (Cam-

pana et al., 2022), which aims to capture the global

image context while reducing memory costs in both

training and inference. We employ four pairs of trans-

former blocks, and adopt a multi-scale patch parti-

tioning strategy in each one (Figure 5).

In the ﬁrst and second blocks, we strategically

vary the patch size to strike a balance between captur-

ing the global context and enhancing computational

efﬁciency during both training and inference. This

approach not only optimizes memory usage but also

contributes to more effective structural and texture in-

formation restoration from the sketch domain, which,

in turn, guides our image inpainting model to obtain

coherent and reliable results.

We deﬁne the set of patch sizes denoted as P =

{2, 4} for each group of transformer blocks. This

selection ensures that the information belonging to

edges and lines from the sketch domain is adequately

restored, aligning with the requirements of our in-

painting process.

Concerning the number of attention heads in our

model, it is worth noting that a larger number can en-

able simultaneous focus on different regions of the

image, enhancing the model’s capacity to learn intri-

Image Inpainting on the Sketch-Pencil Domain with Vision Transformers

125

Missing pixel

Patches

Patch

Valid pixel

Figure 5: Patch self-attention mechanism is used to attend

to the missing pixels by capturing the global context of the

image based on the distance between valid pixels of each

patch.

cate patterns. However, this also requires an increase

in the number of parameters, resulting in higher com-

putational costs during both training and inference,

including memory usage.

In our model, we maintain a consistent number of

heads, which is denoted as K = {4, 4}. This choice

strikes a balance between model effectiveness and

computational efﬁciency, ensuring that the proposed

model can effectively learn complex patterns while

remaining manageable in terms of memory and com-

putation.

3.2 Efﬁcient Transformer Inpainting

In our approach, we adopt the same model conﬁgura-

tion as presented by Campana et al. (2022). In every

two Inpainting Transformer Blocks, we increase the

patch size, denoted as p = {4, 8, 16, 32}. On the other

hand, we decrease the number of attention heads, de-

noted as k = {8, 4, 2, 1}.

The inpainting process unfolds as follows. Given

a damaged image I

, a mask M, and an inpainted

sketch-pencil image

Is as input, all in a 256×256

pixel resolution, these components are jointly passed

to the Efﬁcient Transformer Inpainting (ETI) model.

ETI is responsible for the restoration of both struc-

tural and textural information using prior information

from the sketch-pencil domain, yielding the inpainted

image denoted as I

out

that seamlessly integrates the

visually realistic content.

3.3 Loss Functions

We adopt the same loss functions as those described

by Campana et al. (2022), expressed as

total

= λ

rec

+ λ

style

+ λ

perc

+ λ

adv

(1)

where λ

rec

= 1, λ

style

= 90 for TSTR and 360 for ETI,

perc

= 1.5 for TSTR and 0.9 for ETI, and λ

adv

= 0.01.

We assigned higher weights to the hole, valid, and

perceptual losses for TSTR, aiming to emphasize the

structural aspects. In contrast, we set a higher weight

to the style loss on ETI to emphasize the restoration of

texture details. We deﬁne each term in the following

paragraphs.

Reconstruction Loss (L

rec

). This loss ensures the

coherence between the inpainted and surrounding

known regions, in addition to approaching the

ground-truth information as closely as possible. We

deﬁned this loss as the sum of the hole loss L

hole

and

valid loss L

valid

hole

I−M

∥

(I − M) ⊙ (I

out

− I

)

∥

(2)

valid

∥

M ⊙ (I

out

− I

)

∥

(3)

rec

= L

hole

+ L

valid

(4)

where I the identity matrix, and

I−M

and

denote

the number of holes and valid pixels in M.

Perceptual (L

perc

) and Style Losses (L

style

). Per-

ceptual loss (L

perc

) encourages the inpainted image

out

to match the overall visual appearance of the

ground truth image I

. In turn, Style loss (L

style

) helps

to preserve and match the texture characteristics of the

ground truth image I

. We deﬁned the perceptual and

style loss, respectively, as

perc

∑

∥

out

) − φ

)

∥

)

(5)

and

style

∑

∥

out

) − ω

)

∥

)

, (6)

where φ

(.), i = 1, . . . , 5 denote the activation

maps ReLu1_1, ReLu2_1, ReLu3_1, ReLu4_1, and

ReLu5_1 from a pre-trained VGG-16 (Simonyan and

Zisserman, 2015). ω

(I) = φ

(I)

(I) denotes

the Gram matrix formed by four activation maps

from VGG-16: ReLu2_2, ReLu3_3, ReLu4_3, and

ReLu5_2. N

)

denotes the dimension of the fea-

ture map φ

), and N

)

denotes the dimension of

the feature map ω

), which are used as a normal-

ization factor.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

126

Adversarial Loss (L

adv

). This loss forces the gener-

ated inpainted image I

out

to be indistinguishable from

ground truth image I

. We employ the adversarial

loss proposed by Goodfellow et al. (2014), deﬁned as

= −E

out

[log(D(I

out

))], (7)

= −E

[log(D(I

))] − E

out

[log(1 − D(I

out

))],

(8)

adv

= L

+ L

, (9)

where L

and L

represent the loss functions of the

generator and discriminator.

4 EXPERIMENTS

In this section, we present our experimental results.

First, we brieﬂy describe the datasets used and some

implementation details. Then, we report and discuss

both our quantitative and qualitative results.

4.1 Datasets

For our experiments, we utilized three well-

established datasets commonly used in inpainting re-

search: Places2 (Zhou et al., 2017), which encom-

passes images from 365 diverse scene categories;

CelebA (Liu et al., 2015), comprising facial images

of celebrities; and Paris StreetView (PSV) (Doersch

et al., 2015), which includes street views and build-

ings from Paris. In the case of Places2, we em-

ployed the 1.8 million images from the training set

and 36.5 thousand images from the validation set

for training and evaluation, respectively. These sets

were obtained from the Places2 dataset available at

the following link: http://places2.csail.mit.edu/index.

html. For CelebA and PSV datasets, CelebA contains

202.599K and PSV 15.000K total images, so we per-

formed a split into training and validation sets, and

the reported results are based on this single training-

validation split. Speciﬁcally, for CelebA, we used ap-

proximately 162.7K images for training and 19.961K

images for validation. For PSV, we employed 14.9K

images for training and 100 images for validation.

We used irregular masks generated online during

training. For validation, we employed the mask set

deﬁned by Liu et al. (2018), which consists of 12K ir-

regular masks equally divided into six intervals based

on hole size. In this study, we employed only three

intervals: 20-30%, 30-40%, and 40-50%.

4.2 Implementation Details

Our method was implemented using PyTorch. We

set the batch size as 16 and resized the input image

to 256×256 for both TSTR and ETI. We trained the

TSTR and ETI using Adam optimizer with β

= 0.99

and β

= 0.9.

We trained TSTR for 75, 50, and 40 epochs and

set the initial learning rate to 10

−5

, 10

−4

, and 10

−4

respectively, for Places2, CelebA, and PSV. Addition-

ally, we decayed these learning rates by a factor of

−1

in the last 5, 10, and 10 epochs for Places2,

CelebA, and PSV, respectively. For ETI, we used 80,

75, and 75 epochs, respectively, for Places2, CelebA,

and PSV. The initial learning rate was set to 10

−4

and

was decayed in the same manner as during the train-

ing of TSTR.

4.3 Quantitative Comparison

Inpainting Results. To assess our experiments, we

used four well-established metrics: Peak Signal-

to-Noise Ratio (PSNR), Structural Similarity Index

(SSIM), Frechet Inception Distance (FID) (Heusel

et al., 2017) and Learned Perceptual Image Patch

Similarity (LPIPS) (Zhang et al., 2018). PSNR and

SSIM are simpler measures that assess image similar-

ity according to the ground truth. On the other hand,

FID and LPIPS are particularly important for evaluat-

ing the perceptual realism of the inpainted regions.

For the Places2 dataset, our method showed better

results than our baseline (Campana et al., 2022), pri-

marily due to the incorporation of sketch-pencil do-

main information. However, ZITS achieved the best

results for the perceptual measures, generating high-

quality inpainted images. In addition, LaMa outper-

formed our method by a slight margin in terms of FID

but was outperformed by us in terms of LPIPS.

For CelebA and PSV datasets, our method was

ranked among the best for all metrics, but especially

for perceptual ones, FID and LPIPS. This strong per-

formance emphasizes the efﬁcacy of TSTR in gener-

ating highly realistic inpainted images across various

datasets and scenarios.

Sketch-Pencil Results. Table 2 shows quantitative

results related to the sketch-pencil image inpainting

(TSTR) on the Places2, CelebA, and PSV datasets.

These results highlight the effectiveness of our model

in restoring this information coherently.

To evaluate the quality of the restored edges and

lines, as well as texture, we employ the SSIM and

LPIPS metrics. SSIM assesses the structural similar-

ity between the restored edges and lines, with values

closer to 1 indicating greater similarity. On the other

hand, LPIPS measures the perceptual similarity of the

texture content within the sketch-pencil domain be-

tween the real and restored sketch-pencil image, with

Image Inpainting on the Sketch-Pencil Domain with Vision Transformers

127

Table 1: Comparison of our method against state-of-the-art approaches on Places2, CelebA and Paris Street View. The ﬁrst

and second-best results are marked in bold and underline, respectively.

Datasets Methods

PSNR ↑ SSIM ↑ FID ↓ LPIPS ↓

20-30% 30-40% 40-50% 20-30% 30-40% 40-50% 20-30% 30-40% 40-50% 20-30% 30-40% 40-50%

Places2

Edge-Connect (Nazeri et al., 2019) 24.9439 22.8172 21.1207 0.8661 0.8043 0.7373 2.8315 5.5362 9.9219 0.0841 0.1253 0.1722

CTSDG (Guo et al., 2021) 25.7374 23.4326 21.6453 0.8817 0.8212 0.7552 3.7493 8.6340 16.8813 0.0911 0.1421 0.1992

WaveFill (Yu et al., 2021) 26.1047 23.7590 21.4553 0.8874 0.8274 0.7422 1.3011 3.2134 11.3293 0.0647 0.1028 0.1697

SPL (Zhang et al., 2021) 27.6768 25.2369 23.2940 0.9105 0.8618 0.8064 2.0407 4.5186 8.8990 0.0722 0.1137 0.1616

MADF (Zhu et al., 2021) 26.9094 24.5930 22.7039 0.8938 0.8430 0.7855 1.2426 2.5276 5.1664 0.0897 0.1214 0.1599

Lama (Suvorov et al., 2022) 26.0241 23.9370 22.2043 0.8770 0.8266 0.7701 1.0391 1.6844 2.6772 0.1165 0.1426 0.1747

Patch-Attn (Campana et al., 2022) 26.4769 24.2554 22.3163 0.8923 0.8368 0.7758 1.1783 2.3969 4.6187 0.0650 0.0995 0.1404

ZITS (Dong et al., 2022) 26.3277 24.0073 22.1937 0.8910 0.8359 0.7746 0.9534 1.7659 3.1039 0.0574 0.0889 0.1261

Ours 26.8025 24.4342 22.5445 0.8948 0.8398 0.7779 1.3245 2.7390 5.2472 0.0629 0.0973 0.1386

CelebA

Edge-Connect (Nazeri et al., 2019) 29.1435 26.5719 24.4178 0.9047 0.8662 0.8211 2.4361 3.6728 5.7569 0.0527 0.0755 0.1040

RFR (Li et al., 2020) 29.8901 27.2036 25.0676 0.9280 0.8886 0.8440 1.7047 2.8320 4.4911 0.0431 0.0645 0.0899

CTSDG (Guo et al., 2021) 30.0308 27.1553 24.9321 0.9330 0.8929 0.8473 2.3009 4.3930 7.4196 0.0515 0.0780 0.1090

SPL (Zhang et al., 2021) 32.6547 29.6495 27.2305 0.9539 0.9249 0.8897 1.2756 2.2643 3.5706 0.0421 0.0641 0.0904

MADF (Zhu et al., 2021) 31.8397 28.7059 26.2538 0.9475 0.9135 0.8729 0.7546 1.4399 2.6177 0.0385 0.0563 0.0787

Patch-Attn (Campana et al., 2022) 31.3763 28.7415 26.5915 0.9420 0.9105 0.8740 0.8072 1.4175 2.4025 0.0335 0.0498 0.0697

Ours 32.5599 29.8027 27.4940 0.9482 0.9187 0.8831 0.5761 0.9274 1.5156 0.0310 0.0450 0.0636

PSV Edge-Connect (Nazeri et al., 2019) 28.6885 26.3160 24.7027 0.8973 0.8478 0.7943 39.9341 50.4303 67.2686 0.0677 0.1027 0.1404

RFR (Li et al., 2020) 28.8133 26.6124 24.8159 0.8999 0.8519 0.7963 30.1260 41.7321 53.7483 0.0617 0.0912 0.1280

CTSDG (Guo et al., 2021) 29.4851 27.0640 25.0938 0.9095 0.8599 0.8013 38.7129 56.2173 76.6186 0.0808 0.1052 0.1498

WaveFill (Yu et al., 2021) 30.1529 27.1075 26.0107 0.9178 0.8740 0.8222 28.2945 38.0996 50.4732 0.0482 0.0737 0.1078

SPL (Zhang et al., 2021) 30.9665 28.4221 26.3540 0.9294 0.8897 0.8407 35.8653 47.9462 69.6496 0.0639 0.0977 0.1415

MADF (Zhu et al., 2021) 30.6575 28.0885 26.0039 0.9247 0.8820 0.8303 24.9763 37.4429 51.7381 0.0565 0.0836 0.1198

Patch-Attn (Campana et al., 2022) 29.9215 27.6332 25.7936 0.9145 0.8722 0.8208 24.9832 36.6138 47.9300 0.0544 0.0794 0.1135

Ours 30.5096 28.0505 26.0226 0.9188 0.8762 0.8242 23.6015 32.9914 44.7338 0.0488 0.0730 0.1059

values closer to zero signifying greater structural and

textural similarity between the two images.

Capitalizing on the restored structural and textu-

ral information, our ETI model effectively guides the

structural and textural restoration process for the dam-

aged image, as shown in Table 1. These results col-

lectively afﬁrm the success of our approach in seam-

lessly restoring both structural and textural elements,

presenting a high inpainting performance.

4.4 Qualitative Comparison

Inpainting Results. Figure 6 compares our quali-

tative results with classical and state-of-the-art meth-

ods. For the CelebA and PSV datasets, our method

consistently shows better structural restoration than

its competitors. We highlight highly detailed textures

in eyes, ﬂowers, and architectural elements. Our se-

mantic reconstruction is competitive, especially when

compared to Patch-Attn.

On the other hand, Edge-Connect shows poor

structural and textural outcomes. RFR and MADF

achieved better semantic results but performed poorly

in recovering the structure and texture of large masks

in the face of some building regions. SPL and Wave-

ﬁll outperformed the aforementioned methods at the

semantic level, but SPL produces overly smoothed

content, while Waveﬁll introduces artifacts in high-

structural regions. Finally, Patch-Attn improves se-

mantic and textural reconstruction but also produces

some artifacts in large high-textural regions, such as

the ﬂowers in the CelebA dataset.

For the Places2 dataset, our method achieved

competitive visual results compared to ZITS, espe-

cially for large regions requiring inpainting in com-

plex images. However, ZITS demonstrated possibly

the best results in both semantic and textural restora-

tion among all the methods. Edge-Connect performed

poorly, particularly in images featuring rich semanti-

cal and textural content. LaMa showed improved re-

sults compared to Edge-Connect and Patch-Attn but

exhibited artifacts in high-texture areas, such as is-

lands and trees.

Sketch-Pencil Results. Figure 7 shows our quali-

tative results for sketch-pencil images. These results

highlight the good performance of our TSTR model

on Places2, CelebA, and PSV datasets.

For PSV and Places2, our model efﬁciently re-

stored edges and lines coherently. In CelebA, it

predicted facial features properly, including the eyes

and face. Our method inpainted sketch-pencil im-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

128

Table 2: Quantitative results on Places2, CelebA and PSV for sketch-pencil inpainting.

Methods Datasets

SSIM ↓ LPIPS ↓

20-30% 30-40% 40-50% 20-30% 30-40% 40-50%

Sketch-Pencil

Places2 0.8586 0.7943 0.7248 0.0879 0.1262 0.1698

CelebA 0.9089 0.8626 0.8107 0.0526 0.0754 0.1022

PSV 0.8879 0.8303 0.7636 0.0623 0.0913 0.1300

Places2 Dataset

Input GT Edge-Connect LaMa ZITS Patch-Attn Ours

CelebA Dataset

Input GT Edge-Connect RFR SPL Patch-Attn Ours

Paris Street View Dataset

Input GT Edge-Connect MADF Waveﬁll Patch-Attn Ours

Figure 6: Comparison of details for inpainting results among the proposed method and literature approaches for Places2,

CelebA, and Paris Street View on the Paris Street View dataset.

ages with rich textures successfully, such as those in

Places2, without introducing artifacts that might con-

fuse the ETI in the subsequent task.

5 ABLATION STUDIES

Sketch-Pencil Domain versus Inpainting Quality.

Table 3 shows the inpainting results with and with-

out sketch-pencil information. Our proposed model

achieved better results for every metric when sketch-

pencil information was incorporated into the inpaint-

ing model.

Image Inpainting on the Sketch-Pencil Domain with Vision Transformers

129

Places2CelebAParis

GT Input Damaged Sketch Inpainted Sketch

Figure 7: Visual examples of inpainted images in the sketch-pencil domain.

Table 3: Ablation study comparing inpainting results with sketch-pencil information (our full proposed model) and without

sketch-pencil information (Inpainting model trained without sketch information). We employed masks with size 40-50% to

evaluate our models.

Sketch-Pencil

Places2 CelebA PSV

No Yes No Yes No Yes

PSNR ↑ 22.3563 22.5445 26.3606 27.4940 25.7015 26.0226

SSIM ↑ 0.7750 0.7779 0.8700 0.8831 0.8194 0.8242

FID ↓ 5.2778 5.2472 1.7700 1.5156 47.5351 44.7338

LPIPS ↓ 0.1435 0.1386 0.0705 0.0636 0.1160 0.1059

Loss Function for Image Inpainting. Table 4

presents the impact of multiple loss function combi-

nations, namely the reconstruction loss (L

rec

), style

loss (L

style

), perceptual loss (L

perc

) and adversarial

loss (L

adv

) for sketch-pencil prediction, using the

CelebA dataset. We veriﬁed that we achieved the

best results when using all three losses, which sug-

gests that this combination contributes to the overall

quality and realism of the inpainted images.

Sketch-Pencil Domain versus Edges from Edge-

Connect. We conducted an analysis to compare the

impact of using the Canny edge detector (Nazeri et al.,

2019) versus sketch-pencil information. We experi-

mented with this comparison on the CelebA dataset.

We report these results in Table 5. Using sketch-

pencil information improved the inpainting results,

especially due to the enhanced structural informa-

tion restoration and more detailed texture compared

to Canny edge detector.

6 CONCLUSIONS

We propose a method based on Vision Transformers,

which establishes a clear consistency between struc-

tural and texture information through the utilization of

the sketch-pencil domain. Our approach is based on

the use of a model that previously restores the seman-

tic structural information using edges and lines ex-

tracted from the sketch-pencil domain. Furthermore,

the proposed model also serves as a base to guide the

restoration of the texture of the damaged images using

the restored texture in the sketch-pencil domain.

Quantitative assessments based on experimental

results demonstrate the superiority of our approach.

We have achieved remarkable results on benchmark

datasets such as CelebA and Paris StreetView, and our

performance remains highly competitive on Places2

dataset. Moreover, qualitative evaluations reveal the

compelling ability of our method to consistently and

reliably restore both structural and textural elements

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

130

Table 4: Ablation study comparing loss function for sketch-pencil inpainting on CelebA.

Losses

FID ↓ LPIPS ↓

20-30% 30-40% 40-50% 20-30% 30-40% 40-50%

rec

+ L

perc

+ L

adv

0.8326 1.6155 3.3576 0.0348 0.0521 0.0751

rec

+ L

style

+ L

adv

0.7143 1.1646 1.7654 0.0339 0.0489 0.0686

rec

+ L

style

+ L

perc

+ L

adv

0.5761 0.9274 1.5156 0.0310 0.0450 0.0636

Table 5: Ablation study comparing sketch-pencil with edges on CelebA.

Losses

PSNR ↑ SSIM ↑ FID ↓ LPIPS ↓

20-30% 30-40% 40-50% 20-30% 30-40% 40-50% 20-30% 30-40% 40-50% 20-30% 30-40% 40-50%

Canny 31.2759 28.5871 26.4022 0.9415 0.9091 0.8713 0.6910 1.1469 1.8917 0.0335 0.0500 0.0703

Sketch-pencil 32.5599 29.8027 27.4940 0.9464 0.9187 0.8793 0.6185 1.0203 1.6966 0.0331 0.0485 0.0685

within the missing regions, culminating in visually

pleasing inpainted images.

ACKNOWLEDGEMENTS

The authors would like to thank CNPq

(#309330/2018-1) and FAPESP (#2017/12646-

3) for their support.

REFERENCES

Campana, J. L. F., Decker, L. G. L., Souza, M. R., Maia,

H. A., and Pedrini, H. (2022). Multi-Scale Patch Par-

titioning for Image Inpainting Based on Visual Trans-

formers. In 35th Conference on Graphics, Patterns

and Images (SIBGRAPI), volume 1, pages 180–185.

IEEE.

Campana, J. L. F., Decker, L. G. L., Souza, M. R.,

Maia, H. A., and Pedrini, H. (2023). Variable-

Hyperparameter Visual Transformer for Efﬁcient Im-

age Inpainting. Computers & Graphics, 113:57–68.

Cao, C., Dong, Q., and Fu, Y. (2022). Learning Prior Fea-

ture and Attention Enhanced Image Inpainting. In

17th European Conference on Computer Vision, pages

1–8, Tel Aviv, Israel.

Doersch, C., Singh, S., Gupta, A., Sivic, J., and Efros, A. A.

(2015). What Makes Paris Look Like Paris? Commu-

nications of the ACM, 31(4):1–10.

Dong, Q., Cao, C., and Fu, Y. (2022). Incremental Trans-

former Structure Enhanced Image Inpainting with

Masking Positional Encoding. In IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 11358–11368.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,

N. (2021). An Image Is Worth 16x16 Words: Trans-

formers for Image Recognition at Scale. In 9th In-

ternational Conference on Learning Representations,

pages 1–22.

Gamini, S. and Kumar, S. (2019). Image Inpainting Based

on Fractional-Order Nonlinear Diffusion for Image

Reconstruction. Circuits, Systems, and Signal Pro-

cessing, 15(38):3802–3817.

Ghorai, M., Samanta, S., Mandal, S., and Chanda, B.

(2019). Multiple Pyramids Based Image Inpaint-

ing Using Local Patch Statistics and Steering Ker-

nel Feature. IEEE Transactions on Image Processing,

28(11):5495–5509.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A. C., and

Bengio, Y. (2014). Generative adversarial nets. In

Ghahramani, Z., Welling, M., Cortes, C., Lawrence,

N. D., and Weinberger, K. Q., editors, Advances in

Neural Information Processing Systems 27: Annual

Conference on Neural Information Processing Sys-

tems, Montreal, Quebec, Canada.

Guo, X., Yang, H., and Huang, D. (2021). Image Inpainting

via Conditional Texture and Structure Dual Genera-

tion. In IEEE/CVF International Conference on Com-

puter Vision, pages 14134–14143.

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick,

R. B. (2022). Masked Autoencoders Are Scalable Vi-

sion Learners. In IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 15979–15988,

Orleans, LA, USA.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and

Hochreiter, S. (2017). GANs Trained by a Two Time-

Scale Update Rule Converge to a Local Nash Equi-

librium. In Neural Information Processing Systems,

pages 1–12.

Li, J., Wang, N., Zhang, L., Du, B., and Tao, D. (2020).

Recurrent Feature Reasoning for Image Inpainting. In

Conference on Computer Vision and Pattern Recogni-

tion, pages 7760–7768.

Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y., and Jia, J.

(2022). MAT: Mask-Aware Transformer for Large

Hole Image Inpainting. In IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

10758–10768.

Liao, L., Xiao, J., Wang, Z., Lin, C., and Satoh, S. (2021).

Image Inpainting Guided by Coherence Priors of Se-

mantics and Textures. In IEEE Conference on Com-

Image Inpainting on the Sketch-Pencil Domain with Vision Transformers

131

puter Vision and Pattern Recognition. Computer Vi-

sion Foundation / IEEE.

Liu, G., Reda, F. A., Shih, K. J., Wang, T.-C., Tao, A., and

Catanzaro, B. (2018). Image Inpainting for Irregular

Holes Using Partial Convolutions. In European Con-

ference on Computer Vision, pages 85–100.

Liu, H., Jiang, B., Song, Y., Huang, W., and Yang, C.

(2020). Rethinking Image Inpainting via a Mutual

Encoder-Decoder with Feature Equalizations. Lecture

Notes in Computer Science (including subseries Lec-

ture Notes in Artiﬁcial Intelligence and Lecture Notes

in Bioinformatics).

Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep

Learning Face Attributes in the Wild. In IEEE In-

ternational Conference on Computer Vision, pages

3730–3738.

Nazeri, K., Ng, E., Joseph, T., Qureshi, F. Z., and Ebrahimi,

M. (2019). EdgeConnect: Generative Image Inpaint-

ing with Adversarial Edge Learning. arXiv.

Qiu, J., Liu, B., He, J., Liu, C., and Li, Y. (2019). Paral-

lel fast pencil drawing generation algorithm based on

gpu. IEEE Access.

Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., and

Dosovitskiy, A. (2021). Do Vision Transformers See

Like Convolutional Neural Networks? In Ranzato,

M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and

Vaughan, J. W., editors, Advances in Neural Informa-

tion Processing Systems 34: Annual Conference on

Neural Information Processing Systems, pages 1–8.

Simonyan, K. and Zisserman, A. (2015). Very Deep Con-

volutional Networks for Large-Scale Image Recogni-

tion. In 3rd International Conference on Learning

Representations, pages 1–14.

Strudel, R., Garcia, R., Laptev, I., and Schmid, C. (2021).

Segmenter: Transformer for Semantic Segmentation.

In IEEE/CVF International Conference on Computer

Vision, pages 7262–7272.

Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A.,

Ashukha, A., Silvestrov, A., Kong, N., Goka, H.,

Park, K., and Lempitsky, V. (2022). Resolution-robust

Large Mask Inpainting with Fourier Convolutions. In

Winter Conference on Applications of Computer Vi-

sion, pages 2149–2159.

Wan, Z., Zhang, B., Chen, D., Zhang, P., Chen, D., Liao,

J., and Wen, F. (2020). Bringing Old Photos Back to

Life. In IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 2747–2757.

Yang, J., Qi, Z., and Shi, Y. (2020). Learning to incorpo-

rate structure knowledge for image inpainting. In The

Thirty-Fourth AAAI Conference on Artiﬁcial Intelli-

gence, AAAI 2020, The Thirty-Second Innovative Ap-

plications of Artiﬁcial Intelligence Conference, IAAI

2020, The Tenth AAAI Symposium on Educational

Advances in Artiﬁcial Intelligence, EAAI 2020, New

York, NY, USA, February 7-12, 2020.

Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., and Huang, T.

(2019). Free-form Image Inpainting with Gated Con-

volution. arXiv 1806.03589.

Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., and Huang, T. S.

(2018). Generative Image Inpainting with Contextual

Attention. In IEEE Computer Society Conference on

Computer Vision and Pattern Recognition.

Yu, Y., Zhan, F., Lu, S., Pan, J., Ma, F., Xie, X., and Miao,

C. (2021). WaveFill: A Wavelet-Based Generation

Network for Image Inpainting. In IEEE/CVF Interna-

tional Conference on Computer Vision, pages 14114–

14123.

Zeng, Y., Lin, Z., Yang, J., Zhang, J., Shechtman, E., and

Lu, H. (2020). High-Resolution Image Inpainting

with Iterative Conﬁdence Feedback and Guided Up-

sampling. In European Conference on Computer Vi-

sion, pages 1–17.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang,

O. (2018). The Unreasonable Effectiveness of Deep

Features as a Perceptual Metric. In IEEE Conference

on Computer Vision and Pattern Recognition, pages

586–595.

Zhang, W., Zhu, J., Tai, Y., Wang, Y., Chu, W., Ni, B.,

Wang, C., and Yang, X. (2021). Context-Aware Image

Inpainting with Learned Semantic Priors. In Zhou, Z.,

editor, Thirtieth International Joint Conference on Ar-

tiﬁcial Intelligence, pages 1–7.

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Tor-

ralba, A. (2017). Places: A 10 Million Image

Database for Scene Recognition. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

40(6):1452–1464.

Zhu, M., He, D., Li, X., Li, C., Li, F., Liu, X., Ding, E., and

Zhang, Z. (2021). Image Inpainting by End-to-End

Cascaded Reﬁnement With Mask Awareness. IEEE

Transactions on Image Processing, 30:4855–4866.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

132