Tackling Data Bias in Painting Classiﬁcation with Style Transfer

Mridula Vijendran

, Frederick W. B. Li

and Hubert P. H. Shum

∗ c

Department of Computer Science, Durham University, Durham, U.K.

Keywords:

Data Bias, Style Transfer, Image Classiﬁcation, Deep Learning, Paintings.

Abstract:

It is difﬁcult to train classiﬁers on paintings collections due to model bias from domain gaps and data bias from

the uneven distribution of artistic styles. Previous techniques like data distillation, traditional data augmen-

tation and style transfer improve classiﬁer training using task speciﬁc training datasets or domain adaptation.

We propose a system to handle data bias in small paintings datasets like the Kaokore dataset while simulta-

neously accounting for domain adaptation in ﬁne-tuning a model trained on real world images. Our system

consists of two stages which are style transfer and classiﬁcation. In the style transfer stage, we generate the

stylized training samples per class with uniformly sampled content and style images and train the style trans-

formation network per domain. In the classiﬁcation stage, we can interpret the effectiveness of the style and

content layers at the attention layers when training on the original training dataset and the stylized images.

We can tradeoff the model performance and convergence by dynamically varying the proportion of augmented

samples in the majority and minority classes. We achieve comparable results to the SOTA with fewer training

epochs and a classiﬁer with fewer training parameters.

1 INTRODUCTION

Painting classiﬁcation is used in the art history do-

main for knowledge discovery through object and

pose detection in paintings. It also has other uses

in style and technique identiﬁcation through statis-

tical analysis or image similarity along with artist

identiﬁcation. It is challenging to train classiﬁers on

painting collections due to model bias from domain

gaps and data bias from the uneven distribution of

artistic styles. Previous techniques like data distilla-

tion, traditional and data augmentation improve clas-

siﬁer training using task-speciﬁc training datasets or

domain adaption. We propose a system to handle

data bias in small paintings datasets like the Kaokore

dataset (Tian et al., 2020) while accounting for do-

main adaptation in ﬁnetuning a model trained on real-

world images. Our system comprises two stages:

style transfer, and classiﬁcation. During style trans-

fer, we generate the stylized training samples per class

while training the style transformation network’s de-

coder to the training dataset’s domain. At classiﬁca-

tion, we can interpret the effectiveness of the style and

content layers at the attention layers when training on

https://orcid.org/0000-0002-4970-7723

https://orcid.org/0000-0002-4283-4228

https://orcid.org/0000-0001-5651-6039

∗

Corresponding author.

the original training dataset and the stylized images.

We achieve comparable results to the state-of-the-art

(SOTA) with fewer training epochs and classiﬁer pa-

rameters.

Figure 1: Image samples from the Kaokore dataset.

Previous work has tried to solve data efﬁciency

in model training for small and uneven datasets in a

variety of ways. Data distillation and condensation

techniques have opted to create a synthetic dataset

that is optimal for the model (Zhao et al., 2021; Li

et al., 2020; Zhao et al., 2020; Wang et al., 2018).

Although it provides a compressed representation of

250

Vijendran, M., Li, F. and Shum, H.

Tackling Data Bias in Painting Classiﬁcation with Style Transfer.

DOI: 10.5220/0011776600003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

250-261

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

the training dataset, it overﬁts to a task distribution.

Traditional data augmentation techniques use heuris-

tics to select transformations on their training data

(Berthelot et al., 2019; Carratino et al., 2020) such

that the synthetic data belong to the training distri-

bution. However, these do not account for domain

adaptation when ﬁne-tuning models, reducing sam-

pling bias solely for the training data. As a possible

solution, the model’s learned features account for the

source data using techniques such as style transfer. It

adapts the style from one input image while preserv-

ing the content or structure in the second image, using

the style and content information from the model’s

features. Style transfer data augmentation techniques

(Hong et al., 2021b; Hong et al., 2021a; Jackson et al.,

2019; Zheng et al., 2019; Wang et al., 2022) transfer

the style information from the target to the source for

domain generalization through style invariance. The

classiﬁcation performance can vary from the choice

of the style image, with the style set determining the

class of augmentations. The model can learn faster

with augmentations tailored to the learning algorithm.

Although data augmentation techniques have been

used to improve classiﬁer training, domain adaptation

or solve data bias in class imbalance, they treat them

as independent problems to solve at either the data

level or the model level. Our work aims to utilize

the strength of style transfer to tailor the data to the

domain as learned from the backbone of the model

to create a data augmentation that changes the data’s

style and content in the perspective of the model’s

features to help training as well as consider domain

adaptation. By producing style transfer augmenta-

tions of different proportions for the majority and mi-

nority classes, we can select the styles for different

parts of the data distribution for classes with different

amounts of data. The augmentations for the minority

class form the rare samples, while those of the major-

ity class form the representative samples.

In this paper, we propose a system that solves

the problems through the stages of transforming con-

tent images into class-preserving stylized images us-

ing Style Transfer with AdaIN (Huang and Belongie,

2017) and classifying the model with the original and

stylized images. The ﬁrst stage mitigates data bias by

selecting style images that represents the mean or out-

lier of the cluster, thereby letting the model overﬁt on

the class in the former case and regularizing the model

in the latter case. The second stage tailors the styl-

ized images to the data per class with domain speciﬁc

style transformer decoders. The third stage classiﬁes

the model with the augmented and original training

data and provides the spatial attention to help iden-

tify the data bias at the clustering stage by producing

interpretable attention maps.

We conduct a series of experiments to check if

class imbalance is mitigated through qualitative and

quantitative studies. The qualitative studies are the

classiﬁer’s high and low conﬁdence samples along

with the attention map responses for class balanc-

ing and the importance of the style and content lay-

ers. Through the quantitative studies, we can check

the importance of the spatial attention layer and the

data augmentation strategy. We achieve comparable

results on the Kaokore dataset with the SOTA accu-

racy score of 89.04% after 90 epochs using the LOOK

method (Feng et al., 2021) as compared to our system

with 83.22% after 20 epochs and with a model that

requires less training parameters. By changing the

proportion of p

and p

, we can achieve 78.67% pre-

cision and 75.3% recall with a ResNet-50 (Shah and

Harpale, 2018) backbone. We analyze trends from

different proportions of augmentations for the major-

ity and minority classes and check its effectiveness for

classiﬁers with different representation capacities.

Our main contributions include:

• We present a spatial attention classiﬁcation sys-

tem that achieves comparable results to the SOTA

performance from the LOOK model in Kaokore

dataset with signiﬁcantly less training time and

training parameters.

• We propose to tackle data bias with data balanc-

ing using a style transfer based data augmentation

method, in which styles are extracted from differ-

ent levels of deep features.

• We showcase that we can trade-off accuracy gain

versus precision/recall gain by dynamically ad-

justing the ratio of augmentation between rare and

representative classes.

• Our code is open sourced for validation and

further research: https://github.com/41enthusiast/

ST-SACLF

2 RELATED WORK

Our work concentrates on painting classiﬁcation,

which is a domain with limited data. Due to this

constraint, data efﬁciency or artiﬁcially increasing the

amount of training samples can prove beneﬁcial. The

training data can improve the model performance by

transforming its representation towards the model ob-

jective. The section discusses the training data modi-

ﬁcation at the distribution level, by synthesizing sam-

ples at the data or feature level, and at the data level

without a model.

Tackling Data Bias in Painting Classiﬁcation with Style Transfer

251

2.1 Data Distribution Manipulation

Previous works have synthesized data augmentations,

modifying the training dataset from the model gradi-

ents (Zhao et al., 2021; Li et al., 2020; Zhao et al.,

2020) to condense and distill data into salient model

representations. Data distillation techniques (Wang

et al., 2018) have the advantage of providing a re-

duced yet efﬁcient representation of the training data.

These techniques summarize the training distri-

bution into a representation that is tailored towards

the model or a shared embedding space between the

training and target data distribution. The proposed

work learns a class-wise transformation for each im-

age from model layer embeddings. It focuses on mit-

igating data bias through style invariance rather than

compression.

2.2 Style Transfer for Data

Augmentation

Style transfer for data augmentation can aid classiﬁ-

cation at the data or feature level. Previously, style

transfer techniques were slow, iterative optimization

techniques (Gatys et al., 2015) that modiﬁed the styl-

ized image while leaving the model layers untouched.

The transferred style also does not align with the con-

tent. However, since the model has a relaxed objective

of style invariance, content-speciﬁc style transfer is

not a priority. Later techniques (Huang and Belongie,

2017; Chandran et al., 2021; Kolkin et al., 2022) in-

cluded a separate transformation network that could

be used in inference to generate the stylized images.

At the data level, style transfer modiﬁes the train-

ing distribution itself, whereas at the feature level,

it modiﬁes the model’s features. Smart Augmenta-

tion uses the model features to blend samples selected

from strategies like clustering (Lemley et al., 2017) to

generalize learned augmentation strategies from one

network to another. Style transfer similarly blends

images corresponding to model features for the style

and content. STDA-inf (Hong et al., 2021b) augments

the training data pool with the variations interpolated

between intraclass or interclass speciﬁc styles and the

average of all styles during training. StyleMix and

StyleCutMix (Hong et al., 2021a) explores the degree

of the effect of style and content in the synthetic sam-

ples and assign the mixed data a label based on the ra-

tio of the source images. Style Augmentation (Jack-

son et al., 2019) and STADA (Zheng et al., 2019),

explore the technique effectiveness with different de-

grees of style in the stylized image for model robust-

ness. STDA-inf and StyleMix are very closely tied to

our work, but they do not address the problem of class

imbalance.

At the feature level, style transfer at the model’s

feature maps helps in domain generalization as well

as model robustness (Wang et al., 2022). It generates

feature maps across multiple source domains for the

feature extractor by injecting style as noise in the lay-

ers. The original features and augmented features are

both used to train the classiﬁer.

2.3 Model Agnostic Data Augmentation

Model agnostic data augmentation techniques modify

the training data independently or interdependently

(Berthelot et al., 2019; Carratino et al., 2020) involv-

ing only the training data itself. MixUp is an image

blending technique with the samples either selected at

random or according to the model. The training data

in MixMatch is independently processed by applying

traditional image augmentation techniques like rota-

tions, normalization, adding or removing noise, recol-

orization along with geometric operations like shear-

ing and translation.

The choice for the augmentation can also be

learned to utilize the model’s inductive bias (Cubuk

et al., 2019; Wang et al., 2017). A style transforma-

tion network using GAN (Wang et al., 2017) achieves

this using meta learning by learning augmentations

on a small network that generalize to a larger net-

work. Autoaugment (Cubuk et al., 2019), on the other

hand, uses policies from reinforcement learning to se-

lect augmentations. The policy based augmentations

are retrieved from sampling a selection pool consist-

ing of traditional image augmentations. The selected

augmentations are indicative of domain level knowl-

edge and induce bias based on the model architecture.

Our system operates at the data level by randomly

samples styles from the same class to preserve the

intraclass distribution and mitigate sampling bias by

adding more data to each class in different amounts.

Our system also differs from our competitors that

use contrastive learning (Islam et al., 2021; Feng

et al., 2021), that utilizes the similarity and differ-

ences in data to improve model training efﬁciency,

to train all of their model parameters. Contrasting

our competitors, our classiﬁer backbone consist of

pretrained models (Canziani et al., 2016; Shah and

Harpale, 2018) that were trained on another task with

its head ﬁnetuned for paintings classiﬁcation.

3 METHODOLOGY

The current data augmentation techniques do not con-

sider how to mitigate class imbalance in interclass set-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

252

tings while giving the option to focus on improving

performance or mitigating bias. Neither does the style

transfer based data augmentations tune the style and

content to the task.

Our proposed system seeks to address the above

issues by the following system features:

• We can reduce data bias or promote model

performance by adding different proportions of

style transfer augmented data to the majority and

minority classes. Style transfer augmentations

also promote texture invariance through multiple

styles per sample, forcing the model to focus on

the image content.

• We make the level of details from style transfer

layers conﬁguration to be inline with the model

classiﬁcation through spatial attention modules.

These increase the contribution of local level fea-

tures to the classiﬁcation loss, thereby reducing

the difference in model performance from data

augmentations with different style transfer conﬁg-

urations.

The system consists of two parts as shown in Fig-

ure 2. The style transfer transforms the training data

into their data augmented counterparts. For each

transformation, it uniformly samples a random pair

of content and style images from a class to form hy-

bridized samples. Finally, the original and augmented

datasets feed into a classiﬁer with a pre-trained net-

work and a head trained on the combination of local

and global spatial attention modules. The style trans-

fer uses the same VGG-19 backbone, while the clas-

siﬁer can have different pre-trained backbones.

3.1 Data Augmentation from Style

Transfer

An automatic method of selecting style images com-

pared to STaDA (Zheng et al., 2019) can remove the

subjectivity in selecting style images.

We propose to use Adaptive Instance Normaliza-

tion’s (Huang and Belongie, 2017) image transforma-

tion network for fast transformation speed with cer-

tain ﬂaws. The stylized image would not align the

transferred textures from the style image to the con-

tent image since it is not context aware. The transfor-

mation network is also conﬁguration speciﬁc in the re-

sultant textures and is dependent on a specially trained

VGG-19 backbone.

Style transfer could account for the difference in

domains from the original training dataset with real-

world images compared to paintings. These differ-

ences can range from low-level details like texture and

pattern differences along with stroke level informa-

Figure 2: The overall system for style based data augmen-

tation to improve model classiﬁcation.

tion to that in the high level like different shapes. By

providing style invariance, we can reduce this large

domain gap that can create problems in ﬁne-tuning

and data generalization (Yosinski et al., 2014). We

can utilize style transfer to obfuscate the dataset’s

style and distortions, thereby reducing the domain gap

during transfer learning. The classiﬁer is forced to uti-

lize the content information that is common to both

the source and target datasets considering that convo-

lutional neural networks are more sensitive to texture

information (Von K

ugelgen et al., 2021). Data aug-

mentations can separate the content information that

would be shared across these real and abstracted de-

pictions, allowing for the higher-level features to be

better utilized for classiﬁcation (Geirhos et al., 2019).

(Virtusio et al., 2021) corroborates with the usefulness

of learning style invariance while bypassing artistic

semantics like brush details, and pattern densities.

We present the data augmented counterparts to the

training data that are generated pre-training and using

one model itself unlike Smart Augmentation (Lemley

et al., 2017). The data augmentation method neither

requires an encoder like GANs to exaggerate the de-

tails at the chosen feature levels nor does it need a

separate network to train augmentation strategies for

the main classiﬁcation network.

The style transfer model, from Figure 4, optimizes

the style loss with the gram matrix of its feature em-

beddings to account for second-order statistics corre-

Tackling Data Bias in Painting Classiﬁcation with Style Transfer

253

Figure 3: The original samples per class followed by good and sub-optimal style transfer augmentations in the second and

third rows, respectively.

Figure 4: The style transfer model generates stylized versions of the input data per class.

sponding to texture and feature variance. The content

loss is computed at the bottleneck of the image trans-

formation model to incorporate the style modulation

at the Adaptive Instance Normalization (Huang and

Belongie, 2017) layers with the content from the re-

construction loss to train the decoder end of the trans-

formation model. It uses a modiﬁed pre-trained VGG-

19 model with normalized weights as the encoder. We

train the style transfer model on uniformly sampled

style data from the entire dataset to expose the model

to more style varieties. Once the decoder has been

trained on the training images in a domain, the style

transfer can be computed quickly at inference with

uniformly sampled content and style images with rep-

etition per class. AdaIN is a technique that modulates

the mean and covariance of the content feature map to

that of the style feature map, thereby fusing the infor-

mation from both inputs.

c = f (x

)

s = f (x

)

AdaIN(c, s) = σ(s)



c − µ(c)

σ(c)



+ µ(s)

t = AdaIN(c, s)

(1)

where c and s are content and style features from the

feature extractor, respectively. σ is the variance and

µ is the mean, respectively. t is the AdaIN output. It

modulates the content feature by the style statistics at

the style transformation network’s encoder.

The content loss L

and the style loss L

are given

as MSE losses and are computed as follows:

= || f (g(t) −t)||

= ||µ(φ

(g(t))) − µ(φ

))||

∑

i=1

||σ(φ

(g(t))) − σ(φ

))||

(2)

where t is the AdaIN output from Equation 1 and con-

tent target, x

is the style image, f is the encoder, g

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

254

is the decoder, φ

are the style layers. The style loss

matches the mean and standard statistics between the

style image and the stylized image. The content loss

matches the stylized features to the target features.

During style transfer, only the weights of the de-

coder are updated in the training process. After en-

coding the style and content features for their respec-

tive selected layers, they are used to create a stylized

tensor using the AdaIN layer. The stylized tensor can

retain more information from the style or the structure

information depending on the alpha value. It is passed

through the decoder to form a hybrid image that re-

tains its structure information by matching its content

embedding against the stylized tensor using the con-

tent loss. It retains the style information by matching

its style embeddings against that of the style image

using the style loss. These two losses inﬂuence the

hybrid image learned by the decoder.

Figure 3 shows the quality of the generated sam-

ples per class. Since most of the images are face-

centered, the resultant style transfer transfers the tex-

ture while preserving the content. However, since

there is no constraints on the contents transferred,

some colors bleed into the stylized images as shown

in the bottom row. In the Kaokore dataset, there are

a lot of green backgrounds and characters with green

clothing, it is the common color that bleeds into the

samples.

3.2 Spatial Attention Based Image

Classiﬁer

The classiﬁer, depicted in Figure 5, is made from

a pre-trained image classiﬁcation model like VGG-

16 and ResNet-50 (Canziani et al., 2016; Shah and

Harpale, 2018) followed by extracting the very ﬁrst

layer and selecting 3 layers between the ﬁrst and last

layers to correspond to features with more spatial in-

formation to represent richer features and create a bal-

ance between the style and content information’s con-

tribution to the classiﬁcation loss. The spatial atten-

tion module takes the re-projected layer for comput-

ing attention with the global feature from the bottle

neck. They are concatenated and passed to the head

with dense layers and dropout for image classiﬁca-

tion. It has no batch norm layer and has no global

training statistics that can be re-utilized at test, with

previous work utilizing only these statistics to account

for domain adaptation (Frankle et al., 2020). In this

manner, the data augmentation can account for the

domain adaptation in the model. With the proposed

work, we explore a model agnostic way of domain

adaptation and mitigating data bias resulting from the

Kaokore dataset’s class imbalance.

Spatial attention can both help in visualizing the

impact of style transfer as well as remember coarse

to ﬁne detail present in the image. The learnt atten-

tion map is further biased since the input data is al-

ready ampliﬁed by the selected layers. It serves as

both a weak supervision signal (Jetley et al., 2018)

and the attention mechanism acts as a pseudo mem-

ory bank for context retention among the features fed

to the module.

Figure 5: The classiﬁer architecture is depicted with the

model ﬂow from the input to the outputs. The blue line

indicates local features while the red line indicates global

features. The output from the spatial attention layer to the

fully connected layers are global features weighted by the

corresponding local features.

The spatial attention module computes the atten-

tion map for the local response map and the global

feature at the end of the feature extractor. This em-

beds both the local and global context of the image.

When processing the concatenated spatial attention

responses at the MLP head, the style transfer layers

are prioritized in the loss computation.

Focal loss is the classiﬁcation loss used for the

spatial attention classiﬁer to help mitigate class im-

balance and is formulated as:

= so f tmax(y

pred

)

so f tmax(y

pred

) =

exp

pred

∑

j=1

exp

pred

FL(p

) = −α(1 − p

)

ylog(p

)

(3)

In the eqn 3, α and γ are hyperparameters that can be

Tackling Data Bias in Painting Classiﬁcation with Style Transfer

255

tuned according to the level of class imbalance in the

problem, with higher values for more skewed datasets

with more false positives. We can get p

by pass-

ing a softmax function to the logits output y

pred

from

our spatial attention classiﬁer with c as the number of

classes. y is the target one hot vector and p

is the

predicted probability.

Figure 6: Class imbalance in the Kaokore dataset.

4 EXPERIMENTS

We depict different experiments with our system as

follows. Section 4.1 describes the Kaokore dataset

which are used in the experimentations at Sections 4.3

and 4.2. The qualitative experiments (Section 4.3) ex-

plore the interpretability of the style and content lay-

ers, while the quantitative experiments are done with

ablation studies (Section 4.2) to check the effective-

ness of the system modules, the classiﬁer and type of

data augmentation. Finally, Section 4.4 describes the

system conﬁguration.

4.1 Datasets

The Kaokore dataset (Tian et al., 2020) is a collection

of Japanese paintings categorized in two ways accord-

ing to gender and status. It provides diverse faces

within and between classes with different shapes,

poses and colors. Thus, it makes a suitable choice for

improving classiﬁcation under style invariance. The

gender categorization has the male and female sub-

classes, while status is subdivided into commoner, no-

ble, incarnation or non human or avatar and warrior. It

is very class imbalanced as indicated in Figure 6 and

it consists of face cropped images as seen in Figure

1. The results will be mainly focused on the status to

better showcase the impact of style transfer in classi-

ﬁcation since it requires more ﬁnesse than hyperpa-

rameter tuning and model regularization techniques

unlike the gender classiﬁcation task. The dataset is

fairly small with 6,756 training images, 845 valida-

tion and test images of the same size and could beneﬁt

from transfer learning.

4.2 Quantitative Results

The following experiments were conducted to test the

efﬁcacy of style transfer as a data augmentation tech-

nique. The ﬁrst is an analysis of the style transfer

effects on models of different capacities and architec-

tures. The second explores the model performance

under different conﬁgurations of p

and p

. This is

followed by a comparison with state-of-the-art meth-

ods.

Table 1: Model performance for different classiﬁer back-

bones with and without data augmentation.

Model

Architecture

Style transfer data

augmentation type

Metrics (in percentage)

Accuracy Recall Precision F1 score

VGG16

Optimal rare and

representative mix

82.06 71.41 75.9 73.27

No augmentation 79.91 71.08 73.09 72.00

VGG19

Optimal rare and

representative mix

80.68 71.43 74.06 72.39

No augmentation 78.84 67.67 73.34 69.80

ResNet34

Optimal rare and

representative mix

81.38 73.83 76.40 74.88

No augmentation 80.03 71.49 75.93 73.29

ResNet50

Optimal rare and

representative mix

83.22 73.86 76.9 75.2

No augmentation 78.43 69.55 71.58 70.48

Style transfer has better results when the model

capacity is larger, as seen in VGG-19 and ResNet-34

in Table 1. The control case in the tables are the mod-

els that are trained with no data augmentation. The

data augmentation type in the table refer to the case

when the model is fed the listed types of style trans-

fer transformed training data. It can be inferred that

larger models that overﬁt to the dataset can beneﬁt

from style transfer as a type of model regularization.

The rare augmentations work better for models with

larger capacities since it offers more visual variation

while making it harder for the model to overﬁt on

the dataset. In models with smaller backbones like

VGG-16 and VGG-19, the representative samples of-

fer better augmentations, since the excessive visual

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

256

Table 2: ResNet-34 model performance metrics (accuracy/precision/recall) for different p1 and p2 conﬁgurations, where p1

is the percentage of extra majority class data and p2 is the percentage of extra minority class data.

p1/p2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.1 79.31/76.89/70.55 78.91/75.46/73.23 77.58/76.33/69.54 79.42/74.9/71.5 76.83/73.9/69.43 79.37/76.03/73.34 80.35/75.96/75.21 80.69/76.44/73.23 78.67/73.1/69.13 79.24/73.61/71.61

0.2 79.76/75.93/72.82 78.5/75.36/71.26 76.88/72.01/64.97 79.89/74.56/72.92 78.44/74.16/65.81 79.07/72.84/70.49 80.29/75.77/72.66 76.99/70.92/65.76 78.33/74.4/70.89 80.4/75.47/73.41

0.3 79.26/74.29/72.97 78.9/73.75/71.04 79.94/74.54/73.24 78.32/73.97/69.23 80.05/74.75/73.61 79.54/74.75/73.61 79.54/74.64/71.91 77.87/73.35/70.67 77.4/70.26/64.99 79.19/73.93/71.1

0.4 79.08/75.22/70.69 77.23/74.38/65.6 78.33/73.23/70.98 79.88/75.09/70.38 75.9/72.43/61.49 78.89/72.41/71.15 80.24/76.94/73.6 80.23/76.05/73.82 78.91/73.98/71.69 80.23/75.69/72.14

0.5 77.93/74.17/72.21 80.29/76.2/71.16 79.02/73.37/72.14 79.47/76.59/71.95 79.83/75.52/72.04 79.13/73.64/70.11 78.79/73.61/72.4 78.32/72.86/68.59 80.74/76.19/71.37 79.94/74.82/72.98

0.6 79.95/76.04/71.71 79.24/73.31/71.28 79.3/74.03/70.97 79.66/76.91/73.98 79.36/73.41/71.73 80.64/77.58/73.53 79.36/76.22/70.69 79.9/75.16/72.69 79.25/73.7/71.91 81.32/76.04/71.52

0.7 79.71/74.61/72.64 79.66/76.4/72.43 80.97/76.59/73.26 79.83/74.28/73.45 78.68/75.03/72.38 79.02/75.24/69.63 80.4/75.64/72.17 78.68/73.88/70.29 79.53/73.71/70.04 80.86/75.89/71.19

0.8 80.07/76.38/72.7 81.16/77.31/75.03 80/75.29/73.46 80.17/74.82/72.12 80.05/76.12/70.87 79.25/75.32/70.7 77.69/71.7/68.91 78.38/73.22/66.91 80.34/74.29/71.6 80.23/75.3/72.74

0.9 79.88/75.88/71.76 78.14/72.34/69.05 80.75/77.13/73.84 81.38/76.4/73.83 79.31/74.65/71.06 79.59/75.63/73.29 81.1/77.7/74.61 79.66/73.98/71.43 79.14/74.34/70.97 79.08/74.85/71.52

1.0 77.87/73.32/66.63 79.78/76.16/73.5 78.15/73.47/70.27 79.71/74.37/70.57 79.36/74.72/70.42 78.15/77.43/62.81 80.41/78.62/73.6 80.24/76.39/71.44 80.29/76.37/73.45 78.96/74.33/69.42

(a) Training convergence with differing amounts of rare

samples (p

)

(b) Test accuracy scores with differing amounts of rare (p

)

and representative (p

) augmentations

) and

representative (p

) augmentations

Figure 7: Model performance trends with differing amounts

of style transfer augmentation. p

and p

indicate percent-

ages of data in the common classes, noble and warrior, and

rare classes, incarnation and commoner, used as extra train-

ing data.

variations in styles can hurt the model performance as

seen in (Zheng et al., 2019).

Changing the proportions of the data augmenta-

tions for the rare and representative samples show

Table 3: A comparative study for the Kaokore dataset. Note

that our data augmentation method can be used on top of all

existing state-of-the-arts and boost their performance.

Method

Test

accuracy

Number of

trainable

parameters

(in millions)

VGG-11 (Tian et al., 2020)

78.74% 9.2 M

AlexNet (Tian et al., 2020)

78.93% 62.3 M

DenseNet-121 (Tian et al., 2020)

79.70% 7.6 M

Inception-v3 (Tian et al., 2020)

84.25% 24 M

ResNet-18 (Tian et al., 2020)

82.16% 11 M

MobileNet-v2 (Tian et al., 2020)

82.35% 3.2 M

ResNet-34 (Tian et al., 2020)

84.82 % 21.3 M

SelfSupCon (Islam et al., 2021) 88.92% 47 M

CE+SelfSupCon (Islam et al., 2021) 88.25% 27.9 M

LOOK (ResNet-50) (Feng et al., 2021) 89.04 % 23.5 M

Ours (VGG-16 backbone) 82.06 % 1.2 M

Ours (ResNet-34 backbone) 81.38 % 1.2 M

Ours (ResNet-50 backbone) 83.22 % 20.1 M

trends in model convergence and performance, as

seen in Figure 7. p

and p

are a percentage of the

data in majority and minority classes that are used as

extra training data, allowing for stratiﬁed sampling.

The test was performed on the spatial attention clas-

siﬁer with a ResNet-34 backbone since the capac-

ity of larger models beneﬁt from more training data.

The model’s training convergence is faster with less

rare samples and there is a consistent trend for dif-

ferent ﬁxed p1 values. The test accuracy increases

with more representative samples and rare samples.

On the other hand, the F1 scores mostly beneﬁt from

having lesser proportions of rare augmentations than

representative augmentations. This trend allows for

a trade-off between F1 score, to represent both pre-

cision and recall, and accuracy. It also allows for a

trade-off between model convergence and potentially

overﬁtting to that of regularization from the added

rare samples.

ResNet-50 gets the best performance improve-

ment from the control with no augmentation with

= 0.3 and p

= 0.2, which can be attributed to its

increased capacity of 20 million trainable parameters

(mentioned in Table 3). It has better accuracy with

less rare and representative sample proportions as

compared to the previous conﬁgurations. The larger

number of learnable parameters lets the model beneﬁt

Tackling Data Bias in Painting Classiﬁcation with Style Transfer

257

from more rare samples since it is prone to overﬁtting

on smaller datasets.

Table 2 details the model performance metrics

with each cell listing the accuracy, precision and re-

call respectively. From the Figure 7 and Table 2,

we can infer the choice of the rare proportions (p

)

depending on the percentage of extra representative

samples (p

From the Table 3, our best models with the ResNet

and VGG Architecture reach comparable results with

LOOK (Feng et al., 2021) and the ﬁve contrastive

methods (Islam et al., 2021) with less computation.

Our work’s competitors all fully ﬁnetune their mod-

els but we only ﬁnetune the head of the classiﬁer. The

methods with contrastive learning (Islam et al., 2021;

Feng et al., 2021) achieve better test accuracy, but

they have to be trained longer and have to be com-

pletely ﬁnetuned to the task. In the settings where

they do not do so, they have worse results to our

method when they train on a part of the dataset. They

also have signiﬁcantly worse results in a few shot set-

ting, making them both data intensive and computa-

tionally expensive. On the other hand, our method

is compatible with the SOTA since it is a pretraining

step, possibly achieving better results in tandem with

their method.

From the tables, on comparing the results from the

control to the data augmented case, the performance

is more evenly spread out in the latter case, indicat-

ing better performance per class from the precision,

recall and F1 score metrics. The data augmentations

also seem to provide better results than the control for

models with more parameters and comparative results

for smaller models.

4.3 Qualitative Results

The visualization of the spatial attention map in Fig-

ure 8 can be used to highlight what parts of the image

are considered important to the model’s layers. As in

Figure 8a, without data augmentation, the model fo-

cuses on a wider area, with higher levels of responses

at the lower levels of the model. These lower layers

are sensitive to texture, edge and color information.

In the Kaokore dataset, the faces can be classiﬁed into

the different statuses by their hair style and clothes as

distinctive features. The faces and certain colors in

this case have very high activation responses.

As in Figure 8b, with data augmentation, we can

see the texture details highlighted more than the color

information at the lowest level. The regions with faces

and background have higher responses and in the later

layers, the areas in the vicinity of the hair and subject

are given more importance. Overall, there is more

levels of activity in the response maps with data aug-

mentation.

The most and least conﬁdent images, as seen in

Figure 9, provides a check into the classes the model

is biased towards. It is formed by ranking the model

losses and visualizing the corresponding images. The

most conﬁdent images have the least losses from left

to right, while the least conﬁdent images rank the

losses in a descending order across the test set. The

selection of the Vgg-16 model was motivated by style

transfer working better with it as a backbone. The

style augmentation version has its least conﬁdent im-

ages with noble class examples. This could be due

to the test set’s sampling bias to the noble class. In

the ﬁrst row’s conﬁguration, the least conﬁdent im-

ages are from the commoner class despite its small

test sample size, indicating class imbalance. The re-

maining two conﬁgurations have the same images in

the most conﬁdent images with different rankings.

These images have backgrounds with less variation

and details. In the case of the system with only spa-

tial attention, the least conﬁdent images have com-

plex backgrounds along with subjects with obscured

faces. Style transfer based augmentations account for

the latter weakness but it does not account for highly

complex backgrounds. By providing variations of

styles per sample to promote texture invariance, the

model could ignore image details when ignoring tex-

ture information.

4.4 Implementation Details

During the pre-training phase, the style transfer model

is trained on pairs of uniformly sampled style and

content images from the training dataset for 20,000

iterations. The learned decoder is used in inference to

generate the stylized counterparts by similarly sam-

pling style and content pairs per class. The resultant

dataset retains the same distribution as the training

dataset, having the same number of samples for each

class. It uses the same parameters as the AdaIN style

transfer network(Huang and Belongie, 2017).

The model is trained with the batch size of 64 and

learning rate of 0.0001 for 20 epochs using an Adam

optimizer. Additionally, the model uses dropout with

a probability of 0.23. It uses L2 regularization along

with a focal loss with the gamma and alpha set to 2.

Finally, there are 8 workers for faster data processing.

L2 regularization and focal loss facilitates the

model to focus parts of the feature, since the style

transfer can utilize features of different levels of de-

tails that can get lost from the convolution and pooling

operations. Dropout was selected to further acerbate

this model regularization.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

258

(a) The attention map response without data augmentation for a random test batch.

(b) The attention map response with data augmentation for a random test batch.

Figure 8: Attention map responses for the style transfer layers in a ResNet architecture. From left to right, they represent the

input images, the lowest, low, middle and end layers. The response levels go from low to high and are indicated from dark

blue to red.

Figure 9: The most and least conﬁdent images from the validation subset of the Kaokore dataset for different system conﬁgu-

rations in the classiﬁer with a VGG-16 backbone.

A single NVIDIA A100 GPU instance trained and

did inference on the model. It was also used during

pretraining to generate data augmented counterparts.

The pre-trained models considered in the classiﬁer

are ResNet34, VGG (Canziani et al., 2016; Shah and

Harpale, 2018) and its variants VGG-16 and VGG-19.

The ResNet and VGG architectures provide a com-

paritive study against the benchmarks of the Kaokore

dataset (Tian et al., 2020). The VGG variants are used

to experiment the effect of the augmentations on the

model capacity. Their weights are frozen for all the

stages of the system to showcase the strength of data

augmentation rather than the model architecture itself.

The fully connected layers are removed and the last

layer is selected as a global average pooling layer to

make the model robust to images of any size and bet-

ter serve as a feature extractor.

5 CONCLUSIONS

We observe that style transfer for data augmentation

with the classiﬁer tailored style images and stylization

produces better results per class. It also mitigates data

Tackling Data Bias in Painting Classiﬁcation with Style Transfer

259

bias from class imbalance in small datasets of a dif-

ferent domain. The system achieves this by stylizing

images towards the representative and rare clustered

samples to bias the classiﬁcation loss to a changed

training manifold. We can balance the tradeoff be-

tween accuracy and convergence to recall, precision

and f1-score by changing the proportion of extra data

per minority and majority class. The amount of ex-

tra rare classes to be added range between 20-60% of

the minority classes with more minority classes giv-

ing better recall, precision and f1-scores. In the repre-

sentative classes case, 50-90% more data can improve

all the metrics, with a more pronounced effect on ac-

curacy and model convergence. We conduct qualita-

tive experiments to check class imbalance and inter-

pretability of the backbone at different layers. Next,

we perform quantitative studies to show the weak su-

pervision signal from the spatial attention modules

and the reduced data bias through style transfer aug-

mentations.

While we automate the style images for style

transfer through the random sampling of style and

content images per class, the learned style space is

still subjective due to the variations as a result from

the selection of different style and content layers. Fu-

ture work can look into focused sampling of style and

content images to make the style transfer more task

oriented. Our work has not experimented with vary-

ing the extent of style and content in the image which

can also be learned according to suit the task at hand.

Furthermore, we can use meta learning on top of

the system to learn hyperparameters as well as ef-

fectively learn the training dataset through the differ-

ent style transfer augmentations as support sets with

fewer samples. Since contrastive learning techniques

are highly dependent on the data augmentation tech-

niques, the future work can incorporate it into the

model training process. Since the current system al-

lows for ﬂexibility in the choice of model and training

pipeline, the style transfer based data augmentation

can be adapted in a plug and play manner as a pre-

training step.

Lastly, we will explore the model generalization

on other paintings datasets such as PACS (Li et al.,

2017), WikiArt (Saleh and Elgammal, 2015) and Ri-

jksmuseum (Mensink and Van Gemert, 2014). The

PACS dataset is a small dataset with subjects por-

trayed in different media and can be used to check the

model’s performance in domain generalization. The

WikiArt dataset has paintings of different genres and

styles while the Rijksmuseum dataset has a larger col-

lection of data. The two datasets can be used to check

the data efﬁciency of the model with different training

data sizes.

REFERENCES

Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N.,

Oliver, A., and Raffel, C. (2019). Mixmatch: A holis-

tic approach to semi-supervised learning. In NeurIPS.

Canziani, A., Paszke, A., and Culurciello, E. (2016). An

analysis of deep neural network models for practical

applications. arXiv preprint arXiv:1605.07678.

Carratino, L., Cisse, M., Jenatton, R., and Vert, J.-P. (2020).

On mixup regularization. Technical report, arXiv.

2006.06049.

Chandran, P., Zoss, G., Gotardo, P., Gross, M., and Bradley,

D. (2021). Adaptive convolutions for structure-aware

style transfer. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 7972–7981.

Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le,

Q. V. (2019). Autoaugment: Learning augmentation

policies from data.

Feng, Y., Jiang, J., Tang, M., Jin, R., and Gao, Y. (2021).

Rethinking supervised pre-training for better down-

stream transferring. arXiv preprint arXiv:2110.06014.

Frankle, J., Schwab, D. J., and Morcos, A. S. (2020). Train-

ing batchnorm and only batchnorm: On the expres-

sive power of random features in cnns. arXiv preprint

arXiv:2003.00152.

Gatys, L. A., Ecker, A. S., and Bethge, M. (2015). A

neural algorithm of artistic style. arXiv preprint

arXiv:1508.06576.

Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wich-

mann, F. A., and Brendel, W. (2019). Imagenet-

trained CNNs are biased towards texture; increasing

shape bias improves accuracy and robustness. In In-

ternational Conference on Learning Representations.

Hong, M., Choi, J., and Kim, G. (2021a). Stylemix: Sepa-

rating content and style for enhanced data augmenta-

tion. In 2021 IEEE/CVF Conference on Computer Vi-

sion and Pattern Recognition (CVPR), pages 14857–

14865.

Hong, T., Zou, Y., and Ma, J. (2021b). Stda-inf: Style trans-

fer for data augmentation through in-data training and

fusion inference. In Huang, D.-S., Jo, K.-H., Li, J.,

Gribova, V., and Hussain, A., editors, Intelligent Com-

puting Theories and Application, pages 76–90, Cham.

Springer International Publishing.

Huang, X. and Belongie, S. (2017). Arbitrary style transfer

in real-time with adaptive instance normalization. In

Proceedings of the IEEE international conference on

computer vision, pages 1501–1510.

Islam, A., Chen, C.-F. R., Panda, R., Karlinsky, L., Radke,

R., and Feris, R. (2021). A broad study on the transfer-

ability of visual representations with contrastive learn-

ing. In Proceedings of the IEEE/CVF International

Conference on Computer Vision, pages 8845–8855.

Jackson, P. T., Abarghouei, A. A., Bonner, S., Breckon,

T. P., and Obara, B. (2019). Style augmentation: data

augmentation via style randomization. In CVPR work-

shops, volume 6, pages 10–11.

Jetley, S., Lord, N. A., Lee, N., and Torr, P. H. (2018). Learn

to pay attention. arXiv preprint arXiv:1804.02391.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

260

Kolkin, N., Kucera, M., Paris, S., Sykora, D., Shechtman,

E., and Shakhnarovich, G. (2022). Neural neighbor

style transfer. arXiv e-prints, pages arXiv–2203.

Lemley, J., Bazrafkan, S., and Corcoran, P. (2017). Smart

augmentation learning an optimal data augmentation

strategy. IEEE Access, 5:5858–5869.

Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. (2017).

Deeper, broader and artier domain generalization. In

Proceedings of the IEEE international conference on

computer vision, pages 5542–5550.

Li, Y., Hu, G., Wang, Y., Hospedales, T., Robertson, N. M.,

and Yang, Y. (2020). Dada: Differentiable automatic

data augmentation. arXiv preprint arXiv:2003.03780.

Mensink, T. and Van Gemert, J. (2014). The rijksmuseum

challenge: Museum-centered visual recognition. In

Proceedings of International Conference on Multime-

dia Retrieval, pages 451–454.

Saleh, B. and Elgammal, A. (2015). Large-scale classiﬁca-

tion of ﬁne-art paintings: Learning the right metric on

the right feature. arXiv preprint arXiv:1505.00855.

Shah, U. and Harpale, A. (2018). A review of deep learning

models for computer vision. In 2018 IEEE Punecon,

pages 1–6.

Tian, Y., Suzuki, C., Clanuwat, T., Bober-Irizar, M., Lamb,

A., and Kitamoto, A. (2020). KaoKore: A Pre-modern

Japanese Art Facial Expression Dataset. In Proceed-

ings of the International Conference on Computa-

tional Creativity, pages 415–422.

Virtusio, J. J., Tan, D. S., Cheng, W.-H., Tanveer, M., and

Hua, K.-L. (2021). Enabling artistic control over pat-

tern density and stroke strength. Trans. Multi., 23.

Von K

ugelgen, J., Sharma, Y., Gresele, L., Brendel,

W., Sch

olkopf, B., Besserve, M., and Locatello, F.

(2021). Self-supervised learning with data augmen-

tations provably isolates content from style. Advances

in neural information processing systems, 34:16451–

16467.

Wang, J., Perez, L., et al. (2017). The effectiveness of data

augmentation in image classiﬁcation using deep learn-

ing. Convolutional Neural Networks Vis. Recognit,

11:1–8.

Wang, T., Zhu, J.-Y., Torralba, A., and Efros, A. A. (2018).

Dataset distillation. arXiv preprint arXiv:1811.10959.

Wang, Y., Qi, L., Shi, Y., and Gao, Y. (2022). Feature-

based style randomization for domain generalization.

IEEE Transactions on Circuits and Systems for Video

Technology, 32(8):5495–5509.

Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014).

How transferable are features in deep neural net-

works? Advances in neural information processing

systems, 27.

Zhao, B., Mopuri, K. R., and Bilen, H. (2020). Dataset

condensation with gradient matching. arXiv preprint

arXiv:2006.05929.

Zhao, B., Mopuri, K. R., and Bilen, H. (2021). Dataset

condensation with gradient matching. In International

Conference on Learning Representations.

Zheng, X., Chalasani, T., Ghosal, K., Lutz, S., and Smolic,

A. (2019). Stada: Style transfer as data augmentation.

arXiv preprint arXiv:1909.01056.

Tackling Data Bias in Painting Classiﬁcation with Style Transfer

261