Fully Hidden Dynamic Trigger Backdoor Attacks

Shintaro Narisada

, Seira Hidano and Kazuhide Fukushima

KDDI Research, Inc., Fujimino, Japan

Keywords:

Backdoor Attacks, Poisoning Attacks, Invisible Trigger, Evasion Attacks, Generalization.

Abstract:

Indistinguishable adversarial attacks have been demonstrated with the sophistication of adversarial machine

learning for neural networks. One example of such advanced algorithms is the backdoor attack with hidden

triggers proposed by Saha et al. While Saha’s backdoor attack can produce invisible and dynamic triggers

during the training phase without mislabeling, visible patch images are appended during the inference phase.

A natural question is whether there exists a clean label backdoor attack whose trigger is dynamic and invisible

at all times. In this study, we answer this question by adapting Saha’s backdoor attack to the trigger gener-

ation algorithm and by presenting a completely invisible backdoor attack with dynamic triggers and correct

labels. Experimental results show that our proposed algorithm outperforms Saha’s backdoor attacks in terms

of both indistinguishability and the attack success rate. In addition, we realize that our backdoor attack is a

generalization of adversarial examples since our algorithm also works by using poisoning data only during the

inference phase. We also describe a concrete algorithm for reconstructing adversarial examples as clean-label

backdoor attacks. Several defensive experiments are conducted for both algorithms. This paper discovers the

close relationship between hidden trigger backdoor attacks and adversarial examples.

1 INTRODUCTION

There has been a remarkable amount of research on

neural networks in recent years, especially in the ﬁeld

of computer vision, as demonstrated in applications

such as biometrics (Sundararajan and Woodard, 2018;

Minaee et al., 2019), health care (Esteva et al., 2019;

Shamshirband et al., 2021), and self-driving technol-

ogy (Huang and Chen, 2020; Grigorescu et al., 2020).

The vulnerability of deep models to images or videos

intentionally manipulated by adversaries has been re-

vealed. In the image classiﬁcation domain, perceptual

indistinguishability between adversarial images and

nonadversarial images makes it more difﬁcult for both

humans and machine learning systems to discriminate

poisoned images. Adversarial examples (Goodfellow

et al., 2014; Szegedy et al., 2013; Madry et al., 2017)

are a type of attack mechanism that indistinguishably

manipulates images to cause misclassiﬁcation of the

prediction model during the inference phase.

Backdoor attacks (Gu et al., 2017; Chen et al.,

2017), in which an attacker injects an adversarial im-

age with a small perturbation (referred to as a trigger)

into the training dataset to misclassify images with

https://orcid.org/0000-0002-9399-9778

https://orcid.org/0000-0003-2571-0116

similar trigger patterns during the inference, are also

being discussed in terms of trigger invisibility. There

are two types of triggers: static (or data-independent)

triggers such as watermarking and steganography (Li

et al., 2021a; Ning et al., 2021), and dynamic triggers

that are optimized for each image (Liao et al., 2018;

Li et al., 2021b). The former is easier to realize in

terms of trigger feasibility, while the latter is superior

in terms of attack performance and detection resis-

tance due to the use of data-oriented triggers. Consis-

tency of labels is one of the major factors to be consid-

ered when inserting triggered images into the training

dataset. Although ﬂipping the label of the triggered

image to the target class improves the attack perfor-

mance (Xiao et al., 2012), it compromises the conﬁ-

dentiality of the attack, since the label-ﬂipped image

is signiﬁcantly different from the images in the target

class. Several papers have suggested backdoor attacks

with consistent labels (Saha et al., 2020; Ning et al.,

2021).

Few studies have addressed all of these issues.

Saha et al. (Saha et al., 2020) proposed a clean-label

backdoor attack with a dynamic trigger, in which trig-

gers are invisible in the training phase. During the

inference, the authors used triggers that are conven-

tional visible static patch images since they focused

Narisada, S., Hidano, S. and Fukushima, K.

Fully Hidden Dynamic Tr igger Backdoor Attacks.

DOI: 10.5220/0011617800003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 81-91

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

on the capability of the attack. However, such vis-

ible and nondata-oriented triggers that do not exploit

the features of each image are not only easily detected

by defenders but also fail to provide sufﬁcient attack

performance.

Contributions

In this work, we propose a clean-label backdoor at-

tack where the trigger is invisible and dynamic in both

the training phase and inference phase. Our algorithm

generates dynamic invisible triggers for both the ad-

versary’s base class (source) for the inference and tar-

get class (target) for the training dataset by utilizing

the feature values of images from both classes. The

proposed attack framework is illustrated in Figure 1.

The attack framework can also be speciﬁed as an ad-

versarial example if the poisoning ratio is 0 since the

framework constructs a triggered source image that

satisﬁes the following conditions: (1) the perturba-

tion is invisible. (2) Only the perturbed image has

directivity to the target class. We emphasize that nor-

mal poisoning attacks or backdoor attacks with static

triggers such as patch images cannot be applied as ad-

versarial examples. Using the attack framework, we

generalize adversarial examples as clean-label, invisi-

ble, dynamic backdoor attacks. Note that only known

methods for generalization of adversarial examples

are poisoning attacks that employ label ﬂipping as de-

scribed in (Fowl et al., 2021; Pang et al., 2020).

In our experiments, our algorithm achieves a

100% attack success rate with a 6.25% poisoning rate

and a 99% attack success rate even without poison-

ing on the standard ImageNet dataset. The success

probability is higher than that of the 55.1% of the in-

visible backdoor attack by Saha (Saha et al., 2020)

under the same conditions. For indistinguishability,

we verify that our backdoor algorithm can generate

an invisible trigger when ε is small using the learned

perceptual image patch similarity (LPIPS) indicator.

Compared with adversarial example-based backdoor

attacks, the proposed method provides higher poison-

ing performance, i.e., enhancement of attack perfor-

mance with increased poisoning rate. We adopt neural

cleanse (Wang et al., 2019) and input transformation

defense (Guo et al., 2017) into our algorithms, which

are well-known countermeasures against backdoor at-

tacks and adversarial examples, respectively. These

defensive methods are found to be ineffective or ef-

fective only under limited conditions.

In summary, our main contributions are high-

lighted as follows:

• This paper proposes the ﬁrst framework to imple-

ment fully hidden dynamic trigger backdoor at-

tacks with clean labels. Our attacks render the

triggers indistinguishable in the inference stage

and in the training stage.

• We formulate the creation of invisible adaptive

triggers for the training and inference inputs as

a sequence of optimization problems. This for-

mulation allows both triggered source and target

images to function as backdoor poisons and affect

the classiﬁcation results.

• Based on the proposed attack framework, stan-

dard adversarial examples are generalized for the

ﬁrst time as clean-label backdoor attacks.

• We demonstrate through extensive experiments

that our attacks achieve higher attack success

rates, poisoning performance, and invisibility than

the existing hidden trigger backdoor attack.

• We show that our attacks are robust against ex-

isting countermeasures both for backdoor attacks

and adversarial examples.

We hope that our results will stimulate interest in re-

search on adaptive attacks and defenses to fully hid-

den trigger backdoor attacks.

2 RELATED WORK

Adversarial examples started with the analysis of

the distinct differences between the input space of

deep models and the feature spaces of deep mod-

els (Szegedy et al., 2013; Biggio et al., 2013). Good-

fellow proposed a well-known algorithm referred to

as the fast gradient sign method (FGSM) (Goodfel-

low et al., 2014), followed by many studies on top-

ics such as its iterative variant (Kurakin et al., 2016)

and the projected gradient descent (PGD) introduced

in (Madry et al., 2017). These methods have two

things in common: perturbations are optimized for

each image, and invisibility is guaranteed by the

threshold parameter ε.

Backdoor attacks (Gu et al., 2017; Chen et al.,

2017) are special cases of poisoning attacks (Biggio

et al., 2012; Mu

noz Gonz

alez et al., 2017; Koh et al.,

2018). Poisoning aims to misclassify all clean im-

ages of the source class selected by the attacker, and

a backdoor attack misclassiﬁes only triggered source

images. Consistency of labels for poisoning images in

poisoning or backdoor attacks is one of the primary

issues for maintaining attack feasibility. In poison-

ing attacks, Shafahi (Shafahi et al., 2018) suggests

an algorithm to generate clean-label poison data by

solving an optimization problem in the input and fea-

ture space between the target and the source images.

Clean label backdoor attacks are realized by adopting

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

clean

source 𝑠

clean

target 𝑡

triggered

source 𝑠̃

clean

target 𝑡

Training

triggered

target 𝑡

source label：0

Trigger Generation

target label：1

backdoored

model

Testing

optimize

⋯

Figure 1: Our framework of fully hidden dynamic trigger backdoor attacks. The framework consists of two optimization

phases: (1) Invisible triggered source image ˜s

generation from a clean source class (unicycle) image s

and a target class

(partridge) image t

, which can be directly utilized as backdoor images in the inference phase. (2) Invisible triggered target

image

generation from triggered source ˜s

and clean target image t

is injected into the training data of the target class

with the correct label. In the inference phase, the backdoored model mispredicts the triggered source image ˜s

generated in (1)

as the target class while correctly classifying the clean test images. Note that if the attacker does not have WRITE permission

to the training data, step (2) can be skipped and the attack can be executed as an adversarial example without poisoning.

adversarial perturbations using generative adversarial

networks (GANs) (Turner et al., 2019) or inserting a

visible backdoor signal into an image (Barni et al.,

2019).

Several papers focus on the invisibility of trig-

gers to more subtly blend perturbed images with

other clean images and with adversarial examples.

Liao (Liao et al., 2018) and Li (Li et al., 2021b)

proposed invisible and dynamic trigger generation al-

gorithms but with incorrect labels. Saha’s method

prepares invisible and label consistent triggers dur-

ing training. However, a visible static patch image

is applied in the inference phase (Saha et al., 2020).

Steganography patterns can also be applied to gener-

ate invisible triggers as presented in (Li et al., 2021a).

Ning (Ning et al., 2021) showed clean-label back-

door attacks with steganography-style invisible trig-

gers generated from a speciﬁc image using an au-

toencoder. Although all these methods satisfy crite-

ria such as invisibility, label consistency, and trigger

adaptability, no algorithm that fulﬁlls all of them has

been proposed.

3 METHODOLOGY

3.1 Threat Model

3.1.1 Backdoor Attack

We will use the same thread model for backdoor at-

tacks deﬁned in (Gu et al., 2017; Saha et al., 2020).

The involved parties are an attacker and a victim. We

assume that an attacker can read a part of the training

or surrogate dataset and is capable of writing manipu-

lated data into the training dataset. However, he can-

not remove or overwrite the existing training dataset.

One such application is an online machine learning

service such as MLaaS, which allows users to submit

individual data from their clients and a deep model on

a central server trains or classiﬁes it. We refer to the

manipulated data injected into the training dataset as

poison. In backdoor attacks, a poison is constructed

by appending a small data referred to as a trigger

to clean data. To generate a trigger using the gradi-

ent information, the attacker also knows the structure

of the victim’s model (white box attacks) or a sur-

rogate model to classify a similar dataset (black box

attacks), but he cannot manipulate the victim’s model

itself. Then, the generated trigger is appended to the

attacker’s training data often accompanied by label

ﬂipping. Note that allowing the attacker to manipu-

late labels reduces the feasibility of the attack since a

machine learning system often employs autoannota-

tion for labeling. The victim is assumed to ﬁne-tune

the pretrained model with the contaminated dataset.

In the inference phase, the model misclassiﬁes trig-

gered data into the wrong class. Typically, attackers

aim to misclassify data belonging to a speciﬁc class

(source) into the targeted class (target). To avoid de-

tection of model collapse, data that are not triggered

should be correctly classiﬁed.

3.1.2 Adversarial Example

We also describe the standard threat model for (tar-

geted) adversarial examples mentioned in (Szegedy

et al., 2013; Goodfellow et al., 2014) compared with

backdoor attacks. In the adversarial examples, the at-

tacker has less capability than in backdoor attacks,

that is, he can access the model’s gradient or a part

Fully Hidden Dynamic Trigger Backdoor Attacks

of the (surrogate) training dataset to generate triggers,

but he cannot insert the triggered data into the vic-

tim’s training dataset. The objective of the attacker

is to cause triggered source data to be misclassiﬁed

into the target class in the inference phase. In con-

trast to the backdoor attacks that normally use visi-

ble triggers such as patch images, triggers are usually

indistinguishable in the adversarial examples so that

defenders do not notice the attack.

3.2 Poisoned Image Generation

The central part of the backdoor attacks and adversar-

ial example is how to create the poisoned data utiliz-

ing trigger generation and label ﬂipping. In the fol-

lowing section, we brieﬂy describe the poison gener-

ation algorithms for backdoor attacks and adversar-

ial examples in the image domain as the most typical

cases.

3.2.1 Clean Label Invisible Trigger

In (Saha et al., 2020), the authors proposed a back-

door attack in which the trigger is invisible dur-

ing training by adopting the technique proposed in

(Shafahi et al., 2018) to generate poison using the

data correlation between the source class and the tar-

get class. First, patched source images are created by

Equation 1:

˜s

← s

 (1 − m) + p  m, (1)

where s

is a source image and ˜s

is the triggered

source image. p is a small static trigger such as a

patch image.  denotes an elementwise product. m

is a bitmask such that it is 1 at the position where

the trigger is located and 0 otherwise. Then, a clean-

label triggered target image with the correct label

(

) is constructed from the initial target image t

and patched source image ˜s

by:

← argmin

k f (z) − f ( ˜s

s.t. kz −t

∞

≤ ε,

(2)

where z is a triggered target image, f is the projec-

tion to the feature space for the deep model and ε is a

threshold. If ε is small,

can be regarded as an invis-

ible triggered target image. To make

more general

with respect to trigger location and variation of source

images, Equation 2 is reformulated in the style of a

coordinate descent algorithm. In the inference phase,

the model mispredicts patched source image ˜s

as the

target class because the poison (

) was learned to

move the decision boundary near the patch images in

the target class direction. Note that since the trigger

of ˜s

is still visible, trigger detection algorithms may

notice the attack.

3.2.2 Invisible Trigger Generated by

Adversarial Examples

In our paper, we consider an adversarial example as

a special case of backdoor attacks, namely, it is the

case where the number of poisons to be added to the

training data is 0. During inference, a poisoned image

is generated so that the trigger is hidden. A notable

adversarial example algorithm is PGD (Madry et al.,

2017). In PGD, ˜s

is iteratively constructed by:

˜s

(0)

← s

˜s

( j+1)

← ˜s

( j)

− a sgn ∇

˜s

( j)



F(˜s

( j)

),y



s.t. k ˜s

( j)

− s

∞

≤ ε,

(3)

where a ≤ ε is the step size. Triggers are invisible

if we set ε to a small value. In (Fowl et al., 2021;

Pang et al., 2020), the authors describe that one can

extend adversarial examples to poisoning attacks or

backdoor attacks by adopting the label-ﬂipping tech-

nique. However, to the best of our knowledge, there is

no known algorithm that can realize a backdoor attack

from adversarial examples without label-ﬂipping.

3.3 Fully Hidden Trigger Backdoor

Attack

The primary objective of this study is to develop

a clean label backdoor attack whose trigger is dy-

namic and invisible during both training and infer-

ence. Speciﬁcally, we seek to replace the visible

patched source image ˆs

generated by Equation 1 with

an invisible trigger. One solution to this is to utilize

Equation 2 not only for generating the poisoned target

but also for the poisoned source by replacing all terms

regarding s

and t

˜s

← argmin

k f (z) − f (t

s.t. kz − s

∞

≤ ε.

(4)

For the second term of arg min, we just replace

with

the normal target t

to avoid a circular reference be-

tween Equation 4 and Equation 2. Then, an invisibly

triggered source image ˜s

is generated from one clean

pair (s

). We intend to create triggered source ˜s

that resembles a source image but is near to the tar-

get in the feature space. Afterward, poisoned target

with a hidden trigger is generated by Equation 2

using t

and ˜s

obtained from Equation 4. A poison

(

) is inserted into the training dataset. Note that

Equation 2 refers to the triggered source ˜s

generated

by Equation 4, not the clean source s

, so it can gen-

erate poisoning target

focused on ˜s

. The order in

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

Algorithm 1: Generate-Invisible-Poison.

Input: K data a

from class y

, b

from class

Output: K poisoning data ˆa

for class y

1 Initialize ˆa

← a

for 1 ≤ i ≤ K

2 for t ← 1,. .. ,T do

3 ( ˆa

m[i]

) ← Find-Closest-Pair( ˆa,b)

4 L ←

∑

i=1

k f ( ˆa

) − f (b

m[i]

5 ˆa

← ˆa

− η∇

ˆa

6 ˆa

← min(max( ˆa

− ε),a

+ ε)

7 return ˆa

which ˜s

and

are created is also important. If

created from clean images s

and t

using Equation 2,

then ˜s

will be generated from clean source s

and trig-

gered target

to maintain the connection with

. This

approach degrades the attack performance of ˜s

since

is generated to be close to clean source s

in the fea-

ture space.

By combining Equation 2 and Equation 4 in the

appropriate order, we can approach opening a back-

door on the feature space from both sides of the source

and target classes. Thus, our algorithm is assumed to

be more effective than (Saha et al., 2020) in terms of

not only invisibility but also attack performance. Note

that a pair of triggered source images ( ˜s

) is not

suitable for poison since it causes adversarial training

(reduces the attack success rate) by forcing the model

to correctly learn target-like images ˜s

as the source

class. ˜s

is used only in the inference phase instead of

a visible patch image.

3.3.1 Increasing Attack Performance and

Versatility

An arbitrary source image with the hidden trigger

should succeed in the attack. Optimizing Equation 2

and 4 from single image pairs to K pairs provides

higher attack performance and versatility for source

image variations. A concrete instantiation of the op-

timization is provided in Algorithm 1 as a conse-

quence of the generalization of the algorithm pro-

posed in (Saha et al., 2020). Algorithm 1 outputs K

triggered images ˆa

in class y

from K clean images a

in class y

and K (triggered) images b

in class y

. The

Find-Closest-Pair function in Line 3 permutes K pairs

( ˆa

m[i]

) so that the sum of the L2 distance between

f ( ˆa

) and f (b

) is approximately minimized by solv-

ing a greedy selection algorithm to obtain good pairs

that are close in the feature space. m is a mapping for

indices i in range 1 ≤ i ≤ K.

In our backdoor algorithm, we call Algorithm 1 by

inputting K

clean source and target images. The out-

put consists of K

triggered source images ˆs

. Then,

clean target images for class y

and K

triggered

source images are input to Algorithm 1 to gain K

triggered target images

. We consider the balanced

case K

= K

in our experiments. P ≤ K

triggered

target images are added to the training dataset. If

an attacker does not own the write permission to the

training dataset, he can alternatively run an adver-

sarial example variant of the attack by setting P = 0

(skipping the generation of the poisoning target). In

default, K

triggered source images are available for

the test images.

3.3.2 Attack Feasibility

The attacker’s capabilities in our proposed algorithm

are equivalent to those of (Saha et al., 2020); i.e.,

a part of the training dataset and the gradient of the

model. During the inference phase, one might sug-

gest that our algorithm would need more powerful ca-

pabilities than the existing method since our triggered

source images are created using model gradients, un-

like static patch images. However, recall that our

triggered source images can be generated in advance

when generating triggered target images, at which

point the gradient of the model is still required in the

existing method. It is a realistic assumption, such

as in MLaaS, that an attacker prepares backdoor in-

stances to be submitted to the service in advance and

stores them in his devices until the inference. In ad-

dition, our algorithm assumes the same threat model

as adversarial examples when P = 0, since both al-

gorithms access the gradient of the model to generate

perturbed source images in the inference phase.

3.4 Generalize Adversarial Examples as

Clean-Label Backdoor Attacks

Our second objective is to generalize targeted adver-

sarial examples to realize backdoor attacks without

label ﬂipping. In previous research, adversarial ex-

amples are employed as poisoning attacks or back-

door attacks utilizing label ﬂipping (Fowl et al., 2021;

Pang et al., 2020). The key observation is that our

hidden trigger backdoor attack functions as an adver-

sarial example when the amount of poison is P = 0.

Conversely, we hypothesized that adversarial exam-

ples can be converted to backdoor attacks by taking

an approach similar to that presented in Section 3.3.

Instead of ﬂipping the label for the triggered source

image as discussed in (Fowl et al., 2021; Pang et al.,

2020), we utilize a source-target swapped variant for

the trigger generation algorithm to generate clean la-

Fully Hidden Dynamic Trigger Backdoor Attacks

bel poison. For PGD, a poison is generated by:

(0)

← t

( j+1)

←

( j)

− εsgn ∇

( j)



( j)

),y



s.t. k

( j)

−t

∞

≤ ε.

(5)

Poisoning data to be added to the training data is

formed as (

). Unlike naively inserting ( ˜s

) us-

ing Equation 3 as poison data, appending (

) ex-

pands a speciﬁc region of a target class near

in the

feature space. Equation 5 helps ˜s

to become more

easily classiﬁed as a target class.

Following the attack framework shown in Fig-

ure 1, one can perform clean-label backdoor attacks

generalized from adversarial examples by replacing

Equation 2 and 4 with Equation 5 and 3, respec-

tively. Whole P poisoning data are prepared by solv-

ing Equation 5 for P different clean target images P

times. During inference, standard adversarial exam-

ple algorithms such as Equation 3 are used to generate

the triggered source images.

4 EXPERIMENTS

4.1 Setup

4.1.1 Dataset and Metrics

The ImageNet and CIFAR-10 datasets are selected for

our experiments. For both datasets, the number of

images in the training dataset is 800 for each class,

a total of 1600 for binary classiﬁcation. The dataset

available to the attacker consists of 200 source images

and 200 target images in default settings that are in-

dependent of the victim’s training dataset. Then, the

attacker generates 200 triggered source and target im-

ages. The top 100 target poisons with the lowest loss

are inserted into the training data. The poisoning ratio

is 100/1600 = 6.25% in binary classiﬁcation tasks.

We measure the clean accuracy (abbreviated as

clean acc.) for 50 clean test images for each class

not included in both the victim’s training dataset and

attacker’s training dataset, the poisoned source accu-

racy ( ˆs

acc.) for 50 triggered source images, and the

clean source accuracy (s

acc.) for 50 clean source

images. Note that higher clean source accuracy and

lower poisoned source accuracy (referred to as the at-

tack success rate) are desirable for the attacker.

Our experiments are performed over 10 random

pair datasets as shown in Table 1 for ImageNet and

Table 2 for CIFAR-10 with varying source and tar-

get categories. We measure the average as the results.

Table 1: Random pairs for ImageNet.

ID Source Target

1 slot Australian terrier

2 lighter bee

3 theater curtain plunger

4 unicycle partridge

5 mountain bike Ipod

6 coffeepot Scottish deerhound

7 can opener sulphur-crested cockatoo

8 hotdog toyshop

9 electric locomotive tiger beetle

10 wing goblet

Table 2: Random pairs for CIFAR-10.

ID Source Target

1 bird dog

2 dog ship

3 frog plane

4 plane truck

5 cat truck

6 deer ship

7 bird frog

8 bird deer

9 car frog

10 car dog

Hereafter, we consider the binary classiﬁcation case

for the ImageNet dataset unless otherwise speciﬁed.

4.1.2 Model and Learning Parameters

AlexNet (Krizhevsky et al., 2012) serves as the pre-

diction model, and the fc7 layer serves as the feature

space f to generate the poison. We ﬁne-tune the pre-

trained AlexNet with the victim’s training dataset and

the attacker’s dataset. The learning rate in ﬁne-tuning

is set to 0.001, the batch size is set to 256, and the

number of epochs is set to 30.

4.1.3 Backdoor Generation Algorithms

In our experiments, we consider the algorithm pro-

posed in (Saha et al., 2020) as a representative of ex-

isting invisible backdoor algorithms (abbreviated as

existing). We do not compare other methods us-

ing label ﬂipping or static triggers since they assume

a stronger attacker’s capabilities or different situa-

tions. For the proposed algorithms, we use the fully

invisible backdoor attack described in Section 3.3

(invisible) and a backdoor attack based on an ad-

versarial example using Equation 3 and 5 in Sec-

tion 3.4 (adversarial). The learning rate η in Al-

gorithm 1 is 0.01. The batch size for existing and

invisible is set to 100, and that for adversarial

is set to 256. The number of iterations for existing

and invisible is 5000, and that for adversarial is

10. We set the threshold ε = 4 and P = 100 unless

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

otherwise speciﬁed.

4.2 Binary Classiﬁcation for ImageNet

First, we compare existing and invisible where

ε = 16 using the random pairs in Table 1 , which is

the same condition employed in (Saha et al., 2020).

The results are presented in Table 3. Both methods

achieve high classiﬁcation accuracy for the clean val-

idation data for both the clean model and poisoned

model. For the poisoned model, our invisible algo-

rithm accomplishes an attack success rate of 100.0%,

which is signiﬁcantly higher than the success rate of

55.1% achieved by the existing method. The main

reason for these results is that Equation 1 uses a sim-

ple patch image as a trigger that does not use any

gradient information, while Equation 4 includes the

model’s gradient in the invisible trigger. This ﬁnding

is evidenced by the notion that the invisible also

succeeds in the attack against the clean model, i.e., it

works as an adversarial example. However, the poi-

soned model correctly classiﬁes clean nontriggered

source images.

In addition, we conduct several ablation studies on

ε and P for our invisible algorithm to validate the

effects of invisibility and the amount of poison on the

attack performance. The results are shown in Table 4

and Table 5. Table 4 shows that ε does not have any

impact on the prediction of clean images. The attack

success rate exceeds 99% when ε ≥ 4. When ε = 2,

the attack success rate drops to approximately 80%;

when ε = 1, the attack success rate drops to approx-

imately 10%. Therefore, there is a trade-off between

the trigger invisibility and the attack performance.

For the poisoning rate, we varied the number of

poisoned images between P = 0,100,200, and 400

with ﬁxed ε = 1 or ε = 2. Note that we conduct ad-

versarial examples rather than backdoor attacks when

P = 0 since the attacker manipulates only testing im-

ages. We skip the case where ε ≥ 4 since the attack

success rate is already saturated at P = 0. From Ta-

ble 5, it can be inferred that our invisible attack cer-

tainly has features in common with target poisoning

attacks since the attack success rate increases as P in-

creases, while clean accuracy remains steady.

4.3 Trigger Invisibility

We compare the triggered source between the

existing algorithm and the invisible algorithm.

Figure 2 represents the triggered images. In the ex-

isting methods, the trigger for source images is a ran-

dom patch image. Notably, the trigger for existing

is visually recognized and can be identiﬁed by ex-

Figure 2: (Left) Original image for the unicycle class. (Cen-

ter) Patched source image from Equation 1 with a 30 × 30

random patch image. (Right) Hidden triggered source im-

age from Equation 4 , where ε = 2.

isting trigger detection algorithms. In contrast, by

choosing a smaller ε, the proposed method is able to

generate invisible triggers for the source image that is

physically indistinguishable from the original image.

For the target image side, both algorithms can already

generate invisible triggers, so invisible can indis-

tinguishably hide triggers in all training and inference

phases.

In addition, we conduct quantitative experiments

on trigger visibility using learned perceptual im-

age patch similarity (LPIPS) (Zhang et al., 2018).

Smaller LPIPS values indicate closer perceptual prox-

imity. In our experiments, we measure LPIPS be-

tween the original source images and the triggered

source images for three backdoor algorithms varying

ε in {1, 2,4,8, 16}. For existing, we paste a ran-

domly created patched image at a random location.

We measure the average LPIPS values for all random

pairs (Table 1). The results are shown in Table 6.

Contrary to what is actually perceived, the LPIPS

value of existing is considerably smaller than that

of invisible and adversarial for ε ≥ 2. A pos-

sible reason for this ﬁnding is that triggers gener-

ated by invisible and adversarial are spread over

the entire image, while the triggered source image

matches the original image, with the exception of the

patch generated by the existing method. Addition-

ally, the LPIPS values for ε = 1 and 2 are certainly

comparable to the existing method. Even if ε = 2,

invisible maintains an attack success rate of ap-

proximately 80% (Table 4), it accomplishes both in-

visibility and attack performance.

4.4 CIFAR-10 Dataset

To verify that the proposed algorithm is effective in-

dependent of the dataset and image resolution, we

conduct an experiment for a binary classiﬁer on the

CIFAR-10 dataset. The results are listed in Table 7.

From the results, we achieve an attack success rate of

90.1% with less than 1% degradation of the accuracy,

which is higher than 81.3% of the success rate of the

existing method presented in (Saha et al., 2020) with

ε = 16.

Fully Hidden Dynamic Trigger Backdoor Attacks

Table 3: Binary classiﬁcation results for the ImageNet dataset for (Saha et al., 2020) and invisible under the same settings,

where ε = 16 and P = 100 (poisoning ratio is 6.25%).

(Saha et al., 2020) invisible

Clean Model Poisoned Model Clean Model Poisoned Model

Clean Acc. 0.992 ± 0.012 0.981 ± 0.031 0.992 ± 0.010 0.991 ± 0.021

Poisoned Source Acc. 0.978 ± 0.078 0.449 ± 0.311 0.002 ± 0.018 0.000 ± 0.000

Clean Source Acc. 0.992 ± 0.012 0.970 ± 0.050 0.992 ± 0.012 0.992 ± 0.032

Table 4: Results of the invisible algorithm with varying

ε among 1,2,4,8, 16.

ε model clean acc. ˜s

acc. s

acc.

1 clean 0.991 ± 0.01 0.900 ± 0.10 0.992 ± 0.03

poison 0.991 ± 0.01 0.887 ± 0.11 0.994 ± 0.01

2 clean 0.992 ± 0.01 0.362 ± 0.38 0.994 ± 0.01

poison 0.989 ± 0.02 0.206 ± 0.21 0.990 ± 0.03

4 clean 0.992 ± 0.01 0.008 ± 0.03 0.994 ± 0.01

poison 0.993 ± 0.01 0.008 ± 0.03 0.994 ± 0.01

8 clean 0.992 ± 0.01 0.000 ± 0.00 0.994 ± 0.01

poison 0.992 ± 0.01 0.000 ± 0.00 0.992 ± 0.03

16 clean 0.992 ± 0.01 0.002 ± 0.02 0.992 ± 0.01

poison 0.991 ± 0.02 0.000 ± 0.00 0.992 ± 0.03

Table 5: Results of the invisible algorithm with varying

P among 0,100,200,400 for the poisoned model.

ε P clean acc. ˜s

acc. s

acc.

1 0 0.991 ± 0.01 0.900 ± 0.10 0.992 ± 0.03

100 0.991 ± 0.01 0.887± 0.11 0.994 ± 0.01

200 0.991 ± 0.02 0.868 ± 0.15 0.988 ± 0.03

400 0.994 ± 0.01 0.822 ± 0.28 0.990 ± 0.03

2 0 0.992 ± 0.01 0.362 ± 0.38 0.994 ± 0.01

100 0.989 ± 0.02 0.206 ± 0.21 0.990 ± 0.03

200 0.988 ± 0.03 0.160 ± 0.24 0.984 ± 0.04

400 0.991 ± 0.02 0.098 ± 0.10 0.986 ± 0.03

4.5 Multiclass Settings

We conduct backdoor attacks toward multiclass clas-

siﬁcation tasks with 20 classes using the ImageNet

dataset. We create a multiclass dataset by merging all

the pairs shown in Table 1. It is generally assumed

that the attacker’s party has images of multiple source

classes and that any image belonging to one of those

classes will cause a misprediction simply by show-

ing the model its secret trigger. In the experiments,

we vary the target class from ID1 to ID10, but any

source class listed in Table 1 is the attacker’s source

class. The number of training datasets is 20,918 (ap-

proximately 1050 images × 20 classes). The attacker

should pair a triggered target with each source class

to allow the model to learn the trigger distribution for

each source class. To accomplish the objective, the

attacker generates 400 triggered source and target im-

ages for each source class using Algorithm 1. Then,

the top 40 triggered target images for each class are

selected and inserted into the training data (the poi-

soning ratio is 40 × 10/20, 918 = 1.91%). Fifty im-

ages for each source class are selected for the test

dataset.

The results are shown in Table 8. Even in the mul-

ticlass settings, the average attack success rate still

exceeds 90%, which is much higher than the existing

result of 30.7%, as shown in (Saha et al., 2020). The

mean value of the clean source accuracy for the poi-

soned model does not decrease from that of the clean

model. Individually analyzing each class, the lowest

attack success rate for the clean model is 62% when

the source class is ”unicycle” and the target class is

”plunger”. In such a case, the attack success rate im-

proves to 100% after poisoning. Therefore, our invis-

ible poisoning can boost the attack performance with-

out sacriﬁcing the clean accuracy.

4.6 Backdoor Attacks Using

Adversarial Examples

Here, we examine how invisible algorithm based

on backdoor attacks and adversarial algorithm

based on adversarial examples transition with respect

to the attack performance when varying the poison-

ing ratio P. Figure 3 shows the results for varying

P = 0,100, 200 and 400. A hidden source trigger us-

ing an adversarial example (Equation 3) achieves a

higher attack success rate when P = 0 (in the case

of adversarial examples). In contrast, Algorithm 1

makes backdoor attacks more powerful than adver-

sarial examples when P ≥ 100 (in case of backdoor

attacks). The results imply that both triggers func-

tion as a backdoor attack and an adversarial example,

but invisible is more effective when behaving as a

backdoor attack and adversarial is more effective

when behaving as an adversarial example.

4.7 Resiliency to Defenses

Since our backdoor attacks also have features com-

mon to adversarial examples, it is necessary to ver-

ify the countermeasures against both backdoor attacks

and adversarial examples. In our paper, we use two

fundamental defensive methods referred to as neural

cleanse (Wang et al., 2019) for backdoor attacks and

input-transformations (Guo et al., 2017) for adversar-

ial examples.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

Table 6: Averaged LPIPS values for three backdoor algorithms (existing, invisible and adversarial). We choose

AlexNet for the LPIPS prediction model varying ε in the range {1,2, 4, 8, 16}. Smaller values indicate closer perceptual

proximity to the original image.

(Saha et al., 2020) invisible adversarial

(patch size: 30 × 30) ε = 16 ε = 8 ε = 4 ε = 2 ε = 1 ε = 16 ε = 8 ε = 4 ε = 2 ε = 1

0.0827 0.5595 0.4170 0.2460 0.1122 0.0434 0.2966 0.2781 0.2007 0.0921 0.0299

Table 7: Clean accuracy, poisoned source accuracy and

clean source accuracy for our invisible backdoor attack

on the CIFAR-10 dataset.

Clean Model Poisoned Model

Clean Acc. 0.951 ± 0.064 0.946 ± 0.054

Poisoned Source Acc. 0.121 ± 0.139 0.082 ± 0.166

Clean Source Acc. 0.937 ± 0.103 0.930 ± 0.090

Table 8: Results for the invisible algorithm where ε = 4

and P = 400 (1.91%) for multiclass classiﬁcation.

Clean Model Poisoned Model

Clean Acc. 0.922 ± 0.120 0.942 ± 0.140

Poisoned Source Acc. 0.130 ± 0.079 0.016 ± 0.007

Clean Source Acc. 0.893 ± 0.010 0.927 ± 0.014

0

0.1

0.2

0.3

0.4

0.5

0 100 200 300 400

PoisonedSourceAccuracy

NumberofPoisoningImages

invisible(Eq.2andEq.4)

adversarial(Eq.3andEq.5)

Figure 3: Poisoned source accuracy of invisible and

adversarial. The invisibility is ﬁxed at ε = 2.

4.7.1 Defense Against Backdoor Attacks

Neural cleanse (NC) is a novel model inspection

mechanism against backdoor attacks. NC outputs

-norm, which is referred to as the minimum per-

turbation cost (MPC), for each class. If a class is

backdoored, the MPC to change the prediction of all

the inputs in the class to the target class is abnor-

mally small. In our experiment, we measured the

MPC for all random pairs in Table 1 for both direc-

tions (target → source and source → target) after the

model is backdoored using three backdoor attacks.

The results are shown in Figure 4. For existing,

NC works well since the MPC to change source im-

ages to the target class is substantially smaller than

Figure 4: Result of neural cleanse defense against three

backdoor algorithms where ε = 4 and P = 100.

the other class. However, there is a slight differ-

ence in the MPC of invisible between the source

class and the target class. Thus, it is hard to distin-

guish if the model is backdoored using the MPC as

a threshold detection. Notably, both existing and

invisible generate invisible triggered targeted im-

ages, while only invisible is robust to NC. The

MPC for adversarial between two classes is not as

separated as existing.

4.7.2 Defense Against Adversarial Examples

One of the basic defensive measures for adversarial

examples is to sanitize the poison by perturbing the

input image before inputting the model. (Guo et al.,

2017) applies several image transformations such as

image quilting and JPEG compression to remove the

poisonous trigger from the input. In the experiment,

the total variance minimization (TVM), which has

been shown to be effective for PGD in (Guo et al.,

2017), is applied to all images before inputting the

model in both the training phase and inference phase.

We assume that an attacker is unaware of the counter-

measure, i.e., he uses original untransformed images

to generate the poison. The results are shown in Fig-

ure 5. TVM defense reduces the attack success rate

for all methods, but there is still a possibility of a suc-

cessful attack with a probability of more than 40% for

invisible at P ≥ 200 and adversarial maintains a

70% attack success rate at P ≥ 100. Thus, our pro-

Fully Hidden Dynamic Trigger Backdoor Attacks

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400

PoisonedSourceAccuracy

NumberofPoisoningImages

invisible(none)

invisible(TVM)

adversarial(none)

adversarial(TVM)

existing(none)

existing(TVM)

Figure 5: Poisoned source accuracy for the poisoned model

with/without TVM defense against attackers unaware of de-

fenses.

posed attacks, which generate adaptive triggers, are

more robust than the existing method against the in-

put transformation-based defense. Additionally, if the

attacker is aware of the defense and uses transformed

images to generate poison, the attack is more likely to

succeed.

5 CONCLUSION

We propose a fully hidden dynamic trigger backdoor

attack where the trigger is invisible during both test-

ing and training. Our algorithm dynamically gen-

erates invisible triggers without ﬂipping labels or

changing the victim’s model. Experimental results

veriﬁed the superiority of the proposed algorithms in

terms of invisibility and attack success rate. To pre-

vent fully hidden dynamic trigger backdoor attacks in

practice, adaptive defensive methods are essential.

REFERENCES

Barni, M., Kallas, K., and Tondi, B. (2019). A new back-

door attack in cnns by training set corruption without

label poisoning. In 2019 IEEE International Confer-

ence on Image Processing (ICIP), pages 101–105.

Biggio, B., Corona, I., Maiorca, D., Nelson, B.,

Srndi

N., Laskov, P., Giacinto, G., and Roli, F. (2013).

Evasion attacks against machine learning at test

time. In Machine Learning and Knowledge Discov-

ery in Databases, pages 387–402, Berlin, Heidelberg.

Springer Berlin Heidelberg.

Biggio, B., Nelson, B., and Laskov, P. (2012). Poisoning

attacks against support vector machines.

Chen, X., Liu, C., Li, B., Lu, K., and Song, D. (2017). Tar-

geted backdoor attacks on deep learning systems us-

ing data poisoning.

Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V.,

DePristo, M., Chou, K., Cui, C., Corrado, G., Thrun,

S., and Dean, J. (2019). A guide to deep learning in

healthcare. Nature Medicine, 25(1):24–29.

Fowl, L., Goldblum, M., Chiang, P.-y., Geiping, J., Czaja,

W., and Goldstein, T. (2021). Adversarial examples

make strong poisons.

Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014). Ex-

plaining and harnessing adversarial examples.

Grigorescu, S., Trasnea, B., Cocias, T., and Macesanu,

G. (2020). A survey of deep learning techniques

for autonomous driving. Journal of Field Robotics,

37(3):362–386.

Gu, T., Dolan-Gavitt, B., and Garg, S. (2017). Bad-

nets: Identifying vulnerabilities in the machine learn-

ing model supply chain.

Guo, C., Rana, M., Cisse, M., and van der Maaten, L.

(2017). Countering adversarial images using input

transformations.

Huang, Y. and Chen, Y. (2020). Autonomous driving with

deep learning: A survey of state-of-art technologies.

Koh, P. W., Steinhardt, J., and Liang, P. (2018). Stronger

data poisoning attacks break data sanitization de-

fenses.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in Neural Information Pro-

cessing Systems, volume 25. Curran Associates, Inc.

Kurakin, A., Goodfellow, I., and Bengio, S. (2016). Adver-

sarial examples in the physical world.

Li, S., Xue, M., Zhao, B. Z. H., Zhu, H., and Zhang, X.

(2021a). Invisible backdoor attacks on deep neural

networks via steganography and regularization. IEEE

Transactions on Dependable and Secure Computing,

18(5):2088–2105.

Li, Y., Li, Y., Wu, B., Li, L., He, R., and Lyu, S. (2021b). In-

visible backdoor attack with sample-speciﬁc triggers.

In Proceedings of the IEEE/CVF International Con-

ference on Computer Vision (ICCV), pages 16463–

16472.

Liao, C., Zhong, H., Squicciarini, A. C., Zhu, S., and

Miller, D. J. (2018). Backdoor embedding in convo-

lutional neural network models via invisible perturba-

tion. CoRR, abs/1808.10307.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and

Vladu, A. (2017). Towards deep learning models re-

sistant to adversarial attacks.

Minaee, S., Abdolrashidi, A., Su, H., Bennamoun, M., and

Zhang, D. (2019). Biometrics recognition using deep

learning: A survey.

noz Gonz

alez, L., Biggio, B., Demontis, A., Paudice, A.,

Wongrassamee, V., Lupu, E. C., and Roli, F. (2017).

Towards Poisoning of Deep Learning Algorithms with

Back-Gradient Optimization, page 27–38. Associa-

tion for Computing Machinery, New York, NY, USA.

Ning, R., Li, J., Xin, C., and Wu, H. (2021). Invisible poi-

son: A blackbox clean label backdoor attack to deep

neural networks. In IEEE INFOCOM 2021 - IEEE

Conference on Computer Communications, pages 1–

10.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

Pang, R., Shen, H., Zhang, X., Ji, S., Vorobeychik, Y.,

Luo, X., Liu, A., and Wang, T. (2020). A tale of

evil twins: Adversarial inputs versus poisoned mod-

els. In Proceedings of the 2020 ACM SIGSAC Con-

ference on Computer and Communications Security,

CCS ’20, page 85–99, New York, NY, USA. Associa-

tion for Computing Machinery.

Saha, A., Subramanya, A., and Pirsiavash, H. (2020). Hid-

den trigger backdoor attacks. Proceedings of the AAAI

Conference on Artiﬁcial Intelligence, 34(07):11957–

11965.

Shafahi, A., Huang, W. R., Najibi, M., Suciu, O., Studer, C.,

Dumitras, T., and Goldstein, T. (2018). Poison frogs!

targeted clean-label poisoning attacks on neural net-

works. In Advances in Neural Information Processing

Systems, volume 31. Curran Associates, Inc.

Shamshirband, S., Fathi, M., Dehzangi, A., Chronopou-

los, A. T., and Alinejad-Rokny, H. (2021). A review

on deep learning approaches in healthcare systems:

Taxonomies, challenges, and open issues. Journal of

Biomedical Informatics, 113:103627.

Sundararajan, K. and Woodard, D. L. (2018). Deep learning

for biometrics: A survey. ACM Comput. Surv., 51(3).

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,

D., Goodfellow, I., and Fergus, R. (2013). Intriguing

properties of neural networks.

Turner, A., Tsipras, D., and Madry, A. (2019). Clean-label

backdoor attacks.

Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng,

H., and Zhao, B. Y. (2019). Neural cleanse: Identi-

fying and mitigating backdoor attacks in neural net-

works. In 2019 IEEE Symposium on Security and Pri-

vacy (SP), pages 707–723.

Xiao, H., Xiao, H., and Eckert, C. (2012). Adversarial label

ﬂips attack on support vector machines. In ECAI 2012,

pages 870–875. IOS Press.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang,

O. (2018). The unreasonable effectiveness of deep

features as a perceptual metric. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition (CVPR).

Fully Hidden Dynamic Trigger Backdoor Attacks