Vision Transformer Interpretability via Prediction of Image Reﬂected

Relevance Among Tokens

Kento Sago

and Kazuhiro Hotta

Meijo University, Nagoya, Japan

Keywords:

Explainable AI, Vision Transformer, Transformer Attribution.

Abstract:

The Vision Transformer (ViT) has a complex structure. To use it effectively in a place of critical decision-

making, it is necessary to visualize an area that affects the model’s predictions so that people can understand. In

this paper, we propose a new visualization method based on Transformer Attribution which is widely used for

visualizing the area for ViT’s predictions. This method estimates the inﬂuences of each token on predictions

by considering the predictions of images reﬂected relevance among tokens, and produce saliency maps. Our

method increased the accuracy by about 1.28%, 1.61% for deletion and insertion and about 3.01%, 0.94% for

average drop and average increase on ILSVRC2012 validation data in comparison with conventional methods.

1 INTRODUCTION

The Vision Transformer (ViT) (Dosovitskiy et al.,

2021) is effective for tasks such as image classiﬁca-

tion(Wang et al., 2021; Liu et al., 2022) and object

detection (Carion et al., 2020; Caron et al., 2021; Liu

et al., 2023), semantic segmentation (Zheng et al.,

2021; Xie et al., 2021). ViT has a complex structure,

and in order to use it effectively in a place of critical

decision-making, it is necessary to visualize an area

that affects model’s prediction so that people can un-

derstand (Zhou et al., 2016; Selvaraju et al., 2017;

Petsiuk et al., 2018; Wang et al., 2020). Visualization

of an area that affects model’s prediction allows us

to understand the trend in predictions, and allows the

model for the improvement of performance. There-

fore, we considered that interpreting the model is the

important task.

In this paper, we pay attention to Transformer At-

tribution (Chefer et al., 2021) widely used to interpret

the ViTs. Although this method visualizes the im-

portant area by the relevance scores calculated from a

Hadamard product of Attention scores and gradients

of the model, the relevance scores tend to be locally

larger in objects that occupy the most of the image,

and only parts of them are highlighted while the rests

are not highlighted as shown in Figure 1.

In Transformer Attribution, it did not leverage the

https://orcid.org/0009-0005-0935-0845

https://orcid.org/0000-0002-5675-8713

Figure 1: Visualization results by our proposed method and

Transformer Attribution. Our method highlighted the pre-

dicted class object better than the Transformer Attribution.

relevance among image tokens. On the other hand,

our method focused on leveraging that abandoned rel-

evance. Speciﬁcally, the proposed method again pre-

dicts multiple images calculated from a Hadamard

products of relevance scores produced from Trans-

former Attribution and input images with the model,

and the method calculates a saliency map from the

model’s outputs.

In the qualitative experiments, we used validation

set in the ILSVRC2012 dataset. We evaluated our

method by four measures; the insertion and deletion,

average drop, and average increase. Our method in-

creased the accuracy by about 1.28%, 1.61% for dele-

tion and insertion and about 3.01%, 0.94% for av-

100

Sago, K. and Hotta, K.

Vision Transformer Interpretability via Prediction of Image Reﬂected Relevance Among Tokens.

DOI: 10.5220/0012419200003654

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2024), pages 100-106

ISBN: 978-989-758-684-2; ISSN: 2184-4313

Figure 2: Overview of Transformer Attribution. Gradients and relevances are propagated through the network, and calculated

a matrix product of them to produce the ﬁnal relevancy scores.

erage drop and average increase in comparison with

Transformer Attribution.

This paper is organized as follows. First, we de-

scribe related works in Section 2. We explain about

the proposed method in Section 3. We explain about

our experimental results and discussion in Section 4.

Finally, we describe conclusions and future works in

Section 5.

2 RELATED WORKS

2.1 ViTs Interpretability

The Grad-CAM(Selvaraju et al., 2017), a method for

interpreting Convolutional Neural Networks (CNNs),

computes saliency maps from the deepest feature

maps and gradients of the model. Furthermore, vari-

ous methods were proposed to interpret CNNs (Chat-

topadhay et al., 2018; Ramaswamy et al., 2020; Jiang

et al., 2021), and some of those methods can also in-

terpret ViTs. Subsequently, a method utilizing the At-

tention Score (Abnar and Zuidema, 2020) was pro-

posed to interpret ViT because it is considered that

the attention scores indicate the areas that affects the

ViT predictions.

However, there was a challenge to need effectively

combine scores from different layers. For example, if

we simply average the attention scores for each to-

ken, the signal is often attenuated. Attention Rollout

(Abnar and Zuidema, 2020) that computes the matrix

product of the attention scores of all layers was pro-

posed to remedy this problem. This method showed

an improvement over using a single attention layer,

however, often highlighted irrelevant tokens. Further-

more Transformer Attribution(Chefer et al., 2021) us-

ing gradient is widely used to interpret the ViT. In this

paper, Transformer Attribution is used as a baseline

method.

2.2 Transformer Attribution

In Transformer Attribution shown in Figure 2, ﬁrst,

we feed an input image into ViT, and the gradients

are computed from the output value through all Trans-

former blocks to the input image. In order to calculate

a relevance score R ∈ R

N×N

, the method calculates

a Hadamard product of each transformer block’s at-

tention score A ∈ R

N×N

that represented by the green

matrix in the Figure, and gradients ∇A ∈ R

N×N

that

represented by the orange matrix in the Figure, and

averages those products across all heads. Note that

N represents the number of the token fed into the

Transformer block. Only the positive values should

contribute to the prediction. Moreover, it calculates

the relevance scores from a matrix products of these

products. The component (i, j) of R is the magnitude

of the inﬂuence of the pair of the i −1-th token and

the j −1-th token on the prediction, and this is called

”relevance”. Furthermore, the component (0, 0) of

R is the relevance of the class token and itself. The

method visualizes the areas that affect the predictions

by producing the saliency map from the relevance of

the class token and all other image tokens because it

is considered that the class token has a high capacity

for the interpretability. However, the relevance scores

tend to be locally larger, in objects that occupy the

most of the image, and only parts of them are high-

lighted while the rests are not highlighted as shown in

Figure 1.

3 PROPOSED METHOD

In the proposed method shown in Figure 3, ﬁrst, the

relevance score obtained by Transformer Attribution.

In Figure 3, the blue column in the relevance score in-

dicates the relevance among class token and image to-

kens, while the orange matrix indicates the relevance

score among image tokens. For the orange matrix R ∈

Vision Transformer Interpretability via Prediction of Image Reﬂected Relevance Among Tokens

101

Figure 3: Overview of our proposed method. The orange matrix is the relevance score between image tokens obtained by

Transformer Attribution, and we cut the diagonal components of those relevance scores. Moreover, we reshape each column

to a matrix and interpolate it to the same size as the input image. The image by Hadamard product between the input image

and those relevance score is fed into ViT, and we obtain the logits hw ×C (Green and Yellow vectors in the Figure) from

ViT. We add the value corresponding to the predicted class of the logits (Yellow vector in the Figure) to the relevance among

class token and image tokens (Blue vector in the Figure), and the saliency map is obtained by interpolating the sum result to

a matrix to the same size as the input image.

(N−1)×(N−1)

and identity matrix I ∈ R

(N−1)×(N−1)

Hadamard product ⊙, the proposed method corrects

the magnitude of the relevance score as

′

= R ⊙(1 −I), (1)

′′

= R

′

+ max

′

(i, j)

×I. (2)

When we normalize relevance scores to reﬂect

them into the input image, the diagonal components

has a larger value than the other components. Thus,

if it is normalized as is, the diagonal components will

have a value close to 1 and the other components will

have a value close to 0, and the only relevance among

each token and itself will be reﬂected into the input

image. We consider that it is better suited to mea-

sure the inﬂuence each token on the prediction if not

only the relevance to itself but also the relevance to

itself and others is reﬂected into the image, so we re-

placed the diagonal component with the maximum of

the other components and then normalized them.

When the relevance scores are simply normalized

by the maximum and minimum values of the total rel-

evance scores, if the maximum value is much larger

than the other values, the relevance scores of some

columns after normalization may be almost the same

value, and it may be more difﬁcult to understand

the magnitude relationship of the column. We con-

sider that it would be easier to understand which to-

kens are highly relevance to a particular token if the

relevance scores between the columns could be nor-

malized without changing the magnitude relationship

Figure 4: Normalization of relevance score.

within the columns. Therefore, we normalized the rel-

evance scores as shown in Figure 4.

In Figure 4, we calculate the maximum relevance

score R

(i, j)

along the i-axis, and normalize those max-

imum values along j-axis (Green vector in the Figure).

Moreover, we normalize the relevance score along

the i-axis (Red matrix in the Figure), and we calcu-

late the Hadamard product of those relevance score

represented as the red matrix and the maximum rel-

evance score represented as the green vector. There-

fore, we considered that the relevance scores between

the columns could be normalized without changing

the magnitude relationship within the columns, and it

would be easier to understand which tokens are highly

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

102

Figure 5: The example of images reﬂected the relevance

scores.

relevance to a particular token.

For the input image X ∈ R

1×3×H×W

, we obtain

N −1 images that reﬂect the relevance among the im-

age tokens as

′

= X ⊙interp(mat(R

′′

)), (3)

′

= X ⊙interp(reshape(R

′′

)), (4)

where reshape(·) represents the process to convert

relevance of all tokens to each token from a vector to a

matrix, as in R

(N−1)×(N−1)

→R

(N−1)×1×

√

N−1×

√

N−1

and interp(·) represents the operation of nearest

neighbor interpolation to the same size as the input

image, so the ﬁrst image of X

′

is the image that re-

ﬂects the relevance of all the tokens to the ﬁrst token.

After that, the predictions L ∈ R

(N−1)×C

is output by

classifying the image reﬂected the relevance among

the tokens into C classes. L represents a magnitude of

the inﬂuence of each token on the prediction, and in-

dicate the important areas that affect the predictions.

Thus, we calculate the saliency map by bilinear inter-

polation of the value l ∈ R

(N−1)

corresponding to the

predicted class of L. We also consider that class to-

kens and other tokens have different areas of gazing,

and we add the relevance score obtained by Trans-

former Attribution to l.

From an example of the images reﬂected the rel-

evance scores shown in Figure 5, if the relevance of

other tokens to one’s own token is high, the image

will reﬂect not only one’s own tokens but also tokens

that are highly relevant to one’s own tokens, and we

considered to assist in subsequent classiﬁcations. Our

method estimates from the prediction of the image

whether the relevance scores obtained by Transformer

Attribution correctly indicate the areas that affect the

prediction. If the relevance score for an object in the

predicted class has even a small value, the predic-

tion of the image is considered to by larger, and it is

expected to improve the problem of the Transformer

Attribution, where only parts of them are highlighted

while the rests are not highlighted, in objects that oc-

cupy the most of the image.

4 EXPERIMENTAL SETTINGS

4.1 Dataset

In the following experiments, we evaluated our pro-

posed method with all images in the validation set of

the ILSVRC 2012 (Deng et al., 2009) that consists of

50,000 images from 1,000 classes.

4.2 Baseline

In the following experiments, we used pretrained ViT-

B/16, as in (Chefer et al., 2021). We also used ViT-

B/32 pretrained with Contrastive Language-Image

Pretraining(CLIP)(Radford et al., 2021) as in (Chen

et al., 2022). CLIP published by OpenAI is consists

of an image encoder and a text encode, and be widely

used in zero-shot classiﬁcation by training on very

large data sets.

In the comparison experiment, Grad-CAM com-

putes the saliency map from the gradients and at-

tention scores of the classtoken in the last Trans-

former Block as in (Chefer et al., 2021). Furthermore,

Attention Rollout, Transformer Attribution, and our

method compute the saliency map from the gradients

and attention scores of the all Transformer Block as

in (Chefer et al., 2021).

4.3 Evaluation Metrics

We evaluated our method with deletion and insertion,

average drop, average increase widely used in evalu-

ation metrics for interpretability.

The Deletion measures a decrease in the probabil-

ity of the predicted class when important pixels are

deleted, where the importance is obtained from the

saliency map. The Deletion is the area of the proba-

bility curve with the number of deleted pixels, so the

lower is the better. The Insertion measures a increase

in the probability of the predicted class when impor-

tant pixels are inserted, where the importance is ob-

tained from the saliency map. The Insertion is the

area of the probability curve with the number of in-

serted pixels, so the higher is the better.

Average Drop is computed as follows, and the

lower is the better.

AverageDrop =

∑

i=1

max(0,Y

−O

)

. (5)

Vision Transformer Interpretability via Prediction of Image Reﬂected Relevance Among Tokens

103

Figure 6: Visualization results by the proposed method, Grad-CAM, Attention Rollout, and Transformer Attribution. Our

method highlighted the predicted class object better than the others.

Average Increase is computed as follows, and the

higher is the better.

AverageIncrease =

∑

i=1

Sign(Y

< O

)

(6)

where Y

is the predicted probability of the classc in

the imagei among N, and O

is the predicted proba-

bility of the classc in the imagei inserted only the top

50% of pixels in the saliency map, Sign(·) is the func-

tion that returns 1 if true and 0 if false.

5 EXPERIMENTAL RESULTS

5.1 ViT Results

Table 1 shows the Deletion, Insertion, Average drop,

and Average increase by our method, Grad-CAM, At-

tention Rollout, and Transformer Attribution. Our

method has the best in all metrics. Especially, in the

comparison with Transformer Attribution, Deletion,

Insertion, Average Drop, and Average Increase was

improved by about 1.28%, 1.61%, 3.01%, and 0.94%.

Figure 6 show qualitative comparison results of

the saliency maps. The ﬁrst column shows the input

image. The second column shows the model’s predic-

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

104

Table 1: Comparison of the proposed method and baseline on the ILSVRC2012 validation dataset.

Method Deletion Insertion Average Drop Average Increase

Grad-CAM 26.48 34.89 62.53 9.52

Attention Rollout 16.80 44.70 45.18 14.80

Transformer Attribution 15.49 46.64 40.86 18.68

Ours 14.21 48.25 37.85 19.62

Table 2: Comparison of the proposed method and baseline for Clip-ViT on the ILSVRC2012 validation set.

Method Deletion Insertion Average Drop Average Increase

Transformer Attribution 11.07 31.21 50.51 15.87

Ours 10.19 32.55 48.41 16.92

tion and ground truth label for the input image, and

the ﬁrst row is a false prediction case. Figure 6 shows

qualitative comparison results of the saliency maps.

The ﬁrst column shows the input image, and the sec-

ond column shows the model’s prediction and ground

truth label for the input image.

In comparison with our method and other meth-

ods, the images predicted to be a Solar Dish and Pizza

show more highlighted in our method. This is be-

cause that the relevance scores among the tokens cor-

responding to the object are large, and the model can

predicts correctly the images reﬂected those relevance

scores. Except for the Gold Dish image, Grad-CAM,

Attention Rollout, and Transformer Attribution high-

lighted only one part of the object, and we can inter-

pret as if the rest of the object does not contribute to

the prediction. We considered that this phenomenon

occur because the relevance scores tend to be locally

larger. In contrast, our method further highlights one

part of the object while highlighting the whole of the

object, and we can interpret as if the whole of the ob-

ject contributes to the prediction.

In the case of Gold Dish, Indigo Bunting,

Samoyed, and Echidna, our method highlighted the

background such as branches, trees, and sand. One of

the factors is that the relevance scores between the

background and object have a value. If the model

can predicts correctly the images reﬂected the rele-

vance scores that have the relation between object and

background, the saliency map contains noise on back-

ground.

Our method is not perfect but Table 1 and Figure

6 demonstrated that our method is superior to Grad-

CAM, Attention Rollout, and Transformer Attribu-

tion in various indicators.

5.2 CLIP-ViT Results

Table 2 shows the Deletion, Insertion, Average drop,

and Average increase by our method and Transformer

Figure 7: Visualization results by our proposed method and

Transformer Attribution for CLIP-ViT.

Attribution for CLIP-ViT. In comparison with Trans-

former Attribution, our method improved the accu-

racy by approximately 0.88%, 1.34%, 2.1%, and

1.05% respectivelly.

Figure 8 shows qualitative comparison results for

the CLIP-ViT obtained by our method and Trans-

former Attribution. In the proposed method, the im-

age predicted to be ”Tank” is highlighted the caterpil-

lar portion. The images predicted to be ”Abacus” and

”Vestment” are more highlighted the object portion in

comparison with Transformer Attribution.

6 CONCLUSIONS

In this paper, we improved the Transformer Attribu-

tion which is a method for interpreting the Vision

Transformer. We calculate saliency maps by estimat-

ing whether the relevance score calculated from the

predictions of images reﬂecting the relevance score

correctly indicates the important areas affected the

ViT’s prediction. Transformer Attribution has a prob-

Vision Transformer Interpretability via Prediction of Image Reﬂected Relevance Among Tokens

105

lem that it fails to highlight objects of the predic-

tion class, however, the proposed method improved

the problem. Quantitative and qualitative evalua-

tions demonstrated the effectiveness of the proposed

method.

ACKNOWLEDGEMENTS

This research is partially supported by JSPS KAK-

ENHI Grant Number 22H04735 and 21K11971.

REFERENCES

Abnar, S. and Zuidema, W. (2020). Quantifying

attention ﬂow in transformers. arXiv preprint

arXiv:2005.00928.

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,

A., and Zagoruyko, S. (2020). End-to-end object de-

tection with transformers. In European Conference on

Computer Vision, pages 213–229. Springer.

Caron, M., Touvron, H., Misra, I., J

egou, H., Mairal, J., Bo-

janowski, P., and Joulin, A. (2021). Emerging prop-

erties in self-supervised vision transformers. In Pro-

ceedings of the IEEE/CVF International Conference

on Computer Vision, pages 9650–9660.

Chattopadhay, A., Sarkar, A., Howlader, P., and Balasub-

ramanian, V. N. (2018). Grad-cam++: Generalized

gradient-based visual explanations for deep convolu-

tional networks. In 2018 IEEE Winter Conference

on Applications of Computer Vision, pages 839–847.

IEEE.

Chefer, H. et al. (2021). Transformer interpretability be-

yond attention visualization. In IEEE/CVF Confer-

ence on Computer Vision and Pattern Recognition,

pages 782–791.

Chen, P., Li, Q., Biaz, S., Bui, T., and Nguyen, A. (2022).

gscorecam: What objects is clip looking at? In Pro-

ceedings of the Asian Conference on Computer Vision,

pages 1959–1975.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In IEEE Conference on Computer

Vision and Pattern Recognition, pages 248–255.

Dosovitskiy, A. et al. (2021). An image is worth 16x16

words: Transformers for image recognition at scale.

In International Conference on Learning Representa-

tions.

Jiang, P.-T., Zhang, C.-B., Hou, Q., Cheng, M.-M., and Wei,

Y. (2021). Layercam: Exploring hierarchical class ac-

tivation maps for localization. IEEE Transactions on

Image Processing, 30:5875–5888.

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J.,

Li, C., Yang, J., Su, H., Zhu, J., et al. (2023).

Grounding dino: Marrying dino with grounded pre-

training for open-set object detection. arXiv preprint

arXiv:2303.05499.

Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning,

J., Cao, Y., Zhang, Z., Dong, L., et al. (2022). Swin

transformer v2: Scaling up capacity and resolution.

In Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 12009–

12019.

Petsiuk, V., Das, A., and Saenko, K. (2018). Rise: Ran-

domized input sampling for explanation of black-box

models. arXiv preprint arXiv:1806.07421.

Radford, A. et al. (2021). Learning transferable visual mod-

els from natural language supervision. In Interna-

tional Conference on Machine Learning, pages 8748–

8763.

Ramaswamy, H. G. et al. (2020). Ablation-cam: Vi-

sual explanations for deep convolutional network

via gradient-free localization. In Proceedings of

the IEEE/CVF Winter Conference on Applications of

Computer Vision, pages 983–991.

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,

Parikh, D., and Batra, D. (2017). Grad-cam: Visual

explanations from deep networks via gradient-based

localization. In Proceedings of the IEEE International

Conference on Computer Vision, pages 618–626.

Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S.,

Mardziel, P., and Hu, X. (2020). Score-cam: Score-

weighted visual explanations for convolutional neu-

ral networks. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

Workshops, pages 24–25.

Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D.,

Lu, T., Luo, P., and Shao, L. (2021). Pyramid vi-

sion transformer: A versatile backbone for dense pre-

diction without convolutions. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion, pages 568–578.

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M.,

and Luo, P. (2021). Segformer: Simple and efﬁcient

design for semantic segmentation with transformers.

Advances in Neural Information Processing Systems,

34:12077–12090.

Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu,

Y., Feng, J., Xiang, T., Torr, P. H., et al. (2021). Re-

thinking semantic segmentation from a sequence-to-

sequence perspective with transformers. In Proceed-

ings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 6881–6890.

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Tor-

ralba, A. (2016). Learning deep features for discrim-

inative localization. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 2921–2929.

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

106