Predicting Eye Gaze Location on Websites

Ciheng Zhang

, Decky Aspandi

and Steffen Staab

2, 3

Institute of Industrial Automation and Software Engineering, University of Stuttgart, Stuttgart, Germany

Institute for Parallel and Distributed Systems, University of Stuttgart, Stuttgart, Germany

Web and Internet Science, University of Southampton, Southampton, U.K.

Keywords:

Eye-Gaze Saliency, Image Translation, Visual Attention.

Abstract:

World-Wide-Web, with website and webpage as a main interface, facilitates dissemination of important infor-

mation. Hence it is crucial to optimize webpage design for better user interaction, which is primarily done by

analyzing users’ behavior, especially users’ eye-gaze locations on the webpage. However, gathering these data

is still considered to be labor and time intensive. In this work, we enable the development of automatic eye-

gaze estimations given webpage screenshots as input by curating of a uniﬁed dataset that consists of webpage

screenshots, eye-gaze heatmap and website’s layout information in the form of image and text masks. Our cu-

rated dataset allows us to propose a deep learning-based model that leverages on both webpage screenshot and

content information (image and text spatial location), which are then combined through attention mechanism

for effective eye-gaze prediction. In our experiment, we show beneﬁts of careful ﬁne-tuning using our uniﬁed

dataset to improve accuracy of eye-gaze predictions. We further observe the capability of our model to focus

on targeted areas (images and text) to achieve accurate eye-gaze area predictions. Finally, comparison with

other alternatives shows state-of-the-art result of our approach, establishing a benchmark for webpage based

eye-gaze prediction task.

1 INTRODUCTION

Analysis of user behavior during their interaction with

the website (and contained webpages) is important

to evaluate the website overall quality. This infor-

mation can be used to create more optimized web-

pages for better interactions, as such, there are needs

to characterize these behaviors. Some of the charac-

teristics include users’ tendency to focus on certain

areas (e.g. upper left corner) during browsing (Shen

and Zhao, 2014). Other important characteristics can

be deﬁned from users’ eye-gaze data, which normally

represents visual attention of the users during website

interaction. This data is usually acquired from hu-

man pupil-retinal location and movement relative to

the object of interest that allows one to pinpoint exact

webpage area, where users focus during their inter-

actions and its relations with overall webpage struc-

ture. Although this information is crucial to create a

more optimized website for better interactions, How-

ever, acquiring these eye-gaze data from every user

during their browsing duration is difﬁcult, given the

complexity of the acquisition setting. Thus there is

currently a need for an automatic approach to predict

these eye-gaze locations given an observation of the

webpages.

In computer vision ﬁeld, saliency prediction task

has been extensively investigated that allows estima-

tion of people’s attention given an image. The main

task is to estimate saliency map, which highlights ﬁrst

location (or region) where observers’ eyes focus, with

photographs of natural scenes (Zhang et al., 2018;

Kroner et al., 2020) commonly used as input. Several

methods have been developed so far to solve this task,

including machine learning (Hou and Zhang, 2008;

Li et al., 2012) and recently deep learning based ap-

proaches (Kroner et al., 2020) with quite high accu-

racy achieved. With this progress, there is an oppor-

tunity to adopt these automatic, visual based saliency

predictors for eye-gaze predictions task, by represent-

ing input observation in a form of webpage screen-

shot. This enables adaptation of task objective for

eye-gaze heatmap prediction, in lieu of saliency map,

with both predicted maps representing area where

most people (or in our case users) attend.

The challenges however are differences between

saliency and eye-gaze prediction tasks in terms of

content, structure and layout of input images. For

Zhang, C., Aspandi, D. and Staab, S.

Predicting Eye Gaze Location on Websites.

DOI: 10.5220/0011747300003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 4: VISAPP, pages

121-132

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

121

instance, there is not any concept of depth on web-

page screenshot as opposed to natural images, an as-

pect that is highly utilized for general visual saliency

estimation. In addition, high contrast and varied col-

ored areas on natural image are commonly regarded

as saliency area, which may not be the case for web-

page screenshots, given that user eye-gaze locations

are highly dependent on the type of interactions dur-

ing website browsing. This problem can be recti-

ﬁed by the use of a data-driven approach (Pan and

Yang, 2010), which can be made feasible through

ﬁne-tuning using a specialized dataset on a particu-

lar task. However, this attempt is currently hampered

by the lack of such dataset.

In this work, we curate a generalized dataset of

eye-gaze prediction from webpage screenshots to en-

able effective training for machine and deep learning

approaches. Using this dataset, we propose a deep

learning based method that incorporates image and

textual locations of webpage through mask modali-

ties and combines them with attention fusion mecha-

nism. We then evaluate the impact of transfer learn-

ing, including comparison with other alternatives, to

establish a benchmark for this task. Speciﬁcally, the

contributions of this work are:

1. The establishment of a uniﬁed benchmark dataset

for eye-gaze detection from webpage screenshots,

derived from website and user interactions data.

2. A novel deep learning-based and multi-modal

eye-gaze detector with internal attention that

leverages characteristics of input contents and im-

portance of each stream of modality for accurate

eye-gaze predictions.

3. Benchmark results of automatic eye-gaze location

estimators and our state-of-the-art results for eye-

gaze detection task given webpage screenshot in-

puts.

2 RELATED WORK

An early example of work analyzing website con-

tent is done by Shen and Zhao et.al. (Shen and Zhao,

2014) where three types of webpages are analyzed:

Text-based, Pictorial-based, and Mixed (combination

of Text and Pictorial) websites. The authors show that

some attention characteristics of users during their

interactions with webpage exist, with main ﬁnding

that users usually pay more attention to some relevant

parts of websites (such as left-top corner of a website)

and their tendency to focus on areas where large im-

ages are present. Furthermore, on the websites from

‘Text’ category, their preference to focus on certain

parts of websites (middle left and bottom left regions)

is perceived. Lastly, they propose multi-kernel ma-

chine learning inferences for eye-gaze heatmap pre-

dictions (which represent visual attention of users) for

their developed website eye-gaze dataset of Fixations

in Webpage Images dataset (FiWI).

In computer vision ﬁeld, the task of locating (or

predicting) users’ attention to natural image is com-

monly called as saliency prediction. These tasks are

commonly solved by utilising machine learning tech-

niques, given their automation capability. An earliest

example of this method is Incremental Coding Length

(ICL) (Hou and Zhang, 2008) which aims to pre-

dict the activation of saliency location by measuring

perspective entropy gain of each input feature (sev-

eral image patches) as a linear combination of sparse

coding basis function. Other algorithm of Context-

Aware Saliency Detection (CASD) capitalises on the

concept of dominant objects as additional context to

improve their prediction (Goferman et al., 2011). Fur-

thermore, D Houx et.al. (Houx et al., 2012) propose

an approach that aims to solve ﬁgure-ground separa-

tion problem for prediction. They use a binary and

holistic image descriptor of Image Signature, which

is deﬁned as sign function of Discrete Cosine Trans-

form (DCT) of an image as additional input. Subse-

quently, Hypercomplex Fourier Transform (HFT) (Li

et al., 2012) is used to transform input image to fre-

quency domain to estimate saliency map. One rele-

vant work utilizes deep learning based method, given

their accurate estimations and ability to leverage huge

datasets size (that are increasingly present). This ap-

proach is based on Encoder and Decoder structure

with a pre-trained VGG network to predict saliency

maps (Kroner et al., 2020). Finally in recent years,

several researchers extend the saliency approach to

work with image and depth input. One example is the

work from Zhou et. al. (Zhou et al., 2021) that uses

Hierarchical Multimodal Fusion Network to process

both RGB (Red, Gren and Blue) colours image and

depth maps (one channel) as additional modality for

the input, which results in more accurate gaze maps

prediction.

Even though all of the described methods work for

general visual saliency predictions, however, their ca-

pability to predict users’ eye-gaze location on web-

page is not yet investigated, given lack of dataset

available. Thus in this work, we propose to cre-

ate a specialized webpage based eye-gaze prediction

datasets to allow for the development of automatic

eye-gaze predictions of webpage screenshot inputs,

then utilize it to develop our deep learning based eye-

gaze location predictor.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

122

Figure 1: The ﬂow of dataset generation.

3 METHODOLOGY

3.1 Dataset Gathering and Processing

In this work, we search, process and curate a gen-

eralized eye-gaze dataset from available datasets in

literature. Here, we focus on three publicly avail-

able user-website interaction datasets (where eye-

gaze and webpage screenshots are present): GazeMi-

ning (Menges, 2020), Contrastive Website (Aspandi

et al., 2022a), and Fixations in Webpage Images

(FiWI) dataset (Shen and Zhao, 2014). In this sec-

tion, we provide a short description of datasets, and

proceed with description of pre-processing algorithm

that we conduct to produce our uniﬁed dataset. The

processed dataset is available at

and available after

request.

• GazeMining is a dataset of video and interac-

tion recordings on dynamic webpages (Menges,

2020). In this work, the authors focused on web-

sites which contain constantly changing scenes

(thus dynamic). The data was collected on March

2019 from four participants who interacted with

12 different websites.

• Contrastive Website dataset is a recent website

interaction-based dataset, where participants were

asked to visit two sets of websites (each of four)

and performed respective tasks (ﬂight planning,

route search, new searching and online shopping)

resulting on more than 160 sessions (Aspandi

et al., 2022a).

• FiWI dataset (Shen and Zhao, 2014) is a dataset

that focuses on website-based saliency prediction.

However, it is small in term of webpage screen-

shot availability (only 149 images) compared to

https://doi.org/10.18419/darus-3251

two previous datasets that prevents its use for

larger scale evaluations (this is especially true for

a deep learning model). Therefore FiWI dataset

is mostly (and commonly) utilized as comparative

evaluation with other models, but not for model

training.

Figure 1 shows the ﬂow of our dataset curation,

data pre-processing and the results of the generalized

dataset. The ﬁrst step of pre-processing is elimina-

tion of empty screenshots (which exist in small quan-

tities on GazeMining dataset), i.e. no screenshots are

present, which can come due to rapid sampling dur-

ing acquisition. Secondly, due to dynamic nature of

the observed websites, webpages’ static parts (that

consist of all unchanged, rendered webpages’ ele-

ments during interaction time) are removed (or black-

ened) on original datasets, which necessitates us to

ﬁnd and collect only recorded screenshot - thus re-

moving irrelevant observations (black parts). Thirdly,

we remove the duplicate webpage screenshots of both

datasets for efﬁciency and generate respective loca-

tions of Image and Text as independent layers (called

Image and Text Mask, which we will detail in Sec-

tion 3.2.1). Lastly, eye-gaze heatmap layers are gen-

erated (as ground truth) with respect to duration of

each observed eye-gaze. This pipeline is applied to

all datasets, with the exception of Contrastive Web-

site, where ﬁrst and second steps are skipped, and for

FiWI where only last step is necessary. In the end, our

uniﬁed dataset includes four sub-folders (data, label,

Imagemask, TextMask) with a total of 3119 screen-

shot examples (1546 samples or 49.6% from GazeMi-

ning, 1424 instances or 45.7% from Contrastive Web-

site dataset and 149 examples or 4.7% from FiWI)

along with the associated ground truths.

Predicting Eye Gaze Location on Websites

123

Figure 2: Structure of Multi Mask Input Attentional Network.

3.2 Multi Mask Input Attentional

Network (MMIAN) for Eye Gaze

Prediction

Our pre-processed datasets allows us to propose a

deep learning based model to beneﬁt from sizable

number of observations (Rodriguez-Diaz et al., 2021).

Here we propose a Multi Mask Input Attentional Net-

work (MMIAN) that aims to predict eye gaze loca-

tion given a webpage screenshot as input. Speciﬁ-

cally, given input webpage screenshot images of X

i..n

with X as 2D matrix of webpage screenshot image,

and n as number of batch size, MMIAN estimates

eye-gaze locations of

i..n

, where each

Y is a 2D ma-

trix of eye-gaze heatmap, with each cell value repre-

sents probability of eye-gaze locations. The structure

of MMIAN follows encoder-decoder scheme that is

inspired by Multi-scale Information Network (MSI-

Model) (Kroner et al., 2020) consisting of several

modules: input and Mask Encoder, Mask Generator,

Attentional Feature Fusion, Atrous Spatial Pyramid

Pooling (ASPP) Module and Decoder. In this work,

we further propose several important modiﬁcations:

1. Incorporation of several input masks to include

webpage content information of textual and im-

ages spatial location.

2. Addition of attention mechanisms for effective fu-

sion of input modalities.

Overall architecture of MMIAN can be seen in

Figure 2. Speciﬁcally, an input webpage screenshot

image is passed into mask generator to produce an

image mask and text mask. Then both masks are con-

catenated and fed into Mask Encoder to extract rele-

vant features, while input image is simultaneously fed

into input Encoder. Feature Fusion module then fuses

extracted mask features and input features with sev-

eral attentional modules and combines them in a con-

joint block. This feature block is then fed into ASPP

part to enlarge ﬁeld of views, capturing different reso-

lution views from the input, thus producing richer fea-

ture representations. These features are then passed

through series of up-sampling layers (Decoder) cre-

ating an eye-gaze saliency map as a ﬁnal result. The

implementation code of our model is available on our

repository

3.2.1 Input Masks Generator

It has been observed that webpage layout (which in

principle is arrangement of images and texts) is an

important aspect of webpage (Shen and Zhao, 2014).

To beneﬁt from this information, we propose to in-

https://github.com/ZackCHZhang/WebToGaze

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

124

(a) Input (b) Text Mask (c) Image Mask

Figure 3: One case of input and text mask generation. Figure (a) is an example of webpage screenshot, Figure (b) and (c) are

the generated text mask and image mask respectively.

corporate spatial locations of both image and text on

websites with mask representations: image and text

masks. These masks are represented in form of a ma-

trix, where each cell is activated in the presence of

both image and text in webpage screenshot.

• Image Mask Generation: For image loca-

tion recognition, we use method from M. Xie et

al. (Xie et al., 2020) which produces bounding-

box locations of an input image in a binary map

(i.e. map value is set to 1 where image is present,

otherwise 0).

• Text Mask Generation: we use one of avail-

able optical character recognition (OCR) meth-

ods of Efﬁcient and Accuracy Scene Text detec-

tion (EAST) (Zhou et al., 2017) to locate location

of text of webpage screenshot input. Similar to

Image Mask, this process generates correspond-

ing Text Mask in Binary map format.

One example of generated masks from our curated

dataset can be seen in Figure 3. We can see that both

of masks contain quite accurate locations of both Im-

age and Text on an input webpage screenshot. Even

though some mild imperfections occur (i.e. some but-

tons can be recognized as images, it does not fully de-

tect locations of certain images due to small size, and

some text within images are included on text masks),

however in general the locations of both image and

text on input webpage screenshots are properly rec-

ognized. These masks are then concatenated to be fed

into Mask Encoder part.

3.2.2 Input and Mask Encoder

Encoder part is a series of Convolutional Neural Net-

work that aims to extract visual features from input

matrix, which is original web screenshot image and

concatenated generated Masks, for Input and Mask

Encoder respectively. Both Encoders are based on

VGG16 (Simonyan and Zisserman, 2014) architec-

tures with last fully connected layer removed, and

further reductions of convolution layers (ﬁve convo-

lutional layers with Relu and two max-polling layers)

applied for Mask Encoder. Input and Mask encoder

then produce MaskFeatures and Screenshot Features

to be combined through Multi-Modal Attentional Fu-

sion module.

3.2.3 Multi-Modal Attentional Fusion

We employ multi-modal inputs (that consists of web-

page screenshot and generated masks) to beneﬁt from

additional stream of information (Zhou et al., 2021;

Aspandi et al., 2020). Moreover, we introduce atten-

tional fusion mechanism for more effective fusion of

these modalities, as opposed to simple fusion opera-

tions of summation or concatenation (Aspandi et al.,

2022b), and further solving the problems of inconsis-

tent input semantics and scales (Dai et al., 2021). An

example of attentional feature fusion block (AFF) can

be seen in Figure 4 where it receives two streams of

inputs (denoted as F1 and F2) and initially fuses these

input features through matrix addition (denoted as

). The result from addition of internal Multi-scale

Channel Attention module (that aggregates channel

context by a series of point-wise convolutions) and

a sigmoid function generates a fusion weight. This

fusion weight and its another counterpart (where it

is negated by one) are multiplied (denoted as

) by

each of original inputs respectively, and then added

together (

symbol) to produce their weighted av-

erage as fused feature of Z. This fusion process is

applied to the received MaskFeatures and Screenshot

Features three times, to allow for observation of dif-

ferent scales of respective features (these operations

are marked as circled A in Figure 2). The resultants

of the fused features are then concatenated to a con-

joint block for subsequent pipelines.

3.2.4 Atrous Spatial Pyramid Pooling Module

(ASPP)

ASPP module is made by superimposing different

atrous convolutions (Chen et al., 2017) ﬁlters to ob-

tain features from larger receptive ﬁelds. This is

beneﬁcial given that it is common for webpage lay-

outs to be modularized into a variety of different sub-

layouts. This thus increases receptive ﬁelds (or ﬁeld-

of-view) of internal convolutional layers (where the

Predicting Eye Gaze Location on Websites

125

Figure 4: Left is Attentional Feature Fusion block, right is

Multi-scale Channel Attention module (Dai et al., 2021).

result from several ﬁlters are combined together) en-

ables us to capture such layout conﬁgurations. In our

ASPP module, we use dilation rates of 4, 8, and 12,

with number of 1 × 1 ﬁlter to be 256 to ensure kernel

size compatibility. This module is applied to conjoint

block input, and subsequent features are obtained for

decoding operation.

3.2.5 Output Decoder

Decoder part consists of a series of convolution layers

with up-sampling to decode input features, generating

ﬁnal prediction of eye-gaze heatmap location (in the

form of a 2D tensor and in binary mode). Speciﬁ-

cally, number of up-sampling blocks consisting of bi-

linear scaling operations (each sequentially doubles

the number of input tensor) and a subsequent convo-

lutional layer with kernel size 3 × 3 are used. With

this module, we provide the features generated from

ASPP as input features and ﬁnal eye-gaze heatmap

predictions are obtained. Finally, we convert binary

heatmap to gray-scale format, to conform to the origi-

nal scale of eye-gaze heatmap from ground-truth map.

3.3 Training Procedure

To evaluate the impact of training of models uti-

lizing webpage screenshot based eye-gaze predic-

tion datasets, ﬁrst we perform training process us-

ing an original saliency prediction model of Kro-

ner’s Model (MSI-Model) (Kroner et al., 2020). This

model has been pre-trained on general visual saliency

dataset (SALICON) (Jiang et al., 2015), which we

call as P-MSI, that serves as baseline. Given this

pre-trained model, we perform ﬁne-tuning using our

pre-processed datasets (GazeMining and Contrac-

tive Website dataset) producing FT-MSI-GM and FT-

MSI-CW respectively, and evaluate observed accu-

racy gains. We did the model ﬁne-tuning instead of

performing training from scratch to maximize the po-

tential accuracy obtained - we empirically found that

the overall obtained accuracy of the latter to be sub-

stantially lower.

Additionally, we propose to evaluate the impact

of training using a combined dataset (i.e. we com-

bine examples of both of GazeMining and Contrastive

Website) by further training MSI-Model with this

merged dataset, that we name as FT-MSI-CMB. The

training is conducted until convergence and no fur-

ther accuracy improvement is perceived. Afterward,

we proceed to training stage of our proposed model of

MMIAN by ﬁrst transferring Encoder and Decoder’s

weight from best performing models of the previous

step, and initialized weights of both Mask Encoder

and ASPP by zero-mean uniform distribution.

All of the training is conducted by minimizing

Kullback-Leibler divergence (KLD) loss function as

shown in Equation 1, with Y indicating ground truth

and ε is a regularization constant to guarantee that

denominator is not zero. In this loss, estimation of

saliency maps can be regarded as a probability dis-

tribution prediction task, as formulated by Jetley et.

al. (Jetley et al., 2016). The output of estimator

is normalized to a non-negative number, with KLD

value used to measure level of differences between

predictions and ground truth. Finally, Adam opti-

mizer (Kingma and Ba, 2014) with a learning rate of

−4

is used for overall optimization.

(

Y ||Y ) =

∑

ln(ε +

ε +

) (1)

3.4 Experiment Setting

3.4.1 Dataset Experiment Setting

In this experiment, all of three pre-processed datasets

(cf. Section 3.1) are utilized, with training, valida-

tion and test instance numbers are shown in Table 1.

Speciﬁcally, both GazeMining and Contrastive Web-

site datasets are divided with 60%, 20% and 20% of

samples for training, validation set, and test set re-

spectively. Whereas for FiWI dataset (Shen and Zhao,

2014), all samples (of 149 images) are used for test-

ing.

Table 1: Training, validation and test set for each dataset.

Splits GazeMining Contrastive Website FiWI

Training 928 868 -

Validation 309 278 -

Test 309 278 149

3.4.2 Quantitative Metrics

We use three quantitative metrics of Area under Re-

ceiver Operating Characteristic curve (AUC), Nor-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

126

malized Scanpath Saliency (NSS), and Person’s Cor-

relation Coefﬁcient (CC) to judge the quality of mod-

els’ prediction.

• NSS is commonly used for general saliency pre-

diction tasks as a direct correspondence mea-

sure between predicted saliency maps and ground

truth, which is computed as average normal-

ized saliency at ﬁxated locations (Peters et al.,

2005). Furthermore, NSS is sensitive to false pos-

itives, relative differences in saliency across eval-

uated image, and general monotonic transforma-

tions (Bylinskii et al., 2018). With Y

as a bi-

nary map of true ﬁxation location and N to indi-

cate total pixel number and i indicates each pixel

instance, NSS value can be calculated using an

equation as shown below:

NSS (

Y ,Y

) =

∑

×Y

(2)

• AUC-J evaluates predicted eye-gaze heatmaps as

a classiﬁcation task, where each prediction pixel

is evaluated through a binary classiﬁcation setting.

Here, a certain threshold value is used to decide

whether it is deemed to be correctly predicted as

eye-gaze locations, thus emphasizing frequency

of true positive. We use method described by Judd

et. al. (Judd et al., 2009) to select all required

thresholds.

• CC metric aims to evaluate level of linear rela-

tionship between two input distributions. In eye-

gaze location prediction task, both generated eye-

gaze location map and ground truth are treated as

random variables (Le Meur et al., 2007), and level

value is calculated with both of these map inputs

following equation 3. Thus, with operator σ as

covariance matrix, CC value can be calculated as:

CC(

Y ,Y ) =

σ(

Y , Y )

σ(Y ) × σ(

Y )

(3)

4 EXPERIMENT RESULT

In this section, we ﬁrst present results of both base-

line and our proposed approach using part of our pre-

processed dataset (GazeMining and Contrastive Web-

site), as outlined in Section 3. Then, we provide a

comparison of best results of our approach with al-

ternative saliency prediction models using full set of

our pre-processed dataset, establishing a benchmark

for eye-gaze prediction task given webpage screen-

shot inputs.

Table 2: Results of ﬁne-tuned models on both preprocessed

GazeMining and Contrastive Website datasets.

No. Models

GazeMining Contrastive Website

NSS AUC-J CC NSS AUC-J CC

1. P-MSI 1.037 0.699 0.130 0.839 0.683 0.109

2. FT-MSI-GM 1.398 0.753 0.170 0.717 0.651 0.093

3. FT-MSI-CW 0.695 0.655 0.081 2.364 0.777 0.277

4. FT-MSI-CMB 1.447 0.752 0.170 2.269 0.789 0.269

4.1 Impact of Fine-Tuning Using

Independent and Combined Dataset

Table 2 shows results of four alternative models that

are trained using different datasets. Here we can see

that pre-trained MSI (P-MSI) model performs worse

in comparison to other models, which suggests its in-

ability to generalize to eye-gaze saliency prediction

task. We can further observe that when ﬁne-tuning is

applied, improvement from original P-MSI is notice-

able, especially when it is tested on similar dataset

used for ﬁne-tuning. This can be seen with higher

values achieved on all metrics of FT-MSI-GM and FT-

MSI-CW compared to P-MSI on GazeMining dataset

and Contrastive Website dataset respectively.

When model is trained using combined dataset,

however, accuracy improvement is instead lower.

This is indicated by lower quantitative values

achieved by FT-MSI-CMB as opposed to their coun-

terparts (FT-MSI-GM and FT-MSI-CW), that are

trained separately. This phenomenon may come

from difﬁculty of estimator to learn from these two

datasets, which are quite distinct, and also from the

nature of task executed by the users on each dataset.

Figure 5 presents visual prediction examples

for baseline model (P-MSI) and best performing

model (FT-MSI-CW) for Contrastive Website Dataset

(where the example is originated). The screenshot ex-

ample is a part of route search task, where user has to

enter vehicle information to estimate fuel consump-

tion. Thus it is natural for users to focus on dialog

box (displayed in the center of the webpage screen-

shot) resulting in users’ ﬁxations being located within

this area.

By evaluating the predictions of eye-gaze

heatmaps of baseline model of P-MSI, we can see

that even though it manages to predict parts of

ground-truth (eye-gaze) locations, however, it also

falsely predicts other less relevant locations as eye-

gaze locations (i.e. advertisement part). These results

come from its tendency to detect high color contrast

as area of interest, which is common in natural image

datasets (SALICON (Jiang et al., 2015)). However,

this leads to higher occurrences of false positive, thus

reducing its accuracy. Our ﬁne-tuned model however,

manages to reduce existing inaccuracies by absorbing

Predicting Eye Gaze Location on Websites

127

(a) Input+GT (b) P-MSI (c) FT-MSI-CW

Figure 5: Comparison between pre-trained results from P-MSI (baseline) and FT-MSI-CW (Fine-tuned). Figure (a) shows

webpage screenshot with ground truth (eye-gaze heatmap). Figure (b) and (c) show predictions from P-MSI and FT-MSI-CW.

the characteristics of dataset during ﬁne-tuning,

adapting model to this speciﬁc eye-gaze prediction

task. In this example, we see that the predictions of

ﬁne-tuned models are more precise, especially on

dialog box and they exclude some of high contrast

area that are normally recognized as saliency area

(e.g. advertisement locations).

From the experimental result, it can be concluded

that saliency detection model trained on natural image

dataset does not work well for direct application to

web screenshot scenario. This can be mitigated by

performing careful ﬁne-tuning to only use a dedicated

dataset. Given this result, then we use FT-MSI-GM

and FT-MSI-CW as base for MMIAN optimization,

and for comparison in the next section.

4.2 Impact of Multi-Modal Masks and

Attention-Based Fusion

Table 3 presents results of our proposed MMIAN

model and best results of ﬁne-tuned MSI-Model from

previous section. From this result, it can be seen that

MMIAN produces higher value across quantitative

metrics in overall, outperforming results from ﬁne-

tuned MSI-Model. The gains on all of these metrics

altogether, suggest that estimation results of MMIAN

are more accurate compared to the best of FT-MSI re-

sults (from FT-MSI-GM and FT-MSI-CW), with im-

provements on both true positive (as judged by AUC-

Judd) and false positive Instances (as evaluated by

NSS and CC score).

In order to provide a more comprehensive analysis

of the impact of the use of image and text Masks and

attention mechanisms, we ﬁrst show locations of de-

tected images and texts on webpage screenshot input

to provide semantic explanations of their relevances

for our MMIAN model to produce accurate predic-

tions. Then, we implemented a Gradient-weighted

Class Activation Mapping (Grad-CAM) (Selvaraju

et al., 2017) by propagating multiplied error value

and gradient to each convolutional layer, enabling us

to investigate the relevant activation of convolutional

kernels - with respect to image input - indicating

most prominent part of webpage screenshot for pre-

diction. The example of original webpage screenshot

input that includes image and text locations (drawn as

bounding box), associated eye-gaze location ground-

truth, generated Grad-CAM and each prediction of

ﬁne-tuned MSI and MMIAN is shown in Figure 6.

On Figure 6, ﬁrst we can see that prediction of

MMIAN is more accurate than FT-MSI, with larger

and correctly identiﬁed eye-gaze areas are present.

This is especially noticeable in areas where both im-

age and text are present. For examples there are ar-

eas where button resides (that is recognized as image)

and product description, which in this case, are in-

deed the locations where user looks at. This perceived

higher accuracy can be attributed to the use of image

and text masks during learning, which is combined

effectively through attention mechanism to properly

’guide’ learning process to put priority in this area.

This impact is also apparent from observation of grad-

cam heatmap of FT-MSI and MMIAN, where grad-

cam heatmap of MMIAN are seen to be more con-

centrated on both of images and text area simultane-

ously (indicating where models’ attention is) as op-

posed to ones produced by FT-MSI. Lastly, we also

notice that there are more frequent activations of grad-

cam heatmaps of MMIAN compared to ones from FT-

MSI, which may suggest that larger perceptual ﬁeld of

view from MMIAN is indicative to be able to produce

more accurate eye-gaze estimations in overall.

4.3 State-of-the-Arts Comparison

We compare best predictions of our approach against

other available, seven generalized visual saliency

models:

1. Context-Aware Saliency Detection

(CASD) (Goferman et al., 2011).

2. Discrete Cosine Transform (DCTS) (Houx et al.,

2012).

3. Hypercomplex Fourier Transform (HFT) (Li

et al., 2012).

4. Incremental Coding Length (ICL) (Hou and

Zhang, 2008).

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

128

Figure 6: Example of prediction results of MMIAN and Fine-tuned MSI Model (FT-MSI). First column of (a) and (d) shows

webpage screenshot with outlined text (light green box) and image (purple box) area, along with eye-gaze heatmap ground

truth. Second column (b) and (e) shows predicted result from FT-MSI and MMIAN model, with third column of (c) and (f)

shows respective Grad-CAM heatmap.

5. RARE (Riche et al., 2012).

6. SeoMilanfar (Seo and Milanfar, 2009).

7. Spectral Residual (SR) (Hou and Zhang, 2007).

including one specialized eye-gaze location estimator

of Multiple Kernel Learning (MKL) (Shen and Zhao,

2014).

Table 4 shows results from all evaluated ap-

proaches. Here we see state-of-the-art results of

MMIAN that outperforms other alternatives, includ-

ing eye-gaze predictor of MKL with a large margin

on FiWI dataset (note that there is no training in-

volved for FiWI dataset evaluation). This result is

mainly due to large differences in task characteris-

tics between general visual saliency prediction (where

ﬁrst seven models are trained) and webpage screen-

shot based eye-gaze estimations tasks (where MKL

and MMIAN are specialized). This highlights the in-

ability of visual saliency based models to generalize

on this speciﬁc task. Furthermore, our approach per-

forms better than MKL on FiWI, demonstrating the

effectiveness of our overall approach for eye-gaze es-

timation task.

Table 3: Quantitative results of both ﬁne-tuned MSI models

and MMIAN model.

No. Models

GazeMining Contrastive Website

NSS AUC-J CC NSS AUC-J CC

1. FT-MSI 1.398 0.753 0.170 2.364 0.777 0.277

2. MMIAN 1.579 0.764 0.184 2.487 0.787 0.292

Figure 7 shows predictions example of our

MMIAN and four other best performing model on

FiWI for evaluation (excluding MKL, due to lack of

available implementation of the model). Based on this

ﬁgure, we can observe that most of alternative models

produce a large area of webpage screenshot as poten-

tial eye-gaze locations. However, this leads to largely

false positive prediction, given the mismatch between

predictions against actual eye-gaze locations from

user of these webpage screenshots (i.e. most area

are falsely identiﬁed as eye-gaze locations). In con-

trast, our approach produces more precise estimates,

as observed by more reﬁned (and accurate) predicted

area of eye-gaze locations (that explains large mar-

gin in terms of NSS metrics value from our models

compared to others, as shown in Table 4). This is

mainly due to the tendency of visual saliency models

to focus on pure appearance of webpage screenshot

(i.e. high contrast images), as opposed to MMIAN

that has been conditioned to the speciﬁc character-

istics of our pre-processed eye-gaze dataset (which

inherently contains eye-gaze characteristics of users,

given webpage screenshot). Remarkably, our models

is also capable of correctly predicting eye-gaze loca-

tions on relevant areas, such as text (third row), im-

age (fourth row) and their combinations (ﬁrst, second

and ﬁfth row), in comparison with other approaches.

Here we see that our models’ predictions are consis-

tently more accurate than alternatives, demonstrating

the effectiveness of our approach.

Predicting Eye Gaze Location on Websites

129

Table 4: Comparison of existing saliency predictors evaluated on test sets of our full pre-processed dataset. The boldface

indicates best results, red color implies second-best results, and third-best results are marked by blue coloured fonts.

No. Models

GazeMining Contrastive Website FiWI

NSS AUC CC NSS AUC CC NSS AUC CC

1. CASD (Goferman et al., 2011) 0.567 0.653 0.064 0.419 0.614 0.053 0.680 0.732 0.233

2. DCTS (Houx et al., 2012) 0.479 0.618 0.053 0.256 0.552 0.035 0.541 0.671 0.195

3. HFT (Li et al., 2012) 0.707 0.644 0.088 0.534 0.593 0.067 0.740 0.737 0.251

4. ICL (Hou and Zhang, 2008) 0.485 0.518 0.057 0.192 0.490 0.030 0.444 0.618 0.162

5. RARE (Riche et al., 2012) 0.632 0.653 0.072 0.382 0.589 0.052 0.850 0.758 0.280

6. SeoMilanfar (Seo and Milanfar, 2009) 0.393 0.584 0.038 0.350 0.571 0.044 0.445 0.651 0.163

7. SR (Hou and Zhang, 2007) 0.566 0.639 0.062 0.510 0.612 0.062 0.635 0.714 0.216

8. MKL (Shen and Zhao, 2014) - - - - - - 1.200 0.702 0.382

9. MMIAN (proposed) 1.579 0.764 0.184 2.487 0.787 0.292 1.385 0.786 0.397

Figure 7: Five examples of screenshot inputs from FiWI dataset (with respective eye-gaze heatmaps ground-truth overlaid)

and predictions from other ﬁve saliency predictors (including ours).

5 CONCLUSION

In this work, we enable the development of automatic

eye-gaze estimations given webpage screenshot in-

puts to enable the improvement of the webpage lay-

out, and hence respective user interaction. We do this

by developing a uniﬁed eye-gaze dataset from three

available website and user interaction-based datasets:

GazeMining, Contrastive Website and FiWI. We then

pre-process each dataset to produce necessary data

for eye-gaze prediction task, such as webpage screen-

shot and corresponding eye-gaze heatmaps as ground

truth. In addition, we generate image and textual lo-

cations from webpage (in the form of masks) which

can be used for training and modeling. Given our uni-

ﬁed eye-gaze dataset, then we propose a novel deep

learning based and multi-modal attentional network

eye-gaze predictor to beneﬁt from the characteristics

of the dataset. Our proposed approach leverages spa-

tial locations of text and image (in form of masks),

which is further fused with attentional mechanisms to

enhance the prediction results.

During analysis of the impact of ﬁne-tuning using

our pre-processed dataset, and the effects of training

when the combined dataset is used, we found that the

prediction results are indeed improved when careful

ﬁne-tuning is conducted. Then, we evaluate the pre-

diction of our full approach (MMIAN) with respect

to the ground truth, existing text and image locations

from webpage screenshot, and part of Grad-CAM ac-

tivations. Here we notice accurate predictions of our

approach, especially on textual and image area where

users are looking at, with large and wide activation of

Grad-CAM that are concentrated on these locations.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

130

This observation demonstrates the beneﬁt of the use

of both image and text masks as input in combination

with attention mechanism.

To measure the competitive results of our pro-

posed approach, we compare them with other saliency

prediction alternatives, establishing a benchmark for

eye-gaze prediction task. In our comparison, we

found state-of-the-art results of our model with high

scores across quantitative metrics, and lower false

positive rates than other approaches. Visual analy-

sis further conﬁrms our ﬁndings, that our approach

produces more accurate prediction of eye-gaze loca-

tions on relevant website locations, including where

text and image are present. The result suggests the su-

periority of our approach in capturing user behavior.

Future work will be to incorporate other user behav-

ior characteristics (e.g. mouse trajectory), including

users’ identity (if accessible) such as age, and loca-

tions as additional modalities to further beneﬁt from

this information to improve prediction accuracy.

ACKNOWLEDGEMENT

This work is funded by UDeco project by Germany

BMBF-KMU Innovativ - 01IS20030B.

REFERENCES

Aspandi, D., Doosdal, S.,

Ulger, V., Gillich, L., and

Staab, S. (2022a). User interaction analysis through

contrasting websites experience. arXiv preprint

arXiv:2201.03638.

Aspandi, D., Mallol-Ragolta, A., Schuller, B., and Binefa,

X. (2020). Latent-based adversarial neural networks

for facial affect estimations. In 2020 15th IEEE Inter-

national Conference on Automatic Face and Gesture

Recognition (FG 2020), pages 606–610. IEEE.

Aspandi, D., Sukno, F., Schuller, B. W., and Binefa, X.

(2022b). Audio-visual gated-sequenced neural net-

works for affect recognition. IEEE Transactions on

Affective Computing.

Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., and Durand,

F. (2018). What do different evaluation metrics tell us

about saliency models? IEEE transactions on pattern

analysis and machine intelligence, 41(3):740–757.

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and

Yuille, A. L. (2017). Deeplab: Semantic image seg-

mentation with deep convolutional nets, atrous convo-

lution, and fully connected crfs. IEEE transactions on

pattern analysis and machine intelligence, 40(4):834–

848.

Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., and Barnard, K.

(2021). Attentional feature fusion. In Proceedings of

the IEEE/CVF Winter Conference on Applications of

Computer Vision, pages 3560–3569.

Goferman, S., Zelnik-Manor, L., and Tal, A. (2011).

Context-aware saliency detection. IEEE transac-

tions on pattern analysis and machine intelligence,

34(10):1915–1926.

Hou, X. and Zhang, L. (2007). Saliency detection: A spec-

tral residual approach. In 2007 IEEE Conference on

computer vision and pattern recognition, pages 1–8.

Ieee.

Hou, X. and Zhang, L. (2008). Dynamic visual attention:

Searching for coding length increments. Advances in

neural information processing systems, 21.

Houx, D., HAREL, J., and KOCH, C. (2012). Image signa-

ture: highlighting sparse salient regions. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

34(1):194–201.

Jetley, S., Murray, N., and Vig, E. (2016). End-to-end

saliency mapping via probability distribution predic-

tion. In Proceedings of the IEEE conference on com-

puter vision and pattern recognition, pages 5753–

5761.

Jiang, M., Huang, S., Duan, J., and Zhao, Q. (2015). Sali-

con: Saliency in context. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 1072–1080.

Judd, T., Ehinger, K., Durand, F., and Torralba, A. (2009).

Learning to predict where humans look. In 2009

IEEE 12th international conference on computer vi-

sion, pages 2106–2113. IEEE.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Kroner, A., Senden, M., Driessens, K., and Goebel, R.

(2020). Contextual encoder–decoder network for vi-

sual saliency prediction. Neural Networks, 129:261–

270.

Le Meur, O., Le Callet, P., and Barba, D. (2007). Predict-

ing visual ﬁxations on video based on low-level visual

features. Vision research, 47(19):2483–2498.

Li, J., Levine, M. D., An, X., Xu, X., and He, H. (2012). Vi-

sual saliency based on scale-space analysis in the fre-

quency domain. IEEE transactions on pattern analy-

sis and machine intelligence, 35(4):996–1010.

Menges, R. (2020). Gazemining: A dataset of video and

interaction recordings on dynamic web pages. labels

of visual change, segmentation of videos into stimulus

shots, and discovery of visual stimuli.

Pan, S. J. and Yang, Q. (2010). A survey on transfer learn-

ing. IEEE Transactions on Knowledge and Data En-

gineering, 22(10):1345–1359.

Peters, R. J., Iyer, A., Itti, L., and Koch, C. (2005). Compo-

nents of bottom-up gaze allocation in natural images.

Vision Research, 45(18):2397–2416.

Riche, N., Mancas, M., Gosselin, B., and Dutoit, T. (2012).

Rare: A new bottom-up saliency model. In 2012 19th

IEEE International Conference on Image Processing,

pages 641–644. IEEE.

Rodriguez-Diaz, N., Aspandi, D., Sukno, F. M., and Binefa,

X. (2021). Machine learning-based lie detector ap-

plied to a novel annotated game dataset. Future Inter-

net, 14(1):2.

Predicting Eye Gaze Location on Websites

131

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,

Parikh, D., and Batra, D. (2017). Grad-cam: Visual

explanations from deep networks via gradient-based

localization. In Proceedings of the IEEE international

conference on computer vision, pages 618–626.

Seo, H. J. and Milanfar, P. (2009). Static and space-time

visual saliency detection by self-resemblance. Journal

of vision, 9(12):15–15.

Shen, C. and Zhao, Q. (2014). Webpage saliency. In Eu-

ropean conference on computer vision, pages 33–46.

Springer.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Xie, M., Feng, S., Xing, Z., Chen, J., and Chen, C. (2020).

UIED: A Hybrid Tool for GUI Element Detection,

page 1655–1659. Association for Computing Machin-

ery, New York, NY, USA.

Zhang, D., Fu, H., Han, J., Borji, A., and Li, X. (2018).

A review of co-saliency detection algorithms: Funda-

mentals, applications, and challenges. ACM Trans-

actions on Intelligent Systems and Technology (TIST),

9(4):1–31.

Zhou, W., Liu, W., Lei, J., Luo, T., and Yu, L. (2021). Deep

binocular ﬁxation prediction using a hierarchical mul-

timodal fusion network. IEEE Transactions on Cog-

nitive and Developmental Systems, pages 1–1.

Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., and

Liang, J. (2017). East: An efﬁcient and accurate scene

text detector. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition (CVPR).

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

132