Predicting Visual Importance of Mobile UI Using Semantic Segmentation

Ami Yamamoto, Yuichi Sei

, Yasuyuki Tahara

and Akihiko Ohsuga

Graduate School of Informatics and Engineering, The University of Electro-Communinacions, Tokyo, Japan

Keywords:

Deep Learning, Visual Importance, Mobile Interface, User Interface for Design.

Abstract:

When designing a UI, it is necessary to understand what elements are perceived to be important to users. The

UI design process involves iteratively improving the UI based on feedback and eye-tracking results on the

UI created by the designer, but this iterative process is time-consuming and costly. To solve this problem,

several studies have been conducted to predict the visual importance of various designs. However, no studies

speciﬁcally focus on predicting the visual importance of mobile UI. Therefore, we propose a method to predict

visual importance maps from mobile UI screenshot images and semantic segmentation images of UI elements

using deep learning. The predicted visual importance maps were objectively evaluated and found to be higher

than the baseline. By combining the features of the semantic segmentation images appropriately, the predicted

map became smoother and more similar to the ground truth.

1 INTRODUCTION

In recent years, mobile terminals, as typiﬁed by

smartphones, have spread rapidly, and the rate of In-

ternet usage via mobile terminals has also increased.

In tandem with this growth, the types of applications

available on mobile devices and the number of down-

loads continue to increase, and consumer activities

centered on lifestyle and entertainment, such as shop-

ping, payment, and video streaming, are also expand-

ing. As a result, mobile application developers and

designers need to understand the elements that will

engage users to develop more usable applications.

Designers make improvements based on feedback and

eye tracking results on the UI they create. However,

these methods require research for each design, which

is time-consuming and costly.

In this study, we focused on visual importance,

rather than visual saliency, which has been widely

studied, as a metric for quantitatively evaluating de-

sign. Visual saliency is estimated from actual eye

gaze information obtained by eye tracking, whereas

visual importance data is created by mapping the ar-

eas that users perceive as important when they look at

a design, regardless of their gaze. Therefore, visual

importance is strongly related to semantic categories

such as text and images, as well as position and hue

https://orcid.org/0000-0002-2552-6717

https://orcid.org/0000-0002-1939-4455

https://orcid.org/0000-0001-6717-7028

(Bylinskii et al., 2017).

Since mobile terminal screens are smaller than PC

screens, the number of objects that can be displayed

on a single screen is smaller, and visual saliency man-

ifests itself differently on mobile terminals than on

PCs (Leiva et al., 2020). Therefore, it is highly likely

that the visual importance of mobile devices also

tends to be different from that of PCs, which makes

it signiﬁcant to conduct visual importance forecast-

ing speciﬁc to mobile UI. In addition, new design pat-

terns and interface elements are frequently introduced

in recent mobile application platforms. Furthermore,

features such as hover status in PC interfaces are not

applicable in mobile UIs (Swearngin and Li, 2019).

Since mobile applications are ﬁnger-operated, more

emphasis is placed on visual importance as a quality

characteristic. In addition, since UI is composed of

various design elements such as text, images, and but-

tons, it is reasonable to use visual importance in the

optimization of UI design and feedback tools. Ac-

curately predicting the visual importance of a mobile

UI can provide designers with real-time feedback and

design optimization.

We propose a method for predicting visual impor-

tance maps from mobile UI screenshot images and

semantic segmentation images of UI elements using

deep learning. We investigated three different feature

combination methods and evaluated the predicted vi-

sual importance maps objectively and subjectively.

260

Yamamoto, A., Sei, Y., Tahara, Y. and Ohsuga, A.

Predicting Visual Importance of Mobile UI Using Semantic Segmentation.

DOI: 10.5220/0011655800003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 260-266

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

The proposed method was rated higher than the base-

line.

This paper is organized as follows. Section 2 de-

scribes related studies, Section 3 explains the pro-

posed method, and Section 4 presents experimental

results and evaluates the proposed method. The re-

sults are discussed in Section 5. Finally, Section 6

concludes the paper and presents future prospects.

2 RELATED WORK

Several studies have been conducted on saliency pre-

diction in mobile UI, and Gupta et al. proposed a

method to predict saliency for each UI element, fo-

cusing on the fact that designers add, remove, and

edit elements for each interface component in mo-

bile UI design (Gupta et al., 2018). By using not

only UI images but also images at different scales

as inputs to the model, local and global features are

combined to predict saliency. Leiva et al. created

a dataset of saliency in mobile UI and developed a

saliency prediction model, SAM (SAM Saliency At-

tentive Model), and developed a saliency prediction

model speciﬁc to mobile UI (Leiva et al., 2020). They

also conducted a statistical investigation of saliency in

mobile UI and showed a strong bias toward the upper

left of the screen, text, and images.

Several models have been developed to predict

the visual importance of design, and Bylinskii et al.

proposed a method to predict the visual importance

of graphic design and data visualization using deep

learning (Bylinskii et al., 2017). Fosco et al. pro-

posed the Uniﬁed Model of Saliency and Importance

(UMSI), an integrated model that predicts the visual

importance of ﬁve design classes (web page, movie

poster, mobile UI, infographics, and advertisement)

and saliency in natural images (Fosco et al., 2020).

UMSI is a deep learning model that automatically

classiﬁes classes of input images before predicting

their visual saliency and importance.

However, there are no prediction methods speciﬁc

to the visual importance of mobile UI, and no studies

have yet taken into account mobile UI-speciﬁc factors

such as the placement and categories of UI elements.

As mentioned earlier, since visual importance maps

have a strong association with UI element categories

such as buttons, images, and text, we hypothesized

that semantic segmentation images representing UI

element placement and categories could improve vi-

sual importance prediction performance. Therefore,

we propose a method to predict visual importance

maps from mobile UI screenshot images and seman-

tic segmentation images of UI elements using deep

learning.

3 APPROACH

In this study, a visual importance prediction model

was built based on MSI-Net (Kroner et al., 2020), a

natural image saliency prediction model, utilizing se-

mantic segmentation images that represent the cate-

gories and locations of UI elements. Figure 1 shows

the overall diagram of the prediction model.

3.1 Model Architecture

3.1.1 MSI-Net

In the architecture of this model, the UI encoder,

ASPP module, and decoder are based on MSI-Net

(Kroner et al., 2020), a saliency prediction model for

natural images. MSI-Net takes an encoder-decoder

structure and incorporates the Atrous Spatial Pyramid

Pooling (ASPP) module (Chen et al., 2018) with mul-

tiple convolution layers with different expansion rates

to extract multi-scale features. ASPP module can es-

timate the saliency of the entire image, and quantita-

tive and qualitative performance improvements have

been reported. The encoder is also based on VGG-16,

which removes the stride of the two pooling layers in

the second half of the encoder, allowing for a spatial

representation of 1/8 of the original input size. This

reduces the downscaling effect and allows for higher

feature extraction performance. Since the number of

trainable parameters for this modiﬁed model is the

same as for VGG-16, the model can be initialized

with weights previously learned in ImageNet (Deng

et al., 2009). Since the visual importance dataset is

small, it must be pre-trained on the saliency dataset of

natural images. We considered this adjustment to be

effective for efﬁcient pre-training.

3.1.2 Semantic Segmentation Encoder

In this study, a semantic segmentation encoder that

represents the categories and positions of UI elements

was incorporated into MSI-Net to build the model.

The outputs of the UI encoder and the semantic seg-

mentation encoder were concatenated and used as in-

put to the ASPP module. Semantic segmentation im-

ages contain less detail than UI images and can be

processed at lower resolutions. Therefore, the seman-

tic segmentation encoder uses half the input image

size and fewer convolution layers of the UI encoder.

Predicting Visual Importance of Mobile UI Using Semantic Segmentation

261

Figure 1: Visual importance prediction model.

3.1.3 Feature Concatenation

In MSI-Net, the outputs of the 10, 14, and 18 layers

of encoders are concatenated and used as inputs to the

ASPP module to take advantage of the features of the

different levels of convolution layers. In our model,

the output of the UI encoder, which is the input of the

ASPP module, is varied from layers 18 only, 14,18,

and 10,14,18 to adjust the feature ratio of the UI and

semantic segmentation elements. The dimensions of

the features input to the ASPP module are shown in

Table 1. In the case of layer 18 only, the features of

UI elements are smaller, so the method emphasizes

semantic segmentation relatively more. 10, 14, and

18 layers are methods that emphasize the UI itself be-

cause the features of UI elements are larger. 14 and

18 layers are in between these two methods, empha-

sizing the balance between UI elements and semantic

segmentation.

Table 1: Dimensions of features to be input to the ASPP

module.

Output layer UI Segmentation post-concat

18 512 256 768

14,18 1024 256 1280

10,14,18 1280 256 1536

4 EXPERIMENTS AND RESULTS

In the experiment, MSI-Net without semantic seg-

mentation is used as the baseline model and compared

to three proposed methods with different feature di-

mensions.

4.1 Dataset

For pre-training, we used 10,000 natural images and

semantic segmentation training sets and 5,000 test

sets from SALICON (Jiang et al., 2015) and MS

COCO, using the weights learned in ImageNet as ini-

tial values. For the subsequent ﬁne-tuning, we used

Imp1k (Fosco et al., 2020) mobile UI data, 160 im-

ages from the semantic segmentation training set pub-

lished on Rico (Deka et al., 2017), and 40 images

from the test set. MSI-Net was pre-trained on natural

images and ﬁne-tuned on the UI data, without seman-

tic segmentation.

imp1k is a dataset annotated with visual im-

portance in ﬁve design classes: web pages, movie

posters, mobile UI, infographics, and advertisements.

For mobile UI, 200 screenshots were randomly sam-

pled from the Rico dataset and annotated using mouse

strokes. The design structure of mobile UI and web

pages differs signiﬁcantly from other design struc-

tures and cannot be generalized by models trained to

predict the importance of advertisements and posters

(Fosco et al., 2020). Therefore, we used 200 pieces of

data on mobile UI in the imp1k dataset for this study.

SALICON is a large dataset annotated with the

saliency of natural images and is often used to train

saliency prediction models. imp1k dataset has a small

number of mobile UI data, so we aim to improve

model performance by pre-training on the saliency

dataset.

Rico contains not only UI screenshot images, but

also semantic segmentation related to the meaning

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

262

and usage of elements on UI screens (Deka et al.,

2017). Therefore, we trained the Rico dataset using

only the semantic segmentation images correspond-

ing to the UI contained in imp1k.

4.2 Experimental Settings

In this study, we used KL divergence as the loss func-

tion. KL divergence is suitable for models that aim

at detecting salient targets because it provides a large

penalty for missed predictions. In ﬁne-tuning, the

batch size was set to 4, the learning rate to 1e-4, and

Adam was used as the optimization function. Regard-

ing the input image size, the natural images are hori-

zontal and the mobile UI images are vertical. There-

fore, the natural images were resized to 240 ×160 for

pre-training, and the size was adjusted to 240×160 by

adding a margin next to the mobile UI image for sub-

sequent ﬁne-tuning. The same procedure was used for

semantic segmentation, and the input image size was

set to 120 × 80.

4.3 Evaluation Metrics and Results

Various metrics have been used to evaluate the per-

formance of predictive models for saliency and visual

importance maps. In this study, four indices used in

previous studies of visual importance, R

, CC, RMSE,

and KL, were used for evaluation (Bylinskii et al.,

2017) (Fosco et al., 2020). R

is coefﬁcient of de-

termination, CC is correlation coefﬁcient, RMSE is

root mean square error, and KL is Kullback-Leibler

divergence.

measures the ﬁt between the estimated map and

the ground truth map. It is calculated based on the

variability of the data itself and the discrepancy be-

tween the predictions. The best ﬁt is 1, and the closer

to 1, the better the performance of the model. Given

the grand-truth importance map Q and the predicted

importance map P, R2 is computed as:

(P, Q) =

∑

i=1

− P

)

∑

i=1

− Q)

(1)

where Q =

∑

i=1

CC means the correlation between the estimated

map and the ground truth map. The closer CC is to 1,

the stronger the positive correlation, and the closer to

0, the weaker the correlation. CC is computed as:

CC(P, Q) =

∑

i=1

− P)(Q

− Q)

∑

i=1

− P)

∑

i=1

− Q)

(2)

where P =

∑

i=1

RMSE is calculated from the square of the error

between the estimated map and the ground truth map.

The closer to 0, the higher the prediction accuracy.

Because the square is used, the indicator is sensitive

to outliers, with a lower rating if the prediction is far

off. RMSE is computed as:

RMSE(P, Q) =

∑

i=1

− P

)

(3)

The importance map can be interpreted as repre-

senting for each pixel the probability that the pixel is

considered visually important. KL is a measure of the

distance between the predicted distribution and the

ground truth, and represents how closely the proba-

bility distribution P approximates the probability dis-

tribution Q. A better approximation of the two maps

results in a smaller KL, and a KL of 0 indicates that

the maps are identical. KL is computed as:

KL(P, Q) =

∑

i=1

logQ

−Q

logP

) = L(P, Q)−H(Q)

(4)

where H(Q) = −

∑

i=1

logQ

) is the entropy of the

ground truth importance map and L(P, Q) is the cross

entropy of the prediction and ground truth.

The evaluation results are shown in Table 2. The

numbers in parentheses indicate the layer of the UI

encoder that concatenates the outputs.

Table 2: Performance of visual importance prediction mod-

els for mobile UI.

↑ CC ↑ RMSE ↓ KL ↓

MSI-Net 0.505 0.841 0.102 0.151

Ours(10,14,18) 0.548 0.844 0.0959 0.153

Ours(14,18) 0.639 0.845 0.0883 0.151

Ours(18) 0.631 0.835 0.0923 0.163

Examples of visual importance maps predicted by

the baseline and proposed methods are shown in Fig-

ure 2. Figure 3 shows an example where the proposed

method did not predict well. The more yellow the

pixel is, the higher the visual importance of the area

and the bluer the pixel is, the lower the visual impor-

tance of the area.

5 DISCUSSION

Table 2 shows that Ours(14,18) was equal to or bet-

ter than the baseline on all the evaluation indices.

Both Ours(10,14,18) and Ours(18) also outperformed

the baseline on the R

and RMSE metrics, but both

were slightly worse than the baseline on the KL met-

ric. Thus, a model that balances image features and

Predicting Visual Importance of Mobile UI Using Semantic Segmentation

263

Figure 2: Example of a projected visual importance map.

semantic segmentation features is most suitable for

predicting the visual importance of UI images, and

the appropriate use of semantic segmentation element

features improves the prediction performance of the

visual importance map.

To analyze the differences between the mobile UI

and the other images, Table 3 shows the results when

the prediction model was pre-trained for saliency on

natural images. Table 3 shows that Ours(10,14,18)

performed best in terms of saliency for natural im-

ages, while Ours(14,18), which was superior for mo-

bile UI, performed slightly worse. Even for natural

images, the use of semantic segmentation contributes

to performance improvement. However, the features

of the image itself play a more important role in the

saliency of natural images than semantic segmenta-

tion. Figure 2 shows that the ground truth of visual

importance in mobile UI tends to be distributed across

the UI element parts. Comparing Table 2 and Table

3, the difference between MSI-Net and Ours is larger

in Table 2, which represents the mobile UI results.

Therefore, the beneﬁts of semantic segmentation are

particularly large for mobile UI, and our method was

effective.

Figure 2 shows that the predicted map is smooth

for Ours(18), but it fails to capture image shapes such

as rhombuses. In Ours(10,14,18), the image shapes

are captured, but the features of the semantic segmen-

tation elements are too small compared to the fea-

tures of the UI elements, and the results are almost

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

264

Figure 3: Example of failure to predict well.

Table 3: Performance of saliency prediction models for nat-

ural images.

↑ CC ↑ RMSE ↓ KL ↓

MSI-Net 0.521 0.880 0.113 0.224

Ours(10,14,18) 0.569 0.884 0.107 0.219

Ours(14,18) 0.497 0.881 0.116 0.222

Ours(18) 0.482 0.881 0.117 0.226

the same as in MSI-Net. However, Ours(14,18) is

able to predict smooth importance maps while pre-

serving image features and using semantic segmenta-

tion elements. The results show that using appropriate

semantic segmentation features improves the predic-

tion performance of visual importance maps for mo-

bile UIs.

Figure 3 shows an example where the visual im-

portance map could not be predicted well using se-

mantic segmentation. This UI is tiled with images,

but the visual importance map in ground truth is based

on the features of each image, not the structure of the

UI. Since the images in semantic segmentation rep-

resent only the structure and categories of the UI, we

found that the proposed method does not work well

for UIs with strong image features in the visual im-

portance map. As in this example, it is necessary to

build a model that is more robust to image features in

order to deal with a mobile UI that has a complex UI

structure and more prominent image features. How-

ever, such a model may have low prediction accuracy

for simple mobile UIs.

6 CONCLUSION AND FUTURE

WORK

In this study, we proposed a visual importance predic-

tion method that takes UI elements into account with

the aim of accurately predicting the visual importance

of mobile UI. The evaluation compared the proposed

method with a baseline method that does not use se-

mantic segmentation of UI elements. The visual im-

portance map predicted by Ours(14,18) was smoother

and closer to ground truth than the baseline. We also

adapted the proposed method to natural images to see

if semantic segmentation works differently for mobile

UI and natural images. The use of semantic segmen-

tation was also effective for natural images, but its ef-

fect was weaker than for mobile UI. We found that a

balanced use of semantic segmentation features im-

proves the accuracy of predicting the visual impor-

tance of the mobile UI.

Since changes are made to UI elements such as

buttons and images, rather than to pixels, when devel-

oping UI, future research should examine the visual

importance of each UI element. In addition, since

the experiments in this paper were conducted using

only the UI data included in the imp1k dataset, we

would like to verify whether visual importance can

be predicted in the same way for UIs in different lan-

guages. We would also like to apply the visual im-

portance prediction model proposed in this study to

optimize mobile UI design and to provide feedback

Predicting Visual Importance of Mobile UI Using Semantic Segmentation

265

tools for designers. Speciﬁcally, we are considering

using our predictive model as an objective function

to optimize the color scheme of buttons and text in

mobile UI using a genetic algorithm. Optimization

allows developers to easily create a UI with the in-

creased importance of the UI components they want

to make stand out. For optimization, we would like

to perform predictions for novel and unique UIs that

are not included in existing datasets and conduct user

experiments to see if the predicted visual importance

maps are appropriate.

For a better user experience, it is also necessary to

analyze where users direct their attention when they

see a UI and whether they can understand that UI cor-

rectly. There is already related research on how users

understand UI, such as icon annotation in mobile UI

(Zang et al., 2021) and predicting mobile UI tappabil-

ity (Swearngin and Li, 2019). We believe that com-

bining these related work with our visual importance

predictions will provide more useful feedback to de-

signers.

ACKNOWLEDGEMENTS

This work was supported by JSPS KAKENHI Grant

Numbers JP21H03496, JP22K12157.

REFERENCES

Bylinskii, Z., Kim, N. W., O’Donovan, P., Alsheikh, S.,

Madan, S., Pﬁster, H., Durand, F., Russell, B., and

Hertzmann, A. (2017). Learning visual importance for

graphic designs and data visualizations. Proceedings

of the 30th Annual ACM Symposium on User Interface

Software and Technology, pages 57–69.

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and

Yuille, A. L. (2018). Deeplab: Semantic image seg-

mentation with deep convolutional nets, atrous convo-

lution, and fully connected crfs. IEEE Transactions

on Pattern Analysis and Machine Intelligence 40(4),

pages 834–848.

Deka, B., Huang, Z., Franzen, C., Hibschman, J., Afergan,

D., Li, Y., Nichols, J., and Kumar, R. (2017). Rico:

A mobile app dataset for building data-driven design

applications. Proceedings of the 30th Annual ACM

Symposium on User Interface Software and Technol-

ogy, pages 845–854.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. IEEE conference on computer vision

and pattern recognition, pages 248–255.

Fosco, C., Casser, V., Bedi, A. K., O’Donovan, P., Hertz-

mann, A., and Bylinskii, Z. (2020). Predicting visual

importance across graphic design types. Proceedings

of the 33rd Annual ACM Symposium on User Interface

Software and Technology, pages 249–260.

Gupta, P., Gupta, S., Jayagopal, A., Pal, S., and Sinha,

R. (2018). Saliency prediction for mobile user inter-

faces. 2018 IEEE Winter Conference on Applications

of Computer Vision, pages 1529–1538.

Jiang, M., Huang, S., Duan, J., and Zhao, Q. (2015). Sali-

con: Saliency in context. IEEE conference on com-

puter vision and pattern recognition, pages 1072–

1080.

Kroner, A., Senden, M., Driessens, K., and Goebel, R.

(2020). Contextual encoder–decoder network for vi-

sual saliency prediction. Neural Networks 129, pages

261–270.

Leiva, L. A., Xue, Y., Bansal, A., Tavakoli, H. R., K

oro

glu,

T., Du, J., Dayama, N. R., and Oulasvirta, A. (2020).

Understanding visual saliency in mobile user inter-

faces. 22nd International Conference on Human-

Computer Interaction with Mobile Devices and Ser-

vices, pages 1–12.

Swearngin, A. and Li, Y. (2019). Modeling mobile inter-

face tappability using crowdsourcing and deep learn-

ing. Proceedings of the 2019 CHI Conference on Hu-

man Factors in Computing Systems, pages 1–11.

Zang, X., Xu, Y., and Chen, J. (2021). Multimodal icon

annotation for mobile applications. Proceedings of

the 23rd International Conference on Mobile Human-

Computer Interaction, pages 1–11.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

266