Visual Perception of Obstacles: Do Humans and Machines Focus on the

Same Image Features?

Constantinos A. Kyriakides

1 a

, Marios Thoma

1,2 b

, Zenonas Theodosiou

1,3 c

Harris Partaourides

4 d

, Loizos Michael

2,1

and Andreas Lanitis

1,5 e

CYENS Centre of Excellence, Nicosia, Cyprus

Open University of Cyprus, Nicosia, Cyprus

Department of Communication and Internet Studies, Cyprus University of Technology, Limassol, Cyprus

AI Cyprus Ethical Novelties Ltd, Limassol, Cyprus

Department of Multimedia and Graphic Arts, Cyprus University of Technology, Limassol, Cyprus

Keywords:

Deep Learning Algorithms, Explainability, Eye Tracking, Heatmaps, Obstacle Recognition.

Abstract:

Contemporary cities are fractured by a growing number of barriers, such as on-going construction and infras-

tructure damages, which endanger pedestrian safety. Automated detection and recognition of such barriers

from visual data has been of particular concern to the research community in recent years. Deep Learning (DL)

algorithms are now the dominant approach in visual data analysis, achieving excellent results in a wide range

of applications, including obstacle detection. However, explaining the underlying operations of DL models

remains a key challenge in gaining signiﬁcant understanding on how they arrive at their decisions. The use

of heatmaps that highlight the focal points in input images that helped the models reach their predictions has

emerged as a form of post-hoc explainability for such models. In an effort to gain insights into the learning

process of DL models, we studied the similarities between heatmaps generated by a number of architectures

trained to detect obstacles on sidewalks in images collected via smartphones, and eye-tracking heatmaps

generated by humans as they detect the corresponding obstacles on the same data. Our ﬁndings indicate that the

focus points of humans more closely align with those of a Vision Transformer architecture, as opposed to the

other network architectures we examined in our experiments.

1 INTRODUCTION

One of the oldest and most rudimentary forms of mo-

bility throughout human history is traveling on foot.

According to sociologist Vincent Kaufmann (Kauf-

mann et al., 2004), the capacity of individuals to move

in space and be mobile is partly moderated by access,

which is constrained on the conditions and options

available in a given environment. Contemporary cities

have been fragmented by a growing number of con-

struction barriers and infrastructure damages that gen-

erate several problems, setting pedestrian citizens at

risk.

https://orcid.org/0009-0008-7185-400X

https://orcid.org/0000-0001-7364-5799

https://orcid.org/0000-0003-3168-2350

https://orcid.org/0000-0002-8555-260X

https://orcid.org/0000-0001-6841-8065

Considering the surge of inhabitants in urban areas

in modern times, urban planning is becoming increas-

ingly important and critical for creating a safe and

efﬁcient environment that is inclusive for those who

do not opt for vehicular means of transportation, such

as pedestrians. The development of automated meth-

ods for detecting and recognizing people, barriers, and

damages in visual data to create safe urban environ-

ments has been of particular concern to the research

community in recent years. In this study, we investi-

gate the visual perception of obstacles in urban areas

between humans and machines using heatmaps, with a

speciﬁc focus on enhancing the explainability of Deep

Learning (DL) models.

Speciﬁcally, our methodology entails the ﬁne-

tuning of various DL models using the obstacle detec-

tion dataset by (Thoma et al., 2023). We subsequently

extract heatmaps from a carefully curated subset of

Kyriakides, C., Thoma, M., Theodosiou, Z., Partaourides, H., Michael, L. and Lanitis, A.

Visual Perception of Obstacles: Do Humans and Machines Focus on the Same Image Features?.

DOI: 10.5220/0012453500003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th Inter national Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 4: VISAPP, pages

357-364

ISBN: 978-989-758-679-8; ISSN: 2184-4321

357

20 images using Grad-CAM (Selvaraju et al., 2017),

representing 10 diverse urban obstacles. Concurrently,

we conduct a comprehensive eye-tracking experiment

involving 35 university students, tasking participants

with identifying speciﬁc urban obstacles within the

same dataset subset. The resulting 20 heatmaps per ma-

chine learning model and humans (aggregated across

participants) underwent both quantitative and qualita-

tive analyses. To perform the quantitative comparison

between the extracted machine learning model and

human heatmaps, we employed a multi-grid methodol-

ogy that aims to assess the spatial similarity between

pairs of heatmaps.

2 LITERATURE REVIEW

The study by (Szarvas et al., 2005) compared the per-

formance of pedestrian detection systems when em-

ploying Convolutional Neural Networks (CNNs) ver-

sus Support Vector Machines (SVMs), in search of a

method that could alleviate the problem of pedestrian

accidents. “GLACCESS” is a smartphone applica-

tion prototype, designed to aid walking individuals

with visual impairments by identifying pedestrians in

their surroundings via the collection of images from

wearable cameras (Lee et al., 2020). In an effort to

mitigate accidents associated with distracted pedestri-

ans, (Wang et al., 2012) have used machine learning to

create a smartphone application that detects vehicles

in close proximity to pedestrians who use their smart-

phones while walking. Similarly, (Tung and Shin,

2018) devised “BumpAlert”, that exploits auditory

data from a walker’s surroundings captured by their

smartphone to detect nearby objects. However, pedes-

trian safety is not only endangered by nearby objects,

but also by structural damages that may be present

on their path. For example, structural problems as-

sociated with footpaths and pavements can result in

individuals stumbling and falling (International Trans-

port Forum, 2012). To diminish the effect of such

problems, (Maeda et al., 2018) have used deep neu-

ral networks to detect road damages that compromise

pedestrian safety.

The issue of pedestrian safety has recently gained

further interest in the literature, with studies proposing

novel solutions based on state of the art DL meth-

ods (Thoma et al., 2021). To the best of our knowl-

edge, (Theodosiou et al., 2020) have created the ﬁrst

dataset consisting of pedestrian obstacle images from

wearable cameras and successfully trained a classiﬁer

capable of distinguishing between 24 distinct types of

pedestrian obstacles and barriers. Wearable cameras

have the potential to be an important source of image

data that can inhibit the risk of pedestrian accidents

by providing real-time information to city authorities

about the current state of the city’s infrastructure. Es-

pecially when combined with intelligent detection sys-

tems trained to identify possible threats around a city,

authorities can take the necessary actions and mea-

sures in due time to protect walking citizens. This

highlights the importance of datasets such as the one

introduced in (Theodosiou et al., 2020), that can be

used for ﬁne-tuning DL models.

The use of automated software capable of identi-

fying barriers in sidewalks can facilitate repairs, espe-

cially when coupled with continuous incoming streams

of data from the local community. (Thoma et al., 2021)

published a proof-of-concept study featuring a smart-

phone application that enables the notiﬁcation of com-

munity members when barriers are detected through

images of wearable cameras. The relevance of wear-

able technology in promoting the health of the wearer

has been accentuated multiple times in the past (Do-

herty et al., 2013; Studer et al., 2018; Prabu et al.,

2022).

Although the current state of DL research has made

signiﬁcant leaps in developing accurate models for

various tasks, there is still a need to comprehend how

these models arrive at their decisions. DL models

are inherently hard to explain, due to the multitude

of layers and intermediate computations between the

input and output layers, which often leads to these

models being called “black boxes”.

2.1

Towards Improving Explainability in

Deep Learning

Research in recent years has stressed the importance of

promoting explainable AI that can provide some type

of explanatory logic behind its inferences, in contrast

to merely accepting the algorithm’s results without

properly understanding how the algorithm arrives to

its predictions (Chinu and Bansal, 2023). In this vein,

post-hoc explainability methods have become indis-

pensable tools for analyzing visual models and provid-

ing insights into their decision-making processes.

These techniques, ranging from gradient-based

methods (Zeiler and Fergus, 2014; Selvaraju et al.,

2017) to occlusion analysis (Springenberg et al., 2015),

provide a post-hoc understanding of why a trained

model made a particular prediction. Effectively, they

work by highlighting the signiﬁcant features and re-

gions that inﬂuenced a model’s decision in an effort to

decipher the black-box nature of DL models. However,

despite the advancements in post-hoc explainability, a

crucial aspect remains unexplored: a direct compari-

son with human-generated heatmaps.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

358

Figure 1: The 20 images used in the experiments.

A direct comparison between model-generated and

human-generated heatmaps holds promise for uncover-

ing valuable insights into the alignment of visual atten-

tion. These investigations may shed light on whether

the features emphasized by the model in its heatmaps

align with the salient aspects recognized by human

observers, offering a comprehensive perspective on

the interpretability and reliability of visual models.

It is important to note that machine-generated and

human-generated heatmaps stem from entirely differ-

ent processes. Consequently, a direct comparison

between the two is not straightforward. Machine-

generated heatmaps are created using gradient-based

techniques, highlighting the inﬂuential features for

prediction, whereas human-generated heatmaps are de-

rived from eye-tracking data, which capture visual at-

tention based on the duration and location of gazes on

speciﬁc areas of the image. Another signiﬁcant distinc-

tion lies in the application process: machine-generated

heatmaps are applied instantaneously to the entire in-

put image, whereas human-generated heatmaps un-

fold sequentially, reﬂecting the human observer’s time-

dependent focus. This may entail positive, negative

and even repetitive attention to reach to a conclusion.

Acknowledging these methodological distinctions is

essential for an accurate comparative analysis between

model-generated and human-generated heatmaps.

2.2 Eye Tracking

Eye trackers have gained popularity over the years as

tools for investigating human attention and collecting

eye-tracking data. The primary objective of gathering

eye-tracking data is to capture the temporal dynamics

of a participant’s gaze, allowing the identiﬁcation of

regions of interest in the displayed image. A scanpath,

representing the trajectory of the eye’s movement, is

subsequently superimposed on the image, highlighting

the participant’s gaze path. Fixations denote speciﬁc

areas that capture a participant’s attention, summa-

rizing spatial and temporal information about the at-

tention afforded to speciﬁc parts of the image. The

accumulation of multiple ﬁxations forms a gaze.

Spatial information regarding a participant’s gaze

is derived from inspecting the position of dots on a

scanpath, while the temporal length of each ﬁxation

is determined by the size of each dot. Larger dots in-

dicate longer ﬁxation durations, offering insights into

the perceived signiﬁcance of a particular region for

the given image interpretation task (Blascheck et al.,

2014). Fixations are generated by computing the aggre-

gated number (“ﬁxation count”), position coordinates

and duration of individual ﬁxations. A saccade refers

to the rapid eye movement occurring when transition-

ing from one ﬁxation to another. Eye-tracker data are

typically stored in EyeLink Data Format ﬁles (EDF)

or plain text, encoding events such as ﬁxation points

and saccades during stimulus exposure (Wang, 2021).

A popular method for interpreting eye move-

ment data is through Attention Map generation

(heatmaps) (Wang, 2021; Blascheck et al., 2014). Vi-

sual representations of eye-tracking data facilitate qual-

itative analyses, revealing spatial positions where in-

dividuals concentrate their focus. This information

aids researchers in gaining a deeper understanding of

how the participants’ visual attention is distributed

across image stimuli (Blascheck et al., 2014). For a

comprehensive guide on analyzing eye movement data,

readers are encouraged to consult (Wang, 2021).

3 METHODOLOGY

In order to investigate the correspondence between

machine- and human-generated heatmaps in the con-

text of obstacle detection, we devised a methodology

comprising three distinct steps

Fine-tuning various deep learning models utilizing

the obstacle detection dataset from (Thoma et al.,

2023). Following the model reﬁnement, we extract

heatmaps from a carefully curated subset of 20

images using Grad-CAM (Selvaraju et al., 2017),

representing 10 diverse urban obstacles.

The code used is available at https://constantinos-k.git

hub.io/visual-perception-of-obstacles/

Visual Perception of Obstacles: Do Humans and Machines Focus on the Same Image Features?

359

Figure 2: The human heatmaps (2nd row) and the machine heatmaps (from 3rd row and downwards) for the seven DL

architectures, for a subset of the 10 images (1st row) used in the experiments.

Performing a comprehensive eye-tracking exper-

iment involving 35 university students, using the

same 20 images from step 1. Participants were

tasked with identifying speciﬁc urban obstacles

within the aforementioned dataset subset.

Employing a multi-grid approach to perform a

quantitative comparison between the resulting

20 heatmaps per machine learning model and

those generated by human participants, with the

heatmaps aggregated across all participants for

each image.

For the comprehensive comparison of visual simi-

larities between the heatmaps generated by DL models

and those derived from human observations, we em-

ployed a 4-step multi-grid approach (see Figure 3). In

the initial step, the human-generated heatmaps under-

went resizing from a resolution of

1080 × 1080

pixels

224 × 224

pixels, to ensure uniformity in size with

the machine-generated heatmaps. Subsequently, both

machine and human heatmaps were segmented into

blocks of

16 × 16

pixels, resulting in 196 blocks per

heatmap by the end of this step. The average bright-

ness value of each block was computed within the

range

[0, 255]

. The next stage involved calculating the

brightness difference between corresponding blocks in

the human and machine heatmaps. Finally, the 196 in-

dividual brightness differences were summed to derive

a total visual difference value, providing a quantiﬁ-

able measure of the dissimilarity between the pair of

heatmaps under comparison.

The devised methodology allows for a detailed

exploration of the similarities in attentional patterns

between machine and human observers in the context

of urban obstacle detection.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

360

Figure 3: The steps followed for comparing the visual similarities between machine and human heatmaps.

4 EXPERIMENTAL RESULTS

In this section, we present the outcomes of our ex-

periments, focusing on the extraction and comparison

of machine- and human-generated heatmaps in the

domain of urban obstacle detection. In this context,

we used transfer learning to ﬁne-tune pre-trained ma-

chine learning models for the task of obstacle detection.

Speciﬁcally, we used images depicting obstacles that

affect the safety of pedestrians on city sidewalks (Fig-

ure 1). The images are assigned into the following

10 categories: two-wheeled vehicle, four-wheeled ve-

hicle, bench, crowded sidewalk, hole, parking meter,

parking prevention barrier, broken pavement, trafﬁc

cone, and tree.

To cover a broad spectrum of DL architectures,

this study employs a diverse set of models, speciﬁcally

VGG19 (Simonyan and Zisserman, 2014), ResNet18

and ResNet50 (He et al., 2015), MobileNetV2 (San-

dler et al., 2018), EfﬁcientNet-B0 (Tan and Le, 2020),

Swin Transformer (Swin-B) (Liu et al., 2021), and

ViT-B/16 (Dosovitskiy et al., 2021). Subsequent to the

training process, we employ the Grad-CAM algorithm

to extract machine-generated heatmaps from a subset

of 20 images.

4.1 Machine-Generated Heatmaps

Grad-CAM is a DL visualization tool that produces

a heatmap that identiﬁes what parts of an image con-

tribute most to the output of a model (Selvaraju et al.,

2017). At its core, Grad-CAM taps into the gradient

information ﬂowing through the layers of the model.

By capturing the gradients of the sought class with

respect to the desired layer, Grad-CAM assigns im-

portance scores to different spatial locations. These

importance scores are then used to generate a weighted

combination of the feature maps, creating a heatmap

that illustrates the regions where the neural network fo-

cused during its decision-making process. In our exper-

iments, we employed the last layer before the output of

each model, which generates coherent heatmaps. Due

to the model-agnostic nature of Grad-CAM, it can be

easily applied to a broad spectrum of DL architectures.

In our methodology, heatmaps were generated by

providing Grad-CAM with the correct classiﬁcation

label for each image, ensuring a systematic and con-

sistent approach across all seven models. To maintain

uniformity, all input images featured a resolution of

224 × 224

pixels. The resultant heatmaps were saved

for subsequent comparisons with heatmaps generated

from human observations. For visualization purposes,

the generated heatmaps were superimposed onto the

original input images, visually highlighting the identi-

ﬁed regions that contributed to the output predictions.

The combined images can be viewed in Figure 2.

4.2 Human-Generated Heatmaps

Human heatmaps were acquired through eye-tracking

experiments conducted in a well-equipped laboratory

using the Nano Tobii Eye Tracker. Thirty-ﬁve partici-

pants, aged between 21 and 24 years with no reported

vision impairments, took part in the experiments. Each

participant was presented with the set of 20 images,

and their task was to detect the corresponding obsta-

cle in each image. The experimental process involved

displaying the images on the screen in sequence, with

participants progressing to the next image upon iden-

tifying the obstacle. Prior to each experiment, indi-

Visual Perception of Obstacles: Do Humans and Machines Focus on the Same Image Features?

361

Figure 4: Quantitative differences between human eye-tracking heatmaps against the corresponding machine heatmaps for each

of the 7 vision models. Each model’s performance is averaged over the two distinct images per obstacle type. The models

are sorted from left-to-right, starting with the model that, on average, least deviates from the human heatmaps (ViT-B/16)

to the one with the highest deviation (Swin-B), as depicted by the black dashed trendline. The error bars represent the 95%

conﬁdence intervals, calculated using bootstrapping to estimate the variability and uncertainty inherent in the sample means.

Figure 5: Detailed comparison for the human vs machine

heatmaps for two of the obstacle images, showing the simi-

larity of the ViT-B/16 heatmaps to their human counterparts.

vidualized calibration procedures were conducted to

ensure accurate eye-tracking data. All experiments

were conducted under daylight conditions, with each

session lasting approximately 15 minutes.

Following the experiments, heatmaps were ex-

ported for each image with dimensions of

1080×1080

pixels. The information within the heatmaps is repre-

sented using shades of gray within the range of

[0, 255]

The resulting eye-tracking heatmaps for 10 of the im-

ages are shown in the second row of Figure 2, super-

imposed on the original images, and using the same

colorscale as those from the vision model heatmaps.

4.3 Heatmap Comparison

After performing the multi-grid approach, the obtained

visual dissimilarity values provide a comprehensive

understanding of the convergence and divergence in

attentional patterns between the DL models and hu-

man observers. This quantitative analysis contributes

valuable insights into the explainability and alignment

of machine-generated heatmaps with human visual at-

tention, shedding light on the efﬁcacy of these models

in the speciﬁc task of urban obstacle detection.

Our comparison of the seven machine learning

models revealed considerable variability in the extent

to which generated heatmaps resembled the averaged

human heatmaps. As shown in Figure 4, ViT-B/16

scored the lowest numerical differences on average

across all images and algorithms, indicating a closer

resemblance to human heatmaps (representative ex-

amples of the similarity between the ViT-B/16 and

human heatmaps can be seen on Figure 5). In con-

trast, the highest numerical differences were identiﬁed

for EfﬁcientNet-B0 and the Swin-B models. This out-

come is arguably noteworthy, especially considering

that the Swin Transformer incorporates an attention

mechanism whose conception was inspired by human

attention, yet it did not align closely with the human

heatmaps.

The superior performance of ViT-B/16, which also

employs an attention mechanism, raises intriguing

questions about the speciﬁc elements of the ViT ar-

chitecture that contribute to its closer correlation with

human heatmaps and broader patterns of human visual

perception. These ﬁndings suggest that models with

smaller differences may more accurately resemble hu-

man perception, although this inference necessitates

further research. Such explorations could enhance our

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

362

understanding of the interplay between DL algorithms

and human processing styles and potentially drive the

development of future DL algorithms that better mirror

visual human processing.

5 CONCLUSIONS

In conclusion, our study focused on the extraction

and comparison of machine-generated and human-

generated heatmaps in the context of urban obstacle

detection. The experiments utilized a diverse set of

DL models ﬁne-tuned on images depicting various

obstacles encountered on pavements that affect pedes-

trian safety. We employed the Grad-CAM algorithm to

extract machine-generated heatmaps, visualizing the

features learned by the models during obstacle detec-

tion. These heatmaps were systematically compared

with human-generated heatmaps obtained through eye-

tracking experiments involving 35 participants. The

visual dissimilarity values provided insights into the

alignment of machine-generated heatmaps with human

visual attention. ViT-B/16 demonstrated the closest

resemblance to human heatmaps. ViT-B/16’s supe-

rior performance prompts further investigation into

the speciﬁc architectural elements contributing to its

alignment with human perception.

By pulling back the veil on how these models are

attributing signiﬁcance within images, we can better

understand and trust their outputs. If machine learning

models are designed to more closely resemble human

perception, their decision-making processes may be-

come inherently more understandable, sharing com-

mon ground with recognized human cognitive patterns.

Such an approach could not only improve the inter-

pretability of individual models, but also contribute to

a broader understanding of how to design models that

are both accurate and explainable, which is a signif-

icant goal in the ﬁeld of artiﬁcial intelligence. Sim-

ilarly, when dealing with image interpretation tasks

where humans display increased accuracy, the use of

network architectures that resemble human perception

could lead to more accurate results whereas for tasks

that human performance is inferior, architectures that

resemble human perception should be avoided. The

ﬁndings pave the way for the development of more

explainable and more accurate models.

This paper presents the preliminary results of our

work that lay the foundations for further investigation.

Our future research plans include extracting and com-

paring the heatmaps of additional DL architectures as

well as investigating how the extracted results can be

used to improve the accuracy and explainability of the

generated models.

ACKNOWLEDGEMENTS

This project has received funding from the European

Union’s Horizon 2020 research and innovation pro-

gramme under grant agreement No 739578 comple-

mented by the Government of the Republic of Cyprus

through the Directorate General for European Pro-

grammes, Coordination and Development.

REFERENCES

Blascheck, T., Kurzhals, K., Raschke, M., Burch, M.,

Weiskopf, D., and Ertl, T. (2014). State-of-the-Art

of Visualization for Eye Tracking Data. In EuroVis -

STARs, page 29.

Chinu and Bansal, U. (2023). Explainable AI: To Reveal the

Logic of Black-Box Models. New Generation Comput-

ing.

Doherty, A. R., Hodges, S. E., King, A. C., Smeaton, A. F.,

Berry, E., Moulin, C. J. A., Lindley, S., Kelly, P., and

Foster, C. (2013). Wearable Cameras in Health: The

State of the Art and Future Possibilities. American

Journal of Preventive Medicine, 44(3):320–323.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D.,

Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M.,

Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N.

(2021). An Image is Worth 16x16 Words: Transform-

ers for Image Recognition at Scale. In International

Conference on Learning Representations (ICLR 2021).

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Resid-

ual Learning for Image Recognition.

International Transport Forum (2012). Pedestrian Safety, Ur-

ban Space and Health. ITF Research Reports. OECD.

Kaufmann, V., Bergman, M. M., and Joye, D. (2004). Motil-

ity: Mobility as Capital. International Journal of Ur-

ban and Regional Research, 28(4):745–756.

Lee, K., Sato, D., Asakawa, S., Kacorri, H., and Asakawa, C.

(2020). Pedestrian Detection with Wearable Cameras

for the Blind: A Two-way Perspective. In Proceedings

of the 2020 CHI Conference on Human Factors in

Computing Systems, CHI ’20, pages 1–12. Association

for Computing Machinery.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S.,

and Guo, B. (2021). Swin Transformer: Hierarchical

Vision Transformer using Shifted Windows.

Maeda, H., Sekimoto, Y., Seto, T., Kashiyama, T., and

Omata, H. (2018). Road Damage Detection and Classi-

ﬁcation Using Deep Neural Networks with Smartphone

Images. Computer-Aided Civil and Infrastructure En-

gineering, 33(12):1127–1141.

Prabu, A., Shen, D., Tian, R., Chien, S., Li, L., Chen, Y.,

and Sherony, R. (2022). A Wearable Data Collection

System for Studying Micro-Level E-Scooter Behavior

in Naturalistic Road Environment.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen,

L.-C. (2018). MobileNetV2: Inverted Residuals and

Visual Perception of Obstacles: Do Humans and Machines Focus on the Same Image Features?

363

Linear Bottlenecks. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 4510–4520.

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,

Parikh, D., and Batra, D. (2017). Grad-CAM: Visual

Explanations from Deep Networks via Gradient-based

Localization. In Proceedings of the IEEE International

Conference on Computer Vision, pages 618–626.

Simonyan, K. and Zisserman, A. (2014). Very Deep Convo-

lutional Networks for Large-Scale Image Recognition.

arXiv:1409.1556.

Springenberg, J. T., Dosovitskiy, A., Brox, T., and Ried-

miller, M. (2015). Striving for Simplicity: The All

Convolutional Net.

Studer, L., Paglino, V., Gandini, P., Stelitano, A., Triboli, U.,

Gallo, F., and Andreoni, G. (2018). Analysis of the Re-

lationship between Road Accidents and Psychophysi-

cal State of Drivers through Wearable Devices. Applied

Sciences, 8(8):1230.

Szarvas, M., Yoshizawa, A., Yamamoto, M., and Ogata,

J. (2005). Pedestrian Detection with Convolutional

Neural Networks. In IEEE Proceedings. Intelligent

Vehicles Symposium, 2005., pages 224–229.

Tan, M. and Le, Q. V. (2020). EfﬁcientNet: Rethinking

Model Scaling for Convolutional Neural Networks.

Theodosiou, Z., Partaourides, H., Atun, T., Panayi, S., and

Lanitis, A. (2020). A First-person Database for De-

tecting Barriers for Pedestrians. In VISIGRAPP 2020 -

Proceedings of the 15th International Joint Conference

on Computer Vision, Imaging and Computer Graphics

Theory and Applications, volume 5, pages 660–666.

Thoma, M., Partaourides, H., Sreedharan, I., Theodosiou,

Z., Michael, L., and Lanitis, A. (2023). Performance

Assessment of Fine-Tuned Barrier Recognition Mod-

els in Varying Conditions. In Tsapatsoulis, N., Lanitis,

A., Pattichis, M., Pattichis, C., Kyrkou, C., Kyriacou,

E., Theodosiou, Z., and Panayides, A., editors, Com-

puter Analysis of Images and Patterns, Lecture Notes

in Computer Science, pages 172–181. Springer Nature

Switzerland.

Thoma, M., Theodosiou, Z., Partaourides, H., Tylliros, C.,

Antoniades, D., and Lanitis, A. (2021). A Smartphone

Application Designed to Detect Obstacles for Pedes-

trians’ Safety. In Paiva, S., Lopes, S. I., Zitouni, R.,

Gupta, N., Lopes, S. F., and Yonezawa, T., editors, Sci-

ence and Technologies for Smart Cities, Lecture Notes

of the Institute for Computer Sciences, Social Infor-

matics and Telecommunications Engineering, pages

358–371. Springer International Publishing.

Tung, Y.-C. and Shin, K. G. (2018). Use of Phone Sen-

sors to Enhance Distracted Pedestrians’ Safety. IEEE

Transactions on Mobile Computing, 17(6):1469–1482.

Wang, T., Cardone, G., Corradi, A., Torresani, L., and Camp-

bell, A. T. (2012). Walksafe: A Pedestrian Safety App

for Mobile Phone Users Who Walk and Talk While

Crossing Roads. In Proceedings of the Twelfth Work-

shop on Mobile Computing Systems & Applications,

HotMobile ’12, pages 1–6. Association for Computing

Machinery.

Wang, Z. (2021). Eye Movement Data Analysis and Visual-

ization. In Wang, Z., editor, Eye-Tracking with Python

and Pylink, pages 197–224. Springer International Pub-

lishing.

Zeiler, M. D. and Fergus, R. (2014). Visualizing and Under-

standing Convolutional Networks. In Fleet, D., Pajdla,

T., Schiele, B., and Tuytelaars, T., editors, Computer Vi-

sion - ECCV 2014, Lecture Notes in Computer Science,

pages 818–833. Springer International Publishing.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

364