Pedestrian’s Gaze Object Detection in Trafﬁc Scene

Hiroto Murakami

1 a

, Jialei Chen

, Daisuke Deguchi

Takatsugu Hirayama

2,1

, Yasutomo Kawanishi

3,1

and Hiroshi Murase

Graduate School of Informatics, Nagoya University, Nagoya, Japan

Faculty of Environmental Science, University of Human Environments, Okazaki, Japan

Multimodal Data Recognition Research Team, Guardian Robot Project, Riken, Kyoto, Japan

Keywords:

Pedestrian’s Gaze Object Detection, Object Detection, Gaze Estimation, Trafﬁc Scene, Dataset.

Abstract:

In this paper, we present a new task of detecting an object that a target pedestrian is gazing at in a trafﬁc scene

called PEdestrian’s Gaze Object (PEGO). We argue that the detection of gaze object can provide important in-

formation for pedestrian’s behavior prediction and can contribute to the realization of automated vehicles. For

this task, we construct a dataset of in-vehicle camera images with annotations of the objects that pedestrians

are gazing at. Also, we propose a Transformer-based method called PEGO Transformer to solve the PEGO

detection task. The PEGO Transformer directly performs gaze object detection with the utilization of whole-

body features without a high-resolution head image and a gaze heatmap which the traditional methods rely

on. Experimental results showed that the proposed method could estimate pedestrian’s gaze object accurately

even if various objects exist in the scene.

1 INTRODUCTION

Detection of PEdestrian’s Gaze Object (PEGO) aims

to detect an object that a pedestrian is gazing at in a

trafﬁc scene. This is an important task for comput-

ers to predict the future behavior of a pedestrian in

a trafﬁc scene. For example, as shown in Fig. 1, a

pedestrian gazes at an oncoming car and will proba-

bly wait until the car passes without jumping out into

the roadway. Thus, the detection result of the gaze

object is an important clue that reveals what behavior

the person intends to take in the future.

For the gaze detection task, several datasets have

been released. Recasens et al. (Recasens et al., 2015)

have released a GazeFollow dataset for gaze detection

in our daily life. This pioneering work demonstrates

the importance of gaze detection tasks in person be-

havior prediction. Tomas et al. (Tomas et al., 2021)

have released a Gaze On Objects (GOO) dataset that

aims to ﬁnd products that customers are gazing at in

the retail store scene.

Several gaze detection models have been devel-

oped using these datasets. Recasens et al. (Recasens

et al., 2015) have proposed a model to detect line

of sight using the GazeFollow dataset. Wang et

https://orcid.org/0009-0008-6571-4721

Figure 1: A pedestrian is gazing at an oncoming vehicle.

al. (Wang et al., 2022) used the GOO dataset and have

proposed a GaTector that detects human’s gazing ob-

jects.

In a trafﬁc scene, pedestrian’s gaze detection is

also essential because it contributes to determining

automated driving behavior and implementing tech-

nologies that alert drivers. Belkada et al. (Belkada

et al., 2021) and Hata et al. (Hata et al., 2022) have

proposed a method for detecting “eye contact”, which

indicates whether a pedestrian is gazing at the in-

vehicle camera. However, they have not been studied

to recognize which objects pedestrians are gazing at

if they are not gazing at the in-vehicle camera.

A dataset plays an important role in achieving

pedestrian’s gaze object detection in trafﬁc scenes.

Various datasets have been released for trafﬁc-scene

Murakami, H., Chen, J., Deguchi, D., Hirayama, T., Kawanishi, Y. and Murase, H.

Pedestrian’s Gaze Object Detection in Trafﬁc Scene.

DOI: 10.5220/0012309500003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 2: VISAPP, pages

333-340

ISBN: 978-989-758-679-8; ISSN: 2184-4321

333

understanding (Caesar et al., 2020; Sun et al., 2020;

Rasouli et al., 2019; Cordts et al., 2016; Geiger et al.,

2013). However, to the best of our knowledge, there

is no dataset consisting of annotations on pedestrian’s

gaze objects in trafﬁc scenes. Since existing datasets

and methods focus on our daily lives and retail store

scenes, they are different domains and cannot be used

for PEGO detection in trafﬁc scenes.

Therefore, this paper tackles the tasks to detect the

pedestrian’s gaze object: construct a new dataset and

propose a new method. In this dataset, we manually

annotate each pedestrian in an in-vehicle camera im-

age with the pedestrian’s gaze object. In addition, we

propose a method termed PEGO Transformer for de-

tecting the pedestrian’s gaze object using this dataset.

The PEGO Transformer consists of four modules: a

backbone to extract features from the input images, a

Deformable Transformer to capture the features cor-

responding to objects, a Projection Layer to utilize the

features to produce the result for the ﬁnal prediction,

and a Label Generator to generate the label index for

training the model via loading the dataset.

Contributions of this paper are as follows.

1. This paper proposes a novel PEGO Transformer

that can estimate a gazing object of each pedes-

trian in a trafﬁc scene. The proposed method

is capable of estimation even without the high-

resolution head images or the gaze heatmap re-

quired by conventional methods. The PEGO

Transformer is trained to capture the relationship

between detected objects and pedestrians so that

the likelihood of gaze object for each pedestrian

becomes high.

2. This paper proposes a novel task of PEGO detec-

tion that estimates gaze object of each pedestrian

in the trafﬁc scene. For this task, we construct a

new dataset by extending the widely used trafﬁc

scene dataset.

2 RELATED WORK

2.1 Human’s Gaze Object Detection

Recasens et al. (Recasens et al., 2015) constructed a

gaze detection dataset and proposed a method called

GazeFollow to estimate human gaze. The aim of

their dataset is to estimate the direction of the gaze,

whereas our study aims at detecting the gaze object.

Wang et al. (Wang et al., 2022) and Tu et al. (Tu

et al., 2022; Tu et al., 2023) proposed methods for de-

tecting the gaze object of a target human at the object

level. Wang et al. proposed GaTector, which esti-

mates the products customers are gazing at in a store

scene. The gaze heatmap is estimated using a high-

resolution head image of the target person, and ob-

jects that overlap with the estimated gaze heatmap are

considered the gaze objects. Human-Gaze-Target De-

tection with Transformer (HGTTR) (Tu et al., 2022)

and Gaze following detection Transformer (GTR) (Tu

et al., 2023) proposed by Tu et al. detect human gaze

object in more general scenes. As with GaTector, Tu’s

methods are processed in the head detection branch

and the gaze heatmap detection branch, after which

the gaze object is estimated. However, these methods

rely on high-resolution head images. Most pedestri-

ans captured by in-vehicle cameras are smaller than

those captured in the other scenes due to their dis-

tance and lower image resolution. This makes it difﬁ-

cult to extract the head features required for estimat-

ing the region of attention by these methods. In addi-

tion, since the gaze target is selected based on the esti-

mated gaze heatmap, the selection accuracy is highly

dependent on the performance of the gaze heatmap

estimator. As a result, our method can estimate the

gaze object without a high-resolution head image by

using the pedestrian’s whole body features. It also es-

timates the gaze object directly without using a gaze

heatmap, that is a performance bottleneck of the pre-

vious methods.

2.2 Pedestrian’s Gaze Target Detection

Belkada et al. (Belkada et al., 2021) and Hata et

al. (Hata et al., 2022) worked on pedestrian eye-

contact detection. They use skeletal information to

detect whether a pedestrian is gazing at the in-vehicle

camera because pedestrians captured by in-vehicle

cameras are often small and blurred, and thus exist-

ing eye gaze detection methods cannot be applied di-

rectly. They also constructed new datasets that can

handle the eye contact detection task by extending an

existing trafﬁc scene dataset. The task addressed in

our study is similar to that addressed by Hata et al. in

terms of focusing on pedestrians captured by an in-

vehicle camera. However, their method cannot rec-

ognize the pedestrian’s gaze object without any eye-

contact.

2.3 Dataset Containing Pedestrians

Datasets recorded in real trafﬁc scenes are beneﬁcial

for automated driving tasks. Caesar et al. (Caesar

et al., 2020) have released nuScenes and nuImages

annotated with bounding boxes and object class la-

bels for object detection. In these datasets, bounding

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

334

Object

Queries

BBox, Class

Generator

PEGO Dataset

Transformer

Encoder

MLP

Cross Entropy

Likelihood as

a gazed object

Projection Layer

Label Generator

Transformer

Encoder

CNN

Backbone

Transformer

Decoder

𝑁×𝐶

PEGO Index

𝑦

Input Image

Deformable

Transformer

𝑁%×%1

Move

Target

Pedestrian

Object Query

Backbone

𝑜

Figure 2: Overview of the PEGO Transformer.

boxes and 23 classes of object labels are annotated

for vehicles, bicycles, and pedestrians captured by in-

vehicle cameras. However, the state of each pedes-

trian, such as the gaze direction and the gaze object,

is not annotated.

Sun et al. (Sun et al., 2020) have released the

Waymo Open Dataset (Waymo), which is annotated

with object bounding boxes and class labels in the

same way as nuScenes and nuImages. However, the

state of pedestrians is neither annotated.

On the other hand, the Pedestrian Intention Esti-

mation dataset (PIE dataset) (Rasouli et al., 2019) is

the dataset constructed for pedestrians’ behavior pre-

diction. In this dataset, 1,842 pedestrians captured

by in-vehicle cameras are annotated with information

such as ID, bounding box, whether they are likely to

cross the road, and whether they are gazing in the

camera direction. However, the pedestrian states nec-

essary for the PEGO detection task, such as the gaze

direction and object, are not annotated.

3 PEDESTRIAN’S GAZE OBJECT

DETECTION

In this section, we propose the PEGO Transformer,

which detects pedestrians’ gaze object in the image.

Unlike conventional gaze object detection methods,

our method does not rely on a high-resolution head

image, but uses features from a full-body image for

PEGO estimation. Also, instead of relying on a gaze

heatmap, the PEGO Transformer is trained to capture

the relationship between detected objects and pedes-

trians so that the likelihood of gaze object for each

pedestrian becomes high. The architecture of the

PEGO Transformer is shown in Fig. 2. The architec-

ture consists of four modules: a backbone to extract

features from the input images (CNN backbone), a

Deformable Transformer (Zhu et al., 2021) to reﬁne

Object Queries

Deformable

Transformer

Decoder

Features from

Transformer Encoder

Projection Layer

Label Generator

Figure 3: Object queries.

the features from backbone (transformer encoder and

decoder), a Projection Layer to utilize the features to

produce the result for the ﬁnal prediction, and a Label

Generator to generate the label index for training the

model via loading the dataset. We introduce each of

these modules in the following section.

3.1 Architecture of PEGO Transformer

Backbone. The backbone, consisting of a CNN, aims

to produce features with high-level semantics for in-

put to the Deformable Transformer. Given an input

image x

x ∈ R

C×H×W

, x

x is ﬁrst fed into the CNN back-

bone (e.g., ResNet (He et al., 2016)), to provide fea-

tures with high-level semantics.

Deformable Transformer. The Deformable Trans-

former, consisting of a Deformable Transformer En-

coder and Deformable Transformer Decoder (Zhu

et al., 2021), aims to produce features that corre-

sponds with each object. The features from backbone

are then ﬂattened and combined with positional en-

coding for the deformable transformer encoder. In the

deformable transformer encoder, which beneﬁts from

the deformable self-attention module, the features in-

teract with each other to enhance the output.

As shown in Fig. 3, the Deformable Transformer

Decoder takes the output of the feature extractor as in-

put and associates the features corresponding to each

Pedestrian’s Gaze Object Detection in Trafﬁc Scene

335

object to an object query (o ∈ R

) (Carion et al.,

2020) using the following procedure. First, the O =

, o

, ..., o

} are initialized with random values and

input to the deformable transformer decoder. The in-

put O are associated with the output of the Trans-

former Encoder and the features corresponding to the

object. And then the deformable self-attention mod-

ule captures the relationship between each o ∈ O. As a

result, each o ∈ O becomes a feature that corresponds

one-to-one with the object.

Projection Layer. The Projection Layer (Vaswani

et al., 2017) consists of a Transformer Encoder and

an MLP layer. The Transformer Encoder captures the

gazing correspondences between pedestrians and ob-

jects with O as input. The MLP aims to produce the

conﬁdence scores indicating if an object is gazed.

Label Generator. The Label Generator outputs the

label index of the pedestrian’s gaze object for training

the model. When a pedestrian is gazing at o

corre-

sponding to an object, then the output label index is

To generate the label index, we ﬁrst estimate the

bounding box and class probability of the object from

each o ∈ O. Next, we select o

that has the highest

probability for the pedestrian class and whose value

exceeds the threshold δ. Then, we store m which is the

index of the pedestrian to M = {m

, m

, . . .}. Finally,

the gaze object of the pedestrian corresponding to m ∈

M is obtained from the dataset, and outputs the index,

y, when o

corresponds to the gaze object. The

index m ∈ M of the pedestrian object query obtained

from the output of the label generator is also used in

the projection layer to sort the target pedestrian object

queries to the top.

3.2 Loss Function

The cross-entropy is used as the loss function. The

cross-entropy loss is calculated from the softmax of

the likelihood of each object as a gaze object by the

projection layer and the label of the gazed object out-

put by the label generator.

3.3 Inference of Pedestrian’s Gaze

Object

During the inference procedure, to detect the gaze ob-

ject, the in-vehicle camera image is ﬁrst input to the

PEGO Transformer to obtain the likelihood of each

object as a gaze object. The object with the highest

likelihood is selected as the gaze object.

Table 1: Number of pedestrians and images in the con-

structed dataset.

Source Dataset Pedestrians Images

nuScenes 292 218

nuImages 1,240 870

Waymo 1,193 672

Total 2,725 1,760

4 PEDESTRIAN’S GAZE OBJECT

DATASET

To verify the performance of the PEGO Transformer,

we construct a PEGO Dataset annotate with the

pedestrian’s gaze object. In contrast to existing stud-

ies, we annotate pedestrians’ gaze points in each im-

age. Our dataset contains annotations of the target

pedestrian’s ID, bounding box coordinates, gaze point

coordinates, and pedestrian status. If the gaze object

cannot be identiﬁed, we annotate the point at which a

pedestrian is gazing. In addition, when it is difﬁcult to

judge the point being gazed at, such as in the case of

eye contact or backward facing, we record these situa-

tions as an additional annotation in the dataset. Three

annotators annotate the same image to maintain the

quality of the annotation. Details of the dataset are

described in the following sections.

4.1 Image Details

This dataset was constructed based on the existing

datasets: nuScenes, nuImages and Waymo (Caesar

et al., 2020; Sun et al., 2020). These are large open

datasets containing images captured by in-vehicle

cameras and are annotated with the object’s bounding

box and its class label, as described in Section 2.3.

In our dataset, only images satisfying the follow-

ing conditions were collected:

1. The overlap between a pedestrian and other object

is less than 25 %.

2. The height of the pedestrian bounding box is 200

pixels or more.

3. The entire pedestrian bounding box is present in

the in-vehicle camera image.

4. Target pedestrian and the gaze object appear in the

same image.

5. The annotation target frames are selected by every

5 seconds in nuScenes and Waymo.

Consequently, a total of 2,725 pedestrians (1,760

images) were selected for annotation. The detail of

the dataset is shown in Table 1.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

336

Target Pedestrian

Clicked Positions

Figure 4: Example of annotation.

(a) No object (b) Outside the image

(d) Full backward

Figure 5: Situations in which annotators are unable to select

an object that the pedestrian is gazing at.

4.2 Annotations

Three annotators annotated 2,725 pedestrian’s gaze

objects in the dataset. As shown in Fig. 4, the an-

notators clicked on the gaze object of each pedestrian

using a specialized annotation tool for this task.

In some cases, the annotators could not determine

the gaze object due to several reasons, such as no gaze

object in the scene, the gaze object outside the image,

eye-contact, and full backward posture. Such pedes-

trians cannot be annotated directly by the above an-

notation steps. For such pedestrians, the annotators

annotated the special labels as follows:

No Object

As shown in Fig. 5(a), the “no object” label is an-

notated for a pedestrian who is not gazing at any

object. In this case, the annotator clicked on the

area at which the pedestrian was gazing.

Table 2: Annotation results.

Gazing at Pedestrians

Object 1,234

No object 450

Outside the image 370

Eye-contact 256

Full backward 20

Others 395

Outside the Image

As shown in Fig. 5(b), the “outside the image”

label is annotated for a pedestrian whose gaze ob-

ject outside the image. We placed a small click-

able area around the image in the annotation tool,

and the annotator clicked within this area while

maintaining the pedestrian’s viewing direction.

Eye Contact

As shown in Fig. 5(c), the “eye-contact” label is

annotated for a pedestrian who was gazing at the

in-vehicle camera.

Full Backward

As shown in Fig. 5(d), the “full backward” label

is annotated for a backward-facing pedestrian, be-

cause it is difﬁcult to determine the gaze object in

the scene.

4.3 Annotation Results

Table 2 and Fig. 6 show the annotation results of the

pedestrian’s gaze objects in our dataset. As previously

mentioned, 2,725 pedestrians were annotated. The

target pedestrians are indicated by the yellow boxes,

the red dots are the points clicked by the annotators,

and the objects indicated by the red boxes are PEGOs

in Fig. 6. Each object recorded how many annotators

selected it as a PEGO. This allows the ground truth of

the gaze object to be changed according to the PEGO

detection task. In contrast to existing datasets for gaze

estimation, the size of the pedestrian relative to the

image size is small, and the pedestrian and the target

object are far apart.

The following annotations were included in the

dataset.

• Target pedestrian’s ID

• Target pedestrian’s bounding box

• Gaze point coordinates

• Eye-contact or not

• Full backward or not

• Bounding box of the PEGO (only for a pedestrian

who gazes at an object)

• Category of the PEGO (only for a pedestrian who

gazes at an object)

Pedestrian’s Gaze Object Detection in Trafﬁc Scene

337

Figure 6: Annotation results: The target pedestrian in each image is indicated by yellow box and the PEGOs are indicated by

red boxes. Gaze points in the annotation results are indicated by red dots.

5 EXPERIMENT

A trained PEGO Transformer was used to detect the

pedestrian’s gaze object. In this section, we present

the experimental conditions and the results.

5.1 Implementation Details

We trained the PEGO Transformer on the dataset ex-

plained in section 4. Pedestrians gazing at the ob-

ject were used for training. We performed ﬁve-fold-

cross-validation on the dataset. Each fold contains 47,

44, 43, 44, 48 scene images and 92, 91, 89, 112, 130

pedestrians, respectively. In training step, horizontal

ﬂips were applied to each image as data augmenta-

tion.

In the training procedure, only the parameters of

the projection layer were updated. The feature ex-

tractor and the deformable transformer were initial-

ized with the pre-trained weights of the Deformable

DETR (Zhu et al., 2021). We used the top 40 object

queries (N = 40) whose highest-class probability was

higher than the threshold. The threshold δ was set to

0.3.

5.2 Comparative Methods

To investigate the effectiveness of PEGO Trans-

former, we used two comparative methods to estimate

the gaze object of pedestrians. The ﬁrst model was

a line-of-sight prediction based on GazeFollow (Re-

casens et al., 2015). We created an MLP model to es-

timate the pedestrian’s line of sight and trained it on

the constructed PEGO dataset. The distance from the

estimated line of sight to each of the candidate objects

was calculated, and the object closest to the estimated

line of sight was selected as the gaze object.

The second model was GaTector (Wang et al.,

2022). The model was pre-trained on the GOO

dataset (Tomas et al., 2021) and ﬁne-tuned on the con-

structed PEGO dataset. In this comparison experi-

ment, the energy aggregation loss of GaTector was

calculated for all candidate objects in the image using

the estimated gaze heatmap, and the object with the

smallest loss was selected as the gaze object. The en-

ergy aggregation loss is the ratio of the average of the

estimated gaze heatmap over the entire image to the

average of the estimated heatmap over the bounding

box of the object.

5.3 Results and Discussion

Table 3 reports the accuracy of the PEGO detection.

We evaluated whether the detected gaze object with

the highest score for the target pedestrian was the

same as the ground truth of the gaze object in our

dataset, which we refer to as the Top1 accuracy. In ad-

dition, we evaluated whether the correct answer could

be included within the 2nd, 3rd, 4th, and 5th highest

objects, referred to as Top2, Top3, Top4, and Top5

accuracy, respectively.

The PEGO Transformer is able to estimate the

gaze object of pedestrians with higher accuracy than

chance rate and comparison methods. Figures 7(a),

(b), and (c) show examples of successful PEGO de-

tection. From these results, the PEGO Transformer

succeeded in estimating the gaze object using full-

body image features. In contrast, the estimation re-

sult in Fig. 7(d) was the opposite direction from her

as the gaze object.

As seen in Table 3, the proposed PEGO Trans-

former outperformed the comparative method (Ga-

Tector) that explicitly uses head images. GaTec-

tor requires high-resolution head images to estimate

PEGO, but it is difﬁcult to obtain such images of

target pedestrians in trafﬁc scenes because their dis-

tances become high. Therefore, the performance of

GaTector was lower than expected.

Next, the line of sight based method is difﬁcult

to determine gaze object when multiple objects exist

close to the line of sight. On the other hand, as seen

in Fig. 7(c), the proposed PEGO Transformer can cor-

rectly estimate the gaze object even if a pedestrian

stands very close to the target pedestrian. However,

the proposed PEGO Transformer does not take into

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

338

Table 3: Pedestrian’s Gaze Object (PEGO) detection accuracy.

Top1(%) Top2(%) Top3(%) Top4(%) Top5(%)

Random guess 3.44 6.78 10.0 13.1 16.2

Comparative (Line of sight) 29.8 51.4 56.8 63.9 65.3

Comparative (GaTector) 19.5 22.3 24.6 29.3 32.0

Proposed (PEGO Transformer) 47.2 70.8 83.3 93.8 97.9

(a) True: She gazes an approaching car. (b) True: She gazes an approaching car.

and not the car behind her.

Figure 7: Examples of detection: The target pedestrian is indicated by the yellow box and the estimated PEGO by the red

box. Gaze points in the annotation results are indicated by red dots.

account the pedestrian’s pose, which can make it difﬁ-

cult to determine the gaze object, as seen in Fig. 7(d).

As analyzed by Wang et al. the bottleneck of Ga-

Tector is that the results of the gaze heatmap esti-

mation affect the gaze object prediction (Wang et al.,

2022). On the other hand, the PEGO Transformer can

directly detect the gaze object without relying on the

gaze heatmap. Therefore, the PEGO Transformer per-

formed well in scenes where gaze heatmap estimation

was difﬁcult.

6 CONCLUSIONS

In this paper, we present a new task of detecting an

object that a target pedestrian is gazing at in a trafﬁc

scene. Our proposed PEGO Transformer can estimate

a pedestrian’s gaze object without the high-resolution

head image and the gaze heatmap used in conven-

tional gaze detection methods. Unlike existing gaze

detection datasets considering human daily lives, our

dataset focuses on trafﬁc scenes. This method and

dataset can provide important information for behav-

ior prediction and contribute to the realization of au-

tomated vehicles.

The PEGO dataset proposed in this paper con-

sists of gazed objects of pedestrians that are annotated

from a third-person view. This annotation scheme can

be easily applied to existing large data sets, but the

annotations may differ from the true objects gazed at

by the pedestrians. Thus, future work will include

evaluation of the PEGO transformer in controlled ex-

periments in which pedestrians gaze at predeﬁned tar-

gets.

The previous study by Hata et al. (Hata et al.,

2022) proved that skeleton information is effective in

estimating whether a pedestrian is gazing at an in-

Pedestrian’s Gaze Object Detection in Trafﬁc Scene

339

vehicle camera. Therefore, it is expected that the ac-

curacy will be further improved by taking skeletal in-

formation into account in the PEGO Transformer. We

plan to extend the dataset to improve the accuracy of

the proposed method.

ACKNOWLEDGMENT

This work was partially supported by JSPS Grant-in-

Aid for Scientiﬁc Research 23H03474. The computa-

tion was carried out using the General Projects on su-

percomputer “Flow” at Information Technology Cen-

ter, Nagoya University.

REFERENCES

Belkada, Y., Bertoni, L., Caristan, R., Mordan, T., and

Alahi, A. (2021). Do pedestrians pay attention? eye

contact detection in the wild.

Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E.,

Xu, Q., Krishnan, A., Pan, Y., Baldan, G., and Bei-

jbom, O. (2020). nuScenes: A multimodal dataset

for autonomous driving. In Proceedings of the 2020

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 11618–11628.

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,

A., and Zagoruyko, S. (2020). End-to-End object de-

tection with transformers. In In Proceedings of the

European conference on computer vision, pages 213–

229. Springer.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,

M., Benenson, R., Franke, U., Roth, S., and Schiele,

B. (2016). The Cityscapes Dataset for Semantic Ur-

ban Scene Understanding. In In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 3213–3223.

Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013).

Vision meets robotics: The kitti dataset. Interna-

tional Journal of Robotics Research (IJRR), page

1231–1237.

Hata, R., Deguchi, D., Hirayama, T., Kawanishi, Y., and

Murase, H. (2022). Detection of distant eye-contact

using spatio-temporal pedestrian skeletons. In Pro-

ceedings of the IEEE 25th International Conference

on Intelligent Transportation Systems, pages 2730–

2737.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In In Proceedings

of the 2016 IEEE Conference on Computer Vision and

Pattern Recognition, pages 770–778.

Rasouli, A., Kotseruba, I., Kunic, T., and Tsotsos, J. (2019).

PIE: A large-scale dataset and models for pedestrian

intention estimation and trajectory prediction. In Pro-

ceedings of the 2019 IEEE/CVF International Confer-

ence on Computer Vision, pages 6261–6270.

Recasens, A., Khosla, A., Vondrick, C., and Torralba, A.

(2015). Where are they looking? In Proceedings of

the Advances in Neural Information Processing Sys-

tems, volume 28, pages 199–207.

Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Pat-

naik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine,

B., Vasudevan, V., Han, W., Ngiam, J., Zhao, H., Tim-

ofeev, A., Ettinger, S., Krivokon, M., Gao, A., Joshi,

A., Zhang, Y., Shlens, J., Chen, Z., and Anguelov,

D. (2020). Scalability in perception for autonomous

driving: Waymo Open Dataset. In Proceedings of the

2020 IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 2443–2451.

Tomas, H., Reyes, M., Dionido, R., Ty, M., Mirando, J.,

Casimiro, J., Atienza, R., and Guinto, R. (2021).

GOO: A dataset for gaze object prediction in retail

environments. In Proceedings of the 2021 IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion Workshops, pages 3119–3127.

Tu, D., Min, X., Duan, H., Guo, G., Zhai, G., and Shen, W.

(2022). End-to-End Human-Gaze-Target Detection

with Transformers. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 2192–2200.

Tu, D., Shen, W., Sun, W., Min, X., Zhai, G., and Chen,

C. (2023). Un-gaze: a uniﬁed transformer for joint

gaze-location and gaze-object detection. IEEE Trans-

actions on Circuits and Systems for Video Technology.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I.

(2017). Attention is All you Need. In Proceedings of

the 2017 Advances in Neural Information Processing

Systems, volume 30.

Wang, B., Hu, T., Li, B., Chen, X., and Zhang, Z. (2022).

GaTector: A uniﬁed framework for gaze object pre-

diction. In Proceedings of the 2022 IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 19588–19597.

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J. (2021).

Deformable DETR: Deformable transformers for end-

to-end object detection. In Proceedings of the 9th In-

ternational Conference on Learning Representations.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

340