End-to-End Gaze Grounding of a Person Pictured from Behind

Hayato Yumiya

, Daisuke Deguchi

, Yasutomo Kawanishi

and Hiroshi Murase

Institute of Intelligent System, Nagoya University, Japan

RIKEN GRP, Japan

Keywords:

3D Human Posture, Gaze Grounding, Metric Learning, Person-to-Person Differences.

Abstract:

In this study, we address a novel problem with end-to-end gaze grounding, which estimates the area of an

object at which a person in an image is gazing, especially focusing on images of people seen from behind.

Existing methods usually estimate facial information such as eye gaze and face orientation ﬁrst, and then

estimate the area at which the target person is gazing; they do not work when a person is pictured from behind.

In this study, we focus on individual’s posture, which is a feature that can be obtained even from behind.

Posture changes depending on where a person is looking, although this varies from person to person. In this

study, we proposes an end-to-end model designed to estimate the area at which a person is gazing from their

3D posture. To minimize differences between individuals, we also introduce the Posture Embedding Encoder

Module as a metric learning module. To evaluate the proposed method, we constructed an experimental

environment in which a person gazed at a certain object on a shelf. We constructed a dataset consisting of

pairs of 3D skeletons and gazes. In an evaluation on this dataset,HEREHEREHEREwe conﬁrmed that the

proposed method can estimate the area at which a person is gazing from behind.

1 INTRODUCTION

Understanding the objects to which a person directs

their gaze plays an important role in understanding

human actions and intentions. The more a person is

attracted to an object, the more they may tend to fo-

cus their gazed on it. This information is important

in various applications, such as analyzing degrees of

interest in commercial products. For example, sup-

pose a customer is gazing at a certain product for an

extended period. We can therefore predict that the

customer may be interested in the product and may

be likely to do so. In this research, we refer to the

task of associating the area at which a person appears

to be gazing with an actual object in the real world as

gaze grounding. Many methods have been proposed

to estimate the area at which a person in an image is

gazing (Jha and Busso, 2018; Fridman et al., 2016).

Fridman et al. proposed a method to accurately es-

timate the area of a person’s attention by extracting

face orientations from images. However, these meth-

ods are typically ineffective in practice because cam-

eras may have a viewpoint situated behind the target

person. Also, in realistic situations such as in retail

stores, installing cameras to capture a person from a

view may be awkward or impractical. Therefore, peo-

ple in such camera images often stand with their backs

to the camera. In these cases, because their face can-

not be extracted from the image, the area at which

they are gazing cannot be estimated by existing meth-

ods. Additionally, most existing methods estimate a

heat map of the gaze area in the image, which does

not directly correspond to the actual object.

In this study, we address the problem of gaze

grounding, which estimates the gaze object area of

a person in an image, especially the person pictured

from behind. Humans can estimate where the person

is looking from the posture, even from behind. This

is the case because we know that posture changes de-

pending on where a person is looking. For example,

Figs. 1 (a) and (b) show examples of a person stand-

ing with their back to the camera. They show that

the posture changes depending on where the person

directs their gaze, with their head turning in different

directions or bending forward to observe lower ob-

jects. This inspired us to estimate the gaze area by

focusing on differences in postures. However, by an-

alyzing the postures more deeply, we found that con-

siderable differences between individuals in postures

even when looking at the same object from the same

location, as shown in Figs. 1 (b) and (c). This in-

dicates that these differences need to be compensated

898

Yumiya, H., Deguchi, D., Kawanishi, Y. and Murase, H.

End-to-End Gaze Grounding of a Person Pictured from Behind.

DOI: 10.5220/0011899900003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

898-905

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

Gaze Area

(a) Upper left (Person A).

Gaze Area

(b) Lower left (Person A).

Gaze Area

Figure 1: Looking at one of the objects on a shelf.

for to estimate the area of the object at which a person

is gazing from their posture.

In this study, we propose an end-to-end method to

estimate the area of the object at which a person is

gazing from their posture while differences between

individuals. Here, human posture is deﬁned as a set of

3D locations of body joints calculated by 3D pose es-

timation. In the proposed method, a posture is embed-

ded into a posture feature space. The feature space is

trained by deep metric learning to differentiate poses

if the gazed objects different while bringing even dif-

ferent persons close together in the feature space if

the objects of their gaze are the same. This empha-

sizes posture differences depending on the gaze tar-

get while minimizing person-to-person differences.

Then, a likelihood map associated with the scene is

generated from the embedded posture feature. By re-

ferring to the object location in the scene, the method

aggregates likelihoods within each target object re-

gion. Finally, the region with the highest likelihood

is selected as the region of the object of the person’s

gaze.

The contributions of this paper are as follows.

• We deﬁne the problem of gaze grounding for a

person pictured from behind.

• We propose a method to estimate the area of an

object at which a person is gazing from a camera

viewpoint located behind them. The method gen-

erates a likelihood map from posture information

and associates the map with regions of objects.

• We propose a deep feature embedding method

that can compensate for differences between in-

dividuals. This makes the distances between pos-

ture pairs of different persons looking at the same

target object close together.

• We provide a method to aggregate likelihood by

referring to object locations. This enables us to es-

timate gazed objects with good robustness to ob-

ject locations.

• We also propose an end-to-end training method

that jointly trains the deep feature embedding and

the likelihood-map generation model.

2 RELATED WORK

2.1 Gaze Estimation

Kellnhofer et al. (Kellnhofer et al., 2019) have pro-

posed a method of gaze estimation with a model

trained on images captured under various situations

and camera viewing directions. They constructed a

dataset called Gaze360 comprising indoor and out-

door videos captured by an omnidirectional camera

and annotated with 3D gaze directions. Because this

dataset contains a large number of individual persons,

it can be widely used to evaluate 3D gaze estimation

methods. The estimation accuracy was also improved

using multiple consecutive frames as input to a long

short-term memory (LSTM). However, this method

cannot be applied to estimate the gaze area of a per-

son with their back to the camera.

Nonaka et al. (Nonaka et al., 2022) focused on

the cooperativeness between gaze, head, and body,

and proposed a gaze estimation method using tem-

poral information of head position and posture. They

constructed a dataset with 3D annotations of gaze di-

rection on videos of multiple situations captured by

surveillance cameras, and modeled the gaze direction

likelihood distribution representing the relationships

between head and body postures. Here, a neural net-

work was used to represent the conditional distribu-

tion of gaze direction. This shows that 3D gaze can

be estimated even for scenes with considerable occlu-

sion. However, this method cannot estimate the 3D

gaze direction from a single frame because it requires

temporal information.

End-to-End Gaze Grounding of a Person Pictured from Behind

899

2.2 Gaze Estimation from the Behind a

Person

Bermejo et al. (Bermejo et al., 2020) proposed a

method to estimate gaze direction from the back of a

person’s head. Their method estimates the gaze direc-

tion using the head region detected by YOLO (Red-

mon and Farhadi, 2018) from a single frame captured

by a third-person view camera. In addition, they cre-

ated 3D models of various people and virtually gen-

erated images of a person pictured from behind in

various environments (varting elements such as light

source location, angle, camera distance, and so forth).

By using these images for training, they reduced the

estimation error caused by camera placement, angle,

lighting conditions, resolution, and so forth. Finally,

they achieved an estimation error of about 23 degrees

in the horizontal direction and 26 degrees in the ver-

tical direction, which is relatively accurate for esti-

mating gaze direction from behind. In contrast, it is

difﬁcult to estimate the gaze area because the target

object cannot be accurately determined only by the

gaze direction.

2.3 Gaze Area Estimation from Posture

Information

Kawanishi et al. (Kawanishi et al., 2018) proposed

a method for estimating a gaze target using the pos-

ture of a person in an image. Based on the idea that

posture can vary relative to the gaze target, they esti-

mated the target at which a person was looking as a

classiﬁcation problem into four areas on a book page

based on the person’s posture. Their results suggested

that the human posture can be used to estimate gaze

area. However, because this is a pre-deﬁned classiﬁ-

cation problem, all the target locations should be ﬁxed

beforehand, and the system cannot estimate other tar-

gets.

2.4 Metric Learning

Metric learning is a method for constructing a fea-

ture space embedding that maps semantically identi-

cal data to nearby locations and semantically different

data to distant locations. A typical approach is to learn

a feature space embedding using anchor data, positive

data of the same class, and negative data of a differ-

ent class. Then, the model is trained so that the dis-

tance between the anchor data and the positive data is

smaller than the distance between the anchor data and

the negative data (Chopra et al., 2005; Wang et al.,

2017). In this study, by using this framework, we ob-

tain the embeddings that transform postures gazing at

the same area into close features in a feature space.

3 ESTIMATING GAZE OBJECT

AREA FROM BEHIND

To associate the gaze area with an actual object in

the real world, we propose an end-to-end method that

generates a likelihood map of the gaze area for a given

posture and aggregates likelihoods within each object

region to obtain object-wise likelihoods.

The method estimates the gazed object area from

behind a person using their posture. As may be ob-

served from seen in Fig. 1 (a), humans can easily esti-

mate that Person A is looking at the object located at

the upper left of the shelf. In addition, from the pos-

ture looking at the different areas (Figs. 1(a) and (b)),

we can observe that they have different characteristics

in terms of head orientation, bending of the hips and

legs, and so forth. These indicate that we usually take

a similar posture when looking at the same place and

vice versa. From this characteristic, we consider esti-

mating the gaze object area by focusing on differences

in posture, even from behind.

When we analyze the postures more deeply, as

shown in Figs. 1 (b) and (c), we can note some dif-

ferences between individuals. In the ﬁgure, a differ-

ent person is looking at an object placed at the lower

left; they are in different postures even though they

are looking at the same area. To compensate for these

differences, we introduce a deep metric learning tech-

nique into the Posture Embedding Encoder module.

Fig. 2 shows an overview of the proposed method.

The neural network model consists primarily of two

parts, including a Posture Embedding Encoder mod-

ule and a Likelihood Map Generator module, fol-

lowed by a Likelihood Aggregation process. The neu-

ral network model is a combination of the Posture

Embedding Encoder module and the Likelihood Map

Generator module. The Posture Embedding Encoder

module is trained to compensate for the person-to-

person differences, while the Likelihood Map Gen-

erator module is trained to generate a likelihood map

from a posture. It is trained in an end-to-end man-

ner, which minimizes the sum L of the losses from the

Posture Embedding Encoder module L

and the Like-

lihood Map Generator module L

as given below.

L = L

+ L

. (1)

The Likelihood Aggregation process calculates the

object-wise gazed likelihood from the likelihood map

in reference to object locations.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

900

Encoder

3D posture

Lower

Left

Upper

Right

Upper

Left

embeddings

…

DeConv

Likelihood map

NTXentLoss(L

)

MSELoss(L

)

Object Region

reference

Obj.1 : 0.94

Obj.2 : 0.03

Obj.3 : 0.01

…

Posture Embedding Encoder Module

Likelihood Map Generator Module

Figure 2: The architecture of the proposed model.

3.1 Posture Embedding Encoder

Module

Based on the deep metric learning framework, we use

posture and gazed object area labels to learn the Pos-

ture Embedding Encoder h(·) that embeds a i-th pos-

ture p

into a posture feature space that can absorb the

person-to-person differences.

The ground truth labels of the gaze object area

are provided in the training data. Here, the labels are

IDs of the gaze target object regions. The encoder is

trained so that the distance between two posture fea-

tures is close if the labels are the same and farther

away if the labels are different. This enables us to

project postures into the embedding space that con-

trols for differences between individuals in the pos-

ture feature. The input of the encoder is 21 three-

dimensional coordinates of human joints that is, it is a

63-dimensional vector p

∈ R

. A posture is embed-

ded into a posture feature f

= h(p

) by the encoder.

Here, h(·) is implemented as multiple fully-connected

layers. In the scene, there are several objects that peo-

ple might be expected to gaze at. We assign object

IDs for each object and use them for the metric learn-

ing. Here, we use NTXentLoss, (Chen et al., 2020)

which can consider the multiple labels simultaneously

for L

. NTXextLoss for i-th sample is deﬁned as

= − log

exp(f

· f

/τ)

∑

k=1

[i̸=k]

exp(f

· f

)/τ)

(2)

where f

, f

, and f

are vectors in feature embedding

space calculated from the input posture vectors by the

encoder. Here, f

is a vector of the same class as

in the mini-batch, and f

is a vector in the mini-

batch. This mini-batch is selected by Easy Positive

Triplet Mining. (Xuan et al., 2020)

[i̸=k]

∈ {0, 1} is

a function that outputs 1 if i ̸= k and 0 otherwise. The

Figure 3: An example of a ground-truth likelihood map

used in the training.

variable τ is a hyperparameter, and we use τ = 0.07 in

the experiment.

3.2 Likelihood Map Generator Module

This section describes the details of the likelihood

map generator corresponding to the gaze area likeli-

hood based on the embedded feature space.

The generator g(·) also consists of a neural net-

work comprising, fully connected layers followed by

inverse-convolutional (Deconvolution) layers. The in-

put of the generator is an embedded feature f

com-

puted by the encoder described in the previous sec-

tion, and the generator is trained to output a likelihood

map m

= g(f

) representing the gaze area likelihood

over the target space.

The ground truth likelihood map for the training

is prepared as follows. First, an image of the target

scene in 40 × 60 pixels, named object area map, is

prepared. Here, we assume that the target space is

a shelf containing multiple objects in a store. In an

object area map, pixels corresponding to a gazed ob-

ject in the space are ﬁlled with 1, and the remainder

End-to-End Gaze Grounding of a Person Pictured from Behind

901

Figure 4: Example of dataset acquisition setting.

of the pixels are ﬁlled with 0. To make the output of

the inverse-convolution (Deconvolution) network be a

square size (equal height and width), the object area

map is extended to 64 × 64 pixels with zero padding.

A Gaussian ﬁlter (σ = 3) is then applied to the object

area map to get smooth contours. Figure 3 shows an

example of the heat map created as the target.

For the training, the loss L

is a mean squared er-

ror (MSE) between an estimated map and a ground-

truth map. The loss for i-th sample is calculated as

= MSE(m

) (3)

∑

j=1

i j

− ˆy

i j

)

i j

∈ m

i j

∈

where

is the ground-truth map corresponding to

the i-th input. Here, the size of a likelihood map was

K × K, and K = 64 is used in the experiment.

3.3 Likelihood Aggregation Process

In this section, we describe the details of the likeli-

hood aggregation process from the estimated likeli-

hood map.

The gaze target object region is determined from

the likelihood map generated by Likelihood Map

Generator Module described above, by referring to

the object area map. For each object region, the av-

erage likelihood within the object region is computed

from the likelihood map. Then, the area having the

highest average likelihood is selected as the gazed ob-

ject area.

Figure 5: Example of object placement on the shelf.

4 EXPERIMENTS

We evaluated the performance of the proposed

method. To do so, we constructed a new dataset as

described below.

4.1 Dataset

The purpose of this study was to estimate the gaze

area at which a person is gazing from behind using the

3D coordinates representing a human posture. How-

ever, no datasets are publicly available for this task.

Therefore, we constructed a new dataset consisting of

3D human postures and their corresponding gaze area

annotations.

First, we describe the details of data acquisition

setting. In the dataset acquisition process, we simu-

lated a situation in which a surveillance camera cap-

tures a person looking at one of the objects on a shelf

in a convenience store. Figure 4 shows the data acqui-

sition environment that we prepared for capturing a

person looking at one of the objects on the shelf from

a speciﬁed position. The size of the shelf is 120 cm

height × 180 cm width, and it is divided into 12 areas

where each size is 30 cm height × 60 cm width.

We put several types of target objects on the shelf

such as plastic bottles, cans, books, and paper cartons.

There are several kinds of cans and bottles; one type

of object is placed in each area. Fig. 5 shows the shelf

that we used in the dataset acquisition. We added

annotations of the object area as follows. First, the

objects are grouped into 12 groups. The 12 regions

corresponding to 12 groups of objects, on the shelf

were annotated as segmented regions. Even for the

multiple objects of the same type is in each segment,

we consider these segments as object areas. These 12

object areas were used as the ground truth(GT) corre-

sponding to the postures of the training data.

The subjects were standing 0.5 m away from the

shelf to see each object. The data were collected from

seven participants in the experiment (one female and

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

902

Table 1: Results of correct answer rate and estimation error.

Correct answer rate

Method

Top-1 (%) ↑ Top-2 (%) ↑ Top-3 (%) ↑

Estimation Error (m) ↓

Proposed 34.26 55.02 66.13 0.33

six males). Here, we used Azure Kinect to capture

images, and their resolutions were 1, 280 × 720 pix-

els and the frame rate was 15 fps. The 3D posture in

Azure Kinect is originally composed of 32 3D skele-

tal coordinates. When the target person image is taken

from behind, the nose, eyes, thumbs, and ears are dif-

ﬁcult to estimate accurately by occlusion, so these

joints were not used. In this dataset, 3d postures were

composed of 21 3D skeletal coordinates.

Through this data acquisition process, a total of

15,228 frames were collected as a dataset.

4.2 Experimental Settings

A summary of the proposed method is proposed as

follows.

Propose Method

First, the proposed encoder module is applied to

posture (3D coordinates of human joints) to obtain

posture features in the feature space. Then, de-

convolutional neural network is applied to the em-

bedded features to reconstruct a likelihood map

corresponding to the subject’s gaze area. Finally,

object-wise likelihoods are aggregated for every

object region referring to object locations.

Here, we used ﬁve fully-connected layers for the

encoder, which output a 4-dimensional vector f ∈ R

from a 63-dimensional vector p. As a metric learning

framework, Easy Positive Triplet Mining (Xuan et al.,

2020) was used for sampling triples from the training

data.

For the Likelihood Map Generator, we used 6

fully connected layers and 3 convolutional transpose

layers, and the sigmoid activation function was ap-

plied to the output layer to restrict the output values

within the range [0, 1].

For training the entire network, we used the

AdamW (Loshchilov and Hutter, 2017) optimizer

with the loss L deﬁned in equation (1).

Experiments were conducted in a cross-validation

scheme and the dataset was split with six of the seven

participants as training data and one as testing data.

An evaluation was performed using the following two

evaluation metrics. The ﬁrst was the correct answer

rate corresponding to how much the method was able

to correctly estimate the gazed object from the 12 ar-

eas on the shelf. In the proposed method, the aver-

age likelihood for each object was calculated, and the

highest one is selected. We evaluated whether the ob-

ject with the highest value was the same as the GT,

which we refer to as the Top-1 correct answer rate.

Also, we evaluated whether the correct answer can be

achieved within the 2nd highest and 3rd highest ar-

eas, referred to as Top-2 and Top-3 rates of correct

answers, respectively.

The second evaluation metric is an estimation er-

ror that is the average of Euclidean distances between

a center point of the area with the highest value and

the GT area.

4.3 Results and Discussions

Fig. 6 (c), Fig. 7 (c) show the likelihood map gener-

ated by the proposed method from postures captured

by Azure Kinect. Table 1 shows the average correct

answer rate evaluated via cross-validation by the pro-

posed method. From the Table 1, Top-1 correct an-

swer rate was 34.26%. This is a better result than 8%

chance rate of considering this problem as a 12-class

classiﬁcation problem. In addition, the estimation er-

ror was 0.33 m, and it may be considered that even

when the estimation failed, it is often estimated in the

neighborhood of the correct answer. From these re-

sults, it may be considered say that it is possible to

end-to-end method was able to estimate the gaze area

from the posture.

4.4 Ablation Study

To investigate the effectiveness of the Posture Em-

bedding Encoder module, an ablation study was con-

ducted with a model designed to estimate the likeli-

hood map from the posture without the Encoder.

A summary of the model’s characteristics is pro-

vided below.

Ablated Model

A deconvolutional neural network was directly

applied to the posture feature (3D coordinates of

human joints) to reconstruct a likelihood map cor-

responding to the subject’s gaze area. The number

of parameters in this model was adjusted as in the

proposed model.

Table 2 shows the estimation results between the

proposed method and the ablated model. We observed

End-to-End Gaze Grounding of a Person Pictured from Behind

903

Table 2: Results of correct answer rate and estimation error.

Correct answer rate

Model Encoder Module

Top-1 (%) ↑ Top-2 (%) ↑ Top-3 (%) ↑

Estimation Error (m) ↓

Ablated model - 20.94 39.02 52.38 0.47

Full model ✓ 34.26 55.02 66.13 0.33

Trut h

(a) looking upper-left (b) Ground Truth

Tru t h

(d) Ablated model

Figure 6: Result of a likelihood map estimated for person A looking at upper-left.

Trut h

(a) looking lower-left (b) Ground Truth

Tru t h

(d) Ablated model

Figure 7: Result of a likelihood map estimated for person A looking at lower-left.

that the ablated model that estimated the likelihood

map without posture embedding showed a reduced

Top-1 correct answer rate by 13.32 points and estima-

tion error by 0.14 meters compared to the proposed

method, and Top-2 and Top-3 correct answer rates

were also greatly decreased. This indicates that the

area with the highest average likelihood selected by

the ablated model was located far from the GT area.

In addition, as shown in the generated likelihood

maps of Figs. 6 (c) and 6 (d), it may be observed that

the distribution in the generated likelihood map by the

full model was smaller than that of the ablated model.

Fig. 8 show t-SNE visualization of the feature

space embedding by the proposed encoder module.

As shown in the ﬁgure, each class was clearly sepa-

rated and embedded in the feature space.

These results suggest that the ablated model was

strongly affected by the ambiguity of the gaze area

caused by individual differences in posture. On the

other hand, the proposed method can reduce this am-

biguity by estimating a likelihood map using a fea-

ture embedding space constructed to compensate for

differences between people, the proposed method can

stably generate a likelihood map and improve the es-

timation accuracy by controlling for these variations.

5 CONCLUSIONS

In this study, we have addressed a problem of end-

to-end gaze grounding, especially targeting a person

pictured from behind. We have proposed an end-to-

end method to estimate the gaze object from a posture

of the person by referring to the object locations.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

904

Figure 8: Visualization of embedded feature space by t-

SNE. Each color corresponds to each gaze area.

In the proposed method, the 3D coordinates of

body joints are ﬁrst projected into a constructed fea-

ture embedding space to compensate for differences

between individuals. The gaze area likelihood map is

generated from the embedded features using a decon-

volutional neural network. The likelihood is averaged

within each object area by referring to the object lo-

cations, and object-wise likelihoods are calculated.

To conﬁrm the effectiveness of the proposed

method, we constructed a new dataset consisting of

3D coordinates of body joints and a target area to

which a person directed their gaze, and experiments

were conducted using this dataset. The experimental

results showed that the proposed approach was able

to estimate the gaze area from the posture, and the

encoder module serves an important function in the

performance role of the proposed model.

REFERENCES

Bermejo, C., Chatzopoulos, D., and Hui, P. (2020). Eye-

shopper: Estimating shoppers’ gaze using cctv cam-

eras. In Proceedings of the 28th ACM International

Conference on Multimedia, pages 2765–2774.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020).

A simple framework for contrastive learning of visual

representations. In Proceedings of the 37th Interna-

tional conference on machine learning, pages 1597–

1607. ICML.

Chopra, S., Hadsell, R., and LeCun, Y. (2005). Learn-

ing a similarity metric discriminatively, with applica-

tion to face veriﬁcation. In Proceedings of the 2005

IEEE Computer Society Conference on Computer Vi-

sion and Pattern Recognition (CVPR’05), volume 1,

pages 539–546. IEEE.

Fridman, L., Langhans, P., Lee, J., and Reimer, B. (2016).

Driver gaze region estimation without use of eye

movement. IEEE Intelligent Systems, 31(3):49–56.

Jha, S. and Busso, C. (2018). Probabilistic estimation of the

gaze region of the driver using dense classiﬁcation. In

Proceedings of the 21st International Conference on

Intelligent Transportation Systems (ITSC), pages 697–

702. IEEE.

Kawanishi, Y., Murase, H., Xu, J., Tasaka, K., and Yanagi-

hara, H. (2018). Which content in a booklet is he/she

reading? reading content estimation using an indoor

surveillance camera. In 2018 24th International Con-

ference on Pattern Recognition (ICPR), pages 1731–

1736. IEEE.

Kellnhofer, P., Recasens, A., Stent, S., Matusik, W., and

Torralba, A. (2019). Gaze360: Physically uncon-

strained gaze estimation in the wild. In Proceedings of

the IEEE/CVF International Conference on Computer

Vision, pages 6912–6921.

Loshchilov, I. and Hutter, F. (2017). Decoupled weight de-

cay regularization. arXiv preprint arXiv:1711.05101.

Nonaka, S., Nobuhara, S., and Nishino, K. (2022). Dynamic

3D gaze from afar: Deep gaze estimation from tempo-

ral eye-head-body coordination. In Proceedings of the

2022 IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 2192–2201.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. arXiv preprint arXiv:1804.02767.

Wang, J., Zhou, F., Wen, S., Liu, X., and Lin, Y. (2017).

Deep metric learning with angular loss. In Proceed-

ings of the 16th IEEE International Conference on

Computer Vision, pages 2593–2601.

Xuan, H., Stylianou, A., and Pless, R. (2020). Improved

embeddings with easy positive triplet mining. In Pro-

ceedings of the 2020 IEEE/CVF Winter Conference on

Applications of Computer Vision, pages 2474–2482.

End-to-End Gaze Grounding of a Person Pictured from Behind

905