2.2 Gaze Estimation from the Behind a
Person
Bermejo et al. (Bermejo et al., 2020) proposed a
method to estimate gaze direction from the back of a
person’s head. Their method estimates the gaze direc-
tion using the head region detected by YOLO (Red-
mon and Farhadi, 2018) from a single frame captured
by a third-person view camera. In addition, they cre-
ated 3D models of various people and virtually gen-
erated images of a person pictured from behind in
various environments (varting elements such as light
source location, angle, camera distance, and so forth).
By using these images for training, they reduced the
estimation error caused by camera placement, angle,
lighting conditions, resolution, and so forth. Finally,
they achieved an estimation error of about 23 degrees
in the horizontal direction and 26 degrees in the ver-
tical direction, which is relatively accurate for esti-
mating gaze direction from behind. In contrast, it is
difficult to estimate the gaze area because the target
object cannot be accurately determined only by the
gaze direction.
2.3 Gaze Area Estimation from Posture
Information
Kawanishi et al. (Kawanishi et al., 2018) proposed
a method for estimating a gaze target using the pos-
ture of a person in an image. Based on the idea that
posture can vary relative to the gaze target, they esti-
mated the target at which a person was looking as a
classification problem into four areas on a book page
based on the person’s posture. Their results suggested
that the human posture can be used to estimate gaze
area. However, because this is a pre-defined classifi-
cation problem, all the target locations should be fixed
beforehand, and the system cannot estimate other tar-
gets.
2.4 Metric Learning
Metric learning is a method for constructing a fea-
ture space embedding that maps semantically identi-
cal data to nearby locations and semantically different
data to distant locations. A typical approach is to learn
a feature space embedding using anchor data, positive
data of the same class, and negative data of a differ-
ent class. Then, the model is trained so that the dis-
tance between the anchor data and the positive data is
smaller than the distance between the anchor data and
the negative data (Chopra et al., 2005; Wang et al.,
2017). In this study, by using this framework, we ob-
tain the embeddings that transform postures gazing at
the same area into close features in a feature space.
3 ESTIMATING GAZE OBJECT
AREA FROM BEHIND
To associate the gaze area with an actual object in
the real world, we propose an end-to-end method that
generates a likelihood map of the gaze area for a given
posture and aggregates likelihoods within each object
region to obtain object-wise likelihoods.
The method estimates the gazed object area from
behind a person using their posture. As may be ob-
served from seen in Fig. 1 (a), humans can easily esti-
mate that Person A is looking at the object located at
the upper left of the shelf. In addition, from the pos-
ture looking at the different areas (Figs. 1(a) and (b)),
we can observe that they have different characteristics
in terms of head orientation, bending of the hips and
legs, and so forth. These indicate that we usually take
a similar posture when looking at the same place and
vice versa. From this characteristic, we consider esti-
mating the gaze object area by focusing on differences
in posture, even from behind.
When we analyze the postures more deeply, as
shown in Figs. 1 (b) and (c), we can note some dif-
ferences between individuals. In the figure, a differ-
ent person is looking at an object placed at the lower
left; they are in different postures even though they
are looking at the same area. To compensate for these
differences, we introduce a deep metric learning tech-
nique into the Posture Embedding Encoder module.
Fig. 2 shows an overview of the proposed method.
The neural network model consists primarily of two
parts, including a Posture Embedding Encoder mod-
ule and a Likelihood Map Generator module, fol-
lowed by a Likelihood Aggregation process. The neu-
ral network model is a combination of the Posture
Embedding Encoder module and the Likelihood Map
Generator module. The Posture Embedding Encoder
module is trained to compensate for the person-to-
person differences, while the Likelihood Map Gen-
erator module is trained to generate a likelihood map
from a posture. It is trained in an end-to-end man-
ner, which minimizes the sum L of the losses from the
Posture Embedding Encoder module L
e
and the Like-
lihood Map Generator module L
d
as given below.
L = L
e
+ L
d
. (1)
The Likelihood Aggregation process calculates the
object-wise gazed likelihood from the likelihood map
in reference to object locations.
VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications
900