Limitations of Local-minima Gaze Prediction

Peter A. C. Varley

, Stefania Cristina

, Kenneth P. Camilleri

and Alexandra Bonnici

Department of Systems and Control Engineering, University of Malta, Msida MSD 2080, Malta

Keywords:

Gaze Prediction, Eye Tracking, Feature Location.

Abstract:

We describe a minimal gaze prediction system which is straightforward to implement, can run on everyday

hardware, and does not require high-quality video images. We determine head pose and eye gaze from four

facial landmarks (nose tip, nose bridge, and eye pupils) which can be expressed as local minima of simple

pixel-intensity operations. We assess its stability to: variation of subject’s anatomy; facial landmark outliers;

and facial landmark small systematic errors.

1 INTRODUCTION

In this paper, we describe a minimal gaze prediction

system which is straightforward to implement, can

run on everyday hardware, and does not require high-

quality video images. We determine head pose and

eye gaze from four facial landmarks (nose tip, nose

bridge, and eye pupils) which can be expressed as lo-

cal minima of simple pixel-intensity operations. We

assess its stability to: variation of subject’s anatomy;

facial landmark outliers; and facial landmark small

systematic errors.

Our interest is in providing tools for design meet-

ings where designers meet to discuss their ideas, and

for client-facing meetings where such designs are dis-

played to customers. We wish to use gaze to commu-

nicate the focus of attention to all participants.

However, many aspects of the current pandemic

are unpredictable: not just its duration, but its long-

term sociological and cultural effects. We can be

fairly sure that indoor meetings will be less fre-

quent, and that when they do take place, the partic-

ipants will be masked, making gaze-tracking difﬁ-

cult. Outdoor settings are safer and more acceptable,

the already-observed shift towards walking meetings

(Damen et al., 2020) will surely continue, and partici-

pants in such meetings may well have their hands full

and welcome a hands-free input interface.

We can foresee that social distancing will nor-

malise the idea of seeking information from machines

https://orcid.org/0000-0003-4181-9234

https://orcid.org/0000-0003-4617-7998

https://orcid.org/0000-0003-0436-6408

https://orcid.org/0000-0002-6580-3424

rather than humans, but those seeking information

will be wary of touch-screen input interfaces and

would prefer something which can be operated from

a safe distance, such as the interactive display sug-

gested by (Zhang, 2016).

We can even envisage that, with people spending

more time at home, home automation controlled via a

wall-mounted screen will become more popular. Con-

trolling this interface by gaze from the other side of

the room has its appeal (Tofel, 2020).

Thus, without tying ourselves down to any spe-

ciﬁc application, we can see various possibilities for

a ﬂexible, portable system comprising a projector, a

wall-mounted screen, a camera, and a portable com-

puter. The software should be modular—speciﬁc

applications will have speciﬁc requirements, so up-

grading any particular software component must be

straightforward.

Such a prototype would enable us to determine

which potential applications are realistic, and which

speciﬁc components would require upgrading for

these applications to become a reality.

Typical display screens are around 200 × 160cm

to 240 × 180cm—in this investigation, we assume

240 × 180cm. A 3 × 3 grid of virtual buttons will be

sufﬁcient for simple control applications.

Section 2 describes previous work in gaze recog-

nition in general, and more speciﬁcally in locating eye

and nose features. Section 3 explains and describes

our own system. Section 4 gives snapshots of our re-

sults. Section 5 discusses system stability. Section

6 presents our conclusions and recommendations for

future work.

Varley, P., Cristina, S., Camilleri, K. and Bonnici, A.

Limitations of Local-minima Gaze Prediction.

DOI: 10.5220/0010477400460057

In Proceedings of the International Conference on Image Processing and Vision Engineering (IMPROVE 2021), pages 46-57

ISBN: 978-989-758-511-1

2 PREVIOUS WORK

In reviewing previous work, we are interested both

in methods and in applications. Choice of a mini-

mal subset of facial landmarks is important, and, as

we shall see, methods for locating eyes and noses are

of particular interest. Section 2.1 considers applica-

tions. Section 2.2 gives an overview of previous work

in gaze prediction. Section 2.3 discusses choice of

landmarks. Section 2.4 discusses previous work in lo-

cating eyes and pupils. Section 2.5 discusses previous

work in locating nose landmarks.

2.1 Applications

Toolkits for tracking the gaze of the user of a per-

sonal computer are now available commercially—

GazeRecorder (GazeRecorder, 2020) is one such—

so this must be considered a mature technology. It

is nevertheless worth noting that it is strongly range-

dependent—GazeRecorder applications are satisfac-

tory when the user is 70–80cm from the camera, but

the performance deteriorates rapidly with increasing

distance.

Multi-user gaze-tracking applications, and gaze

tracking at a distance (anything over 1m), are much

rarer.

Zhang (Zhang, 2016) considers various public-

facing applications, either outdoors or in shopping

malls, based on the concept of an interactive display

which is intuitive to use and requires no instruction.

For person-independent eye tracking for public

display applications, the accuracy is about one third

of the screen size. Rather than attempt a numer-

ical gaze prediction, the system classiﬁes gazes as

left/centre/right, with N consecutive identical pre-

dictions constituting a command to which the sys-

tem will respond (there is a suggestion that N=6 was

used). This is sufﬁcient for an application which al-

lows the user to choose one of three side-by-side op-

tions. It was noted that several users wore glasses and

used the system without problems, but varying height

was a problem as tall users tended to stoop to use the

system, while shorter users lifted their heels.

The suggested application is an album cover

browser where the user cycles through clockwise or

anticlockwise until the desired album shows up (al-

though we note that, even if the intention is to adver-

tise the products of one label, a typical label will have

between 1,000 and 10,000 albums on the market, and

cycling through all of them is impractical).

A more realistic suggestion is an events calendar

for the coming month—users should be able to de-

duce which way to scroll, and there will not be so

many events that they get bored before reaching the

one they want.

Sidenmark (Sidenmark et al., 2020) attempt to

distinguish natural head movements from intentional

head pose changes. Although initial results suggest

that this is possible in principle, this remains work in

progress.

Mardenbegi (Mardanbegi et al., 2019) addresses a

fundamental problem in multi-person gaze-tracking:

who is looking at the screen and who is not? Unfor-

tunately, this work includes what to us is a horrible

example of a bad interface paradigm: they use shak-

ing the head to signify select. While head gestures

is culture-dependent, in most cultures with which we

are familiar, nodding signiﬁes acceptance and shaking

the head signiﬁes rejection.

2.2 Gaze Tracking (General)

When discussing how of images of faces may be pro-

cessed, the distinction is often made between model

based and appearance based methods. It is not clear

that this distinction is justiﬁable, let alone helpful, as

there is a large overlap.

Model based methods assume that what is being

processed is a face, and that faces have certain known

properties. Some model based methods hypothesise

things which cannot be seen, such as the centre of

the eyeball. But methods which only use landmarks

which can be seen (eyes, nose, mouth) and label them

as component parts of a face are also model based.

Appearance based methods use only that which

can be seen. A few (but not many) appearance based

methods make no assumptions at all about the face

image, but just feed it straight into an AI machine

(usually MTCNN). Most appearance based methods

compile feature vectors which they then feed into an

AI machine (often MTCNN, but Webgazer (Papout-

saki et al., 2016) used SVM).

There are also methods (Ishikawa et al., 2004)

(Weidenbacher et al., 2006) (Sapienza and Camilleri,

2014) which are model based in that they label the

features they ﬁnd as eyes, nose, mouth, but also ap-

pearance based in that they work entirely with what

can be seen.

Instead of dividing ideas into camps, it is more

helpful to look at individual methods, see how well

they work, and assess their advantages and disadvan-

tages.

For work prior to 2016, we commend Open-

Face (Baltrusaitis et al., 2016), an open-source toolkit

which implements those gaze tracking ideas current

at the time, including MTCNN, HOG/SVM and Haar

Cascades.

Limitations of Local-minima Gaze Prediction

As a representative example of the current state of

the art, we can consider Zhang (Zhang, 2016), which

describes a complete gaze-tracking system, from sys-

tem components, through implementation and inte-

gration, to applications, testing and user assessment.

It also includes a good general overview of the state

of the art at the time.

Zhang’s approach was to use a neural network,

to which the input was an annotated image accom-

panied by a selection of features. This raises ques-

tions: surely the strength of neutral networks is that

they can detect which patterns are important; if we al-

ready know what is important in an image, why use a

neural network at all?

More recent developments include:

Zhang (Zhang et al., 2019) presents OpenGaze.

an open-source toolkit for appearance-based gaze es-

timation and interaction. OpenGaze is largely a front-

end for OpenFace (Baltrusaitis et al., 2016).

Hagihara (Hagihara et al., 2018) creates a map-

ping between objects in the real world and objects the

user looks at. To this end, they present a 3D gaze

tracker which tracks depth as well as x-y coordinates.

Their implementation requires the user to wear a hel-

met or eye-tracker.

Mardenbegi (Mardanbegi et al., 2019) use

vestibulo-ocular reﬂex to determine how far away the

face is from whatever the user is looking at. But mea-

surements are made using a virtual reality headset,

and require accuracies which cannot be achieved us-

ing a typical laptop camera.

Although almost all recent systems have been

built around neural networks, it appears that diminish-

ing returns are setting in, with each new development

being based on a more subtle aspect of the human eye,

resulting in a smaller incremental improvement on its

predecessor.

While black-box methods such as neural networks

have their advantages, they are uninformative. In

practice, mere success is insufﬁcient—we want some-

thing which works for reasons which we understand,

in order that, when it doesn’t work, we understand

why and can ﬁx (or work around) the problem. Fur-

thermore, even accepting that predictions from neural

networks will be somewhat more reliable than those

from simpler methods (since the neural network takes

much more information into account when making its

prediction), this does not necessarily mean that a neu-

ral network system will be more reliable. Prediction

is one component of the overall system, a simpler but

faster component can make far more predictions in

the same time, and statistical analysis of many predic-

tions could well lead to better results than dependence

on one somewhat better prediction. Only experimen-

tal results can determine which gives better results.

For an alternative approach, we must go back

to (Kazemi and Sullivan, 2014), which implements

(Doll

ar et al., 2010)’s Cascaded Pose Regression and

(Cao et al., 2012)’s Ferns. This approach has proved

popular with hobbyists, and has been implemented

by (Xu et al., 2015) and (Papoutsaki et al., 2016)

amongst others. The key ideas here are (a) that in-

cremental improvement can turn a good estimate into

a better one and (b) that “anywhere in the image” is

a sufﬁciently good starting point. While we agree

wholeheartedly with (a), we cannot agree with (b)—

what appears in the background in any image is be-

yond our control, and we cannot predict how it may

disrupt iterative improvement.

2.3 Choice of Landmarks

How many facial landmarks are required?

The Dlib implementation (King, 2009) of

(Kazemi and Sullivan, 2014) locates and tracks 68 fa-

cial landmarks, but this is surely excessive. As an

alternative, Dlib provides an option for detecting just

5 landmarks: four eye points (inside and outside cor-

ners of the left and right eyes) and one nose point (the

base of the nasal septum).

FastHpe (Sapienza and Camilleri, 2014) locates

four facial features: left and right eyes, nose, and

mouth. The precise landmarks are not speciﬁed—

features are used to detect motion by comparing one

frame to the next, not to determine head pose.

Clearly, if we are to perform 3D calculations, we

require at least four landmarks, which must not lie

in the same plane (all ﬁve of the Dlib landmarks are

coplanar), and which must be in the rigid part of the

face (and not on the mouth, which can move indepen-

dently).

2.4 Eyes

Two methods for obtaining eye regions stand out:

Haar Cascades (Viola and Jones, 2004), and Cascaded

Pose Regression (Doll

ar et al., 2010) and its deriva-

tives. If we prefer the former, it is because there

are several good and readily-available Haar cascades

for eyes, notably Yu’s left- and right-eye cascades

(OpenCV, 2015). Asteriadis (Asteriadis et al., 2006)

has observed that the lower 60% of the regions re-

turned by Yu’s cascades are centred on the pupil.

FastHpe (Sapienza and Camilleri, 2014) uses Haar

Cascades. Applications based on the 68-point version

of Dlib (King, 2009) use Cascaded Pose Regression.

Both detect eyes but not pupils.

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

(Timm and Barth, 2011) use Haar Cascades to de-

termine an initial region of interest, and follow this

with a gradient-following method to determine pupil

positions: the pupil is the point at which most gradi-

ent vectors cross.

Recent ideas which are worthy of investigation in-

clude:

Liu (Liu et al., 2017) introduced a geometric re-

formulation which maintains the relationship between

left and right eyes when head pose changes.

Zhang (Zhang, 2016) introduced Pupil-Canthri-

Ratio, which could usefully be included in any ap-

proach which compiles localised features.

Cheng (Cheng et al., 2020) identiﬁes the user’s

dominant eye, and uses that rather than the other one

(or a combination of the two) for gaze-tracking. Their

results are often (but not always) better than those of

previous methods developed by the same authors.

2.5 Noses

Compared with eyes, noses have received compara-

tively little attention. One might think that, as noses

are a consistent and readily-identiﬁable shape, they

would be an ideal application for Haar cascades, but

the reality is otherwise—the only readily-available

Haar nose cascade, that of (Castrill

on et al., 2007),

is far from reliable, as it is not clear which nose land-

mark it detects. Nevertheless, FastHpe (Sapienza and

Camilleri, 2014) uses this cascade.

The 68-point version of Dlib (King, 2009), imple-

menting the ideas of (Kazemi and Sullivan, 2014), lo-

cates nine nose points: four in a line from the bridge

to the tip, and a further ﬁve in an arc covering the nos-

trils. It can be noted that this method is less successful

in practice for noses than for other facial landmarks:

it is slower to converge, and the results are less accu-

rate. It is also hard-coded, so impossible to modify.

The more recent 5-point version of Dlib locates

just one nose point, the base of the nasal septum.

It is for these reason that we prefer simple ideas

such as that of Varley (Varley et al., 2021). This max-

imises a pixel-intensity-difference operation 2M −

L − N between three squares: M is centred on the

nose tip, and L and N are below and either side of it—

nose tips protrude from the face and catch the light,

whereas nostrils are concave and dark, as can be seen

in Figure 1. Understanding how it works means that

we are aware of its limitations: although this method

is quite good at ﬁnding nose tips, it is even better at

ﬁnding ear lobes, so it must be constrained to a region

of interest which includes the centre of the face and

excludes the ears.

Figure 1: Averaged Nose Tips and Surrounding Regions.

3 IMPLEMENTATION

There are four points on the human face which are

mathematically unique: the two pupils, the nose

bridge, and the nose tip. The important implication

of mathematical uniqueness is that, given a reason-

able estimate of where the feature probably is, we can

then use optimisation methods to determine a more

accurate position.

We make use of these four points as follows:

1. Locate faces in the image. See Section 3.1.

2. Locate the mouth region in each face. See Section

3.2.

3. Locate the eye regions in each face. See Section

3.3.

4. Find the pupils for each eye. See Section 3.4.

5. Find the nose tip. See Section 3.5.

6. Find the nose bridge. See Section 3.6.

7. Calculate the head pose: tilt angle, nod and turn,

and eye gaze. See Section 3.7.

8. Predict the gaze target. See Section 3.8.

3.1 Faces

We start our analysis by locating faces in an image.

Starting with an RGB image, we take the red channel,

which leads to slightly better results than the more

usual greyscale. We use Lienhart’s Alt2 Frontal Face

detector (Lienhart and Maydt, 2002), which in prac-

tice we have found to be most reliable, to ﬁnd the face

regions. If several regions are found and they do not

Limitations of Local-minima Gaze Prediction

overlap, we process the best of them (regions with two

eyes are better than regions with one eye, which are

better than regions with no eyes). The face region is

used to constrain mouth, eye and nose regions of in-

terest.

3.2 Mouths

We locate a mouth region in each face. As a non-

rigid feature, the mouth itself is inappropriate for use

in gaze prediction, so these mouth regions are used

solely to constrain eye and nose regions of interest, for

which mouth regions provide a suitable lower bound.

We use Deniz’s Smile detector (OpenCV, 2015)

and Castrill

on’s Mouth detector (Castrill

on et al.,

2007) to ﬁnd the mouth region. Although neither of

these is entirely reliable, by running both and com-

paring the results we can usually determine a reliable

mouth region.

3.3 Eye Regions

We use Yu’s Left Eye and Right Eye detectors

(OpenCV, 2015) to ﬁnd eye regions, and where pos-

sible we follow Asteriadis’s (Asteriadis et al., 2006)

recommendation of using the lower 60% of this re-

gion.

Ideally, Yu’s cascades will ﬁnd one left and one

right eye. Sometimes they do not, but by applying

common sense rules for selecting/estimating missing

regions we can usually get a useable result anyway.

These rules are:

• If the same detector (left or right eye) detects two

eye regions which overlap, merge them

• If a detected left eye overlaps with a detected right

eye, remove the one which is on the wrong side of

the face

• If there is one left eye and more than one right eye

(or vice versa), pick the one nearest the reﬂection

across the face of the unique eye and discard the

others

• If there is only one eye, estimate the other one by

reﬂection across the face.

The eye regions are not used directly in calculat-

ing gazes, but are one of the best predictors of pupil

position, as described next.

3.4 Pupils

The centre of each eye pupil is at the centre of a re-

gion with approximate rotational symmetry in which

pixel intensity increases with distance from the cen-

tre: pupils are darker than irises, and irises are darker

Figure 2: Averaged Nose Bridges and Surrounding Re-

gions.

than sclerae. (Occasionally, specular reﬂection may

interfere with this general rule.)

There are two fairly-reliable methods for locating

pupils.

When Yu’s cascades ﬁnd one unambiguous left or

right eye (in about 86% of faces), the best estimate of

pupil position is the centre of the lower 60% (Asteri-

adis et al., 2006) of the cascade region.

Alternatively, given a region of interest, the pupil

is at the centre of the darkest 5×5 patch in this region.

The reliability is around 64%.

By cascading these methods, we can ﬁnd pupil lo-

cations with an accuracy of ±1 pixel and a reliability

of around 95%.

3.5 Nose Tips

We use the method described in (Varley et al., 2021).

As can be seen in Figure 1, in a typical face, the

nose tip is the furthest point on the face from the face

plane and catches the light best, and the nostrils are

always below and either side of the nose tip and are

darker than their surroundings. Even when tightly

constrained, the method sometimes ﬁnds other nose

landmarks rather than tips; when it ﬁnds the correct

landmark (in about 86% of cases), median accuracy

is ±1 pixel.

3.6 Nose Bridges

By virtue of its central position, the nose bridge is the

reference point from which all gaze predictions start,

as well as contributing to the calculation.

The nose bridge is at the centre of a saddle point,

with eye regions to either side, and skin above and be-

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

low. As can be seen in Figure 2, eye regions are typ-

ically occluded by foreheads, noses and cheeks, and

are thus dark, while the skin of the forehead and nose

is in front of that of the bridge and thus bright.

To locate nose bridges, we use a method concep-

tually similar to the nose tip method above. We max-

imise the image intensity differences of four rectan-

gles, (V +W ) − (M + N), where V and W are above

and below, and M and N are either side of the nose

bridge.

As there are potentially several saddle points in

a face image, this method must also be constrained

to an appropriate region of interest—in practice, an

initial estimate that it is somewhere between the eyes

is good enough. The method reliably ﬁnds the correct

landmark, but median accuracy is only ±2 pixels.

3.7 Gazes

We require a system of equations for determining

gaze predictions. How many measurable values are

there, and how many unknowns?

By locating the landmarks described above, we

obtain eight measurable values: the x- and y-

coordinates of the four local-minima landmarks. L

and R are the left and right pupil coordinates, and V

and H are the nose tip and bridge coordinates.

The head is located somewhere in xyz space,

where x and y are horizontal and vertical coordinates

in the image, and z is distance from the camera.

We model the three human head movements (nod-

ding, shaking, tilting) as (pitch, yaw, roll) of a “disem-

bodied head” (Murphy-Chutorian and Trivedi, 2009)

which rotates about a centre point between the eyes

and behind the nose bridge. This centre of rotation is

located at (X,Y,Z) in xyz space. X and Y have to be

determined, but in this analysis we assume that Z can

be estimated from anatomical parameters such as the

inter-eye distance.

We model the two human eye movements (glanc-

ing aside, glancing upwards) as horizontal and verti-

cal translations (in principle, they are pitch and yaw

about the centre of the eyeball, but there is little to be

gained by modelling this added complexity).

For the purposes of simple analysis, with respect

to this centre of rotation, when the head is facing for-

ward (pitch, yaw and roll all 0):

• the left pupil is at (+E,0, 0),

• the right pupil is at (−E,0, 0),

• the nose bridge is at (0,0,−B),

• the nose tip starts at (0,+D,−C).

This gives us four anatomical parameters:

• E: inter-eye distance,

• B: protrusion of the nose bridge from the face

plane,

• C: protrusion of the nose tip from the face plane,

• D: vertical distance from the nose bridge to the

nose tip.

Heads have ﬁve angular degrees of freedom:

• N is nodding (pitch), which rotates the head in the

yz plane, leaving x unchanged,

• S is shaking (yaw), which rotates the head in the

xz plane, leaving y unchanged,

• T is tilting (roll), which rotates the head in the xy

plane, leaving z unchanged,

• P is glancing aside, which in principle rotates the

pupils in the xz plane, leaving y unchanged; we

treat it as a translation along the x-axis,

• U is glancing upwards, which in principle rotates

the pupils in the yz plane, leaving x unchanged;

we treat it as a translation along the y-axis.

Table 1: Notation.

Notation type meaning

L point left pupil coordinates

R point right pupil coordinates

V point nose tip coordinates

H point nose bridge coordinates

E length inter-eye distance

B length nose bridge protrusion

C length nose tip protrusion

D length nose height (tip to bridge)

N angle nod (pitch)

S angle shake (yaw)

T angle tilt (roll)

P (angle) glancing aside

U (angle) glancing upwards

X scalar centre of rotation x

Y scalar centre of rotation y

This leaves us with eight equations in eleven un-

knowns (see Table 1 for a full list of data points and

unknowns, and Figure 3 for an illustration). In order

to make the problem tractable, we must remove three

unknowns, and we choose to do this by estimating

other anatomical parameters B,C,D as a ﬁxed propor-

tion of inter-eye distance E. As will be seen in Section

5.1, this can lead to problems when the subjects have

particularly small or large noses.

For simplicity, terms of the form sin(α)sin(β)

have been removed as in most cases angles are small.

V.x = X −C sin(S)cos(N) − D cos(S) sin(T ) (1)

Limitations of Local-minima Gaze Prediction

Figure 3: Notation: Points, Distances and Angles.

V.y = Y + D cos(N) cos(T ) −C sin(N) (2)

H.x = X −Bsin(S)cos(N) (3)

H.y = Y − B sin(N) (4)

L.x = X +(P + E/2)cos(S)cos(T )

+U cos(S)sin(T )

(5)

L.y = Y + (P + E/2) cos(N) sin(T )

−U cos(N) cos(T )

(6)

R.x = X +(P − E/2)cos(S)cos(T )

+U cos(S)sin(T )

(7)

R.y = Y + (P − E/2)cos(N)sin(T )

−U cos(N) cos(T )

(8)

Solving this system analytically might be possi-

ble, but we prefer a simpler approach. We rearrange

the equations into pairs:

L.x − R.x = E cos(S)cos(T ) (9)

L.y − R.y = E cos(N) sin(T ) (10)

L.x + R.x = 2X + 2P cos(S)cos(T )

+ 2U cos(S)sin(T )

(11)

L.y + R.y = 2Y + 2P cos(N) sin(T )

− 2U cos(N)cos(T )

(12)

V.x − H.x =(B −C)sin(S) cos(N)

− D cos(S)sin(T )

(13)

V.y − H.y = D cos(N)cos(T ) − (C − B) sin(N) (14)

V.x + H.x =2X − D cos(S)sin(T )

− (B +C)sin(S) cos(N)

(15)

V.y + H.y = 2Y + D cos(N)cos(T )

− (B +C)sin(N)

(16)

This can be solved iteratively (set all cosines to 1

and U to 0 in the ﬁrst iteration):

E = magnitude(L − R)/(cos(S)cos(T )) (17)

sin(T ) = (L.y − R.y)/(E cos(N)) (18)

sin(N) =(V.y − H.y − D cos(N) cos(T ))

/(B −C))

(19)

sin(S) = ((V.x − H.x) + D cos(S)sin(T ))

/((B −C)cos(N)))

(20)

X = (V.x + H.x + (B +C)sin(S)cos(N)

+ D cos(S)sin(T ))/2

(21)

Y = (V.y + H.y + (B +C)sin(N)

− D cos(N)cos(T ))/2

(22)

P = ((L.x + R.x)/2 − X −U cos(S) sin(T ))

/(cos(S)cos(T ))

(23)

U = (Y − (L.y + R.y)/2 + P cos(N)sin(T )))

/(cos(N) cos(T ))

(24)

We have found that, if implemented as-is, this se-

quence takes some time to converge as it oscillates.

However, by smoothing calculation of E (Equation

17) so that E = (E

)/2 for previous value E

and

new value E

, it converges very quickly. We use 10

iterations, but 4 should be sufﬁcient for stable predic-

tions.

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

Figure 4: Original and Processed Images: Looking Right, Down, Left.

3.8 Gaze Target

The overall gaze prediction is a vector, the sum of

the head pose and eye gaze, originating from the nose

bridge.

Geometrically, the head pose (S,N) and pupil dis-

placement (P,U ) are 2D projections of 3D vectors

originating from (X,Y ); we must multiply them by

the distance from the subject to the camera to ﬁnd

the gaze target. We estimate this distance from E,

the inter-eye distance, as W /E where W is a tune-

able program parameter. A further complication is

that, anatomically, eye movements are more subtle

than head movements, so must be scaled up to obtain

the correct effect; scaling factor F is another tuneable

parameter. We thus calculate the gaze target G:

G = [X ,Y ] +W ([S,N] + F[P,U])) (25)

4 RESULTS

The images in the top row of Figure 4 were extracted

from a short test video in which the subject’s intention

was to keep her gaze steady while moving her head,

and processed as described in Section 3. The results

are as shown in the second row of Figures 4. Images

and results have been cropped to remove irrelevant

background.

In the results images:

• The pink square shows the face rectangle as found

by the face Haar cascade

• The purple rectangle shows the mouth rectangle

as determined by the Haar cascades

• The yellow rectangles show the left and right eye

rectangles as found by the two eye Haar cascades

• The yellow dots show the predicted positions of

the subject’s left and right pupils

• The green rectangle shows the region of interest

which constrains the search for nose tips

• The green dots show the predicted positions of the

subject’s nose tip and nose bridge

• The red square marks the predicted head pose vec-

tor

• The blue square marks the predicted eye gaze vec-

tor

• The white square marks the overall gaze predic-

tion.

For example, in the top left image in Figure 4, the

subject has moved her head to the right, and is glanc-

ing to her left so as to keep the camera in view. Thus

the head pose prediction is to her right; the eye gaze

prediction is to her left; and the overall gaze predic-

tion is (relatively) central.

5 STABILITY

We consider three sources of instability which could

disrupt our gaze predictions: variations of anatomy;

outliers such as those caused by failure to detect land-

marks correctly; and small errors such as those im-

posed by the limitations of pixel resolution.

5.1 Anatomy

We noted in Section 3.7 that estimating anatomical

nose parameters as a ﬁxed proportion of inter-eye dis-

Limitations of Local-minima Gaze Prediction

Figure 5: Image: A Short Nose (Huang et al., 2007).

Figure 6: Processed Image: Head Pose Too High.

tance could lead to problems when the subjects have

particularly short or long noses. This proves to be the

case in practice: short noses appear to be pointing up-

wards (as in Figures 5 and 6, where D/E = 0.352),

and long noses appear to be pointing downwards (as

in Figures 7 and 8, where D/E = 0.735). (Both im-

ages are taken from the LFW dataset (Huang et al.,

2007).)

This can be overcome by recalibration—the ra-

tios B/E, C/E and D/E are constant for any individ-

ual user—but recalibrating for each new user is time-

consuming and tedious.

It we want a system which works for everyone

straight out of the box, there is no easy solution. In

order to allow for the full variety of human noses,

we shall need more nose landmarks, and our system

and the equations which describe it will inevitably be-

come more complex.

Figure 7: Image: A Long Nose (Huang et al., 2007).

Figure 8: Processed Image: Head Pose Too Low.

5.2 Outliers

Although outliers can occur for any number of rea-

sons, the most common cause in frontal faces is the

nose tip ﬁnder described in Section 3.5, which in test-

ing found the wrong landmark in about 14% of im-

ages. For example, when processing Figure 9, it has

found the wing of the nose rather than the nose tip—

see Figure 10.

This result is typical: although an outlier in the

nose tip prediction has caused a large error in the

estimated head pose, the resulting error in eye gaze

prediction often almost compensates for this, and the

overall resulting error is surprisingly small.

When the head pose is to one side (in our imple-

mentation, beyond 28

◦

), the Haar cascades used for

ﬁnding pupils become unreliable for the more dis-

tant eye. Sometimes they fail altogether, and some-

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

Figure 9: Original Image: Looking Up.

Figure 10: Processed Image: Landmark Failure.

times they return ambiguous regions of interest which

can including hair or eyebrows, causing the secondary

method for locating pupils to fail too. (Timm and

Barth, 2011) have reported similar problems. In such

cases, the head pose prediction can be very poor.

5.3 Small Errors

When using a single camera, errors of ±1 pixel are

commonplace and, in practice, unavoidable. What ef-

fect do they have on gaze prediction?

The subject in Figure 11 (again taken from the

LFW dataset (Huang et al., 2007)) has a particularly

average nose, D/E = 0.568, very close to the median

ratio of nose length to inter-eye distance, so is a suit-

able subject for a sensitivity analysis. We estimate, by

comparison with other images, that this image corre-

sponds to a face around 123 cm from the camera.

We ﬁnd by varying the labelled landmark posi-

tions that:

• a 1-pixel x error in the nose bridge position

changes the shake angle by 3.23

◦

• a 1-pixel y error in the nose bridge position

changes the nod angle by 3.35

◦

• a 1-pixel x error in the nose tip position changes

the shake angle by 3.27

◦

• a 1-pixel y error in the nose tip position changes

the nod angle by 3.35

◦

Figure 11: A Typical Nose (Huang et al., 2007).

These angles will increase with distance, as the

size of each landmark in the image decreases. Fur-

thermore, the absolute error on the screen for any

given angle error will also increase with distance.

How far from the camera can we go before the error

becomes unacceptable?

Table 2: Absolute Error Estimates vs Distance.

Distance (cm) x-error (cm) y-error (cm)

123 7 7

150 11 10

200 19 19

250 30 29

300 43 42

350 59 57

400 77 75

450 98 95

500 121 118

550 147 143

600 176 171

650 207 202

700 242 236

750 279 272

800 320 311

850 363 354

900 410 399

950 461 448

1000 515 500

Assuming that (a) distance has no other effect on

gaze prediction than reducing the size of the face, (b)

the angle error resulting from a 1-pixel error increases

in proportion to the distance, and (c) the resulting ab-

solute error from a given angle error is in proportion

to the distance, we obtain the ﬁgures in Table 2.

For example, at 250cm, an error of ±1 pixel can

Limitations of Local-minima Gaze Prediction

change the nod angle by 6.57

◦

and/or the shake angle

by 6.81

◦

, leading to a horizontal error of 30cm and/or

a vertical error of 29cm.

Thus, if the target is a 80 × 60 cm box on a 240 ×

180 cm screen on the wall of a 10 × 6m room, a 1-

pixel error from 300 cm will miss the box, a 1-pixel

error from 450–500 cm will miss the screen, and a

1-pixel error from 750–800 cm will miss the wall.

6 CONCLUSIONS

We have shown that a minimal gaze prediction sys-

tem using only four points can make reasonably reli-

able predictions for subjects with average noses who

sit within 2m of the camera. This system is easily im-

plemented, requiring only four or ﬁve Haar cascades

(all of which are bundled with OpenCV). It is easy to

modify, or even replace, any of the landmark locators.

This simplicity comes at some cost. There are

places where we could use more data points, most ob-

viously where we have to make assumptions about the

anatomical proportions of the face.

What can be done for people with small or large

noses? We could add calibration to retune the sys-

tem for each new user; the cost is ease of use. Alter-

natively, the methods for locating nose tips and nose

bridges are reasonably reliable, and we could in prin-

ciple use similar methods to identify other landmarks

on the nose, giving us extra equations; the cost is

added complexity.

We would also like to be able to weight our cal-

culations so that, when the head is turned, we give

priority to the nearer eye. This would be particularly

useful in those cases where the head is turned and the

location of the more distant eye has not been deter-

mined correctly. With only four points, there is no

redundancy, and no opportunity to give some points

higher weightings than others.

Although it may appear counter-intuitive, gross

outliers are not usually a serious problem. In a video-

processing system in which landmarks are tracked

from one frame to the next, outliers can be caught and

discarded.

The most serious problem is that of small errors

becoming large errors with increasing distance from

the camera, as this imposes a limit on the distance at

which gaze prediction can be useful.

On this basis, we can assess the potential appli-

cations listed in Section 1. Interactive display boards

used from a distance of between 1–2m should cer-

tainly be possible. Multi-user interactive boards may

be restricted in the number of users, as it will be dif-

ﬁcult to place them so that they are less than 2.5m

from the board but more than 2m from one another.

Sadly, gaze-controlled smart homes may not yet be

realistic, as even if the screen is placed on the centre

of the longer wall of a 5 × 3m living room, there will

be locations in the room which are out of range.

At present, it seems that the best workaround is

to improve the hardware: either buy a more expen-

sive camera with higher resolution, or (better still) use

multiple cameras.

The natural progression is from still images to

video sequences. Before we make this leap, we must

ensure that our system is ready for it.

ACKNOWLEDGEMENTS

The authors wish to acknowledge the project: “Set-

ting up of transdisciplinary research and knowledge

exchange (TRAKE) complex at the University of

Malta (ERDF.01.124)”, which is co-ﬁnanced by the

European Union through the European Regional De-

velopment Fund 2014–2020.

REFERENCES

Asteriadis, S., Nikolaidis, N., Hajdu, A., and Pitas, I.

(2006). An eye detection algorithm using pixel to edge

information. In ICCVW.

Baltrusaitis, T., Robinson, P., and Morency, L.-P. (2016).

Openface: An open source facial behavior analysis

toolkit. In 2016 IEEE Winter Conference on Applica-

tions of Computer Vision (WACV), Lake Placid, NY,

pages 1–10.

Cao, X., Wei, Y., Wen, F., and Sun, J. (2012). Face align-

ment by explicit shape regression. In CVPR, pages

2887–2894.

Castrill

on, M., D

eniz, O., Hern

andez, M., and Guerra, C.

(2007). Encara2: Real-time detection of multiple

faces at different resolutions in video streams. In Jour-

nal of Visual Communication and Image Representa-

tion Vol 18 No 2, pages 130–140.

Cheng, Y., Zhang, X., Lu, F., and Sato, Y. (2020). Gaze

estimation by exploring two-eye asymmetry. In IEEE

Transactions on Image Processing (TIP), 29(1), pages

5259–5272.

Damen, I., Lallemand, C., Brankaert, R., Brombacher, A.,

van Wesemae, P., and Vos, S. (2020). Understanding

walking meetings: Drivers and barriers. In ACM Pro-

ceedings of CHI 2020.

Doll

ar, P., Welinder, P., and Perona, P. (2010). Cascaded

pose regression. In CVPR, pages 1078–1085.

GazeRecorder (2020). Gazerecorder webcam eye tracking.

https://gazerecorder.com/.

Hagihara, K., Taniguchi, K., Abibouraguimane, I., Itoh,

Y., Higuchi, K., Otsuka, J., Sugimoto, M., and Sato,

IMPROVE 2021 - International Conference on Image Processing and Vision Engineering

Y. (2018). Object-wise 3d gaze mapping in physical

workspace. In Proc. Augmented Human 2018, pages

25:1–25:5.

Huang, G. B., Ramesh, M., Berg, T., and Learned-Miller,

E. (2007). Labeled faces in the wild: A database for

studying face recognition in unconstrained environ-

ments. Technical Report 07-49, University of Mas-

sachusetts, Amherst.

Ishikawa, T., Baker, S., Matthews, I., and Kanade, T.

(2004). Passive driver gaze tracking with active ap-

pearance models. In Proceedings of the 11th World

Congress on Intelligent Transportation Systems.

Kazemi, V. and Sullivan, J. (2014). One millisecond face

alignment with an ensemble of regression trees. In

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 1867–1874.

King, D. E. (2009). Dlib-ml: A machine learning toolkit.

In Journal of Machine Learning Research 10, pages

1755–1758.

Lienhart, R. and Maydt, J. (2002). An extended set of haar-

like features for rapid object detection. In Proceed-

ings. 2002 International Conference on Image Pro-

cessing volume 1, pages 900–903. IEEE.

Liu, Y., Lee, B.-S., and McKeown, M. (2017). A new re-

construction method in gaze estimation with natural

head movement. In Fifteenth IAPR International Con-

ference on Machine Vision Applications (MVA), May

2017.

Mardanbegi, D., Clarke, C., and Gellersen, H. (2019).

Monocular gaze depth estimation using the vestibulo-

ocular reﬂex. In Proceedings - ETRA 2019: 2019

ACM Symposium On Eye Tracking Research and Ap-

plications, page 20. ACM.

Murphy-Chutorian, E. and Trivedi, M. M. (2009). Head

pose estimation in computer vision: A survey. In IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence Volume 31 No 4.

OpenCV (2015). Open Source Computer Vision Library.

Papoutsaki, A., Sangkloy, P., Laskey, J., Daskalova, N.,

Huang, J., and Hays, J. (2016). Webgazer: Scalable

webcam eye tracking using user interactions. In Pro-

ceedings of the 25th International Joint Conference

on Artiﬁcial Intelligence (IJCAI), pages 3839–3845.

AAAI.

Sapienza, M. and Camilleri, K. P. (2014). Fasthpe: A recipe

for quick head pose estimation. Technical Report TR-

SCE-2014-01, University of Malta.

Sidenmark, L., Mardanbegi, D., Ramirez Gomez, A.,

Clarke, C., and Gellersen, H. (2020). Bimodalgaze:

Seamlessly reﬁned pointing with gaze and ﬁltered ges-

tural head movement. In ETRA ’20 Proceedings of the

12th ACM Symposium on Eye Tracking Research and

Applications. ACM, ACM.

Timm, F. and Barth, E. (2011). Accurate eye centre local-

isation by means of gradients. In Proceedings. 6th

International Conference on Computer Vision, Imag-

ing and Computer Graphics Theory and Applications,

pages 125–130.

Tofel, K. C. (2020). Eye-gaze tracking on a smart

display: The next smart home interface?

https://staceyoniot.com/eye-gaze-tracking-on-a-

smart-display-the-next-smart-home-interface/.

Varley, P. A., Cristina, S., Bonnici, A., and Camilleri, K. P.

(2021). As plain as the nose on your face? In Pro-

ceedings. 16th International Conference on Computer

Vision, Imaging and Computer Graphics Theory and

Applications.

Viola, P. and Jones, M. (2004). Robust real-time face de-

tection. In International Journal of Computer Vision,

57(2), pages 137–154.

Weidenbacher, U., Layher, G., Bayerl, P., and Neumann,

H. (2006). Detection of head pose and gaze direction

for human-computer interaction. In Perception and

Interactive Technologies. PIT.

Xu, P., Ehinger, K. A., Zhang, Y., Finkelstein, A., Kulkarni,

S. R., and Xiao, J. (2015). Turkergaze: Crowdsourc-

ing saliency with webcam based eye tracking. Tech-

nical Report 1504.06755, arXiv preprint.

Zhang, X., Sugano, Y., and Bulling, A. (2019). Evaluation

of appearance-based methods and implications for

gaze-based applications. In Proc. 37th ACM SIGCHI

Conference on Human Factors in Computing Systems

(CHI 2019).

Zhang, Y. (2016). Eye tracking and gaze interface design

for pervasive displays. PhD thesis, University of Lan-

caster.

Limitations of Local-minima Gaze Prediction