ATTENTION MODELS FOR VERGENCE MOVEMENTS BASED ON

THE JAMF FRAMEWORK AND THE POPEYE ROBOT

Niklas Wilming, Felix Wolfsteller, Peter König

Institute of Cognitive Science, University of Osnabrück, Albrechtstr. 28, Osnabrück, Germany

Rui Caseiro, João Xavier, Helder Araújo

Institute of Systems and Robotics - ISR, University of Coimbra, Portugal

Keywords:

Stereo vision, Visual attention, Active vision.

Abstract:

In this work we describe a novel setup for implementation and development of stereo vision attention models

in a realistic embodied setting. We introduce a stereo vision robot head, called POPEYE, that provides degrees

of freedom comparable to a human head. We describe the geometry of the robot as well as the characteristics

that make it a good candidate for studying models of visual attention. Attentional robot control is implemented

with JAMF, a graphical modeling framework which allows to easily implement current state-of-the-art saliency

models. We give a brief overview over JAMF and show implementations of four exemplary attention models

that can control the robot head.

1 INTRODUCTION

In recent years the study of visual attention has be-

come more and more popular. For a review see

(Knudsen, 2007). Besides numerous behavioral and

physiological studies, a number of computational

models of attention have been proposed (Itti and

Koch, 2001). However, most models concentrate on

monocular input in eye-centered coordinates. This is

biologically implausible in at least two ways. First it

constrains the number of available features in salience

approaches to monocular features and neglects depth

cues (Jansen et al., 2008). In addition they do not

address the role of head-movements, which show dis-

tinct patterns in humans (Einhäuser et al., 2007), is

not addressed.

Especially with regard to depth cues and head

movements there is a need to evaluate such models

in a realistic and embodied way. In order to ﬁll this

gap the POPEYE stereo robot head (Figures 1 and 2)

was developed. It provides stereo video input and its

degrees of freedom are comparable to the major de-

grees of freedom of the oculomotor system: tilt, pan

and eye vergence.

To implement attention models on such hardware

can be difﬁcult and requires a large degree of tech-

nical knowledge. To ease the development of atten-

tion models that can control POPEYE’s movements

we use JAMF (Steger et al., 2008). It is a graphical at-

tention modeling framework that represents attention

models as directed graphs of functional units and is

well suited to express current state-of-the-art saliency

models.

In this work, we aim at introducing a novel setup

that allows to use POPEYE to study different mod-

els of visual and auditory attention. First we will in-

troduce the robot head and give a brief overview of

JAMF and describe how both are integrated. Finally,

we give four examples of simple saliency map models

that control the robot head: One that considers "red"

as salient, a contrast based model and two models that

extend the latter with face detection and optical ﬂow.

2 RELATED WORK

In the literature other robotic heads have been de-

scribed. The ISR Multi-degree of freedom robot head

(Batista et al., 1995) supports many degrees of free-

dom, including eye zoom and independent tilts for

each eye. Fellenz and Hartmann present a simpler

robot made from off-the-shelf parts that still has ten

degrees of freedom (Fellenz and Hartmann, 2002).

Other robotic heads are embedded in humanoid robots

429

Wilming N., Wolfsteller F., König P., Caseiro R., Xavier J. and Araújo H. (2009).

ATTENTION MODELS FOR VERGENCE MOVEMENTS BASED ON THE JAMF FRAMEWORK AND THE POPEYE ROBOT.

In Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, pages 429-437

DOI: 10.5220/0001782204290437

 SciTePress

Figure 1: A CAD drawing of POPEYE.

that are designed for interaction with humans in mind,

like the Mertz robot (Aryananda and Weber, 2004),

which was given a friendlier appearance with a mask

that resembles a human face. A stereo head that is de-

signed for precision and robustness much like ours, is

CeDAR (Cable Drive Active vision Robot) (Truong

et al., 2000). Further work that has dealt with con-

trol aspects as well as design of active vision robots

are (Yamato, 1999; Andersen et al., 1996; Grosso and

Tistarelli, 1995). The problems that affect a binoc-

ular active vision head are analysed in (Gasteratos

and Sandini, 2002). Here, the authors conclude that

systems perform optimally when they are initialized

such that the two cameras are perfectly aligned and

perpendicular to the baseline. Small variations in the

vergence angle or small horizontal deviations of the

principal point inﬂuence the ability to extract 3D in-

formation from stereo images dramatically. This in-

sight guided the development of the POPEYE system.

3 THE POPEYE STEREO HEAD

The stereo robot head was designed to mimic the ba-

sic abilities of the human head. The human visual sys-

tem has nine degrees of freedom, they are the mechan-

ical degrees of the neck: pan, tilt and swing; the op-

tical degrees of the eyes: focus and aperture; and the

mechanical degrees of the eyes: tilt, pan and swing.

However neck pan and tilt are tightly coupled with

eye pan and tilt. This effectievely reduces the degrees

of freedom to those needed to ﬁxate a point in 3D

space. More details on the human visual system can

be found in (Carpenter, 1988).

The POPEYE robot has a Helmholtz conﬁgura-

tion (Helmoltz, 1925) (tilt axis shared by both eyes),

and consists of four rotational degrees of freedom:

neck pan, neck tilt and individual eye pan. Two more

manually adjustable degrees of freedom or conﬁgu-

ration parameters are available, namely the baseline

between the eyes and the translational distance of the

optical center of each eye to the tilt axis. Thus, in

terms of degrees of freedom, POPEYE is comparable

to a human head.

One main advantage of this robot head is that the

robot eyes are not affected by translations other than

the neck-pan movements. This is possible because of

the location of the camera centers, which allows pure

rotation along the eye axis. Thereby calculation of

new ﬁxation targets is greatly simpliﬁed. The head

features two "eye" slots for cameras and two "ear"

slots for microphones (not used in the present study)

at places that roughly match the location of human

eyes and ears.

3.1 Hardware

In order to simulate the performances of the human

visual system, there is a requirement for large accel-

eration, low friction, high repeatability and minimal

transmission errors. As also seen in POPEYE, these

are some of the primary characteristics of systems that

use motors and feedback devices mounted directly to

the axes of motion.

Figure 2: A picture of the robot POPEYE.

The dynamic performance and accuracy are

achieved with harmonic drive AC motors. With the

harmonic drive gear-boxes transmission compliance

and backlash, which can cause inaccuracy and oscil-

lations, are almost eliminated. Harmonic drive gears

have several advantages:

• They operate with zero backlash which makes the

robot suitable for smooth pursuit movements;

• They are available with positioning accuracy of

better than one minute of arc and repeatability

within a few seconds of arc;

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

430

• They have high torque capacity since power is

transmitted through multiple tooth engagement.

Harmonic Drive gearing offers output torque ca-

pacity equal to conventional gears twice the size

and three times the weight.

The dynamic properties of the robot are summarized

in Table 1. The connection of a standard PC to the

motors of POPEYE is schematized in Figure 3. The

control signal originates from a controller board that

is connected to the PCI bus of the PC and is ampli-

ﬁed in a servo drive before reaching the motors. The

motors send feedback to the encoders.

Controller Board

DCX-PCI300(PMC)

Brushless AC Motor

Harmonic Drive

Servo Drive

(Amplifier)

Encoder readings

Hall

Control Signals

Figure 3: The robot control hardware, the control board, the

servo drive and the motors.

Table 1: Dynamic characteristics of the POPEYE robot.

Precision Range Velocity

Neck Pan (2.5E

−6

)

◦

[−110

◦

,110

◦

] ∼ 8.4

◦

Neck Tilt (2.5E

−6

)

◦

[−35

◦

,35

◦

] 10

◦

Eye pan (4.16E

−6

)

◦

[−45

◦

,55

◦

] ∼ 18.7

◦

Baseline — [165mm, —

340mm]

The cameras used in the robot are Flea2 cam-

eras from PointGrey. They are ﬁrewire color cameras

based on a Sony 1/3” progressive scan CCD that can

capture images with resolutions up to 1024 × 768 at

30Hz. The system can optionally be used with lenses

of different focal lengths.

3.2 Calibrating the Robot

In order to guarantee 3D ﬁxation (i.e. having the prin-

cipal point of both cameras pointing to the same point

in space), calibration of both the cameras and the ge-

ometry of the robot must be performed with good ac-

curacy. We use the algorithm described in (Zhang,

1999) to calibrate the intrinsic and extrinsic parame-

ters of the cameras as well as radial distortion (the tan-

gential distortion is negligible). Having pure rotations

in the eye axes simpliﬁes the image formation geome-

try. Pure rotation is important to implement distance-

independent saccade algorithms and is essential for

algorithms that assume that the relationship between

motion space and motion in joint space may be esti-

mated without knowledge of the target distance.

In order to have pure rotations of the camera in the

eye axes, the camera centers have to be adjusted by

displacing the camera along the optical axis. To per-

form this adjustment of the camera center two meth-

ods are proposed:

• Parallax based method. Consider two objects at

different depths. If after a rotation of the eye,

the relative positions of the corresponding images

change, then the motion had a translational com-

ponent. To test this effect, we placed a pattern

with black vertical lines on white background on

the wall. To create the parallax effect we placed

between the camera and the pattern a transparent

acrylic sheet with just one thinner vertical black

line. The thickness of this line was adjusted to

create the illusion that it is an extension of one of

the lines of the pattern on the wall. This adjust-

ment has been done by hand, and an edge detector

was used to conﬁrm the straightness of the result-

ing line. If after the rotation the straightness is not

the same, this means that we don’t have pure ro-

tation and the position of the center of projection

must be adjusted by displacing the lens camera

body along the optical center.

• Homography based. The homography resulting

from a pure rotation has the same eigenvalues (up

to scale) as the rotation matrix (Hartley and Zis-

serman, 2004). Consequently the angle of rotation

between views may be computed directly from the

phase of the complex eigenvalues of the homog-

raphy matrix. This method can be used to vali-

date the results from the previous one, although

empirical experience demonstrated that it is very

sensitive to noise.

When using active cameras in a multi-camera con-

ﬁguration, it is convenient and often essential to be

able to deﬁne a ﬁduciary frame in which the cameras

are aligned with each other. The most natural choice,

in the case of a stereo robot head, is for the cam-

eras to be pointing straight ahead such that they are

i) parallel, ii) perperpendicular to the elevation axis

and iii) perpendicular to the pan axis, i.e. horizon-

tal. This ﬁducial aligned frame could be realized us-

ing special mechanical and/or optical devices, but can

also be achieved automatically, a process called self-

alignment. This self alignment was achieved with an

implementation of the methods described in (Knight

and Reid, 2006).

3.3 Kinematics and Fixation

To move the robot head to a new ﬁxation target, cor-

rect motor commands have to be generated that con-

trol neck tilt, neck pan and eye vergence. If a ﬁxa-

tion target is only deﬁned by one 2D point in each

ATTENTION MODELS FOR VERGENCE MOVEMENTS BASED ON THE JAMF FRAMEWORK AND THE

POPEYE ROBOT

431

camera, the corresponding 3D point has to be recon-

structed. Given the intrinsics of the cameras and the

corresponding points in both images the coordinates

of the 3D point can be triangulated by intersecting the

rays deﬁned in each camera by the focal point and

the corresponding pixel. Since in most cases the rays

do not intersect, the ﬁxation point is computed as the

middle point of the line segment that minimizes the

distance between the two rays.

3.3.1 Direct Kinematics

We make the following approximations in order to

simplify the computation:

• The center of the head is the cyclopean eye, so

D = 0

• The eyes rotatation is pure (no translation from the

original frame is involved) so ∆l = 0 and ∆r = 0

• the ﬁxation is symmetric, so θ

= −θ

cam

RIGHT EYE

LEFT EYE

TILT

PAN

∆

cam

Figure 4: The kinematic geometry of POPEYE.

The following transformation (T ) relates the pose

of the ﬁxation frame {W

} to the base frame{W

where

is a rotation component:







Bsin(θ

)cos(θ

)

tan(θ

)

−Bsin(θ

)

tan(θ

)

+ D

Bcos(θ

)cos(θ

)

tan(θ

)

0 0 0 1







(1)

3.3.2 Inverse Kinematics

Given a point in 3d world space it is possible to have

the robot ﬁxate the 3d point using the equations that

follow. The position of the ﬁxation point in the frame

{b} :











Bsin(θ

)cos(θ

)

tan(θ

)

−Bsin(θ

)

tan(θ

)

Bcos(θ

)cos(θ

)

tan(θ

)

(2)

Equations for the eyes, θ

(right eye), θ

(left eye), θ

(neck pan) and θ

(neck tilt):











= tan

−1





= tan

−1

−

= tan

−1

Bcos(θ

)

= −θ

(3)

4 ATTENTIONAL ROBOT

CONTROL

One of the main purposes of POPEYE is to study

and develop different models of attention in a re-

alistic embodied setting. Compared to competing

approaches we emphasize a universal setup to de-

velop and study attention models. Therefore we com-

bine POPEYE with the graphical attention modelling

framework JAMF which is able to express many ex-

isting saliency models. For a more detailed treatment

of the framework see (Steger et al., 2008). In the fol-

lowing we provide a rough sketch of the framework

and argue why it is well suited for attentional robot

control.

4.1 The JAMF Attention Modeling

Framework

JAMF is an open source application that allows rapid

prototyping and development of attention models.

Models are represented as directed graphs that have

functional units, called components, as nodes. Within

such a graph, information is passed in the form of

matrices from one unit to the next along connecting

edges. Thus, each unit processes the output of its pre-

decessors. Executing a graph means to traverse it in

a breadth- ﬁrst manner and to perform the function of

each node. When a full pass is done, we speak of one

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

432

iteration of the graph. To handle non-static environ-

ments (e.g. video camera input) the graph is iterated

multiple times, e.g. processing one frame per itera-

tion.

Figure 5: A screenshot of the JAMF Client. The boxes

on the white canvas symbolize functional units, the arrows

show the ﬂow of information. The graph depicted, receives

input from POPEYE’s cameras, down-samples them and

splits the RGB image into individual channels. With this

information, a saliency map is computed that contains only

information about red colors (S = R − G − B). The result

is fed into the "StereoSaliency" component that takes the

maximum of the saliency map. For clarity, only processing

for one camera is shown. In the full version, the "Stere-

oSaliency" component receives input from the second cam-

era as well. It then compares found maxima in both im-

ages by computing a sum of squared differences and out-

puts two points which are used by the "ISR Head Control"

component to generate motor commands that ﬁxate the cor-

responding 3D point.

At its core, JAMF uses a client-server architecture.

The JAMF client is used for development of models

and control of running simulations. New models can

be developed by dragging available components to a

drawing canvas (see Figure 5) where connections that

specify the ﬂow of information can be drawn. The

client also controls execution of the developed mod-

els. Besides starting, stopping or pausing running

models, it provides means to feed input and param-

eters into the graph and to introspect and visualize

intermediate results. If components expose parame-

ters (e.g. window size), these can be modiﬁed by the

client as well.

The server instantiates models into running simu-

lations. It receives a graph from the client, translates

it into equivalent C code, compiles and links the sim-

ulation against the component repository and a con-

troller application. Due to the use of directed graphs

as a representation, the server can automatically ex-

ploit the structure of the graph to parallelize inde-

pendent branches. This allows to utilize all cores on

multi-core processors. Once the simulation is started,

the client connects to the controller application via

TCP/IP. This communication channel allows to:

• Start, stop and pause simulations

• Set components’ parameters

• Request output of components

• Send input to components

A component within the JAMF framework is an al-

gorithm, which is wrapped into a C++ class that pro-

vides access to input, output and parameters. This

can conveniently be achieved by extending JAMF’s

base component class and obeying a naming conven-

tion (e.g. input, output, getter and setter methods start

with input_, out put_, get_ and set_ respectively). In-

puts and outputs are OpenCV matrices. The existing

component base makes heavy use of this highly op-

timized computer vision library, which is encouraged

for new components as well.

JAMF is well suited for the task of attentional

robot control for several reasons. By using directed

graphs as model representation, JAMF captures the

structure of most existing saliency models. They

usually consist of a set of discrete processing stages

(e.g. feature extraction, conspicuity map computa-

tion, conspicuity map combination, top-down inte-

gration), where each stage depends on the output of

its predecessors. Another important aspect for robot

control is computational efﬁciency. JAMF can reach

high performance due to several strategies. The use

of native C code and OpenCV for component devel-

opment generates fast simulations. The graph struc-

ture allows automatic parallelization. Development

of new attention models is eased by the graphical

client and the standard component repository which

contains many building blocks of common attention

models. JAMF separates model from implementation

such that technical details are "hidden" by the graph-

ical client. Therefore, model developers do not nec-

essarily need knowledge of speciﬁc implementation

details. A functional description, as provided in the

built-in user documentation, for each component suf-

ﬁces.

4.2 JAMF-POPEYE Interface

To use POPEYE within JAMF attention models we

use the following setup (see Figure 6). A new com-

ponent was developed that, given two corresponding

points, ﬁxates the robot head on this point (see 3.3).

For this, POPEYE has to be connected to the host that

runs the JAMF server, as this component uses low

level API routines and hardware. Existing attention

ATTENTION MODELS FOR VERGENCE MOVEMENTS BASED ON THE JAMF FRAMEWORK AND THE

POPEYE ROBOT

433

Figure 6: The JAMF / POPEYE setup. Camera output is

fed into the attention model. The model generates a new

target point which is transduced into motor commands by

the actuator component.

models can be extended to control POPEYE by in-

cluding and feeding ﬁxation targets into this compo-

nent. Execution and control of the simulation can then

be carried out remotely from the machine that runs the

JAMF client using a TCP/IP connection.

5 USE CASE

Having described the software and hardware setup

in previous chapters, it is fruitful to see some ex-

emplary implementations of attention models. The

general task is, given two input images, to generate

a new ﬁxation target for the robot head. A simple

approach is to ﬁnd interesting points in each image

separately, which corresponds to computation of two

saliency maps. The "ISR Head Control" component

can then triangulate a 3D point from the two points

and generate motor commands that move the robot

head to ﬁxate this point. Obviously, the triangulation

is only meaningful when the two 2D coordinates point

to the same point in 3D space. To check this, we com-

pute the sum of squared differences in a patch around

the 2D coordinates. A ﬁxation target is only accepted

when the difference is sufﬁciently small. Insofar, the

attention model can be treated as a black box as long

as it outputs two saliency maps (see Figure 6 for a

graphical depiction).

For an easy example consider an attention model

that is tuned to red colors as the only interesting fea-

ture. Figure 5 shows a JAMF graph that models this.

Mathematically, simple saliency models can be ex-

pressed in the following way:

Given an input image I we compute feature ma-

trices, named F

f eaturename

. In the example mentioned

above the "redness" feature F

red

is obtained by sub-

tracting the green and blue color channel G(I),B(I)

from the red color channel R(I).

red

(I) = R(I) − G(I) − B(I) (4)

The resulting matrix F is normalized by transforming

its values into z-values by subtracting the mean

F and

dividing by the standard deviation σ

. Note that the

parameter t speciﬁes the number of frames that are

used to estimate the mean and variance.

Z(F,t) =

F −

F,t

(5)

Using equations 4 and 5, we compute a weighted sum

of z-scored feature values as ﬁnal saliency map S. The

sum is formed over the set of different feature maps

(Γ), where each is weighted by a scalar θ

. In the ex-

ample above the set of feature matrices just contains

the redness feature Γ = {F

red

} and hence no weight is

necessary.

S(I,t) =

∑

γ∈Γ

∗ Z(F

(I),t) (6)

The most interesting point in an image is deﬁned

as the location of the maximum in the saliency map

S(I,t).

= argmax(S(I

,t)) (7)

Where argmax returns the position of the maximum

in a matrix. Note that this has to be done on both

input images (d ∈ {l,r}).

The implementation of the "redness" model in

JAMF is straightforward. First, new camera input

from both cameras is obtained by the “CamCapture”

component. Each camera frame is processed inde-

pendently in its own processing stream. To speed

up performance, two "GaussianPyramid" components

downscale the camera images delievered by "Cam-

Capture" to a resolution of 320x240 pixels. To extract

the "red" feature, two "ChannelSplitter" components

split the input images into their RGB color channels.

Two "Sub" components for each camera can then sub-

tract the green and blue channel from the red channel.

The resulting feature maps are then normalized by the

“ZTransformation” component and summed with the

“Add” component for each camera individually. At

this point, the information of both cameras is fused

by the "StereoSaliency" component. It computes the

sum of squared differences in a patch surrounding the

most salient points in each input image. If the SSD

falls below a threshold, the “ISRHead Control” com-

ponent triangulates a new 3D ﬁxation target and mo-

tor commands are send to the robot to ﬁxate the new

target. This simple demonstration serves as a proof of

concept. Naturally, considering only the color "red"

is not too interesting.

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

434

Figure 7: Exemplary results from all four implemented saliency models. The ﬁrst row gives the original, unprocessed movie

frame. In the following rows ﬁxation targets (maxima) are indicated using a star. Saliency maps are shown for models using

1) red as salient feature, 2) contrast measurements, 3) contrast measurements plus face detection, 4) contrasts, face detection

and optical ﬂow as features. The inﬂuence of the time dependent z-score normalization can be seen in the last row. Sudden

onset of movement in the 4th frame boosts the inﬂuence of the optical ﬂow feature, which returns to baseline after several

frames.

It is easy to extend this model with different fea-

tures. Adding new features such as optical ﬂow

or face detection can be accomplished by simply

adding the respective components to the model ("Op-

ticalFlow", "FaceDetect") which can be done with a

few clicks. They are connected in the same way as the

"redness" feature. They receive input from the gaus-

sian pyramid and their output is normalized by the

"ZTransformation" component. The z-score normal-

ization ensures that different features can be summed

without risking that one feature alone drives the ﬁ-

nal saliency map. All in all we have implemented

four different saliency models that differ in the fea-

tures they use:

1. Model #1 uses only the color red as feature as de-

scribed above

2. Model #2 uses red-green, blue-yellow and lumi-

nance contrasts

3. Model #3 extends model #2 with face detection

4. Model #4 extends model #3 with optical ﬂow

This scheme demonstrates how easy it is to test

new attention models in this setup. Figure 7 shows the

resulting saliency map from the four different models

Table 2: Performance of models 2-4. The models were

evaluated on a 2x Dual-Core AMD Opteron Processor (2.2

GHz) with 4GB Ram. All values in frames per second.

“Single” refers to graphs where only monocular input is

processed, the number of feature extraction components is

thus half compared to “double” graphs. “MP” refers to eval-

uation with automatic parallelization.

Model: #2 #3 #4

Single 17.2 9.6 7.8

MP single 29.3 13.3 11.4

Double 12.8 5.5 4.9

MP double 24.31 11.5 9.9

mentioned above. Notably, the generated ﬁxation pat-

terns can be saved along with the visual input to eval-

uate the behavior of the POPEYE robot. This allows

to compare computational models with psychophysi-

cal experiments. We have refrained from carrying out

such experiments, as our main focus is to provide a

setup in which these comparisons are possible.

An important aspect for the comparison of embod-

ied attention models with human behavioral data is

the processing speed. Table 2 shows a summary of

how many frames per second can be processed with

this setup. (Salthouse and Ellis, 1980) report that the

ATTENTION MODELS FOR VERGENCE MOVEMENTS BASED ON THE JAMF FRAMEWORK AND THE

POPEYE ROBOT

435

minimum duration of a ﬁxation is in the order of 200-

500ms, which suggests that even the most complex

model is fast enough to simulate natural ﬁxation be-

havior (with about 10fps). This also demonstrates the

automatic parallelization abilities of JAMF. Depend-

ing on the number of components, complexity of each

component and the graph structure a speed-up of fac-

tor two can be reached without manual adjustments.

6 CONCLUSIONS

In this work, we introduce a novel setup to study and

develop attention models on POPEYE, a human in-

spired stereo vision robot head. POPEYE allows to

study attention models in a more realistic, embodied,

three dimensional setting. The geometric properties

of the head make it easy to control. The robot can

be controlled by JAMF, a framework to develop and

test attention models in a graphical fashion. The com-

bination of both allows to easily implement attention

models that drive the robot’s behavior. Its capabilites

to evaluate different attention models in an embodied

setting make it a prime candidate for comparing atten-

tion models to psychophysical data. Because design

and technical realization are hidden behind a graphi-

cal abstraction layer it can be used by researchers that

do not have a speciﬁc computer science background.

Within the setup, attention models are represented as

directed graphs that can easliy be shared with other

research groups.

The head is prepared to use stereo auditory inputs

as well. One straightforward way to incorporate audi-

tory information into the existing saliency models is

to use estimated locations of auditory input sources as

a 2D bias ﬁeld that modiﬁes saliency values greatest

at points closest to the source (Quigley et al., 2008).

Such a feature can be treated in the same manner as

all other features in the shown saliency models.

However, there are some problems that still need

to be adressed. So far, a human operator is required

for calibration of the robot before a simulation ses-

sion. Furthermore we have not optimized the speed

of POPEYE to match human saccade-behavior, thus

the speed of movement might not match that of hu-

mans.

The models presented in this work are rather sim-

ple, but show the capabilities of the setup. There are

several open issues with these models. For instance,

practical experience has shown that using a threshold

for the sum of squared differences to compare the two

salient points is not optimal. It is very sensitive to

noise and does not work on rather uniform areas. The

choice of features in the current models is also very

limited and can be improved. One of the most ob-

vious issues that needs improvement is probably the

integration of overt and covert visual attention. Ein-

häuser et al. investigate eye-in-head movements and

head-in-world movements and suggest that both have

distinct contributions for gaze allocation (Einhäuser

et al., 2008).

Future work to improve our models will build on

some of the models presented in the literature. In

(Choi et al., 2006) a biologically motivated vergence

control method for an active stereo vision system that

mimics human-like stereo visual selective attention

is proposed. They compute a gist of the scene that

can later be used in localization. In our case the gist

could be used for online parameterization of the fea-

ture extraction stage. Thereby, the model can be tuned

to different environments (e.g. indoor and outdoor

scenes). Furthermore a depth feature could be inte-

grated into our model. The process of retrieving 3D

information from stereo saliency maps is described

in (Conradt et al., 2002). A vergence control stereo

system using retinal optical ﬂow disparity and target

depth velocity is described in (Batista et al., 2000).

A saliency map model considering depth informa-

tion as a feature is described in (Ouerhani and Hugli,

2000), although the range data was retrieved using

a laser range ﬁnder. In this work we have pursued

a strategy where one saliency map is computed for

every camera. Henkel has proposed a depth estima-

tion algorithm that is able to compute a “cyclopean”

view of a stereo scene (Henkel, 1998). This allows

to resolve several issues that are problematic when

using binocular saliency maps: a cyclopean view is

not concerned with occluded image areas ((Bruce and

Tsotsos, 2005), (Zitnick and Kanade, 1999)) and can

speed up the saliency map computation as only one

map has to be computed. Furthermore, it can aid the

triangulation of a 3D ﬁxation target by giving a depth

estimate. We have put forwarded an integrated hard-

and software system for simulation of visual atten-

tion that has to be seen as a step into the direction of

studying models of attention in a more realistic and

embodied way.

ACKNOWLEDGEMENTS

We gratefully acknowledge the support by IST-

027268-POP (Perception On Purpose).

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

436

REFERENCES

Andersen, C. S., Andersen, C. S., Crowley, J. L., D, P. P., D,

P. P., and Perram, J. (1996). A framework for control

of a camera head. Technical report.

Aryananda, L. and Weber, J. (2004). Mertz: a quest for a ro-

bust and scalable active vision humanoid head robot.

Humanoid Robots, 2004 4th IEEE/RAS International

Conference on, 2:513–532.

Batista, J., Dias, J., Araújo, H., and Almeida, A. (1995).

The isr multi-degrees-of-freedom active vision robot

head: design and calibration. In M2VIP’95–Second

International Conference on Mechatronics and Ma-

chine Vision in Practice, Hong–Kong.

Batista, J., Peixoto, P., and Araújo, H. (2000). A focusing-

by-vergence system controlled by retinal motion dis-

parity. In ICRA, pages 3209–3214.

Bruce, N. and Tsotsos, J. (2005). An attentional framework

for stereo vision. Computer and Robot Vision, 2005.

Proceedings. The 2nd Canadian Conference on, pages

88–95.

Carpenter, H. (1988). Movements of the Eyes. London Pion

Limited, second edition edition.

Choi, S.-B., Jung, B.-S., Ban, S.-W., Niitsuma, H., and Lee,

M. (2006). Biologically motivated vergence control

system using human-like selective attention model.

Neurocomputing, 69(4-6):537–558.

Conradt, J., Simon, P., Pescatore, M., and Verschure, P.

(2002). Saliency maps operating on stereo images

detect landmarks and their distance. In ICANN ’02:

Proceedings of the International Conference on Arti-

ﬁcial Neural Networks, pages 795–800, London, UK.

Springer-Verlag.

Einhäuser, W., Schumann, F., Bardins, S., Bartl, K., Böning,

G., Schneider, E., and König, P. (2007). Human eye-

head co-ordination in natural exploration. Network:

Computation in Neural Systems, 18(3):267–297.

Einhäuser, W., Schumann, F., Vockeroth, J., Bartl, K., M.,

C., J., H., Schneider, E., and König, P. (2008). Dis-

tinct Roles for Eye for Eye and Head Movements in

Selecting Salient Image Parts During Natural Explo-

ration (in press). Ann. N.Y. Acad. Sci.

Fellenz, W. A. and Hartmann, G. (2002). A modular low-

cost active vision head.

Gasteratos, A. and Sandini, G. (2002). Factors affecting

the accuracy of an active vision head. In SETN ’02:

Proceedings of the Second Hellenic Conference on AI,

pages 413–422, London, UK. Springer-Verlag.

Grosso, E. and Tistarelli, M. (1995). Active/dynamic stereo

vision. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 17(9):868–879.

Hartley, R. and Zisserman, A. (2004). Multiple View Geom-

etry in Computer Vision. Cambridge University Press.

Helmoltz, H. (1925). Treatize on physiological optics.

Dover.

Henkel, R. (1998). A Simple and Fast Neural Network Ap-

proach to Stereovision. Advances in Neural Informa-

tion Processing Systems, pages 808–814.

Itti, L. and Koch, C. (2001). Computational modelling

of visual attention. Nature Reviews Neuroscience,

2(3):194–204.

Jansen, L., Onat, S., and König, P. (2008). Free viewing of

natural images: The inﬂuence of disparity. Journal of

Vision (in press).

Knight, J. and Reid, I. (2006). Automated alignment of

robotic pan-tilt camera units using vision. Interna-

tional Journal of Computer Vision, 68(3):219–237.

Knudsen, E. (2007). Fundamental Components of Atten-

tion. Annual Review of Neuroscience, 30:57.

Ouerhani, N. and Hugli, H. (2000). Computing visual atten-

tion from scene depth. In ICPR ’00: Proceedings of

the International Conference on Pattern Recognition,

page 1375, Washington, DC, USA. IEEE Computer

Society.

Quigley, C., Onat, S., Harding, S., Cooke, M., and König,

P. (2008). Audio-visual integration during overt visual

attention. Journal of Vision (in press).

Salthouse, T. and Ellis, C. (1980). Determinants of eye-

ﬁxation duration. Am J Psychol, 93(2):207–34.

Steger, J., Wilming, N., Wolfsteller, F., Höning, N., and

König, P. (2008). The jamf attention modelling frame-

work. In WAPCV 2008, Santorini, Greece.

Truong, H., Abdallah, S., Rougeaux, S., and Zelinsky, E.

(2000). A novel mechanism for stereo active vision. In

In Proc. Australian Conference on Robotics and Au-

tomation. ARAA.

Yamato, J. (1999). A layered control system for stereo vi-

sion head with vergence. Systems, Man, and Cyber-

netics, 1999. IEEE SMC ’99 Conference Proceedings.

1999 IEEE International Conference on, 2:836–841

vol.2.

Zhang, Z. (1999). Flexible Camera Calibration By View-

ing a Plane From Unknown Orientations. In Inter-

national Conference on Computer Vision (ICCV’99),

pages 666–673, Corfu, Greece.

Zitnick, C. and Kanade, T. (1999). Cooperative algorithm

for stereo matching and occlusion detection. Tech-

nical Report CMU-RI-TR-99-35, Robotics Institute,

Carnegie Mellon University, Pittsburgh, PA.

ATTENTION MODELS FOR VERGENCE MOVEMENTS BASED ON THE JAMF FRAMEWORK AND THE

POPEYE ROBOT

437