ACTIVE SENSING STRATEGIES FOR ROBOTIC PLATFORMS,

WITH AN APPLICATION IN VISION-BASED GRIPPING

Benjamin Deutsch

∗

, Frank Deinzer

∗

, Matthias Zobel

∗

Chair for Pattern Recognition, University of Erlangen-Nuremberg

91058 Erlangen, Germany

Joachim Denzler

Chair for Computer Vision, University of Passau

94030 Passau, Germany

Keywords:

Service Robotics, Object Tracking, Zoom Planning, Object Recognition, Grip Planning

Abstract:

We present a vision-based robotic system which uses a combination of several active sensing strategies to grip

a free-standing small target object with an initially unknown position and orientation. The object position is

determined and maintained with a probabilistic visual tracking system. The cameras on the robot contain a

motorized zoom lens, allowing the focal lengths of the cameras to be adjusted during the approach. Our system

uses an entropy-based approach to ﬁnd the optimal zoom levels for reducing the uncertainty in the position

estimation in real-time. The object can only be gripped efﬁciently from a few distinct directions, requiring

the robot to ﬁrst determine the pose of the object in a classiﬁcation step, and then decide on the correct

angle of approach in a grip planning step. The optimal angle is trained and selected using reinforcement

learning, requiring no user-supplied knowledge about the object. The system is evaluated by comparing the

experimental results to ground-truth information.

1 INTRODUCTION

This paper focuses on the task of visual tracking, clas-

siﬁcation and gripping of a free-standing object by a

robot. Typically, vision-based robotic gripping ap-

plications (Smith and Papanikolopoulos, 1996), use

a passive, non-adaptive vision system. We present a

system which combines several active sensing strate-

gies to improve the sensor input available to the robot,

and allow the robot to choose the best approach direc-

tion. A quantitative evaluation of the robot’s perfor-

mance is achieved by comparing ground-truth infor-

mation with experimental results.

Our robot (see ﬁgure 1) consists of a platform with

a holonomic movement system, a stereo camera sys-

tem mounted on a pan-tilt-unit on top of the platform,

and a lift-like gripper on the front of the platform.

The object tracking subsystem uses the stereo head to

estimate the 3D position of the object relative to the

robot; this information is used to continuously adapt

the robot’s movement and guide it to the target. Ad-

ditionally, the two cameras possess motorized zoom

lenses, allowing the cameras’ zoom levels to be ad-

justed during tracking. Instead of a simple reactive

∗

This work was partly funded by the German Research

Foundation (DFG) under grant SFB 603/TP B2. Only the

authors are responsible for the content.

Figure 1: (left) Our robot, showing the stereo head and the

gripper (holding a plastic bottle). (right) The stereo head.

approach (Tordoff and Murray, 2001), or a trained

model-dependent behavior (Paletta and Pinz, 2000),

the tracking system automatically calculates the op-

timal zoom level with an entropy-based information

theoretical approach (Denzler and Brown, 2002).

Since the objects used may not be gripped from ev-

ery angle (see ﬁgure 2 for examples of object orienta-

tion), the robot needs to detect the relative orientation

of the object and adjust its approach accordingly. The

object classiﬁer generates both class and pose infor-

mation, of which only the latter is used in this system.

The gripping planner uses the pose information to

calculate the optimal gripping direction and the move-

ment the robot must perform to approach the object

169

Deutsch B., Deinzer F., Zobel M. and Denzler J. (2004).

ACTIVE SENSING STRATEGIES FOR ROBOTIC PLATFORMS, WITH AN APPLICATION IN VISION-BASED GRIPPING.

In Proceedings of the First International Conference on Informatics in Control, Automation and Robotics, pages 169-176

DOI: 10.5220/0001140901690176

 SciTePress

PSfrag replacements

(a) (b) (c)

Figure 2: (a) An invalid gripping angle. (b) The worst still

acceptable gripping angle. (c) The optimal gripping angle.

from this direction. This is related to active viewpoint

selection for object recognition (Borotschnig et al.,

2000). The viable grabbing positions were not input

directly into the system; rather, the robot underwent a

training phase, using reinforcement learning.

It should be noted that, unlike robot grip planning

work as described in (Mason, 2001) or (Bicchi and

Kumar, 2000), determining the optimal gripping po-

sition from the shape, outline or silhouette of the ob-

ject is not a focus of our work. Instead, the optimal

gripping position is learned during the training phase

through feedback. This feedback could come from

successful or unsuccessful gripping attempts, or even

(as in our case) from a human judge. The robot can

generalize from this training and evaluate new, un-

trained gripping positions.

The rest of this paper is organized as follows: Sec-

tion 2 details the methods used for object tracking,

object classiﬁcation and grip planning. It also shows

the interdependencies between the different subsys-

tems. Section 3 contains some experimental results.

It explains the exact setup that was used, the mea-

surements taken, and evaluates the results. Section

4 concludes this paper, and outlines future enhance-

ments which aim to enable the system to grip moving

targets as well.

2 METHODS

This section details the methods used in our system.

Section 2.1 describes the object tracker, section 2.2

the object classiﬁer and grip planner, and section 2.3

the co-integration of the two.

2.1 Object Tracking

The purpose of the object tracking system is to deter-

mine the target object’s location relative to the robot

at all times. This information is critical for the robot’s

approach and the classiﬁcation system (see section

2.2). The location is the 3D position of the target (x,

y and z coordinates in mm) relative to the stereo head.

The vision system consists of two cameras on a

pan-tilt unit with a vergence axis per camera (see ﬁg-

ure 1 (right)). The pan axis is not used, leaving both

PSfrag replacements

Time step t − 1 t t + 1

Level 0

Level 1

Level 2

µ(t − 1)

µ(t)

Figure 3: The hierarchical extension to the region tracker.

cameras a shared tilt and individual vergence axes.

The object is ﬁrst visually tracked in the cam-

era images with two 2D template-based object track-

ers, based on a system presented by (Hager and Bel-

humeur, 1998). A detailed description of the tracker

is beyond the scope of this paper; it is only necessary

to note that the tracker generates a time-dependant

motion parameter estimate µ(t), describing the mo-

tion of the tracked object in the image. The estimate

µ(t−1) from the previous time step is used as a start-

ing point for determining µ(t). The tracking system

on our robot uses motion parameters which describe

only translations in the image plane u and a relative

scaling factor s; these are sufﬁcient for tracking a sta-

tionary object, given the planar robot movement.

One problem of the original implementation is that

it can only handle small object motions. Too large

motions will cause the trackers to lose the object. To

allow the object to move farther in the camera image

per time step, the tracker used in our system contains

a hierarchical extension (Zobel et al., 2002).

An image pyramid is created by scaling the camera

image and the reference template downwards k − 1

times, creating k levels of hierarchy. Typically, each

level has half the resolution of the one below it.

Each of these k levels then runs its own region

based tracker, scaled appropriately. Whereas the non-

hierarchical tracker uses the motion parameter esti-

mate µ(t − 1) as an initial value while tracking the

object from time step t−1 to time step t, the hierarchi-

cal tracker propagates the previous estimate through

its different levels. The initial estimate is still the

motion parameter vector

µ(t − 1) from level 0 of

the previous time step. However, it is now passed to

the tracker operating on level k − 1, the highest level,

which results in a (rough) estimate

k−1

µ(t). This is

then propagated down to level k − 2 and so on, the

last level receiving the estimate

µ(t) and calculating

the estimate

µ(t). Figure 3 shows an example with

3 hierarchy levels.

Using this scheme, the complete tracker can han-

dle larger movements, typically 2

k−1

times as large,

while still producing results just as accurate as the

non-hierarchical version. Of course, care needs to

be taken to scale the motion parameters between dif-

ferent levels; in our case of doubling resolutions, the

translation parameter u needs to be divided by 2

k−1

ICINCO 2004 - ROBOTICS AND AUTOMATION

170

when going from

µ(t − 1) to

k−1

µ(t), and multi-

plied by 2 for each descended level from there on.

The scaling parameter s is resolution independent and

does not need to be adjusted.

In our system, we use k = 3 hierarchy levels, pro-

viding a good compromise between tracking capabil-

ity and computation time requirements.

In order to combine, and smooth, the noisy 2D in-

formation provided by the two trackers, an extended

Kalman ﬁlter (Bar-Shalom and Fortmann, 1988) is

employed. The (extended) Kalman ﬁlter is a standard

state estimation tool and will not be described here. A

brief overview of the notation used here follows.

The object’s true state at time t is denoted by q

This is a 9-dimensional vector comprised of the 3D

position, velocity and acceleration of the object. For

each time step, the ﬁlter receives an observation o

(comprised, in our case, of four scalar values: the x

and y image coordinates of the target centers in the

two camera images) and incorporates this into its cur-

rent belief. The ﬁlter’s belief about the true state at

time t is a probability density function p(q

|hoi

where hoi

denotes all observations received up to

time t. In the case of the Kalman ﬁlter, this is as-

sumed to be a normal distribution N (

, P

). The

ﬁlter uses a state transition model to predict a new

state estimate p(q

t+1

) = N (

−

t+1

, P

−

t+1

) from the

previous estimate. Upon receiving the next observa-

tion o

t+1

, the ﬁlter compares this to a predicted ob-

servation and updates its belief to p(q

t+1

|hoi

t+1

During the tracking process, it is possible to ad-

just the focal lengths of the cameras. It is clear that

such an adjustment will have an effect on the observa-

tion function used in the Kalman ﬁlter. This adds an

action parameter a to the object state belief, giving

p(q

|hoi

, hai

). The goal of the zoom planning sub-

system is to ﬁnd an action a

∗

(in our case two zoom

levels, that is, the ﬁeld-of-view of each camera) that is

optimal for the Kalman ﬁlter, i.e. one that minimizes

the uncertainty of the positional belief generated in

the next time step.

In the Kalman ﬁlter, this uncertainty is described by

the a-posteriori covariance matrix P

. The “larger”

the covariance matrix is, the more likely the true state

is deemed to be farther away from the estimated mean

. Several different measurements for the covari-

ance matrix have been proposed in the context of es-

timation evaluation (Puckelsheim, 1993), such as the

determinant of the matrix, or the inverse of the trail of

its inverse (D- and A-Optimality, respectively).

Following (Denzler et al., 2003), we employ the

entropy of the posterior distribution as an uncertainty

measure (Shannon, 1948). The entropy of the a-

posteriori density p(q

|hoi

, hai

) is deﬁned as

H(q) = −

p(q

|hoi

, hai

) log p(q

|hoi

, hai

)dq

(1)

Since (for our system) p(q

|hoi

, hai

) is an n-

dimensional normal distribution N (

, P

), the en-

tropy can be calculated as

H(q) =

log

(2π)

(2)

Unfortunately, the correct entropy can only be calcu-

lated after the most recent observation o

has taken

place, yet we wish to choose an action a

based on

this entropy before the observation. What we need to

calculate is the conditional entropy

H(q

, a

) =

p(o

)H(q

)do

. (3)

This is, in effect, the expected entropy of the belief

over all observations o

, given the action a

. In the

case of the Kalman ﬁlter, P

is independent of the

actual observation o

. Using this information and re-

moving all terms which are irrelevant to the optimiza-

tion, we obtain

∗

= arg min

| (4)

There is one problem with this approach, however:

this formula assumes that there will always be an ob-

servation, no matter how large the focal length. This

is clearly not always the case. One of the main prob-

lems with a large focal length is the associated small

ﬁeld of view. A larger focal length increases the

chance that the object’s projection will in fact be out-

side of a camera’s sensor. In this case, no (usable)

observation has occurred, and the best estimate for

the object’s current state is the a-priori state estimate

p(q

|hqi

t−1

, hoi

t−1

) = N (

−

, P

−

Splitting the conditional entropy into successful

and unsuccessful observations, an unsuccessful obser-

vation being an observation which lies outside one of

the camera’s sensors, one can deﬁne w

) to be the

chance that the object will be visible and the observa-

tion will be successful, and w

) as the chance that

the object will not be visible (Denzler et al., 2003).

) and w

) can be estimated using Monte

Carlo sampling, or by closed-form evaluation in the

case of normal distributions.

Although any irrelevant terms for the optimization

can still be eliminated, the logarithms can no longer

be avoided. This leaves the optimization problem

∗

= arg min

) log

+ w

) log

−

(5)

2.2 Object Classiﬁcation — Grip

Planning

One of the goals of this work is to provide a solution

to the problem of selecting an optimal grippoint resp.

ACTIVE SENSING STRATEGIES FOR ROBOTIC PLATFORMS, WITH AN APPLICATION IN VISION-BASED

GRIPPING

171

PSfrag replacements

Mobile platform,

classiﬁer

Grip planning

Action a

State s

Reward r

t+1

Figure 4: Principles of Reinforcement Learning applied to

grip planning.

grip positions without making a priori assumptions

about the objects and the classiﬁer used to recognize

the class and pose of the object. The problem is to

determine the next view of an object given the current

observations. The problem can also be seen as the de-

termination of a function which maps an observation

to a new grippoint. This function should be estimated

automatically during a training step and should fur-

ther improve over time. The estimation must be done

by deﬁning a criterion, which measures how good it

is to grip an object from a speciﬁc position. Addition-

ally, the function should take uncertainty into account

in the recognition process as well as in the grippoint

selection. The latter one is important, since the robot

must move around the object to reach the planned

grippoint; a noisy operation. So the ﬁnal position of

the robot will always be error-prone. Last not least,

the function should be classiﬁer independent and be

able to handle continuous grippoints.

A straight forward and intuitive way to formalizing

the problem is given by looking at ﬁgure 4. A closed

loop between sensing s

and acting a

can be seen.

The chosen action

= (∆ϕ) with ∆ϕ ∈ [0

◦

; 360

◦

[ (6)

corresponds to the movement of the mobile platform.

As it will only move on a circle around the object in

the application presented in this paper, the deﬁnition

of (6) is sufﬁcient. The sensed state

= (Ω

, ϕ)

with ϕ ∈ [0

◦

; 360

◦

[ (7)

contains the class Ω

and pose ϕ (the rotation) of the

object relative to the robot. This state is estimated

by the employed classiﬁer. In this paper we use a

wavelet-based classiﬁer as described in (Grzegorzek

et al., 2003) but other classiﬁcation approaches can

be applied. Additionally, a so called reward r

, which

measures the quality of the chosen grippoint is re-

quired. The better the chosen direction for gripping

an object, the higher the yielding reward has to be. In

our case we decided for r

∈ [0; MAX], MAX = 10

with r

= 0 for the worst grip position (ﬁgure 2(a))

and r

= 10 for the best grip position (ﬁgure 2(c)).

It is important to notice that the reward should also

include costs for the robot movement, so that large

movements of the robot are punished. These costs

cost(a) =

m · MAX ·

∆ϕ

360

∆ϕ ≤ 180

m · MAX ·

360−∆ϕ

360

∆ϕ > 180

(8)

with m ∈ [0; 1] are subtracted from each reward.

At time t during the decision process, the goal will

be to maximize the accumulated and weighted future

rewards, called the return

∞

n=0

t+n+1

(9)

with γ ∈ [0; 1]. The weight γ deﬁnes how much in-

ﬂuence a future reward will have on the overall return

at time t + n + 1. For the application of selecting

the optimal grip position γ = 0 is sufﬁcient as only

one step is necessary to reach the goal, the optimal

grip position.

Of course, the future rewards cannot be observed at

time step t. Thus, the following function, called the

action-value function

Q(s, a) = E {R

= s, a

= a} (10)

is deﬁned, which describes the expected return when

starting at time step t in state s with action a. In

other words, the function Q(s, a) models the ex-

pected quality of the chosen movement a for the fu-

ture, if the classiﬁer has returned s before.

Viewpoint selection can now be deﬁned as a two

step approach: First, estimate the function Q(s, a)

during training. Second, if at any time the classiﬁer

returns s as classiﬁcation result, select that camera

movement which maximizes the expected accumu-

lated and weighted rewards. This function is called

the policy

π(s) = argmax

Q(s, a) . (11)

The key issue of course is the estimation of the func-

tion Q(s, a), which is the basis for the decision pro-

cess in (11). One of the demands of this paper is

that the selection of the most promising grip position

should be learned without user interaction. Reinforce-

ment learning provides many different algorithms to

estimate the action-value function based on a trial and

error methods (Sutton and Barto, 1998). Trial and er-

ror means that the system itself is responsible for try-

ing certain actions in a certain state. The result of such

a trial is then used to update Q(·, ·) and to improve its

policy π.

In reinforcement learning a series of episodes are

performed: Each episode k consists of a sequence of

state/action pairs (s

, a

), t ∈ {0, 1, . . . , T }, where

the performed action a

in state s

results in a new

state s

t+1

. A ﬁnal state s

is called the terminal state,

where a predeﬁned goal is reached and the episode

ends. In our case, the terminal state is the state where

ICINCO 2004 - ROBOTICS AND AUTOMATION

172

PSfrag replacements

◦

180

◦

270

◦

s = θ(s, a)

= θ(s

, a

)

Figure 5: Illustration of the transformation function θ(s, a)

and the distance function d(·, ·).

gripping an object is possible with high conﬁdence.

During the episode, new returns R

(k)

are collected

for those state/action pairs (s

, a

) which were vis-

ited at time t during the episode k. At the end of the

episode, the action-value function is updated. In our

case, so called Monte Carlo learning is applied, and

the function Q(·, ·) is estimated by the mean of all

collected returns R

(i)

for the state/action pair (s, a)

for all episodes. Please note that is sufﬁcient for the

scope of this paper to restrict an episiode to only one

chosen action. Longer episodes have been discussed

for more complicated problems of viewpoint selection

in (Deinzer et al., 2003).

As a result for the next episode one gets a new de-

cision rule π

k+1

, which is now computed by maxi-

mizing the updated action value function. This pro-

cedure is repeated until π

k+1

converges to the opti-

mal policy. The reader is referred to a detailed intro-

duction to reinforcement learning (Sutton and Barto,

1998) for a description of other ways for estimating

the function Q(·, ·). Convergence proofs for several

algorithms can be found in (Bertsekas, 1995).

Most of the algorithms in reinforcement learning

treat the states and actions as discrete variables. Of

course, in grippoint selection parts of the state space

(the pose of the object) and the action space (the cam-

era movements) are continuous. A way to extend the

algorithms to continuous reinforcement learning is to

approximate the action-value function

Q(s, a) =

)

K (d (θ(s, a), θ(s

, a

))) · Q(s

, a

)

K (d (θ(s, a), θ(s

, a

)))

(12)

which can be evaluated for any continuous

state/action pair (s, a). Basically, this is a weighted

sum of the action-values Q(s

, a

) of all previously

collected state/action pairs (s

, a

). The other

components within (12) are:

• The transformation function θ(s, a) (see ﬁgure 5)

transforms a state s with a known action a with the

intention of bringing a state to a “reference point”

(required for the distance function in the next item).

In the context of the current deﬁnition of the states

from (7) it can be seen as a ”shift” of the state:

θ(s, a) = s +

Ω

{z }

∆ϕ

{z }

contains a

Ω

(ϕ + ∆ϕ) mod 360

(13)

• A distance function d(·, ·) (see ﬁgure 5) to calculate

the distance between two states. Generally speak-

ing, similar states must result in low distances. The

lower the distance, the more transferable the infor-

mation from a learned action-value to the current

situation is. As one has to compare two states, the

following formula meets the requirements:

d(s, s

) = d

µµ

Ω

¶¶

|ϕ − ϕ

| for Ω

= Ω

∞ otherwise

(14)

• A kernel function K(·) that weights the calculated

distances. A suitable kernel function is the Gaus-

sian K(x) = exp(−x

), where D denotes the

width of the kernel.

Viewpoint selection, i.e. the computation of the

policy π, can now be written, according to (11), as

the optimization problem

π(s) = argmax

Q(s, a) . (15)

2.3 Combined Execution

There are several points that arise from the fact that

the object tracking and the object classiﬁcation sys-

tems make use of the same cameras, at the same time.

The tracking system needs to keep continuous track of

the target, so the classiﬁer must use the same camera

settings and images.

During the classiﬁcation process, the object track-

ing still calculates the optimal zoom level for track-

ing purposes. However, the classiﬁcation process per-

forms best when the camera is at the same zoom level

as during the classiﬁer’s training. This conﬂict is re-

solved by augmenting the zoom level optimization

system (5) with a weighting function

∗

= arg min

) log

−

· w

) (16)

Inherent bounds on their arguments prevent the loga-

rithms from ever becoming negative.

ACTIVE SENSING STRATEGIES FOR ROBOTIC PLATFORMS, WITH AN APPLICATION IN VISION-BASED

GRIPPING

173

Figure 6: Overview of the experimental setup. See the text

for a description of the individual steps.

) ≡ 1 unless an object classiﬁcation needs to

be performed, in which case it becomes very large

(1000) for actions in which the left camera zoom level

diverges from the classiﬁer’s preferred level by more

than a ﬁxed tolerance (the object classiﬁcation uses

only the image from the left camera). The same op-

timization system is then always used for the zoom

levels, independent of the task the robot is currently

performing.

Another restriction of the pose estimation process

is that it was only trained for one degree of freedom,

namely the rotation. This means that the robot must

move to a ﬁxed distance from the target in order to

eliminate any tilting rotation. The object tracking sys-

tem needs to provide a reliable depth estimate for the

target at all times; merely detecting that the object has

been reached is insufﬁcient.

3 EXPERIMENTS

For experimental evaluation of the system, the grip-

ping task was repeatedly performed with different

robot starting locations and object orientations. The

ﬁnal position of the robot when gripping the object is

used to evaluate the object tracker, while the orien-

tation of the robot relative to the object assesses the

classiﬁer and grip planner.

3.1 Setup

The experiments were performed as shown in ﬁgure 6.

The robot is placed a random distance from the object.

The object is always at a ﬁxed height. After the tar-

get object is selected, its 3D coordinates are tracked

(1). The robot then orients itself to the target (2) and

begins to move towards it (3). At a ﬁxed distance,

the robot stops to perform its classiﬁcation and pose

detection.

Once the pose is known, the robot moves around

in a circular path (4) by the angle determined by the

the grip planner. Then it moves towards the target un-

til it is close enough for the object to be gripped (5).

This ﬁnal distance is still determined by visual track-

ing of the target only, no proximity or tactile sensors

are used. After the robot has gripped the object, it

lifts it from its stand, moves back a short distance and

places the object on the ﬂoor.

The object classiﬁcation was trained by the use of

a turntable placed in front of the robot, and captur-

ing the objects through the robot’s left camera at dif-

ferent angles and lightings, as in (Grzegorzek et al.,

2003). The distance from the robot to the objects was

constant, resulting in a single degree of freedom and

the object tracking requirements mentioned in sec-

tion 2.3. The grip planning was trained by placing the

robot in front of the object. The robot classiﬁed the

object, then performed a random action, i.e. it moved

in a circle around the object by a random amount. A

human operator then rated the action between 0 (bad)

and 10 (good) for the reinforcement learning (see sec-

tion 2.2). This rating was subjective and therefore un-

stable; reinforcement learning can cope with such in-

put, however. The object tracking requires no training

of any kind.

3.2 Measurements and Evaluation

Twelve different series of experiments were per-

formed, each series consisting of at least 10 individual

experiments as described above (if the object tracker

lost the object during phase 3, an experiment was re-

peated). For each new series, a different robot starting

position and orientation was chosen. The robot was

returned to approximately this starting position at the

beginning of each experiment in a series. The ﬁrst six

series were performed with zoom planning disabled,

and the second six series used zoom planning.

A total of 133 experiments were conducted. In 13

cases, the object tracking system lost the object at the

beginning of the experiment, where the target is very

small in the camera image. In 10 cases, the object was

lost near the end of the experiment, where the target

is viewed increasingly from above, causing its appear-

ance to diverge from the original template (these ex-

periments still yielded orientation measurements). In

105 cases, the robot’s ﬁnal orientation towards the tar-

get was within the valid grip position limits (see ﬁgure

2(b)), and in 15 cases, outside of them.

Before each experiment, the ground truth position

and pose of the target was calculated using a calibra-

tion pattern placed at the bottom of the target’s stand.

The calibration pattern indicates the pose of the stand

itself, while a rotational scale afﬁxed to the bottom

of the target shows its orientation relative to the stand

(see ﬁgure 7). The calibration pattern is removed be-

fore the experiment starts; the robot does not have ac-

cess to the calibration results.

For each experiment, the target position estimate is

acquired during a 10 second pause after the robot has

ﬁnished orienting itself towards the object, but before

it starts moving closer. All position estimates during

this time (about 55–60 measurements) are averaged to

obtain the position estimate for one experiment. Since

the self-orientation of the robot towards the target at

ICINCO 2004 - ROBOTICS AND AUTOMATION

174

Figure 7: (left) The gripper with the afﬁxed scale, and the

rotational scale on the target. In this example, the grip posi-

tion is 72 mm. (right) The stand with the calibration pattern.

the beginning of each experiment is rather accurate,

only the distance estimation from the robot to the tar-

get is evaluated, and compared to the ground-truth

distance obtained from the calibration pattern.

The ﬁnal gripping position is determined with the

use of a scale attached to the robot’s gripper (ﬁgure

7). The gripping position is deﬁned as the length the

target intrudes into the gripper at the farthest point the

gripper still touches the target. This is compared to

the ideal gripping position of 80 mm.

Figure 8 shows the distance estimation error, plot-

ted against the ground-truth distance (positive values

mean the distance was overestimated; this was the

case in each experiment). Generally, the farther away

the target is, the larger the error in the distance es-

timation becomes. At the same time, the ﬁnal grip-

ping position, as described above, is not affected by

the starting position. The continuous fusion of new

positional information by the Kalman ﬁlter allows the

robot to recover from the inaccurate and uncertain ini-

tial position estimate.

In this application, since the robot moves slowly

enough, the ability to change the zoom level has a

negligible effect on the position estimation, compared

to (for example) the inﬂuence of inaccuracies in the

camera vergences. The main beneﬁt of a ﬂexible

zoom level, for this task, occurs during the template

selection scheme. Allowing the template matching to

be performed at higher zoom levels allows one to in-

crease the size of the template in the camera image,

which increases the robustness of the tracking system.

If the robot starts sufﬁciently far away, using the ﬁxed

minimal zoom level, the template covering the target’s

head is about 16 × 16 pixels. If the zoom level is in-

creased prior to template matching, a larger template

(such as 32 × 32 pixels) can be ﬁtted over the same

target region.

The robot will typically remain at these high zoom

levels at the beginning, gradually reducing the zoom

as it approaches the bottle. This greatly reduces the

chance that the 2D trackers lose the object near the be-

ginning of an experiment; early object loss occurred

in 11 out of 71 experiments with ﬁxed zoom, but only

in 2 out of 62 experiments with a variable zoom, a

reduction of 79%.

−100

100

200

300

400

500

600

1400 1600 1800 2000 2200 2400 2600 2800

PSfrag replacements

Initial distance estimation error

Gripping position error

Ground-truth distance (mm)

Error (distance / gripping) (mm)

Figure 8: Evaluation of the distance estimation error and

ﬁnal gripping position as a function of the ground-truth dis-

tance. As the distance increases, the estimation error in-

creases. However, the gripping position error remains con-

stant. All values are in mm.

To evaluate the object classiﬁcation, the pose esti-

mate of the target was compared to the ground truth

pose as calculated above to obtain the pose classiﬁ-

cation error. The pose classiﬁcation error turned out

to be unbiased (near zero mean). The 90th, 75th and

50th percentile of the absolute error is 18.7, 12.5 and

7 degrees, respectively. This compares very favorably

to the cutoff error for acceptance of 20 degrees, as

shown in ﬁgure 2(b).

For the evaluation of the grip planning, the pose

estimation from the classiﬁer is added to the action

(movement in degrees) proposed by the grip planner,

for each experiment. This resulting grip angle is com-

pared to the ideal gripping angles (for this target) of

90 and 270 degrees. In our experiments, the plan-

ning error has a mean of -1.52 degrees, with a stan-

dard deviation of 1.36 degrees. This bias is the re-

sult of the cost function applied to the action selection

in section 2.2, as shown in ﬁgure 9. Since the target

is equally grippable from two locations, the gripping

system chooses the closer one (requiring less move-

ment) by weighting the action ratings as in equation

(8). A side effect of this weighting is that the modes of

the rating, too, get shifted slightly to favor less move-

ment. This shift, however, is negligible when com-

pared to the pose estimation error.

Finally, the actual gripping angle is evaluated. This

is the pose of the target relative to the robot at step

(5), and is measured externally by use of the scale af-

ﬁxed to the target. This angle is compared to the ideal

gripping angles (90 and 270 degrees in our case) to

obtain the ﬁnal grip position error. The deviation was

again unbiased (near zero mean), with the 90th, 75th

and 50th percentile at 23, 15 and 9 degrees respec-

tively. As demonstrated in ﬁgure 2(b), an absolute er-

ror below 20 degrees was deemed “acceptable”, while

anything above was “not acceptable”. Out of 120 ex-

periments which resulted in grippoint selections, only

ACTIVE SENSING STRATEGIES FOR ROBOTIC PLATFORMS, WITH AN APPLICATION IN VISION-BASED

GRIPPING

175

PSfrag replacements

50 100 150 200 250

300 350

Q(s, a)

with costs

no costs

◦

Figure 9: Grippoint selection incorporating costs. The tar-

get is estimated to be at 38 degrees. The inﬂuence of the

cost function on the rating function is clearly visible.

15 were not acceptable, as the result of an incorrect

pose estimation by the classiﬁer.

4 CONCLUSION AND OUTLOOK

In this paper, we have presented the combination of

two systems, an object tracker and an object classiﬁer,

which are able to grip a non-trivial object using only

visual feedback.

An important aspect is that neither of these systems

require any explicit modeling, neither in the behavior

of the focal length adjustment, nor in the selection of

the gripping angle. Instead, the focal length adjust-

ment comes automatically from the information theo-

retic approach, while the correct angle is trained, al-

lowing the system to generate its own model.

Future work will focus on improving the individ-

ual components of this system, motivated by the goal

of tracking and gripping a moving target. This com-

prises prediction of the target’s position multiple steps

into the future, automatic adaptation of tracking fea-

tures to cope with visually changing objects, and eval-

uation of reinforcement learning techniques which al-

low learning an optimal sequence of actions.

REFERENCES

Bar-Shalom, Y. and Fortmann, T. (1988). Tracking and

Data Association. Academic Press, Boston, San

Diego, New York.

Bertsekas, D. P. (1995). Dynamic Programming and Op-

timal Control. Athena Scientiﬁc, Belmont, Mas-

sachusetts. Volumes 1 and 2.

Bicchi, A. and Kumar, V. (2000). Robotic grasping and con-

tact: A review. In Proceedings of the 2000 IEEE In-

ternational Conference on Robotics and Automation,

volume 1, pages 348–353, San Francisco.

Borotschnig, H., Paletta, L., Prantl, M., and Pinz, A. (2000).

Appearance-based active object recognition. Image

and Vision Computing, 18(9):715–727.

Deinzer, F., Denzler, J., and Niemann, H. (2003). Viewpoint

Selection – Planning Optimal Sequences of Views for

Object Recognition. In Computer Analysis of Images

and Patterns – CAIP 2003, LNCS 2756, pages 65–73,

Heidelberg. Springer.

Denzler, J. and Brown, C. (2002). Information Theoretic

Sensor Data Selection for Active Object Recognition

and State Estimation. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 24(2):145–157.

Denzler, J., Zobel, M., and Niemann, H. (2003). Informa-

tion Theoretic Focal Length Selection for Real-Time

Active 3-D Object Tracking. In International Con-

ference on Computer Vision, pages 400–407, Nice,

France. IEEE Computer Society Press.

Grzegorzek, M., Deinzer, F., Reinhold, M., Denzler, J., and

Niemann, H. (2003). How Fusion of Multiple Views

Can Improve Object Recognition in Real-World En-

vironments. In Vision, Modeling, and Visualization

2003, pages 553–560, M

unchen. Aka GmbH, Berlin.

Hager, G. and Belhumeur, P. (1998). Efﬁcient region track-

ing with parametric models of geometry and illumi-

nation. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 20(10):1025–1039.

Mason, M. (2001). Mechanics of Robotic Manipulation.

MIT Press. Intelligent Robotics and Autonomous

Agents Series, ISBN 0-262-13396-2.

Paletta, L. and Pinz, A. (2000). Active Object Recogni-

tion by View Integration and Reinforcement Learning.

Robotics and Autonomous Systems, 31(1–2):71–86.

Puckelsheim, F. (1993). Optimal Design of Experiments.

Wiley Series in Probability and Mathematical Statis-

tics. John Wiley & Sons, New York.

Shannon, C. (1948). A mathematical theory of communi-

cation. The Bell System Technical Journal, 27:379–

423,623–656.

Smith, C. and Papanikolopoulos, N. (1996). Vision-guided

robotic grasping: Issues and experiments. In Proceed-

ings of the 1996 IEEE International Conference on

Robotics and Automation, pages 3203–3208.

Sutton, R. and Barto, A. (1998). Reinforcement Learning.

A Bradford Book, Cambridge, London.

Tordoff, B. and Murray, D. (2001). Reactive Zoom Control

while Tracking Using an Afﬁne Camera. In Proceed-

ings of the 12th British Machine Vision Conference,

volume 1, pages 53–62.

Zobel, M., Denzler, J., and Niemann, H. (2002). Binocu-

lar 3-D Object Tracking with Varying Focal Lengths.

In Proceedings of the IASTED International Confer-

ence on Signal Processing, Pattern Recognition, and

Application, Crete, Greece, pages 325–330, Anaheim,

Calgary, Zurich. ACTA Press.

ICINCO 2004 - ROBOTICS AND AUTOMATION

176