Anatomical Landmark Tracking by One-shot Learned Priors for

Augmented Active Appearance Models

Oliver Mothes and Joachim Denzler

Computer Vision Group, Friedrich Schiller University Jena, Jena, Germany

{oliver.mothes, joachim.denzler}@uni-jena.de

Keywords:

Landmark Tracking, Active Appearance Models, Whitened Histograms of Orientations.

Abstract:

For animal bipedal locomotion analysis, an immense amount of recorded image data has to be evaluated by

biological experts. During this time-consuming evaluation single anatomical landmarks have to be annotated

in each image. In this paper we reduce this effort by automating the annotation with a minimum level of

user interaction. Recent approaches, based on Active Appearance Models, are improved by priors based on

anatomical knowledge and an online tracking method, requiring only a single labeled frame. However, the

limited search space of the online tracker can lead to a template drift in case of severe self-occlusions. In

contrast, we propose a one-shot learned tracking-by-detection prior which overcomes the shortcomings of

template drifts without increasing the number of training data. We evaluate our approach based on a variety of

real-world X-ray locomotion datasets and show that our method outperforms recent state-of-the-art concepts

for the task at hand.

1 INTRODUCTION

The profound investigation of animal locomotion

plays an important role in many ﬁelds of research,

e.g., zoology, biomechanics, and robotics. For those

analyses an immense amount of data has to be

recorded to be able to derive a model or to reﬁne ex-

isting ones. In this context, it is necessary to evalu-

ate the collected data in detail, which requires con-

siderable expenses by biological experts in terms of

manually annotating every single measure (Nyakatura

et al., 2011; Andrada et al., 2013). Therefore, an

automation of this task is highly preferable. In or-

der to analyze the locomotor system in vivo, high-

speed X-ray acquisition is applied. In an usual ex-

perimental setup, animals are placed on a treadmill

which is enclosed by a C-arm X-ray acquisition sys-

tem with two perpendicular detectors providing a top

view (dorsoventral view) as well as a side view (lat-

eral view) image of the entire locomotor system. To

allow for a detailed biological evaluation, acquisition

is performed at a high spatial and temporal resolution

(1536 × 1024 pixels at 1000 FPS) on average for 1-2

seconds, resulting in up to 2000 frames. In order to

avoid the time-consuming task of manual annotation

of single images (Haase et al., 2013), an automation

of this task at a minimum level of user interaction is

of great interest.

bone

example

Anatomical

Knowledge

Training Data

LDA

Full AAM

AUGMENTED AAM

Subset AAM

Local

Tracking

extraction

learn

model

One-shot

Detector

train

Lower Leg

Contraints

Eipolar Constraints

Distance Constraints

Torso

Constraints

Figure 1: Based on few annotated biplanar recorded training

images an Augmented AAM (Haase and Denzler, 2013) is

trained, consisting of anatomical knowledge, a full multi-

view AAM model, an AAM model of the torso land-

mark subset, epipolar constraints and a local tracking-by-

detection prior introduced in this paper.

In (Haase and Denzler, 2013) Active Appearance

Models (AAM) (Cootes et al., 2001) have been ap-

plied to several bipedal bird locomotion datasets. One

crucial conclusion of this work is that AAMs need

substantial constraints from various sources. With the

support of additional anatomical knowledge, i.e. re-

gion segmentation, multi-view acquisition, and local

landmark tracking, for the animals lower limb sys-

tem, the resulting Augmented AAM (Haase and Den-

zler, 2013) provides robust results for the majority of

the processed datasets. However, the applied online

tracking approach (Amthor et al., 2012) suffers from

246

Mothes O. and Denzler J.

Anatomical Landmark Tracking by One-shot Learned Priors for Augmented Active Appearance Models.

DOI: 10.5220/0006133302460254

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 246-254

ISBN: 978-989-758-227-1

a potential template drift caused by severe and even

full occlusion of the tracked objects.

Motivated by this shortcoming, we propose a one-

shot learned tracking-by-detection approach which

can handle these limitations by a global search. With

one representative example of an annotated landmark

subset a detector is learned. A two-staged graph-

based tracking approach then provides motion trajec-

tories through the whole sequence. Those trajectories

are utilized as a prior in the Augmented Active Ap-

pearance Model framework together with priors from

other sources as illustrated in Figure 1. In our experi-

ments we show that this extension is able to improve

previous results by up to 120 pixels in precision.

The remainder of the paper is structured as follows.

In Section 2 we will give a brief overview of re-

lated work, followed by a short introduction to Ac-

tive Appearance Models and its augmentation in Sec-

tion 3. Afterwards, our one-shot learned tracking-by-

detection approach will be introduced in Section 4.

An evaluation of the detection and tracking results is

presented in Section 5. Finally, Section 6 concludes

the paper with a short discussion.

2 RELATED WORK AND

MOTIVATION

For anatomical landmark tracking Haase et al. (Haase

and Denzler, 2011) applied Active Appearance Mod-

els (Cootes et al., 2001) to X-ray locomotion scenar-

ios. They showed that this generative model is well

suited for the task at hand since training requires only

a small amount of low contrast images. However,

this approach has its weaknesses for a certain sub-

set of landmarks–primarily landmarks of the lower

limb system–undergo severe occlusions. They ex-

tended their approach in (Haase et al., 2011) to multi-

view AAMs (Lelieveldt et al., 2003), which is more

robust and accurate for torso landmark subsets com-

pared to the single view approach. By concatenat-

ing corresponding landmarks of the second view the

model became more general. The usage of additional

constraints, especially for the distal limb landmarks,

supporting the multi-view AAM, leads to a holistic

model, referred to as Augmented AAM (Haase and

Denzler, 2013). Anatomical knowledge, the multi-

view information formulated as epipolar geometry,

and a local tracking approach were used as priors for

augmenting the standard AAM. Subtemplate Match-

ing (STM) (Amthor et al., 2012) as a data-driven on-

line tracking approach localizes landmarks of the dis-

tal limb segments. Based on the small number of

available training images, STM only needs one ini-

tial labeled frame for robust tracking, which renders

the method highly preferable for the underlying task.

However, online tracking fails in the case of severe

occlusions and temporal disappearance of tracked ob-

jects. For example, subsequences with long-term oc-

clusions of similar crossing objects considerably af-

fects the tracking performance. As a consequence the

template drift occurs, which results in a total loss of

the structure to be tracked. Sequences with tempo-

ral disappearance of the object of interest produces

a similar effect. An extension of STM is a picto-

rial structure approach (Amthor et al., 2014), where

the distal limb system is formulated by a kinematic

chain of single bones. Unfortunately, the extended

method has the same weaknesses as STM. In con-

trast, our one-shot learned tracking-by-detection ap-

proach will tackle the template drift problem and han-

dles strong texture shifts using a robust ofﬂine graph-

based tracker even when the patch detection is missed

in single frames.

Ofﬂine tracking algorithms are used to track objects in

sequences (Andriluka et al., 2010; Li et al., 2015) and

they are often formulated as a graph theoretical prob-

lem (Zhang et al., 2008; Berclaz et al., 2011; Jiang

et al., 2013; Dehghan et al., 2015). First and fore-

most, reliable object detections serve as basis for all

tracking approaches. Detection approaches are based

on local image features like HOG (Dalal and Triggs,

2005; Felzenszwalb et al., 2010) or SIFT (Lowe,

2004) to detect objects in every single frame. In or-

der to localize an object of interest a Support Vector

Machine (SVM) is used for classifying positive and

negative image patches in a sliding window manner

(Felzenszwalb et al., 2010). However, SVM training

is computationally expensive–especially when apply-

ing hard negative mining–and need a huge amount of

training data.

Based on the fact that the amount of training data in

our application scenario is limited, we use Whitened

HOG features and an LDA model (Hariharan et al.,

2012) for detecting landmark subsets, which only

needs one single positive example for robust detector

training.

More recently, Coarse-to-ﬁne Convolutional Network

Cascades (Sun et al., 2013; Zhou et al., 2013) are

designed in a multi-level architecture for facial land-

mark detection. By fusing the outputs of each level of

the multiple networks a robust and accurate estima-

tion is possible. However, the Convolutional Neural

Network frameworks have a complex structure and

need a lot of data for training which is contrary to

our pre-condition of a very limited number of train-

ing data.

Anatomical Landmark Tracking by One-shot Learned Priors for Augmented Active Appearance Models

247

The main contribution of this paper is a one-shot

learned tracking-by-detection approach using a linear

detector utilizing Histogram of Oriented Gradients

(HOG) features and a classiﬁer based on the Linear

Discriminant Analysis in a sliding-window manner to

detect the landmark subset of the lower limb system.

The detection method provides two important advan-

tages. On the one hand, many detector models for a

sequence can be trained in a very short time and on the

other hand the model training requires only one repre-

sentative positive example which is important for the

desired small annotation effort. Additionally, we use

smart convolutions to speed up sliding window man-

ner detections. Subsequently, a two-staged graph-

based tracking algorithm is used to determine land-

mark subset trajectories through the whole sequence.

In contrast to STM tracking, template drifts are re-

duced or even eliminated since landmark trajectories

are optimized globally. The single landmark tracks of

the lower limb system serve as important prior knowl-

edge for the ﬁtting task of an probabilistic Augmented

AAM model, trained with only 10 annotated exam-

ples.

3 AUGMENTED ACTIVE

APPEARANCE MODELS

Augmented AAMs (AAAMs) proposed in (Haase and

Denzler, 2013) extend the ﬁtting process of standard

AAMs by providing additional prior knowledge.

An AAM is a parametric statistical generative model

consisting of a shape component and a shape-free tex-

ture component. Training data consists of N images

with corresponding landmark annotations for relevant

anatomical structures. After aligning the N land-

mark shapes via Procrustes Analysis (Kendall, 1984),

shape variation is parameterized by applying Prin-

ciple Component Analysis PCA to the shape matrix

S = (s

− s

s, ..., s

− s

s) where s

s represents the mean

shape. The linear shape model s

s with shape parame-

ters b

and shape eigenvectors P

and the mean shape

s is given by:

s = s

s + P

. (1)

Afterwards, each image texture I

, ..., I

is warped

into the mean shape s

s. To obtain the linear texture

model g

g the very same PCA-based procedure is ap-

plied to the shape-normalized image vectors

g = g

g + P

, (2)

with the texture parameters b

, the texture eigenvec-

tors P

and the mean texture g

g. By concatenating the

variance-normalized shape and texture parameter vec-

tors and applying PCA again, we obtain the combined

parameters b

(Cootes et al., 1998).

After training, the AAM model can be used to ﬁnd

a best ﬁtting for the model parameter vector

a given input image by minimizing the squared dis-

tance δg

g = (g

image

− g

model

) of the given image and

the model appearance:

= argmin

δg

g , (3)

where we make use of the linear relationship

δb

= A

Aδg

g . (4)

However, the accuracy of standard AAMs is prone to

occlusions and unseen cases due to the linear shape

and texture modeling. To overcome these limitations

prior knowledge from different sources can be incor-

porated. In addition to multi-view information (Haase

et al., 2011), the authors of (Haase and Denzler, 2013)

use various types of constraints to augment the stan-

dard AAM. Constraints like subset AAM, anatomi-

cal knowledge, epipolar geometry and a local track-

ing method (Amthor et al., 2012) for the animals

lower limb system are used to increase the ﬁtting per-

formance and overcome the typical AAM weakness

based on linear shape and texture modeling. We im-

plement the idea of a combined approach by refor-

mulating the AAM ﬁtting as a maximum a-posteriori

(MAP) framework as in (Haase and Denzler, 2013)

with a conditional independent input image I

I and ev-

ery provided ﬁtting constraints π

π:

c,MAP

= argmax

p(b

I, π

π)

= argmax

p(I

I|b

) · p(π

π|b

) · p(b

) .

(5)

For input image data I

I it is sufﬁcient to use only

a cropped version g

image

, deﬁned by the AAM

shape conﬁguration. The likelihood can than be

modeled as a Gaussian distribution g

image

∼

N (g

model

, Σ

image

−g

model

) where Σ

image

−g

model

will be

estimated in AAM training. The prior term p(π

π|b

)

performs the integration of all the constraints π

π into

the AAM ﬁtting process, where π

π represents the dif-

ferences between the given target constraint values

and the values based on the AAM parameters b

Again a Gaussian distribution π

π|b

∼ N (0, Σ

) will

be assumed. The term p(b

) can be modeled as max-

imum likelihood estimation using a uniform distribu-

tion. For more information about the prior modeling

of Augmented AAMs please refer (Haase and Den-

zler, 2013). A serious weakness of standard AAM

is tracking landmarks of the lower limbs of the ani-

mal locomotor system. To overcome this drawback,

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

248

a local tracking constraint π

local

with the results of

an online tracking approach (Amthor et al., 2012), lo-

calizing those critical landmarks, is included in Aug-

mented AAM framework.

4 ONE-SHOT LEARNED

TRACKING APPROACH

For a reliable data-driven tracking of landmarks of

lower limb landmarks, initially, a sophisticated de-

tector is of great importance. As detection of single

landmarks is more complicated, a detection of land-

mark subset patches is of advantage. The landmarks

of single bones can be described as such a subset.

In the following sections we introduce a one-shot

learned tracking-by-detection approach. In Section

4.1 the bone detection method will be discussed,

while Section 4.2 focuses on bone tracking and land-

mark retrieval.

4.1 One-shot Learned Detector

To distinguish positive and negative examples, the

combination of HOG features and SVM classiﬁcation

was the most commonly used approach in the past

decade (Dalal and Triggs, 2005; Felzenszwalb et al.,

2010). Unfortunately, SVM training and testing is

computationally expensive, especially when applying

hard negative mining with a huge amount of training

data.

To overcome this limitation, Hariharan et al. intro-

duces in (Hariharan et al., 2012) an object detection

approach based on augmented HOG features (Felzen-

szwalb et al., 2010) and a classiﬁer based on linear

discriminant analysis (LDA).

Their model relies on the assumption that the distribu-

tions of object instances (positives) and background

examples (negatives) follow both a Gaussian distri-

bution. Thereby, the major computational effort is

caused by the estimation of the background statistics

(corresponding to the negative samples). Estimating

the covariance matrix Σ

and the mean vector µ

has

to be done only ones.

For every positive class only the respective mean vec-

tor µ

has to be computed to obtain a discriminative

linear separation of the two classes

LDA

= Σ

−1

(µ

− µ

) . (6)

A sliding window-based method and template match-

ing is used to compute similarity scores of a feature

vector x

x by a linear Whitened Histograms of Orienta-

tions (WHO) detector f (x

x) = hω

LDA

, x

xi. Dense sam-

pling of these features allows for matching the image

templates.

To speed up the evaluation the authors of (Freytag

et al., 2015) changed the order of computations and

reformulated the sampling task as efﬁcient convolu-

tions. The entire set of window patches with D

fea-

ture dimensions in the dense tiled grid of T × T cells

are evaluated by adding D convolutions of 1×1 ﬁlters

with corresponding feature planes. Consequently, de-

rived from D = T · T · D

, we obtain the feature ex-

traction as efﬁcient convolutions.

Unfortunately, the objects of interest in our applica-

tion are rotated within a certain range. Accordingly,

in the detection process, the input image needs to be

rotated. As a result, the detection result at a speciﬁc

location in the image contains a lot of multiple detec-

tions depending on the used angular resolution. Each

detection contains position information, a detection

angle and a detection score.

For the tracking algorithm the detection results are ﬁl-

tered to obtain object hypotheses. First, the normal-

ized detection maps I

of every frame t are smoothed

by accumulating Gaussian ﬁlter kernels G

G(x, y, σ)

weighted by the corresponding detection score result-

ing in a smoothed detection map O

with

(i, j ) =

∑

x=−

∑

y=−

(i + x, j + y)G

G(x, y, σ) , (7)

where m = n describe the ﬁlter size of G

G(x, y, σ). In

order to extract single detection hypotheses with new

detection positions, Non-Maximum Suppression is ap-

plied to O

Based on the assumption that the highest detection

score yields the highest similarity with the model, the

related detection score and rotation angle result from

the rotation angle with the highest detection score

within a local neighborhood deﬁned by half the ob-

ject size around the detection hypotheses is selected.

Finally, the detection scores of all frames have to be

normalized again.

4.2 Graph-based Tracking

To associate the detection results of Section 4.1, a reli-

able tracking algorithm is necessary. The graph-based

tracking approach based on (Jiang et al., 2013) uses

the detection results of Section 4.1 and is divided into

two steps. In the ﬁrst step, the algorithm extracts seg-

ments of robust object trajectories by searching simi-

lar detected objects of subsequent frames.

Here, a Directed Acyclic Graph (DAG) G is for-

mulated where every detection hypothesis repre-

sents a node. We deﬁne detection hypotheses H

H =

Anatomical Landmark Tracking by One-shot Learned Priors for Augmented Active Appearance Models

249

, ..., H

} with H

= {h

t,0

, ..., h

t,K

} where h

t,i

rep-

resents the i

detection hypothesis of frame t. Fur-

thermore, we add a source h

source

and a sink node

sink

to G = (H

H, E

E, d

d), E

E ⊆ H

H × H

H, which are fully

connected to all other nodes h

t,i

. The edge weights of

the DAG depend on the number of detection feature

weights d

, as for example spatial d

, temporal d

and

angular distances d

but also detection scores or other

detection results of adjacent detection hypotheses.

However, the non-negative edge cost function d : H

H ×

H with P detection features is calculated unlike (Jiang

et al., 2013) as follows:

d(h

t,i

, h

t+∆t, j

) =

∑

p=0

· d

t,i

, h

t+∆t, j

)

r.t. ∆t > 0 .

(8)

The inner weight parameters α

of the single track-

ing priors with

∑

p=0

= 1 regularizes the inﬂu-

ence of individual priors. Finding an optimal set of

weights automatically is subject of future research.

Before using a shortest path algorithm like Dijkstra

or Bellman-Ford, thresholds, θ

min

and θ

max

, based

on the used tracking priors have to be deﬁned, such

that: θ

min

≤ d

≤ θ

max

. The thresholding sets con-

straints for the ﬁrst stage (tracklet extraction) and set

edges which do not match the pre-condition to inﬁn-

ity. Therefore, this constraints guarantee reliable path

segments of related detection hypothesis and prevents

mistakenly created shortest paths through the whole

graph. In the DAG the edge weights of extracted paths

are subsequently set to inﬁnity to avoid multiple ex-

traction of the same tracklets. The tracklet extraction

process stops, if no further tracklet can be found, i.e.,

the tracklet length is smaller than 2.

Afterwards, the extracted tracklets are linked again to

whole paths within a second DAG G

= (H

, E

, d

⊆ H

× H

H, where H

= {τ

, ..., τ

} are the esti-

mated tracklet hypothesis and d

:⊆ H

× H

a non-

negative cost function similar to equation 8.

5 EXPERIMENTS

In this section we evaluate the performance of the

Augmented AAM framework extended by the intro-

duced landmark detection and tracking techniques.

We conduct our experiments on ﬁve avian locomo-

tion datasets of several bird species with focus on

sequences showing long-term object occlusion. The

datasets were recorded by a high-speed X-ray acqui-

sition system at 1000 Hz with a resolution of 1536

x 1024 pixels. Table 1 summarizes the analyzed

datasets.

Table 1: An Overview of analyzed datasets.

Name Species Frames Labeled Frames

Q1 Quail 706 22

Q2 Quail 701 15

T1 Tinamou 776 37

J1 Jackdaw 1201 46

J2 Jackdaw 1051 36

In Section 5.1 we compare different detection meth-

ods applied to two selected datasets with consider-

able self occlusions. Afterwards, results of the graph-

based tracking algorithm are shown in Section 5.2

which uses the detections retrieved as described in

Section 5.1. Finally, we use the tracking results in

Section 5.3 as powerful priors for the Augmented

AAM framework.

5.1 Comparison of the Detector Models

In general, detector models are learned from positive

object examples. The number and quality of these ex-

amples is a crucial factor for their accuracy. However,

in our application the number of annotated frames

should be as much as necessary, but as few as possi-

ble. Another challenge for learning a reliable detector

in our application is the visual appearance of the input

images. X-ray acquisition systems provide grayscale

images of low contrast. Accordingly, the detector has

to overcome issues with respect to appearance and

amount of positive training examples. Learned de-

tector models using patches around landmarks are not

representative enough concerning the high intra-class

variability. Instead of using such landmark patches,

the usage of subsets of landmarks is highly preferable

to obtain a representative robust detector model based

on examples with a low intra-class variability. Hence,

the usage of corresponding landmarks of the lower

limb bones (proximal and distal landmarks) are used

in our application to deﬁne subsets and new patches

for our detector. This is done by creating rotation nor-

malized bounding boxes around the landmark subset.

Figure 2 conﬁrms our assumption that using landmark

subset patches instead of patches around single land-

marks for learning a detector model provides much

better detection performance.

In our experiments we applied four different detector

methods to the mentioned landmark subsets: HOG

features trained with a linear SVM using all posi-

tive examples, a WHO model trained with all positive

examples and as well as a HOG-SVM and a WHO

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

250

0 0.2 0.4 0.6 0.8 1

Recall

0.2

0.4

0.6

0.8

Precision

landmarks (0.084)

bones (0.74)

Figure 2: Detector models were trained for landmark sub-

sets patches (bones) and the single landmark patches of the

lower limb landmarks of all datasets of Table 1.

model trained with only one-shot, in other words, one

single example (one-shot learning).

HOG. The feature extraction is based on (Dalal and

Triggs, 2005) with a gradient quantization to 9 ori-

entation bins, a cell size of 8x8 pixels, a block size of

4x4 cells, and a block spacing stride of 8 pixels. A lin-

ear SVM model is trained using all annotated frames

(positive bone examples) in the training data. Another

SVM model is trained with only one representative

example of the training data. To generate negative ex-

amples, window patches were clipped around the pos-

itive example patches. During detection the image is

rotated between −90

◦

and +90

◦

with respect to bio-

logical constraints of the bone landmark subsets. For

every rotation the detector obtains object hypotheses

with information regarding position, detection score,

and rotation angle using a sliding window technique.

The chosen rotational resolution depends on the patch

size. We used a degree step of 1.

WHO. For the WHO model ω

LDA

the background

statistics has to be computed ﬁrst. Therefore, N

ran-

domly unlabeled image patches were selected from

the sequence and the mean µ

is estimated by com-

puting the mean HOG feature µ

∑

i=0

E[x

]. The

covariance Σ

is estimated using the spatial autocor-

relation function (Hariharan et al., 2012). With the

assumption of independent and identically Gaussian

distributed positive and negative examples the model

training can be performed using only one positive ex-

ample. Since the LDA model ω

LDA

is not rotational

invariant, the sliding window technique has to be per-

formed for multiple image rotations, as well.

In a further experiment we compare the detection

methods for all datasets of Table 1. We compared

HOG-SVM models with WHO models. Both are

0 0.2 0.4 0.6 0.8 1

Recall

0.2

0.4

0.6

0.8

Precision

HOG - all (0.66)

HOG - one (0.053)

WHO - all (0.79)

WHO - one (0.74)

Figure 3: The Precision-Recall (PR) curve illustrates the

performance comparison of the applied detection methods

to all datasets of Table 1, respectively.

trained with all positive landmark and bone examples.

Afterwards, models with only one representative ex-

ample were trained. Since the detection method of

Section 4.1 applied to landmark patches yields count-

less numbers of false positives and missed detections,

we use for further experiments only those detection

results based on the bone patches.

Figure 3 illustrates the comparison of the presented

detection methods, which exhibit different detection

performances. Both models using WHO features

clearly outperform the results using HOG features,

because the estimated background statistics Σ

has a

large inﬂuence on the linear separation. Regarding

the whitening of the WHO features the performance

of trained WHO models are nearly equivalent, regard-

less of the number of training samples. The poor per-

formance of the model trained with one HOG fea-

ture example is caused by the weak representation of

the class, as a linear separation is nearly impossible.

Because of its very poor performance, the one-shot

learned HOG-SVM model is ignored for the follow-

ing experiments.

5.2 Online vs. Ofﬂine Tracking

In this section the graph-based tracking algorithm de-

scribed in Section 4.2 is applied to the detection hy-

potheses of Section 5.1. Every detection provides a

position information, a detection angle, a detection

frame number, and a detection score. Based on this

information the weights of the DAG G are calculated

as described in Section 4.2, where d

represents the

spatial distance, d

the angular distance, d

the tempo-

ral distance and d

the inverted summarized detection

scores between two nodes.

Anatomical Landmark Tracking by One-shot Learned Priors for Augmented Active Appearance Models

251

Image

Euclidean Landmark Error [px]

104

116

167

249

275

320

366

411

451

487

Method

STM (Amthor et al. 2012) HOG WHO (all examples) WHO (one example)

Figure 4: The detection result for the Q2-dataset from Sec-

tion 5.1 were applied to the two-staged graph-based tracker

and compared to the STM approach of (Amthor et al.,

2012). The landmark error observed over time visualizes

the drawback of the STM for cases of occlusion and tempo-

ral disappearance.

After normalizing all weights d

we choose uniform

inner weights α

for the edge cost function d. Ad-

ditionally, in the second stage (tracklet linking), the

mean velocity is calculated using position informa-

tion of all detections covered by the respective track-

lets. Based on the anatomical knowledge, especially

the length of the extracted bone examples described in

Section 5.1 and the detection angle information, it is

possible to recover the proximal and distal landmark

positions. Figure 4 illustrates the comparison of the

STM baseline with our tracking results applied to the

Q2-dataset. For all remaining datasets the Euclidean

tracking error plots show similar results. The graph

of the Euclidean tracking error clearly shows the tem-

plate drift of the STM algorithm at time step 190 after

the initial position at time step 167 where the pixel

error was close to 0 pixels. All other graphs show ro-

bust trajectories of the introduced detection methods

with an error of only 10 pixel on average. The tra-

jectories based on the WHO one-shot detector model

achieves the nearly same performance like the detec-

tors trained with all positive examples while only one

single training example was used.

5.3 A One-shot Learned Prior for

AAAMs

As an extension of the Augmented AAM framework

(Haase and Denzler, 2013), illustrated in Figure 1,

we replaced the utilized local tracking prior π

local

(Amthor et al., 2012) by our tracking-by-detection ap-

proach from Section 5.2 which is able to recover lost

templates based on global optimization in contrast to

Landmark category

Euclidean error [px]

1.2

1.4

1.6

1.8

2.2

2.4

Torso Upper leg Lower leg

Constraints

none (Standard AAMs) STM HOG WHO (all examples) WHO (one example)

Landmarks

Euclidean Landmark Error [px]

0.5

1.5

2.5

●

● ●

TMT dist

TMT prox

(b) Q1 dataset

Landmarks

Euclidean Landmark Error [px]

0.5

1.5

2.5

●

TMT dist

TMT prox

Figure 5: Prior for the Augmented AAM: four different

local tracking approaches were analyzed with respect to

Euclidean landmark errors of the proximal and distal Tar-

sometatatarsi landmarks of the lower limb system. The

Standard AAM and the Augmented AAM utilizing the STM

approach as local prior serve as baseline for the comparison

to our approach (see Section 5.2).

(Amthor et al., 2012). The comparison is conducted

for all datasets of Table 1.

First, a multi-view AAM model of all landmarks and

the torso landmark subset were trained with 10 anno-

tated frames and is used as torso constraint. Based

on both available views (lateral and dorsoventral)

the epipolar geometry with the help of the Funda-

mental Matrix is estimated and is used as epipolar

constraints. Anatomical knowledge, in terms of bi-

ological distance constraints were obtained via im-

age segmentation as proposed in (Haase and Denzler,

2013). Together with one of the mentioned tracking

approaches as lower leg constraints the AAAM is for-

mulated as in Section 3.

In our experiments we compared the inﬂuence of

the different tracking priors. Figure 5 illustrates the

Euclidean landmark error of the local tracking ap-

proaches of Section 5.2 applied to the Augmented

AAM framework of (Haase and Denzler, 2013). It

can be clearly seen that the template drift problem of

the Q2-dataset (see Figure 4) using the STM tracking

approach substantially affects the performance of the

entire AAAM framework.. In contrast, our proposed

tracking prior allows for highly accurate results of the

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

252

Table 2: Qualitative results of selected frames of the J2-dataset illustrate the tracked landmarks by the Augmented AAM using

the STM tracking prior (Amthor et al., 2012) and our one-shot learned tracking-by-detection prior.

STM

WHO

frame 300 350 400 450 500

Table 3: Augmented AAM using the STM tracking approach (Amthor et al., 2012) and our one-shot learned tracking-by-

detection method as local tracking prior shows in comparison to each other nearly the same accurate results applied to the

T1-dataset.

STM

WHO

frame 225 265 285 305 385

combined AAAM approach without the loss of indi-

vidual landmark subsets. In Figure 6 the average error

over all sequences of Table 1 is shown. It can clearly

be seen that our presented tracking-by-detection prior

outperforms the online tracking approach even in the

case of using only one single example for training.

6 CONCLUSIONS

In this paper we introduced a one-shot learned

tracking-by-detection prior supporting an Augmented

AAM framework for anatomical landmark retrieval in

animal locomotion analysis. In particular, a linear de-

tector was trained with only one representative pos-

itive example of a desired landmark subset in a very

fast manner. A two-staged graph-based tracking algo-

rithm generates whole trajectories of the detected hy-

potheses and recovers the single landmarks of the sub-

set. Finally, the landmark tracking results were used

as a prior for an AAM to support the model-driven

baseline algorithm and solve the model ﬁtting task

for occluded and temporally disappeared landmarks.

We compared our approach to another local tracking

Landmark category

Euclidean error [px]

1.2

1.4

1.6

1.8

2.2

2.4

Torso Upper leg Lower leg

Constraints

none (Standard AAMs) STM HOG WHO (all examples) WHO (one example)

Landmark category

Euclidean error [px]

1.2

1.4

1.6

1.8

2.2

2.4

Torso Upper leg Lower leg

(b) dorsoventral view

Landmark category

Euclidean error [px]

1.2

1.4

1.6

1.8

2.2

2.4

Torso Upper leg Lower leg

Figure 6: For the ﬁve datasets of Table 1 an averaged eval-

uation of the two different views illustrates the Euclidean

landmark error of three landmark groups (torso, upper leg,

lower leg).

method using a frame-by-frame template matching

strategy which is very accurate in sequences with par-

tial self occlusion, but fails in case of long-term full

Anatomical Landmark Tracking by One-shot Learned Priors for Augmented Active Appearance Models

253

occlusions. In our experiments we showed that this

extension is able to improve previous results by up to

120 pixels in precision. To further improve the track-

ing precision of our proposed algorithm, a higher an-

gular as well as spatial resolution can be used, which,

however, also increases detection runtime.

ACKNOWLEDGMENTS

The research was supported by grant DE 735/8-1 of

the German Research Foundation (DFG).

REFERENCES

Amthor, M., Haase, D., and Denzler, J. (2012). Fast and ro-

bust landmark tracking in x-ray locomotion sequences

containing severe occlusions. In International Work-

shop on Vision, Modelling, and Visualization (VMV).

Eurographics Association.

Amthor, M., Haase, D., and Denzler, J. (2014). Robust pic-

torial structures for x-ray animal skeleton tracking. In

International Conference on Computer Vision Theory

and Applications (VISAPP). SCITEPRESS.

Andrada, E., Nyakatura, J. A., Bergmann, F., and Blickhan,

R. (2013). Adjustments of global and local hindlimb

properties during terrestrial locomotion of the com-

mon quail (coturnix coturnix). Journal of Experimen-

tal Biology.

Andriluka, M., Roth, S., and Schiele, B. (2010). Monocular

3d pose estimation and tracking by detection. In Com-

puter Vision and Pattern Recognition (CVPR), 2010

IEEE Conference on. IEEE.

Berclaz, J., Fleuret, F., T

uretken, E., and Fua, P. (2011).

Multiple object tracking using k-shortest paths opti-

mization. Pattern Analysis and Machine Intelligence,

IEEE Transactions on.

Cootes, T., Edwards, G., and Taylor, C. (1998). Active

appearance models. In Computer Vision ECCV98.

Springer Berlin Heidelberg.

Cootes, T. F., Edwards, G. J., and Taylor, C. J. (2001). Ac-

tive appearance models. IEEE Trans. Pattern Anal.

Mach. Intell.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In Computer Vision and

Pattern Recognition, 2005. CVPR 2005. IEEE Com-

puter Society Conference on. IEEE.

Dehghan, A., Modiri Assari, S., and Shah, M. (2015).

Gmmcp tracker: Globally optimal generalized max-

imum multi clique problem for multiple object track-

ing. In The IEEE Conference on Computer Vision and

Pattern Recognition (CVPR). IEEE.

Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and

Ramanan, D. (2010). Object detection with discrim-

inatively trained part-based models. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence.

Freytag, A., Schadt, A., and Denzler, J. (2015). Interactive

image retrieval for biodiversity research. In German

Conference on Pattern Recognition (GCPR). Springer.

Haase, D., Andrada, E., Nyakatura, J. A., Kilbourne, B. M.,

and Denzler, J. (2013). Automated approximation of

center of mass position in x-ray sequences of animal

locomotion. Journal of Biomechanics.

Haase, D. and Denzler, J. (2011). Anatomical landmark

tracking for the analysis of animal locomotion in x-ray

videos using active appearance models. In Scandina-

vian Conference on Image Analysis (SCIA). Springer.

Haase, D. and Denzler, J. (2013). 2d and 3d analysis of ani-

mal locomotion from biplanar x-ray videos using aug-

mented active appearance models. EURASIP Journal

on Image and Video Processing.

Haase, D., Nyakatura, J. A., and Denzler, J. (2011). Multi-

view active appearance models for the x-ray based

analysis of avian bipedal locomotion. In Annual Sym-

posium of the German Association for Pattern Recog-

nition (DAGM). Springer.

Hariharan, B., Malik, J., and Ramanan, D. (2012). Discrim-

inative decorrelation for clustering and classiﬁcation.

In Computer Vision–ECCV 2012. Springer.

Jiang, X., Haase, D., K

orner, M., Bothe, W., and Denzler,

J. (2013). Accurate 3d multi-marker tracking in x-ray

cardiac sequences using a two-stage graph modeling

approach. In Computer Analysis of Images and Pat-

terns, pages 117–125. Springer.

Kendall, D. G. (1984). Shape manifolds, procrustean met-

rics, and complex projective spaces. Bulletin of the

London Mathematical Society.

Lelieveldt, B., zmc, M., van der Geest, R., Reiber, J.,

and Sonka, M. (2003). Multi-view active appearance

models for consistent segmentation of multiple stan-

dard views: application to long- and short-axis cardiac

{MR} images. International Congress Series.

Li, L., Nawaz, T., and Ferryman, J. (2015). Pets 2015:

Datasets and challenge. In Advanced Video and Signal

Based Surveillance (AVSS), 2015 12th IEEE Interna-

tional Conference on. IEEE.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. International Journal of Com-

puter Vision.

Nyakatura, J. A., Andrada, E., Blickhan, R., and Fischer,

M. S. (2011). Avian bipedal locomotion. In 5th Inter-

national Symposium on Adaptive Motion of Animals

and Machines (AMAM). Elsevier.

Sun, Y., Wang, X., and Tang, X. (2013). Deep convolutional

network cascade for facial point detection. In Com-

puter Vision and Pattern Recognition (CVPR), 2013

IEEE Conference on. IEEE.

Zhang, L., Li, Y., and Nevatia, R. (2008). Global data asso-

ciation for multi-object tracking using network ﬂows.

In Computer Vision and Pattern Recognition, 2008.

CVPR 2008. IEEE Conference on. IEEE.

Zhou, E., Fan, H., Cao, Z., Jiang, Y., and Yin, Q. (2013).

Extensive facial landmark localization with coarse-to-

ﬁne convolutional network cascade. In Computer Vi-

sion Workshops (ICCVW), 2013 IEEE International

Conference on. IEEE.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

254