Triple-stream Deep Metric Learning of Great Ape Behavioural Actions

Otto Brookes

, Majid Mirmehdi

, Hjalmar K

uhl

and Tilo Burghardt

Department of Computer Science, University of Bristol, U.K.

Evolutionary and Anthropocene Ecology, iDiv, Leipzig, Germany

tilo@cs.bris.ac.uk, hjalmar.kuehl@idiv.de

Keywords:

Animal Biometrics, Multi-Stream Deep Metric Learning, Animal Behaviour, Great Apes, PanAf-500 Dataset.

Abstract:

We propose the ﬁrst metric learning system for the recognition of great ape behavioural actions. Our proposed

triple stream embedding architecture works on camera trap videos taken directly in the wild and demonstrates

that the utilisation of an explicit DensePose-C chimpanzee body part segmentation stream effectively com-

plements traditional RGB appearance and optical ﬂow streams. We evaluate system variants with different

feature fusion techniques and long-tail recognition approaches. Results and ablations show performance im-

provements of ∼ 12% in top-1 accuracy over previous results achieved on the PanAf-500 dataset containing

180,000 manually annotated frames across nine behavioural actions. Furthermore, we provide a qualitative

analysis of our ﬁndings and augment the metric learning system with long-tail recognition techniques showing

that average per class accuracy – critical in the domain – can be improved by ∼ 23% compared to the literature

on that dataset. Finally, since our embedding spaces are constructed as metric, we provide ﬁrst data-driven

visualisations of the great ape behavioural action spaces revealing emerging geometry and topology. We hope

that the work sparks further interest in this vital application area of computer vision for the beneﬁt of endan-

gered great apes. We provide all key source code and network weights alongside this publication.

1 INTRODUCTION

As the climate crisis gathers pace, the threat to many

endangered species grows ever more perilous (Al-

mond et al., 2022). All species of great apes are, for

instance, listed as endangered or critically endangered

according to the IUCN Red List (IUCN, 2022).

Consequently, there is urgent need for methods

that can help to monitor population status and assess

the effectiveness of conservation interventions (K

uhl

and Burghardt, 2013; Congdon et al., 2022; Tuia

et al., 2022). This includes the recognition of behav-

iors and variation therein, as an integral part of biolog-

ical diversity (Dominoni et al., 2020; Carvalho et al.,

2022).

Previous works have employed deep neural net-

works which leverage multiple modalities, such as

RGB, optical ﬂow, and audio (Sakib and Burghardt,

2020; Bain et al., 2021), for the classiﬁcation of great

ape behaviours and actions. However, higher level ab-

stractions such as pose or body part information have

remained unexplored for addressing this task. In re-

sponse, we propose utilising the latter together with

RGB and optical ﬂow in a triple-stream metric learn-

ing system (see Fig. 1) for improved classiﬁcation re-

sults and domain visualisations relevant to biologists.

Great Ape Activities. This paper will focus on

great ape activity recognition, where the coarse ac-

tivity classes used are illustrated in Fig. 2 for the

utilised PanAf-500 dataset (see Sec. 3). Note that

computer vision would traditionally categorise these

classes as actions whilst in the biological realm they

represent behaviour (or aspects thereof) often cap-

tured in ethograms (Nishida et al., 1999; Zamma and

Matsusaka, 2015). For clarity, in this paper we will

refer to these classes as behavioural actions recognis-

ing historical traditions in both disciplines.

We will approach the classiﬁcation task via a deep

metric learning system (Karaderi et al., 2022) that

embeds inputs into a latent space and uses geomet-

ric distances to form distributions that align with the

semantic similarity captured by the classes (Hermans

et al., 2017; Musgrave et al., 2020). A major advan-

tage over standard supervised systems is that sample

distances in visualisations of the latent space always

relate to learned similarity and, thus, are more natu-

rally interpretable by experts. We will also analyse

294

Brookes, O., Mirmehdi, M., Kühl, H. and Burghardt, T.

Triple-stream Deep Metric Learning of Great Ape Behavioural Actions.

DOI: 10.5220/0011798400003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

294-302

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

positive

anchor

fusion

negative

ResNet-50

128

Metric

Learning

Triplet

128

embedding model

DensePose-C

Optical ﬂow

RGB

128

embedding model

shared weights

Figure 1: System Overview. Our proposed triple-stream metric learning approach utilises all RGB appearance, optical ﬂow,

and DensePose-C segmentations of chimps in videos. Exploiting hybrid reciprocal triplet and cross entropy losses, the model

is then trained to map embeddings representing great ape behavioural actions onto a metric space, where semantically similar

representations are geometrically close forming natural clusters. This pipeline improves on state-of-the-art classiﬁcation

performance and allows for visualisations of the underpinning space of behavioural actions (best viewed zoomed).

the role that additional DensePose-Chimp informa-

tion (Sanakoyeu et al., 2020) can play in improving

recognition performance compared to systems that

utilise RGB and optical ﬂow only. Lastly, as shown

by Sakib and Burghardt (Sakib and Burghardt, 2020),

there are signiﬁcant challenges in correctly classify-

ing behavioural actions which occur infrequently and

form the distribution tail (see Fig. 2). To address this,

we will employ three long-tailed recognition (LTR)

techniques to improve performance on tail classes; (i)

logit adjustment (Menon et al., 2020); (ii) class bal-

anced focal loss (Cui et al., 2019); and (iii) weight

balancing (Alshammari et al., 2022).

In summary, our contributions are as follows:

(i) we implement the ﬁrst deep metric learning system

for recognising great ape behavioural actions; (ii) we

show that utilising explicit pose information has a sig-

niﬁcant positive effect on recognition performance in

this domain; and (iii) we establish that existing LTR

techniques can be applied in a metric learning setting

to improve performance on tail classes for the prob-

lem. The proposed approaches improve the state-of-

the-art performance benchmarks with respect to top-1

(∼ 85%) and average per class (∼ 65%) accuracy on

the PanAf-500 dataset.

2 RELATED WORK

Action recognition aims to classify actions observed

in video (Kalfaoglu et al., 2020; Shaikh and Chai,

2021). Learning spatio-temporal features character-

istic for actions (Simonyan and Zisserman, 2014) via

various deep learning paradigms forms the approach

of choice in the domain of human action recogni-

tion (HAR). We will brieﬂy review concepts from this

ﬁeld, before discussing specifc relevant great ape be-

havioural action recognition and LTR methods.

Human Action Recognition. Although there are

numerous deep learning approaches to action recog-

nition (Zhou et al., 2018; Lin et al., 2019; Tran et al.,

2019; Kalfaoglu et al., 2020; Pan et al., 2019; Majd

and Safabakhsh, 2020; Sharir et al., 2021; Zhang

et al., 2021a) this work focuses on multi-stream ar-

chitectures, which address key aspects of the action

recognition problem (e.g., spatial and temporal) in-

dependently and explicitly. Feichtenhofer et al. (Fe-

ichtenhofer et al., 2019) introduced the SlowFast ar-

chitecture which employs two streams, each operat-

ing at different frame rates; a slow, low frame-rate

pathway captures spatial information while the fast,

high frame-rate pathway captures ﬁne temporal detail.

Other types of multi-stream networks process differ-

ent visual modalities. Simonyan (Simonyan and Zis-

serman, 2014) introduced a two-stream network that

processes RGB and optical ﬂow to exploit spatial and

temporal semantics, respectively. Since then, several

networks that utilise additional modalities, such as

motion saliency (Zong et al., 2021) and audio (Wang

et al., 2021), have been introduced. Recently, the in-

troduction of pose, which is critical for the perception

of actions (Le et al., 2022), has shown promising re-

sults in multi-stream architectures (Hong et al., 2019;

Triple-stream Deep Metric Learning of Great Ape Behavioural Actions

295

# Samples (log)

Behavioural action classes

hanging

walking

sitting on back

standing

sitting

climbing up

camera interaction

running

climbing down

Figure 2: Behavioural Actions in the PanAf-500 Data.

Examples of each one of the nine behavioural action classes

(top) and their distribution across the approx. 180k frames

in the dataset (bottom). Note the imbalance of two orders of

magnitude in the distribution (best viewed zoomed).

Hayakawa and Dariush, 2020; Duan et al., 2021; Li

et al., 2022). In particular, the DensePose format

provides an opportunity to exploit ﬁne-grained, seg-

mentation map-based pose representations for action

recognition. Hayakawa et al. (Hayakawa and Dar-

iush, 2020) combine RGB and DensePose estimations

in a two-stream network and demonstrate strong per-

formance on egocentric footage of humans. Whilst

such signiﬁcant progress has been made in the domain

of HAR, research into great ape behavioural action

recognition is still in its infancy and few systems have

been tested on natural datasets.

Great Ape Domain. To date, two systems have

attempted automated great ape behavioural action

recognition, both are multi-stream architectures. The

ﬁrst (Sakib and Burghardt, 2020) is based on the

two-stream convolutional architecture by Simonyan

et al. (Simonyan and Zisserman, 2014) and used 3D

ResNet-18s for feature extraction and LSTM-based

fusion of RGB and optical ﬂow features. They report

top-1 accuracy of 73.52% across the nine behavioural

actions in the PanAf-500 dataset (see Sec. 3) and a

relatively low average per class accuracy (42.33%),

highlighting the issue of tail class performance. The

second, proposed by Bain et al. (Bain et al., 2021),

is a deep learning system that requires both audio

and video inputs and detects two speciﬁc behaviours;

buttress drumming and nut cracking. Their system

utilised a 3D ResNet-18 and a 2D ResNet-18 for ex-

traction of visual and assisting audio features, respec-

tively, in different streams. They achieved an aver-

age precision of 87% for buttress drumming and 85%

for nut cracking on their unpublished dataset. How-

ever, the multi-modal method is not applicable to all

camera trap settings since many older models do not

provide audio. It cannot be utilised on the PanAf-500

dataset since many clips there do not contain audio.

Long-Tailed Recognition. Most natural recorded

data exhibits long-tailed class distributions (Liu et al.,

2019). This is true of great ape camera-trap footage

which is dominated by commonly occurring be-

haviours - even with only the nine classes of the

PanAf-500 data the distribution shows a clear tail (see

Fig. 2). Without addressing this issue, models trained

on such data often exhibit poor performance on rare

classes. Various counter-measures have been pro-

posed (Verma et al., 2018; Kang et al., 2019; Zhang

et al., 2021b). Class balanced losses assign addi-

tional weights, typically determined by inverse class

frequencies, to samples from rare classes and have

yielded strong results when coupled with techniques

to reduce per-class redundancy (Cui et al., 2019).

Similarly, logit adjustment uses class frequencies to

directly offset output logits in favour of minority

classes during training (Menon et al., 2020). An

orthogonal approach, based on the observation that

weight norms for rare classes are smaller in naively

trained classiﬁers, is to perform weight balancing (Al-

shammari et al., 2022). These techniques have

achieved strong results on several LTR benchmarks.

Before detailing how we use triple-stream metric

learning with explicit DensePose-Chimp processing

and LTR extensions for behavioural action recogni-

tion, we will brieﬂy outline the utilised dataset.

3 DATASET

The Pan-African dataset, gathered by the Pan African

Programme: ‘The Cultured Chimpanzee’, comprises

∼ 20,000 videos from footage gathered at 39 study

sites spanning 15 African countries. Here we utilise

a 500 video subset, PanAf-500, speciﬁcally ground-

truth labelled for use in computer vision under re-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

296

Figure 3: Frame-by-frame Ground Truth Annotations.

Four still frames from PanAf-500 videos with annotations

of location (green boxes) and behavioural actions (visu-

alised as text) of the apes in-frame (best viewed zoomed).

producible and comparable benchmarks. It includes

frame-by-frame annotations for full-body locations of

great apes and nine behavioural actions (Sakib and

Burghardt, 2020) across approximately 180k frames

(see. Fig. 3). Fig. 2 displays the behavioural actions

classes in focus together with their distribution. We

utilised the PanAf-500 dataset for all experiments and

employ the same training and test partitions described

in (Sakib and Burghardt, 2020).

4 METHOD

The proposed system utilises three visual modali-

ties as input; RGB, optical ﬂow, and DensePose-C

estimations (Sanakoyeu et al., 2020), as illustrated

in Fig. 1). All optical ﬂow images are pre-computed

using OpenCV’s implementation of the Dual TV L1

algorithm (Zach et al., 2007). We employ the model

developed by Sanakoyeu et al. (Sanakoyeu et al.,

2020) to generate DensePose-C segmentations de-

scribing chimpanzee pose. The model predicts dense

correspondences between image pixels and a 3-D ob-

ject mesh where each mesh represents a chimpanzee

body part speciﬁed by a selector I and local surface

coordinates within each mesh indexed by U and V .

Frame-by-frame application to each of the PanAf-

500 videos yields DensePose-C estimates expressed

in IUV coordinates.

Each of the three input modalities is fed into a 3D

ResNet-50 (Du Tran et al., 2017) backbone, which

together act as a feature extractor (see Fig. 1). The

input tensors into the backbones are 3D since inputs

are processed in snippets, that is each stream accepts a

sequence of n consecutive RGB frames, optical ﬂow

images, or IUV coordinates, respectively. The ﬁnal

fully-connected layer outputs an n-dimensional en-

coding for each stream. These are fused into a single

embedding using three popular approaches; (i) sim-

.... x

128

Concatenation

Conv3D

ResNet50

b x 6144 x 5 x 8 x 8

MaxPool3D

Feature maps

2048 x 5 x 8 x 8

feature extractors

Triple stream

AdaptiveAvgPool3D

Fc1

b x 2048 x 3 x 5 x 5

Fc2

b x 1024

.... x

128

.... x

128

.... x

128

Multiplication

L2 norm

.... x

128

RGB

Optical ﬂow

Output

DensePose-C

Figure 4: Fusion Head Schematics. A component break-

down of fusion by element-wise multiplication (left) and

convolutional fusion (right) as applied for our work to ex-

plore their impact on performance.

ple averaging across streams; (ii) convolutional fusion

whereby stream features are concatenated and passed

to a 3D convolutional layer as a volume; and (iii)

element-wise multiplication of all three embedding

vectors followed by L2 normalisation. The latter two

approaches are illustrated in detail in Fig. 4. A lin-

ear layer at the end of the fusion head ﬁnally outputs

the uniﬁed embedding as logits. Whilst this system

was trained via metric learning - visually sketched in

Fig. 1 (right) - a k-NN classiﬁer is used to perform

inference in the embedding space during evaluation.

Let the parameters of this network f

(·) be de-

noted by θ. Furthermore, let f

(x) = x be the short-

hand for referring to embeddings. Our metric learn-

ing objective is, thus, to minimise the distance be-

tween anchor-positive embedding pairs d(x

, x

) and

maximise distance between anchor-negative embed-

ding pairs d(x

, x

), where d represents a Euclidean.

Instead of using standard triplet loss (Hermans et al.,

2017) L

T L

, we use an improved version (Andrew

et al., 2021), where the model is optimised via a hy-

brid reciprocal triplet and softmax cross-entropy loss:

= L

+ λ L

. (1)

It is assembled from two components balanced by

λ = 0.1 as given in (Andrew et al., 2021). The two

Triple-stream Deep Metric Learning of Great Ape Behavioural Actions

297

components themselves are evaluated as:

= d(x

, x

) +

d(x

, x

)

(2)

= −log



∑

i=1



, (3)

where C denotes the total number of classes and y are

the class labels.

In order to extend this system into the LTR do-

main we substitute the softmax cross-entropy term

for losses calculated using; (i) cross-entropy soft-

max with logit adjustment (Menon et al., 2020) L

;

(ii) class-balanced focal loss (Cui et al., 2019) L

;

and (iii) class-balanced focal loss with weight balanc-

ing (Alshammari et al., 2022). The ﬁrst two losses are

evaluated as follows:

= −log



+ τ · log π

∑

i=1

+τ · log π



, (4)

= −

1 − β

∑

i=1

(1 − p

)

log(p

), (5)

where π represents the class priors (i.e., class frequen-

cies in the training set) and temperature factor τ = 1,

β = 0.99 is the re-weighting hyper-parameter, n is the

total number of samples, y are the classes, γ = 1 is the

focal loss hyper-parameter and p

= σ(x

). Balancing

the network weights θ is performed via a MaxNorm

constraint ∥θ

l,i

∥

≤ δ

, ∀i given in (Alshammari et al.,

2022) imposed on each class ﬁlter i in the last layer l

of the network where δ is the L2-norm ball radius. We

will reference a L

-based optimisation where weight

balancing is performed via L

W B

Methodologically, this described architecture ap-

proaches the learning of behavioural great ape actions

via ﬁve key capabilities: 1) utilisation of multiple rel-

evant input modalities across an entire video snippet;

2) effective streamed content encoding; 3) fusion into

a single embedding space; 4) metric space optimisa-

tion so that distances naturally reﬂect semantic sim-

ilarity; and 5) taking into account class imbalances

common to the domain content.

5 EXPERIMENTS

5.1 General Training Setup

We train our architecture via SGD optimisation using

batch size 32 and learning rate 10

−4

. Feature extrac-

tor backbones are initialised with Kinetics-400 (Kay

et al., 2017) pre-trained weights and training runs are

distributed over 8 Tesla V100 GPUs for 100 epochs.

Table 1: Behavioural Action Recognition Benchmarks.

Top-1 and average per-class (C-Avg) accuracy performance

on the PanAf-500 dataset for the current state-of-the-

art (row 1), single and dual-stream baselines (rows 2–5),

and our triple-stream networks (rows 6–8) for different fu-

sion methodologies and losses tested.

Models/Streams Fusion Loss Top-1 C-Avg

Sakib et al. 2020

1 RGB+OF LSTM L

73.52% 42.33%

Up to Dual-Stream

2 RGB only None L

T L

55.50% 32.67%

3 RGB only None L

74.24% 55.76%

4 RGB+OF Avg L

T L

62.90% 39.10%

5 RGB+OF Avg L

75.02% 61.97%

Triple-Stream (Ours)

6 RGB+OF+DP Avg L

81.71% 46.61%

7 RGB+OF+DP Conv L

82.04% 56.31%

8 RGB+OF+DP Elem L

85.86% 50.50%

5.2 Baselines and Stream Ablations

As shown in Tab. 1, we ﬁrst establish performance

benchmarks for one and two stream baseline archi-

tectures of our system (rows 2–5) against the cur-

rent state-of-the-art (row 1), which uses a ResNet-18

backbone with focal loss L

, SGD, and LSTM-based

frame fusion (Sakib and Burghardt, 2020). As ex-

pected, we conﬁrmed that - using identical setups and

losses - adding an optical ﬂow stream is beneﬁcial

in the great ape domain mirroring HAR results (see

rows 2 vs 4, and 3 vs 5). Additionally, models trained

using L

consistently outperformed standard triplet

loss L

scenarios (see rows 2 vs 3, and 4 vs 5). Fi-

nally, a dual-stream version of our proposed architec-

ture trained with L

outperforms the state-of-the-art

by a small margin (see rows 1 vs 5).

5.3 Triple-Stream Recognition

As given in Tab. 1 rows 6–8, our proposed triple-

stream architecture signiﬁcantly outperforms all base-

lines with regards to top-1 accuracy, achieving up to

85.86%. Thus, explicit DensePose-C information ap-

pears a useful information source for boosting be-

havioural action recognition in great apes. However,

without LTR techniques all our triple-stream models

are signiﬁcantly outperformed by a dual-stream set-

ting (row 5) with regards to average per-class accu-

racy. This reduction is caused by signiﬁcantly poorer

performance on minority classes (see Sec. 5.4).

Since the learned behavioural action embeddings

are constructed as metric from the outset, they can

be visualised meaningfully – we note that such data-

driven visualisations are novel in the primatology do-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

298

camera interaction

climbing up

climbing down

hanging

running

sitting

sitting on back

standing

walking

Behavioural actions

Single Stream (RGB)

Kinetics pretrained (no training)

Training

Test

Dual Stream (RGB+OF)

Triple Stream (AllThree)

Figure 5: Visualisations of Great Ape Behavioural Action Spaces. A 2D t-SNE (Wattenberg et al., 2016) visualisation of

the 128-dimensional training (top-right) and test (bottom-right) embeddings produced by the single, dual and three-stream

network with convolutional fusion. We can see that training set embeddings from all classes are clustered cleanly. In contrast,

test set embeddings show signiﬁcant overlap and only embeddings from majority classes form distinct clusters. This is

consistent with the high top-1 accuracy and relatively low average per-class accuracy reported in Tab. 1.

main. Fig. 5 depicts such learned spaces for our data

and architecture where, independent of stream cardi-

nality, embeddings cluster the training data cleanly.

This is of course expected given above 99% top-1

training accuracy in all settings. Yet, behavioural ac-

tions of great apes are highly intricate as well as vari-

able and, even with approx. 144, 000 training frames

used, the model clearly shows signs of overﬁtting. As

a result, test set embeddings exhibit signiﬁcant cluster

overlap. Sample groups representing sitting, standing,

and walking, for instance, blend into one another. In

addition to overﬁtting, this also highlights the transi-

tional nature of these often temporarily adjacent and

smoothly changing actions. Thus, future temporally

transitional ground truth labelling may be needed to

represent behavioural great ape action in the PanAf-

500 dataset more authentically.

5.4 Fusing Streams

When looking at the impact of information fusion

methods on performance in more detail, we ﬁnd that

benchmarks vary signiﬁcantly (see Tab. 1 rows 6–8)

when we test averaging, element-wise multiplication,

and convolutional fusion, as described in Sec. 4. Re-

sults show that convolution and element-wise mul-

tiplication improve performance slightly across both

metrics when compared with averaging: top-1 accu-

racy improves by 0.33% and 4.1%, respectively (see

rows 6–8). However, the most signiﬁcant gains are

observed with respect to average per class accuracy

which increases by 3.44% for element-wise multipli-

cation and 9.7% for convolutional fusion. Learnable

parameters in the convolution method clearly help

blending information even when only fewer samples

are available for training. Building on this improve-

ment, we will next investigate the impact of LTR

methods in order to beneﬁt tail class performance.

5.5 Long-Tail Recognition

When grouping behavioural actions into head (cov-

ering sitting, standing, and walking) and remain-

ing tail classes based on frequency in the data (see

Fig. 2), a signiﬁcant performance gap becomes appar-

ent even when using the so far best C-Avg performing

model (see Tab. 2 row 1). Employing LTR techniques

can, however, reduce this gap and improve average

per-class accuracy further as quantiﬁed across rows

2–4 in Tab. 2). Fig. 6 shows t-SNE visualisations of

the three LTR triple-stream approaches when trained

with convolutional feature fusion. Particularly for the

class-balanced approaches and weight-balancing se-

tups (two rightmost), tail class clusters appear more

clearly separated and class overlap is generally re-

duced. Thus, for the great ape domain underrepre-

sented classes are indeed an effective source of infor-

mation for improving action separability in general.

Triple-stream Deep Metric Learning of Great Ape Behavioural Actions

299

climbing up

hanging

running

sitting

sitting on back

standing

walking

Logit adjustment

No LTR augmentation

Weight balanced

CB (+focal loss)

climbing down

camera interaction

Test

Figure 6: Long-tail Test Embeddings. A 2D t-SNE visualisation of the 128-dimensional test embeddings produced by the

three-stream network with convolutional fusion alone (leftmost) and augmented with each LTR technique; (i) logit adjustment

(ii) CB (+focal loss) and (iii) weight balancing. All LTR-augmented methods improve clustering of embeddings belonging to

tail classes. They appear more clearly separated and exhibit less overlap when compared with the non-LTR method.

6 CONCLUSION

In this work we introduced the ﬁrst deep metric learn-

ing system for great ape behavioural action recogni-

tion. We demonstrated that the proposed triple-stream

architecture can provide leading state-of-the-art per-

formance when tested on the PanAf-500 camera trap

dataset covering 180,000 annotated frames across 500

videos taken in the wild. We demonstrated that the ad-

dition of a DensePose-C chimpanzee pose estimation

stream into the embedding architecture is highly ef-

fective and leads to system performance of 85.86%

top-1 accuracy on the data. We also showed that

adding LTR techniques that address poor tail class

performance to the system can improve the average

per-class accuracy to 65.66% on the dataset. Despite

these improvements we note that both larger anno-

tated datasets to counteract overﬁtting as well as more

temporally blended forms of annotation (e.g. action

transition annotations) would beneﬁt the authenticity

of data-driven great ape behavioural representations.

We hope that the research presented here sparks fur-

ther interest in this vital application area for the bene-

ﬁt of endangered species such as great apes.

ACKNOWLEDGEMENTS

We thank the Pan African Programme: ‘The Cul-

tured Chimpanzee’ team and its collaborators for al-

lowing the use of their data for this paper. We thank

Amelie Pettrich, Antonio Buzharevski, Eva Martinez

Garcia, Ivana Kirchmair, Sebastian Sch

utte, Linda

Gerlach and Fabina Haas. We also thank manage-

ment and support staff across all sites; speciﬁcally

Yasmin Moebius, Geoffrey Muhanguzi, Martha Rob-

bins, Henk Eshuis, Sergio Marrocoli and John Hart.

Thanks to the team at https://www.chimpandsee.org

particularly Briana Harder, Anja Landsmann, Laura

K. Lynn, Zuzana Mach

ckov

a, Heidi Pfund, Kris-

teena Sigler and Jane Widness. The work that al-

lowed for the collection of the dataset was funded

by the Max Planck Society, Max Planck Society In-

novation Fund, and Heinz L. Krekeler. In this re-

spect we would like to thank: Ministre des Eaux et

For

ets, Minist

ere de l’Enseignement sup

erieur et de la

Recherche scientiﬁque in C

ote d’Ivoire; Institut Con-

golais pour la Conservation de la Nature, Minist

ere de

la Recherche Scientiﬁque in Democratic Republic of

Congo; Forestry Development Authority in Liberia;

Direction Des Eaux Et For

ets, Chasses Et Conserva-

tion Des Sols in Senegal; Makerere University Bi-

ological Field Station, Uganda National Council for

Science and Technology, Uganda Wildlife Authority,

National Forestry Authority in Uganda; National In-

stitute for Forestry Development and Protected Area

Management, Ministry of Agriculture and Forests,

Ministry of Fisheries and Environment in Equatorial

Guinea. This work was supported by the UKRI CDT

in Interactive AI under grant EP/S022937/1.

Table 2: LTR-enabled Behavioural Action Recogni-

tion Benchmarks. Average per-class accuracy for our

triple-stream network with convolutional fusion for best

performing non-LTR method (row1), and three LTR ap-

proaches (rows 2–4) targetting poor tail class performance.

Method/Loss C-Avg Head Tail

Non-LTR Triple-Stream

1 L

56.31 80.57 44.78

LTR Triple-Stream

2 L

61.76 83.22 50.7

3 L

63.56 77.60 55.95

4 L

W B

65.66 82.55 56.26

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

300

REFERENCES

Almond, R., Grooten, M., Juffe Bignoli, D., and Petersen,

T. (2022). Wwf (2022) living planet report 2022 -

building a nature-positive society. 1

Alshammari, S., Wang, Y.-X., Ramanan, D., and Kong, S.

(2022). Long-tailed recognition via weight balancing.

In CVPR, pages 6897–6907. 2, 3, 5

Andrew, W., Gao, J., Mullan, S., Campbell, N., Dowsey,

A. W., and Burghardt, T. (2021). Visual identiﬁcation

of individual holstein-friesian cattle via deep metric

learning. Computers and Electronics in Agriculture,

185:106133. 4

Bain, M., Nagrani, A., Schoﬁeld, D., Berdugo, S., Bessa, J.,

Owen, J., Hockings, K. J., Matsuzawa, T., Hayashi,

M., Biro, D., et al. (2021). Automated audiovisual

behavior recognition in wild primates. Science ad-

vances, 7(46):eabi4883. 1, 3

Carvalho, S., Wessling, E. G., Abwe, E. E., Almeida-

Warren, K., Arandjelovic, M., Boesch, C., Danquah,

E., Diallo, M. S., Hobaiter, C., Hockings, K., et al.

(2022). Using nonhuman culture in conservation re-

quires careful and concerted action. Conservation Let-

ters, 15(2):e12860. 1

Congdon, J., Hosseini, M., Gading, E., Masousi, M.,

Franke, M., and MacDonald, S. (2022). The future

of artiﬁcial intelligence in monitoring animal identiﬁ-

cation, health, and behaviour. 1

Cui, Y., Jia, M., Lin, T.-Y., Song, Y., and Belongie, S.

(2019). Class-balanced loss based on effective num-

ber of samples. In CVPR, pages 9268–9277. 2, 3,

Dominoni, D. M., Halfwerk, W., Baird, E., Buxton, R. T.,

Fern

andez-Juricic, E., Fristrup, K. M., McKenna,

M. F., Mennitt, D. J., Perkin, E. K., Seymoure, B. M.,

et al. (2020). Why conservation biology can beneﬁt

from sensory ecology. Nature Ecology & Evolution,

4(4):502–511. 1

Du Tran, H. W., Torresani, L., Ray, J., Lecun, Y., and Paluri,

M. (2017). A closer look at spatiotemporal convolu-

tions for action recognition.(2017). OK. 4

Duan, M., Qiu, H., Zhang, Z., and Wu, Y. (2021). Ntu-

densepose: A new benchmark for dense pose action

recognition. In Big Data, pages 3170–3175. IEEE. 3

Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019).

Slowfast networks for video recognition. In ICCV,

pages 6202–6211. 2

Hayakawa, J. and Dariush, B. (2020). Recognition and

3d localization of pedestrian actions from monocular

video. In ITSC, pages 1–7. IEEE. 3

Hermans, A., Beyer, L., and Leibe, B. (2017). In defense

of the triplet loss for person re-identiﬁcation. arXiv

preprint arXiv:1703.07737. 1, 4

Hong, J., Cho, B., Hong, Y. W., and Byun, H. (2019).

Contextual action cues from camera sensor for multi-

stream action recognition. Sensors, 19(6):1382. 3

IUCN (2022). Iucn red list of threatened species version

2022.1. 1

Kalfaoglu, M. E., Kalkan, S., and Alatan, A. A. (2020).

Late temporal modeling in 3d cnn architectures with

bert for action recognition. In ECCV, pages 731–747.

Springer. 2

Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng,

J., and Kalantidis, Y. (2019). Decoupling representa-

tion and classiﬁer for long-tailed recognition. arXiv

preprint arXiv:1910.09217. 3

Karaderi, T., Burghardt, T., Hsiang, A. Y., Ramaer, J., and

Schmidt, D. N. (2022). Visual microfossil identiﬁca-

tion via deep metric learning. In ICPRAI, pages 34–

46. Springer. 1

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C.,

Vijayanarasimhan, S., Viola, F., Green, T., Back, T.,

Natsev, P., et al. (2017). The kinetics human action

video dataset. arXiv preprint arXiv:1705.06950. 5

uhl, H. S. and Burghardt, T. (2013). Animal biomet-

rics: quantifying and detecting phenotypic appear-

ance. TREE, 28(7):432–441. 1

Le, V.-T., Tran-Trung, K., and Hoang, V. T. (2022). A com-

prehensive review of recent deep learning techniques

for human activity recognition. Computational Intel-

ligence and Neuroscience, 2022. 2

Li, Y., Lu, Z., Xiong, X., and Huang, J. (2022). Perf-net:

Pose empowered rgb-ﬂow net. In WACV, pages 513–

522. 3

Lin, J., Gan, C., and Han, S. (2019). Tsm: Temporal shift

module for efﬁcient video understanding. In ICCV,

pages 7083–7093. 2

Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., and Yu,

S. X. (2019). Large-scale long-tailed recognition in

an open world. In CVPR, pages 2537–2546. 3

Majd, M. and Safabakhsh, R. (2020). Correlational con-

volutional lstm for human action recognition. Neuro-

computing, 396:224–229. 2

Menon, A. K., Jayasumana, S., Rawat, A. S., Jain, H., Veit,

A., and Kumar, S. (2020). Long-tail learning via logit

adjustment. arXiv preprint arXiv:2007.07314. 2, 3, 5

Musgrave, K., Belongie, S., and Lim, S.-N. (2020). Pytorch

metric learning. 1

Nishida, T., Kano, T., Goodall, J., McGrew, W. C., and

Nakamura, M. (1999). Ethogram and ethnography

of mahale chimpanzees. Anthropological Science,

107(2):141–188. 1

Pan, Y., Xu, J., Wang, M., Ye, J., Wang, F., Bai, K., and

Xu, Z. (2019). Compressing recurrent neural networks

with tensor ring for action recognition. In AAAI, vol-

ume 33, pages 4683–4690. 2

Sakib, F. and Burghardt, T. (2020). Visual recognition of

great ape behaviours in the wild. VAIB. 1, 2, 3, 4, 5

Sanakoyeu, A., Khalidov, V., McCarthy, M. S., Vedaldi, A.,

and Neverova, N. (2020). Transferring dense pose to

proximal animal classes. In CVPR, pages 5233–5242.

2, 4

Shaikh, M. B. and Chai, D. (2021). Rgb-d data-based action

recognition: A review. Sensors, 21(12):4246. 2

Sharir, G., Noy, A., and Zelnik-Manor, L. (2021). An image

is worth 16x16 words, what is a video worth? arXiv

preprint arXiv:2103.13915. 2

Simonyan, K. and Zisserman, A. (2014). Two-stream con-

volutional networks for action recognition in videos.

NeurIPS, 27. 2, 3

Triple-stream Deep Metric Learning of Great Ape Behavioural Actions

301

Tran, D., Wang, H., Torresani, L., and Feiszli, M. (2019).

Video classiﬁcation with channel-separated convolu-

tional networks. In ICCV, pages 5552–5561. 2

Tuia, D., Kellenberger, B., Beery, S., Costelloe, B. R.,

Zufﬁ, S., Risse, B., Mathis, A., Mathis, M. W., van

Langevelde, F., Burghardt, T., et al. (2022). Perspec-

tives in machine learning for wildlife conservation.

Nature communications, 13(1):1–15. 1

Verma, V., Lamb, A., Beckham, C., Najaﬁ, A., Courville,

A., Mitliagkas, I., and Bengio, Y. (2018). Manifold

mixup: learning better representations by interpolat-

ing hidden states. 3

Wang, L., Yuan, X., Zong, M., Ma, Y., Ji, W., Liu, M.,

and Wang, R. (2021). Multi-cue based four-stream 3d

resnets for video-based action recognition. Informa-

tion Sciences, 575:654–665. 2

Wattenberg, M., Vi

egas, F., and Johnson, I. (2016). How to

use t-sne effectively. Distill. 6

Zach, C., Pock, T., and Bischof, H. (2007). A duality based

approach for realtime tv-l 1 optical ﬂow. In DAGM,

pages 214–223. Springer. 4

Zamma, K. and Matsusaka, T. (2015). Ethograms and

the diversity of behaviors, page 510–518. Cambridge

University Press. 1

Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B.,

Chen, H., Marsic, I., and Tighe, J. (2021a). Vidtr:

Video transformer without convolutions. In ICCV,

pages 13577–13587. 2

Zhang, Y., Wei, X.-S., Zhou, B., and Wu, J. (2021b). Bag

of tricks for long-tailed visual recognition with deep

convolutional neural networks. In AAAI, volume 35,

pages 3447–3455. 3

Zhou, B., Andonian, A., Oliva, A., and Torralba, A. (2018).

Temporal relational reasoning in videos. In ECCV,

pages 803–818. 2

Zong, M., Wang, R., Chen, X., Chen, Z., and Gong, Y.

(2021). Motion saliency based multi-stream multiplier

resnets for action recognition. Image and Vision Com-

puting, 107:104108. 2

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

302