Human Object Interaction Detection Primed with Context

Maya Antoun

and Daniel Asmar

Vision and Robotics Lab, American University of Beirut, Bliss Street, Beirut, Lebanon

Keywords:

Human Object Interaction, Scene Understanding, Deep Learning.

Abstract:

Recognizing Human-Object Interaction (HOI) in images is a difﬁcult yet fundamental requirement for scene

understanding. Despite the signiﬁcant advances deep learning has achieved so far in this ﬁeld, the perfor-

mance of state of the art HOI detection systems is still very low. Contextual information about the scene

has shown improvement in the prediction. However, most works that use semantic features rely on general

word embedding models to represent the objects or the actions rather than contextual embedding. Motivated

by evidence from the ﬁeld of human psychology, this paper suggests contextualizing actions by pairing their

verbs with their relative objects at an early stage. The proposed system consists of two streams: a semantic

memory stream on one hand, where verb-object pairs are represented via a graph network by their correspond-

ing feature vector; and an episodic memory stream on the other hand in which human-objects interactions

are represented by their corresponding visual features. Experimental results indicate that our proposed model

achieves comparable results on the HICO-DET dataset with a pretrained object detector and superior results

on HICO-DET with ﬁnetuned detector.

1 INTRODUCTION

Perception stands as one of the fundamental build-

ing blocks of a completely autonomous system. Liv-

ing beings rely on perception for their survival; we

perceive the environment around us, the objects we

interact with, as well as other humans. Despite its

apparent simplicity, visual perception is difﬁcult to

realize in autonomous agents primarily because of

our shortcomings in understanding and replicating

human solutions that involve higher levels of cogni-

tion. In the past decade, and with the notable devel-

opments in deep learning, signiﬁcant steps forward

have been achieved in scene perception and under-

standing. More speciﬁcally, and of particular interest

to this paper is the problem of Human-Object Interac-

tion (HOI). Given an input image, the aim of HOI is

to localize and estimate the interactions between hu-

mans and the objects around them by predicting the

triplet < human, predicate, ob ject >. Detecting these

interactions requires both knowledge of human and

object information as well as the possible interactions

between them.

Researchers have solved the HOI problem using

two types of methods: two-stage and single-stage

methods. The ﬁrst stage in a two-stage method is

https://orcid.org/0000-0001-6017-5804

https://orcid.org/0000-0002-4932-9777

the detection of the humans and objects using an off-

the-shelf detector, then in the second stage the inter-

action between them is predicted using the extracted

features. In single-stage systems, the object detec-

tion and interaction prediction are done in parallel or

in an end-to-end manner. Most of the existing two-

stage systems, (Gkioxari et al., 2018; Gao et al., 2018;

Baldassarre et al., 2020; Hou et al., 2020; Li et al.,

2020a), rely on interpreting the scene based on its ap-

pearance as well as the geometric layout of objects

and people within the scene. In some of these works,

contextual information is only incorporated through

features from the union region of a human and object

bounding box which may not always be shown in the

features covering the union region. Other systems,

(Li et al., 2019; Wan et al., 2019; Song et al., 2020)

solve the HOI problem by estimating the pose of de-

tected people as an addition to the spatial and visual

features. However, the human performing the action

on the object is not always visible in the image and

can be occluded by different objects or other humans.

Therefore, relying on the human visual and pose for

action prediction is not sufﬁcient.

Alternatively, other two-stage networks, (Liang

et al., 2020; Bansal et al., 2020a; Kim et al., 2020b;

Zhou et al., 2020; Li et al., 2020b; Sun et al., 2020;

Wang et al., 2020a; Xu et al., 2019; Gao et al., 2020;

Liu et al., 2020; Hou et al., 2021), predict the HOI

prediction by integrating semantics into the network

Antoun, M. and Asmar, D.

Human Object Interaction Detection Primed with Context.

DOI: 10.5220/0011612200003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

59-68

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

Semantic Memory

General knowledge

about world

'Where was the

backpack last seen?'

'What is the

backpack used for?'

<Human, Carry, Backpack>

<Human, Hold, Backpack>

Episodic Memory

Traveling back in time

Figure 1: The perception behind HOI prediction. The episodic memory recalls contextual details from previous experiences

and the semantic memory recalls facts, concepts, and ideas about the objects and the scene in question. The episodic memory

and the semantic memory interplay and together help predict the correct interaction from the image.

architecture. Xu (Xu et al., 2019) construct a knowl-

edge graph between object and action based on the

semantic features of the ground-truth annotations of

training dataset and external source. Bansal (Bansal

et al., 2020a) integrate visual and spatial features with

general word embedding of humans and objects. Gao

(Gao et al., 2020) propose a dual relation graph by

using spatial-semantic representation to describe each

human-object pair. Liu (Liu et al., 2020) build a con-

sistency graph that encodes the relations among ob-

jects, actions and interactions. Liang (Liang et al.,

2020) build a dual-graph attention network that aggre-

gates contextual visual, spatial, and semantic infor-

mation. These works use general vector representa-

tion models to represent the actions and objects which

does not take into consideration the semantic context

of the object in the image expect for the established

edges between the nodes in the graphs. Thus, the

same action is represented by the same feature vec-

tor when mentioned with different objects.

To improve the HOI detection, recent works have

developed one-stage pipelines to detect HOIs in a sin-

gle shot. Single-stage methods, (Liao et al., 2020;

Wang et al., 2020b; Kim et al., 2020a), localize the in-

teraction with an interaction point or ﬁnd the anchor

box of a human-object pair. Contextual features are

extracted around the detected point or box. The inter-

acting triplets are predicted by matching the detected

objects with the localized interaction and manually

searching for the threshold. Later, single-stage meth-

ods were improved by using end-to-end transformer-

based methods (Zou et al., 2021; Chen et al., 2021;

Kim et al., 2021; Tamura et al., 2021). A transformer-

based contextual self-attention mechanism is used to

detect the interacting pairs and predict their interac-

tion simultaneously. In these single-stage methods,

contextual features are extracted visually from the

image without any semantic representations. How-

ever, relying on visual context can be tricky in images

where details are not well visible, such as in paint-

ings and artwork. In our work we rely on improving

the semantic contextual representation of the objects

and verbs. Our model does not only rely on visual

features because context from semantic features helps

the model become more robust to any type of images

whether details, such as edges, are visible. Therefore,

we only compare our work to two-stage methods only.

To solve the problem of extracting context from

the semantic representation instead of the visual rep-

resentation, we rely on human psychology for action

perception. Nelissen (Nelissen et al., 2005) argue that

action information without knowledge about the iden-

tity of the object acted upon, is not sufﬁcient to pro-

vide a full understanding of the observed action. Also,

Gallese (Gallese et al., 1996) state that the movement

analysis in humans depends on the presence of ob-

jects. The cortical responses for goal directed ac-

tions are different from the responses evoked when

the same action is executed but without the presence

of the object. Moreover, Bub (Bub and Masson, 2006)

show that observers build speciﬁc forms of gestural

knowledge obtained from the conceptual representa-

tions of the objects. This suggests the importance of

object priming in the representation of the action.

Two types of memories interplay to guide the vi-

sual search for the targets in a scene: (1) The episodic

memory, located in the hippocampus part of the brain,

answers questions about the position, colors, edges,

and context in which the object was last seen in pre-

vious experiences; and (2) the semantic memory, lo-

cated in the neocotex part of the brain, includes facts,

concepts and ideas independent of personal experi-

ence. The semantic memory answers general ques-

tions, such as the affordance or the name and type

or usual position of the object of interest. Seman-

tic information are derived from the accumulation of

the episodic memory. Therefore, they are interrelated

to form, together, a complete picture of the scene

(Figure1).

This paper is motivated by human psychology

where the human brain, to infer an interaction, relies

not only on the human preforming the action but also

on the object that they are interacting with. Similar

to vision transformers, we use BERT, a transformer-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

based word embedding model for contextual semantic

representation of the action in context of the detected

object in the GCN. By using the visual features of the

object and the union box to ﬁnd the similarity between

the interaction visual and semantic features, we build

the connection between both memories.

In this paper we apply the idea to HOI detection

by priming context into the encoding of actions (i.e.

verbs) at different levels of a deep network. We rely

on the detected object’s visual-spatial features as well

as its semantic relationship to actions. To beneﬁt from

the inﬂuence of the object on the interaction predic-

tion, we change the semantic representation of the ac-

tions based on their presence with the object.

Our network consists of two streams. In the ﬁrst

episodic memory stream, features corresponding to

the visual appearance, spatial features and the phys-

ical layout of people and objects are extracted as well

as that of the action. The second stream is the seman-

tic memory in which a graph convolutional (GCN)

network is built between the objects and the actions.

The objects and the actions are represented in the

affordance-based graph by their personalized contex-

tual vector representation extracted from a contex-

tual word embedding model. The verb-object depen-

dence is applied by representing the action features

as their word embedding when presented with the de-

tected object. The features from the episodic memory

stream and the semantic memory stream will be used

together to predict the human object interaction.

The main contributions of this paper include:

• We build an object related graph where the action

nodes are represented by their contextual embed-

ding when mentioned with the object which im-

proves the model performance by contextualizing

the graph.

• By feeding the visual features in the episodic

memory to the semantic features from the seman-

tic memory, we build the relationship between the

two memories and enhance the graph output.

• Our approach outperforms two-stage state-of-the-

arts on the challenging HICO-DET dataset with

a ﬁnetuned object detector and shows compara-

ble results with an object detector pretrained on

COCO.

2 SYSTEM OVERVIEW

Figure 2 presents the ﬂowchart of our proposed HOI

detection system. Given an input image, the objective

is to predict the triplet <human,verb,object> between

each candidate human-object pair.

The model consists of two streams: an episodic

memory stream, and a semantic memory stream. For

the episodic memory stream, we use a pretrained ob-

ject detection model to detect humans and objects in-

side query images. Then, using a feature extraction

backbone, visual features are extracted from each hu-

man F

and object F

individually on one hand, and

from the union of both human and object F

Int

on the

other hand. Moreover, a spatial attention feature map

is created from the both human and object bound-

ing boxes. For the human pose features F

, 2D hu-

man body pose is extracted using a pretrained pose

estimation model (RMPE (Fang et al., 2017)). (Sec-

tion 2.1). These features are used together to predict

the HOI.

For the semantic memory stream, semantic infor-

mation is represented by the knowledge graph, which

is built from the ground truth annotations of the train-

ing data. To render the training dataset more compre-

hensive, we augment it with the ConceptNet database

(Speer et al., 2016), which builds additional nodes to

the verb-object graph based on the affordance knowl-

edge of the objects. First, semantic features from the

detected object’s class F

and its related verbs F

are

extracted using BERT(Devlin et al., 2018), a contex-

tual word embedding model. The object’s semantic

features F

are concatenated with its visual features

to create the object node F

V S

. The graph network

is updated through convolutions and the new interac-

tion features F

Int

are compared with its visual features

Int

for HOI prediction. (Section 2.2). Finally, the

loss functions calculated from each module are added

together, using a weighted sum, to get the ﬁnal verb

prediction loss function.

2.1 Episodic Memory Stream

Given an input image, a pretrained object detec-

tion model (Faster-RCNN) detects the candidate hu-

mans and objects and estimates the coordinates of the

bounding boxes for humans BB

and objects BB

. A

feature extraction backbone, ResNet-101, is used to

extract visual features from the cropped human F

and objects boxes F

respectively. Moreover, visual

features from the union human and object boxes are

extracted and represent the interaction visual features

Int

. A spatial attention feature map F

is generated

from the human and object bounding boxes following

(Bansal et al., 2020b) and (Chao et al., 2018).

We use the two channel binary image representa-

tion to model the spatial relationship between a hu-

man and an object. The union of the two bounding

boxes as a reference and re-scale it to a ﬁxed size.

Then, a binary image with two channels is created:

Human Object Interaction Detection Primed with Context

Object

Detection

Model

Pose features F

Spatial features F

Union features F

Int

Object features Fo

Human features F

Ride Motorcycle 0.95

Sit on Motorcycle 0.82

Wash Motorcycle 0.10

Object Affordance

Semantic Schema

Semantic Memory (Neocortex)

Similarity

FCs

Loss

Ride Motorcycle

Episodic Memory (Hippocampus)

Loss

Graph

'Motorcycle'

Word

Embedding

Model

Sit on

Ride

wash

Updated Features

Int1

Int2

Int3

Feature

Extraction

Backbone

Sit on

Ride

wash

GCN

Figure 2: System ﬂowchart: It consists of a visual-spatial module 2.1 where visual-spatial and pose features are extracted for

the object, human, and the interaction between them. In the knowledge graph module 2.2, semantic features are extracted

for the detected object and all the related candidate actions. The input features of the detected object-related actions are

replaced with their contextual ones. A cross entropy loss is extracted from each spatial-visual branch and a cosine similarity

is calculated between the candidate interaction semantic features and the interaction’s visual features.

value ones and zeros are ﬁlled in the human and ob-

ject bounding boxes respectively in the ﬁrst channel.

In the second channel, value ones and zeros are ﬁlled

in the object and human bounding boxes respectively.

These two-channel binary images are fed into a two-

layer convolutional network to extract the spatial at-

tention feature map F

. To extract the human pose

features F

, we follow the work of (Li et al., 2019).

We apply (Fang et al., 2017) to extract 17 keypoints

from the union bounding box. Then, the keypoints

are connected with lines of different gray value rang-

ing from 0.15 to 0.95 representing the different body

parts. Finally, the union box is reshaped to 64x64 to

construct the pose map F

For each detected human, the visual, spatial and

pose features are concatenated together and fed to a

fully connected layer followed by a Sigmoid activa-

tion function to ﬁnd the action prediction score s

based on human information. Similarly, an action pre-

diction scores s

is found from the concatenation of

the object and union appearance features and feeding

them to a fully connected layer followed by a Sigmoid

activation function. These scores are used to extract

the individual cross entropy losses L

cross

for the hu-

man and L

cross

for the object.

2.2 Semantic Memory Stream

To model the semantic representation of each object

and action, we follow the work of (Xu et al., 2019).

A graph convolutional network G (GCN) (Kipf and

Welling, 2016) is built, and whose aim is to model

the relations between nodes N connected by edges E.

The purpose of building a GCN, in our system, is to

learn the features of the objects and candidate actions

nodes by inspecting the relationship between them.

An adjacency matrix is used to build the connection

between the nodes. In our case, undirected edges are

used where the connection between the nodes is the

same in both direction.

We ﬁrst use the training dataset ground truth anno-

tation to extract the nodes and edges of the graph. To

get a richer graph, we use an external dataset, Con-

ceptNet, that includes all the affordance based rela-

tionships for all the objects in the database. To ensure

that the affordance of the object is well represented in

the graph, we extract all data with the usedfor rela-

tionship between them. This gives all possible triplets

connects the objects with the actions that might oc-

cur with them and no connections are established be-

tween objects and actions that are never mentioned

within the database. For example, if the detected ob-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

ject is motorcycle, the actions that are connected to it

in the graph include sit on, ride, hold, wash, clean and

the actions that have no connection to it are eat, cook,

read. Adding these affordance based nodes enriches

the graph network with nodes that help in getting bet-

ter action predictions.

As a ﬁrst stage, the detected object semantic fea-

tures are represented by a general word embedding

features F

. We rely on the pretrained BERT model

(Devlin et al., 2018) to extract vector representation

of each object word.

Based on the detected object, the interaction

phrase composed by the verb followed by the ob-

ject < verb, ob ject > is fed to the word embedding

model and the vector representation of the verb F

(ﬁrst word) is extracted in context of the object. These

new representations are more speciﬁc to the detected

object, and their features are tailored to that object.

A context-based GCN is constructed where the object

nodes are represented by the concatenation of the ob-

ject semantic features F

and the verb nodes are rep-

resented by their semantic word embedding features.

The object-related verbs are represented by their con-

textual semantic features. whereas other non-related

verbs are represented by their general semantic fea-

tures. The adjacency matrix A of the GCN for all the

networks is deﬁned by the binary values of whether

the nodes are connected or not.

Given the nodes features F

and F

and the adja-

cency matrix A, the semantic feature representation of

the nodes at the (i + 1)

layer are extracted using the

forward pass of the GCN deﬁned as:

i+1

= σ(W

) (1)

where, A

is the normalized adjacency matrix, W

the learned weight at the i

layer and F

is the feature

vector representation of the nodes at the i

layer. σ is

the non-linear activation function applied to the out-

put of the convolution in order to represent the non-

linear features in latent dimension. The output of the

GCN is a feature vector representing the objects F

and the verbs F

At the output of the GCN, we concatenate the up-

dated object semantic features with its visual features

with that of the candidate verbs. This concatenation

represent the interaction semantic representation F

Int

in the context of the object. We found that the addition

of the object visual features yields a better represen-

tation of the object in the scene context. The cosine

similarity between the visual F

Int

and semantic inter-

action features is calculated, and the interaction with

the highest similarity score s

is considered to be the

graph prediction. Inspired by (Salvador et al., 2017),

the feature representations F

Int

and F

Int

are mapped

into the joint embedding space as: φ

= W

Int

and φ

= W

Int

respectively. W

and W

are the

learned embedding weights. Thus the cosine similar-

ity is loss is deﬁned as:

cos

(

1 − cos(φ

, φ

) if y = 1

max(0, cos(φ

, φ

) − α) if y = 0

(2)

where, α is the margin and y is set to 1 if the candidate

verb is the ground truth and zero if not.

The calculated losses from the episodic memory

and the semantic memory modules are added, using a

weighted sum, together to get the ﬁnal loss function:

total

= λ

cos

+ λ

cross

+ λ

cross

(3)

where, λ

, λ

, andλ

are added weights to each indi-

vidual loss function to control their contribution to the

total loss. The ﬁnal target is to minimize the total loss

term in (3).

3 IMPLEMENTATION AND

EXPERIMENTS

3.1 Experimental Setup

Experiments are performed on the HICO-DET (Chao

et al., 2018) for Human Object Interaction prediction.

HICO-DET is a large dataset for detecting HOIs. It

includes 38,118 training and 9,658 testing images for

80 objects and 117 action verbs. HICO-DET anno-

tates the images for full 600 human-object interac-

tions. Interactions that occur less than 10 times in the

training are categorized as Rare. We have 138 Rare

and 462 Non-Rare interactions in the HICO-DET

dataset. We also use ConceptNet (Speer et al., 2016)

database to extract all the affordance based relation-

ships and use the action verbs as additional nodes.

ConceptNet includes data from other crowd sourced

resources, expert-created resources, and games with a

purpose such as Wiktionary which is a free multilin-

gual dictionary and OpenCyc.

We follow the method of (Chao et al., 2018) to

evaluate the performance of the proposed systems, us-

ing the metric of role mean average precision (role

mAP). A prediction for a human-object interaction is

considered correct if the human and object bounding

boxes have an Intersection over Union (IoU) greater

than 0.5 with the ground-truth boxes and if the verb

class label of the interaction of the pair is correct.

We rely on the pretrained Faster-RCNN (Ren

et al., 2015) for human and object detection for train-

ing. A threshold of 0.8 for human detection score and

Human Object Interaction Detection Primed with Context

Table 1: State-of-the-art comparison (mAP) on HICO-DET test set.

Default Known Object

Method Detector Backbone Full Rare Non Rare Full Rare Non Rare

Bansal (Bansal et al., 2020a)

HICO-DET

ResNet-101 21.96 16.43 23.62 - - -

VCL(Hou et al., 2020) ResNet-50 23.63 17.21 25.55 25.98 19.12 28.03

DRG (Gao et al., 2020) ResNet-50-FPN 24.53 19.47 26.04 27.98 23.11 29.43

IDN (Li et al., 2020a) ResNet-50 26.29 22.61 27.39 28.24 24.47 29.37

SCG(Zhang et al., 2020) ResNet-50-FPN 31.33 24.72 33.31 - - -

OURS ResNet-101 32.51 24.92 34.78 34.27 28.77 35.91

InteractNet(Gkioxari et al., 2018)

COCO

ResNet-50-FPN 9.94 7.16 10.77 - - -

GPNN (Qi et al., 2018) ResNet-50 13.11 9.34 14.23 - - -

iCAN(Gao et al., 2018) ResNet-50 14.84 10.45 16.15 16.26 11.33 17.73

Xu (Xu et al., 2019) ResNet-50 14.7 13.26 15.13 - - -

Bansal(Bansal et al., 2020a) ResNet-101 16.96 11.73 18.52 - - -

DRG (Gao et al., 2020) ResNet-50-FPN 19.26 17.74 19.71 23.4 21.75 23.89

VCL (Hou et al., 2020) ResNet-50 19.43 16.55 20.29 22.00 19.09 22.87

VSGNet (Ulutan et al., 2020) ResNet-50 19.80 16.05 20.91 - - -

ConsNet(Liu et al., 2020) ResNet-50-FPN 22.15 17.12 23.65 - - -

IDN(Li et al., 2020a) ResNet-50 23.36 22.47 23.63 26.43 25.01 26.85

SCG(Zhang et al., 2020) ResNet-50-FPN 21.85 18.11 22.97 - - -

OURS ResNet-101 22.73 21.37 23.14 25.86 24.57 26.24

Table 2: State-of-the-art comparison (mAP) on V-COCO test set.

Method Backbone Scenario 1 Scenario 2

InteractNet(Gkioxari et al., 2018) ResNet-50-FPN 40 -

GPNN(Qi et al., 2018) ResNet-101 44 -

iCAN(Gao et al., 2018) ResNet-50 45.3 52.4

TIN(Li et al., 2019) ResNet-50 47.8 54.2

DRG(Gao et al., 2020) ResNet-50-FPN 51 -

VSGNet(Ulutan et al., 2020) ResNet-152 51.8 57

IDN(Li et al., 2020a) ResNet-50 53.3 60.3

SCG(Zhang et al., 2020) ResNet-50-FPN 54.2 60.9

OURS ResNet-101 54.8 61.6

0.4 for object detection score is set. These values are

chosen experimentally. ResNet-101 (He et al., 2016)

is used as a feature extraction backbone. We ﬁnetune

Faster-RCNN during testing only. The object nodes

of the graph network are represented by their seman-

tic features. We rely on the pretrained BERT model

(Devlin et al., 2018) to extract vector representation

of each word that has a size of 1x768. To get the can-

didate verbs semantic features, we feed the sentence

composed of the verb and the object to BERT and ex-

tract the ﬁrst word’s features as the verb features in

the context of the object. We perform two convolu-

tions on the input graph to get the ﬁnal semantic vec-

tor representations of the object words and their con-

nected verbs of the size 1x512. LeakyReLU with a

negative slope of 0.2 (Wang et al., 2018) is used as

the activation function after each layer of the graph.

The total loss hyperparameters λ

, λ

are set to 1 and

is set to 2. The margin for the cosine loss is set

to 0.1. We use Stochastic Gradient Descent (SGD) to

train the model for 10 epochs with a learning rate of

0.001, a weight decay of 0.0005, and a momentum of

0.9.

3.2 Experimental Testing and Results

We compare the mAP of our model with state-of-the-

art methods in Table 1 on HICO-DET dataset. We

report our results using a pretrained object detector

on MS-COCO dataset (Lin et al., 2014) and using a

ﬁne-tuned object detector on Default and Known Ob-

ject settings. We observe that our system shows com-

parable results when using a pretrained object detec-

tor compared to other state-of-the-art two-stage sys-

tems. When we used the ﬁne-tuned detector, we were

able to outperform state of the art two-stage meth-

ods by 0.97% and 1.03% on the Default and Known

Object setting of the HICO-DET. The main reason

why our model outperforms IDN with HICO-DET,

but not with COCO is that our model was able to

get higher cosine similarity between the interaction

semantic and visual features which is due to more

accurate detections. To analyze the contributions of

each component of our model, we perform an ablation

study and report the results in Table 3. We test our

model without taking into consideration the HOI pre-

diction from the human stream, including the visual

and spatial and pose feature. Then, we test it without

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

lie_on

text_on

type_on

sit_on

drink_with

hop_on

eat_at

sit_at

brush_with

throw

jump

kick

run

walk

kiss

hug

squeeze

lick

smell

flush

wash

clean

dry

talk_on

text_on

carry

repair

hold

read

no_interaction

Figure 3: Visualisation of the input action class representations using tSNE (left) using BERT word embedding, and (right)

after modifying the representation of cellphone-related actions.

carry

text_on

hold

repair

read

talk_on

type_on

carry

talk_on

text_on

hold

repair

read

no_interaction

board

direct

drive

exit

row

sail

turn

pet

feed

kiss

hug

wash

ride

jump

clean

grind

throw

peel

stir

pick

cook

drink

Figure 4: Visualisation of action class representations using tSNE at the output of the GCN (left) without object priming, and

(right) with object ‘cellphone’ priming.

Table 3: Ablation study of the proposed system on HICO-

DET in Default setting.

Full Rare Non Rare

OURS 27.26 21.92 28.85

OURS-w/o human 23.21 19.45 24.33

OURS-w/o object 22.58 18.89 23.68

OURS-w/o graph 21.13 17.34 22.26

the object stream, including the object and the union

visual features. At last, to show the importance of the

semantic graph we test the model without the seman-

tic memory stream.We can see from the results that

the graph has the largest inﬂuence on the HOI predic-

tion followed by the object and union features.

The results support our hypothesis about the im-

portance of the presence of the object in the interac-

tion prediction and the value of the semantic features

for better HOI prediction.

In Table 4, we test our model without object prim-

ing. The action verbs in the graph are represented by

their BERT word embedding vector without priming

them with the detected object. The results support our

main argument that object priming enhances the verb

representation and thus it improves the HOI predic-

tions. It improved the model mAP by 2.49 %, given

that object priming adds context to the semantic rep-

resentation of the related verbs, and thus enhances the

HOI prediction accordingly.

In Figure 3, we compare the different input rep-

resentations of the candidate actions using tSNE

(Maaten and Hinton, 2008). Figure 3 (left) represents

the general BERT word embedding on the candidate

action verbs. We can see that BERT represents the

Table 4: Ablation study of the semantic memory module on

HICO-DET in Default setting.

Full Rare Non Rare

OURS 27.26 21.92 28.85

OURS-w/o object priming 24.77 18.61 26.61

Human Object Interaction Detection Primed with Context

(a) From left to right: hold wine glass, hold clock, hold dog, hold book, hold baseball bat, hold cake.

975 ride boad

(b) From left to right: ride skateboard, ride boat, ride bicycle, ride snowboard.

Figure 5: HOI detections on the HICO-DET test images. Our model detects various forms of HOIs for same actions ‘hold’ in

(a) and ‘ride’ in (b) with different set of objects.

actions in a general contextual manner. In Figure 3

(right), we modify the representation of detected ob-

ject cell phone related actions by feeding the sentence

composed by the verb and the object to BERT and ex-

tracting the feature vector of the verb. We can see that

the actions related to cellphone are clustered together

without the use of GCN. These activities include text

on, talk on, hold, ..., and shows how much the addition

of the object context in the semantic module helps the

GCN by clustering the object-related actions at the in-

put.

Figure 4 shows the representations of the candi-

date actions at the output of the graph convolutional

network with and without the introduction of the de-

tected object at the input GCN level. From Figure 4

(left), we see that the actions related (in red) to the

detected object cell phone were not correctly clus-

tered together. From Figure 4 (left), we can notice

that Cluster A (in red), which refers to cellphone re-

lated activities, enclosed additional actions to the in-

put ones such as type on, pick up. Moreover, we can

see that the training of the model helped in cluster-

ing different activities related actions. For example,

Cluster B (in blue) refers to pet related activities such

as pet, walk, feed, kiss. Cluster C (in green) refers to

food related activities including peel, stir, pour. Clus-

ter D (in purple) refers to vehicle related activities

such as drive, board, load, sail.

Figure 5 shows qualitative results of our method.

We highlight the ability of our system to correctly pre-

dict interactions with objects that it was not trained

on. This is due to the fact that the same verb is con-

nected to similar objects in the training and external

datasets.

4 CONCLUSIONS

We present a novel model for Human Object Inter-

action detection which uses visual, spatial, pose, and

graph semantic features from the input image to get

the best output prediction. We showed that the pres-

ence of the object in the input semantic features plays

a fundamental role in enhancing the action prediction

by contextualizing the semantic representations in the

scene. Visual-spatial features are extracted from the

human, object, and interaction. A similarity is calcu-

lated between the visual-spatial features of the inter-

action and the semantic features from the graph out-

put of the candidate interactions. The external dataset

included all affordance based connection that can oc-

cur between a object and a verb. The constructed se-

mantic graph helped in predicting interactions that the

network was not trained on. We rely on prior work in

the episodic stream. We contextualize the action se-

mantic representation in the GCN. We connect both

memories through the concatenation of the object vi-

sual and semantic features and the comparison of the

interaction visual and semantic features. Our experi-

ments demonstrated that our system improved the per-

formance on HOI detection.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

ACKNOWLEDGEMENTS

The authors would like to acknowledge the American

University of Beirut (AUB) and the National Coun-

cil for Scientiﬁc Research of Lebanon (CNRS-L) for

granting a doctoral fellowship to Maya Antoun.

REFERENCES

Baldassarre, F., Smith, K., Sullivan, J., and Azizpour, H.

(2020). Explanation-based weakly-supervised learn-

ing of visual relations with graph networks. arXiv

preprint arXiv:2006.09562.

Bansal, A., Rambhatla, S. S., Shrivastava, A., and Chel-

lappa, R. (2020a). Detecting human-object interac-

tions via functional generalization. In AAAI, pages

10460–10469.

Bansal, A., Rambhatla, S. S., Shrivastava, A., and

Chellappa, R. (2020b). Spatial priming for de-

tecting human-object interactions. arXiv preprint

arXiv:2004.04851.

Bub, D. and Masson, M. (2006). Gestural knowledge

evoked by objects as part of conceptual representa-

tions. Aphasiology, 20(9):1112–1124.

Chao, Y.-W., Liu, Y., Liu, X., Zeng, H., and Deng, J. (2018).

Learning to detect human-object interactions. In 2018

ieee winter conference on applications of computer vi-

sion (wacv), pages 381–389. IEEE.

Chen, M., Liao, Y., Liu, S., Chen, Z., Wang, F., and Qian,

C. (2021). Reformulating hoi detection as adaptive set

prediction. arXiv preprint arXiv:2103.05983.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Fang, H.-S., Xie, S., Tai, Y.-W., and Lu, C. (2017). Rmpe:

Regional multi-person pose estimation. In Proceed-

ings of the IEEE international conference on com-

puter vision, pages 2334–2343.

Gallese, V., Fadiga, L., Fogassi, L., and Rizzolatti, G.

(1996). Action recognition in the premotor cortex.

Brain, 119(2):593–609.

Gao, C., Xu, J., Zou, Y., and Huang, J.-B. (2020). Drg: Dual

relation graph for human-object interaction detection.

In European Conference on Computer Vision, pages

696–712. Springer.

Gao, C., Zou, Y., and Huang, J.-B. (2018). ican: Instance-

centric attention network for human-object interaction

detection. arXiv preprint arXiv:1808.10437.

Gkioxari, G., Girshick, R., Doll

ar, P., and He, K. (2018).

Detecting and recognizing human-object interactions.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 8359–8367.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Hou, Z., Peng, X., Qiao, Y., and Tao, D. (2020). Visual

compositional learning for human-object interaction

detection. arXiv preprint arXiv:2007.12407.

Hou, Z., Yu, B., Qiao, Y., Peng, X., and Tao, D. (2021).

Affordance transfer learning for human-object inter-

action detection. arXiv preprint arXiv:2104.02867.

Kim, B., Choi, T., Kang, J., and Kim, H. J. (2020a). Union-

det: Union-level detector towards real-time human-

object interaction detection. In European Conference

on Computer Vision, pages 498–514. Springer.

Kim, B., Lee, J., Kang, J., Kim, E.-S., and Kim, H. J.

(2021). Hotr: End-to-end human-object interac-

tion detection with transformers. arXiv preprint

arXiv:2104.13682.

Kim, D., Lee, G., Jeong, J., and Kwak, N. (2020b). Tell

me what they’re holding: Weakly-supervised object

detection with transferable knowledge from human-

object interaction. In AAAI, pages 11246–11253.

Kipf, T. N. and Welling, M. (2016). Semi-supervised clas-

siﬁcation with graph convolutional networks. arXiv

preprint arXiv:1609.02907.

Li, Y.-L., Liu, X., Wu, X., Li, Y., and Lu, C. (2020a). Hoi

analysis: Integrating and decomposing human-object

interaction. Advances in Neural Information Process-

ing Systems, 33.

Li, Y.-L., Xu, L., Liu, X., Huang, X., Xu, Y., Wang, S.,

Fang, H.-S., Ma, Z., Chen, M., and Lu, C. (2020b).

Pastanet: Toward human activity knowledge engine.

In Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 382–391.

Li, Y.-L., Zhou, S., Huang, X., Xu, L., Ma, Z., Fang, H.-

S., Wang, Y., and Lu, C. (2019). Transferable in-

teractiveness knowledge for human-object interaction

detection. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

3585–3594.

Liang, Z., Guan, Y., and Rojas, J. (2020). Visual-semantic

graph attention network for human-object interaction

detection. arXiv preprint arXiv:2001.02302.

Liao, Y., Liu, S., Wang, F., Chen, Y., Qian, C., and Feng,

J. (2020). Ppdm: Parallel point detection and match-

ing for real-time human-object interaction detection.

In Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 482–490.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Doll

ar, P., and Zitnick, C. L. (2014).

Microsoft coco: Common objects in context. In Euro-

pean conference on computer vision, pages 740–755.

Springer.

Liu, Y., Yuan, J., and Chen, C. W. (2020). Consnet: Learn-

ing consistency graph for zero-shot human-object in-

teraction detection. In Proceedings of the 28th ACM

International Conference on Multimedia, pages 4235–

4243.

Maaten, L. v. d. and Hinton, G. (2008). Visualizing data

using t-sne. Journal of machine learning research,

9(Nov):2579–2605.

Nelissen, K., Luppino, G., Vanduffel, W., Rizzolatti, G.,

and Orban, G. A. (2005). Observing others: multi-

Human Object Interaction Detection Primed with Context

ple action representation in the frontal lobe. Science,

310(5746):332–336.

Qi, S., Wang, W., Jia, B., Shen, J., and Zhu, S.-C. (2018).

Learning human-object interactions by graph parsing

neural networks. In Proceedings of the European Con-

ference on Computer Vision (ECCV), pages 401–417.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Advances in neural information

processing systems, pages 91–99.

Salvador, A., Hynes, N., Aytar, Y., Marin, J., Oﬂi, F., We-

ber, I., and Torralba, A. (2017). Learning cross-modal

embeddings for cooking recipes and food images. In

Proceedings of the IEEE conference on computer vi-

sion and pattern recognition, pages 3020–3028.

Song, Y., Li, W., Zhang, L., Yang, J., Kiciman, E.,

Palangi, H., Gao, J., Kuo, C.-C. J., and Zhang, P.

(2020). Novel human-object interaction detection

via adversarial domain generalization. arXiv preprint

arXiv:2005.11406.

Speer, R., Chin, J., and Havasi, C. (2016). Conceptnet

5.5: An open multilingual graph of general knowl-

edge. arXiv preprint arXiv:1612.03975.

Sun, X., Hu, X., Ren, T., and Wu, G. (2020). Human object

interaction detection via multi-level conditioned net-

work. In Proceedings of the 2020 International Con-

ference on Multimedia Retrieval, pages 26–34.

Tamura, M., Ohashi, H., and Yoshinaga, T. (2021). Qpic:

Query-based pairwise human-object interaction detec-

tion with image-wide contextual information. arXiv

preprint arXiv:2103.05399.

Ulutan, O., Iftekhar, A., and Manjunath, B. S. (2020). Vs-

gnet: Spatial attention network for detecting human

object interactions using graph convolutions. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 13617–13626.

Wan, B., Zhou, D., Liu, Y., Li, R., and He, X. (2019).

Pose-aware multi-level feature network for human ob-

ject interaction detection. In Proceedings of the IEEE

International Conference on Computer Vision, pages

9469–9478.

Wang, S., Yap, K.-H., Yuan, J., and Tan, Y.-P. (2020a).

Discovering human interactions with novel objects via

zero-shot learning. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 11652–11661.

Wang, T., Yang, T., Danelljan, M., Khan, F. S., Zhang, X.,

and Sun, J. (2020b). Learning human-object interac-

tion detection using interaction points. In Proceedings

of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 4116–4125.

Wang, X., Ye, Y., and Gupta, A. (2018). Zero-shot recogni-

tion via semantic embeddings and knowledge graphs.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 6857–6866.

Xu, B., Wong, Y., Li, J., Zhao, Q., and Kankanhalli, M. S.

(2019). Learning to detect human-object interactions

with knowledge. In Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition.

Zhang, F. Z., Campbell, D., and Gould, S. (2020). Spatially

conditioned graphs for detecting human-object inter-

actions. arXiv preprint arXiv:2012.06060.

Zhou, T., Wang, W., Qi, S., Ling, H., and Shen, J. (2020).

Cascaded human-object interaction recognition. In

Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 4263–

4272.

Zou, C., Wang, B., Hu, Y., Liu, J., Wu, Q., Zhao, Y., Li, B.,

Zhang, C., Zhang, C., Wei, Y., et al. (2021). End-to-

end human object interaction detection with hoi trans-

former. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

11825–11834.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications