A Fine-grained Perspective onto Object Interactions
from First-person Views
Dima Damen
a
Department of Computer Science, University of Bristol, Bristol, U.K.
Keywords:
Egocentric Vision, First-Person Vision, Action Recognition, Fine-Grained Recognition, Object Interaction
Recognition, Skill Determination, Action Completion, Action Anticipation, Wearable Cameras, First-Person
Datasets, EPIC-Kitchens.
Abstract:
This extended abstract summarises the relevant works to the keynote lecture at VISAPP 2019. The talk dis-
cusses understanding object interactions from wearable cameras, focusing on fine-grained understanding of
interactions on realistic unbalanced datasets recorded in-the-wild.
1 INTRODUCTION
Humans interact with tens of objects daily, at home
(e.g. cooking/cleaning), during working (e.g. assem-
bly/machinery) or leisure hours (e.g. playing/sports),
individually or collaboratively. The field of research,
within computer vision and machine learning, that fo-
cuses on the perception of object interactions from a
wearable cameras is commonly referred to as ‘first-
person vision’. In this extended abstract, we co-
ver novel research questions, particularly related to
the newly released largest dataset in object interacti-
ons, recorded in people’s native environments: EPIC-
Kitchens.
2 DEFINITIONS
Object interactions could be perceived from different
ordinal-person viewpoints - where ‘ordinal’ is used
to generalise between first-, second- and third-person
views. A view is referred to as a first-person view, if
the interaction is captured by a wearable sensor, worn
by the actor performing the interaction itself. Conver-
sely, a second-person view is when the interaction is
captured by a camera of a co-actor, or a recepient of
the action. Finally, a third-person view, common in
remote static cameras, is when the interaction is cap-
tured by an observer not relevant to the interaction or
the actor during that interaction.
a
https://orcid.org/0000-0001-8804-6238
3 DATASETS AND EPIC-Kitchens
For years, Computer Vision has focused on capturing
videos from a third-person view, with the majority
of action recognition datasets using a remote camera
observing the action or interaction (Marszalek et al.,
2009; Kuehne et al., 2011; Caba Heilbron et al., 2015;
Carreira and Zisserman, 2017).
Increasingly, first-person vision datasets have
been recorded, capturing full body motion such as
sports (Kitani et al., 2011), social interactions (Alletto
et al., 2015; Fathi et al., 2012a; Ryoo and Matthies,
2013) and object interactions (De La Torre et al.,
2008; Fathi et al., 2012b; Pirsiavash and Ramanan,
2012; Damen et al., 2014; Georgia Tech, 2018; Si-
gurdsson et al., 2018).
In 2018, the largest dataset on wearable cameras
was released through a collaboration led by the Uni-
versity of Bristol alongside the University of Catania
and the University of Toronto - http://epic-kitchens.
github.io/. EPIC-Kitchens (Damen et al., 2018) of-
fers more than 11.5M frames, captured using a head-
mounted camera in 32 different kitchens, with over 55
hours of natural interactions from cooking to washing
the dishes (Fig 1).
4 FINE-GRAINED OBJECT
INTERACTIONS
Datasets, such as EPIC-Kitchens, can offer unique
opportunities to studying previously unexplored pro-
Damen, D.
A Fine-grained Perspective onto Object Interactions from First-person Views.
DOI: 10.5220/0008345900110013
In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 11-13
ISBN: 978-989-758-354-4
Copyright
c
2019 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
11
Figure 1: Sample frames from EPIC-Kitchens.
blems in fine-grained object interactions. A few of
these opportunities are highlighted here.
Overlapping Object Interactions: Defining the
temporal extent of an action is fundamentally an
ambiguous problem (Moltisanti et al., 2017; Si-
gurdsson et al., 2017). This is usually resolved
through multi-labels, i.e. allowing a time-segment
to belong to multiple classes of actions. Howe-
ver, actual understanding of interaction overlap-
ping requires an space of action labels that cap-
tures dependencies (e.g. filling a kettle requires
opening the tap). Models that capture and predict
overlapping interactions are needed for a finer-
understanding of object interactions.
Object Interaction Completion/Incompletion:
Beyond classification and localisation, action
completion/incompletion is the problem of
identifying whether the action’s goal has been
successfully achieved, or merely attempted. This
is a novel fine-grained object interaction research
question proposed in (Heidarivincheh et al.,
2016). This work has been recently extended to
locating the moment of completion (Heidarivin-
cheh et al., 2018) - that is the moment in time
beyond which the action’s goal is believed to be
completed by a human observer.
Skill Determination from Video: Even when an
interaction is successfully completed, further un-
derstanding of ‘how well’ the task was comple-
ted would offer knowledge beyond pure classifica-
tion. In this leading work (Doughty et al., 2018a),
a collection of video could be ordered by the skill
exhibited in each video, through deep pairwise
ranking. This method has been recently extended
to include rank-aware attention (Doughty et al.,
2018b) - that is a novel loss function capable of
attending to parts of the video that exhibit higher
skill as well as parts that demonstrate lower skill
including mistakes or hesitation.
Anticipation and Forecasting: Predicting upco-
ming interactions has recently gathered additional
attention, triggered by the presence of first-person
datasets (Furnari et al., 2018; Rhinehart and Ki-
tani, 2017). Novel research on uncertainty in anti-
cipating actions (Furnari et al., 2018), or relating
forecasting to trajectory prediction (Rhinehart and
Kitani, 2017) have recently been proposed.
Paired Interactions: One leading work has at-
tempted capturing both the action and its counter-
action (or reaction), both from a wearable ca-
mera (Yonetani et al., 2016). This is a very ex-
citing area of research, still under-explored.
5 CONCLUSION
Recent deep-learning research has only scratched the
surface of potentials for finer-grained understanding
of object interactions. As new hardware platforms
for first-person vision emerge (Microsoft’s Hololens,
Magic Leap, Samsung Gear, · · · ), applications of fine-
grained recognition will be endless.
REFERENCES
Alletto, S., Serra, G., Calderara, S., and Cucchiara, R.
(2015). Understanding social relationships in egocen-
tric vision. In Pattern Recognition.
Caba Heilbron, F., Escorcia, V., Ghanem, B., and Car-
los Niebles, J. (2015). Activitynet: A large-scale vi-
deo benchmark for human activity understanding. In
CVPR.
Carreira, J. and Zisserman, A. (2017). Quo vadis, action
recognition? a new model and the kinetics dataset. In
CVPR.
Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Fur-
nari, A., Kazakos, E., Moltisanti, D., Munro, J., Per-
rett, T., Price, W., and Wray, M. (2018). Scaling
egocentric vision: The EPIC-KITCHENS Dataset. In
ECCV.
Damen, D., Leelasawassuk, T., Haines, O., Calway, A., and
Mayol-Cuevas, W. (2014). You-do, I-learn: Disco-
vering task relevant objects and their modes of inte-
raction from multi-user egocentric video. In BMVC.
De La Torre, F., Hodgins, J., Bargteil, A., Martin, X., Ma-
cey, J., Collado, A., and Beltran, P. (2008). Guide to
the Carnegie Mellon University Multimodal Activity
(CMU-MMAC) database. In Robotics Institute.
Doughty, H., Damen, D., and Mayol-Cuevas, W. (2018a).
Who’s Better? Who’s Best? Pairwise Deep Ranking
for Skill Determination. In CVPR.
Doughty, H., Mayol-Cuevas, W., and Damen, D. (2018b).
The Pros and Cons: Rank-aware temporal attention
for skill determination in long videos. In Arxiv.
Fathi, A., Hodgins, J., and Rehg, J. (2012a). Social inte-
ractions: A first-person perspective. In CVPR.
VISIGRAPP 2019 - 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
12
Fathi, A., Li, Y., and Rehg, J. (2012b). Learning to recog-
nize daily actions using gaze. In ECCV.
Furnari, F., Battiato, S., and Farinella, G. (2018). Levera-
ging uncertainty to rethink loss functions and evalu-
ation measures for egocentric action anticipation. In
ECCVW.
Georgia Tech (2018). Extended GTEA Gaze+.
http://webshare.ipat.gatech.edu/coc-rim-wall-
lab/web/yli440/egtea gp.
Heidarivincheh, F., Mirmehdi, M., and Damen, D. (2016).
Beyond action recognition: Action completion in
RGB-D data. In BMVC.
Heidarivincheh, F., Mirmehdi, M., and Damen, D. (2018).
Action completion: A temporal model for moment de-
tection. In BMVC.
Kitani, K. M., Okabe, T., Sato, Y., and Sugimoto, A. (2011).
Fast unsupervised ego-action learning for first-person
sports videos. In CVPR.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre,
T. (2011). HMDB: A large video database for human
motion recognition. In ICCV.
Marszalek, M., Laptev, I., and Schmid, C. (2009). Actions
in context. In CVPR.
Moltisanti, D., Wray, M., Mayol-Cuevas, W., and Damen,
D. (2017). Trespassing the boundaries: Labeling tem-
poral bounds for object interactions in egocentric vi-
deo. In ICCV.
Pirsiavash, H. and Ramanan, D. (2012). Detecting activities
of daily living in first-person camera views. In CVPR.
Rhinehart, N. and Kitani, K. M. (2017). First-person acti-
vity forecasting with online inverse reinforcement le-
arning. In ICCV.
Ryoo, M. S. and Matthies, L. (2013). First-person activity
recognition: What are they doing to me? In CVPR.
Sigurdsson, G. A., Gupta, A., Schmid, C., Farhadi, A., and
Alahari, K. (2018). Charades-ego: A large-scale data-
set of paired third and first person videos. In ArXiv.
Sigurdsson, G. A., Russakovsky, O., and Gupta, A. (2017).
What actions are needed for understanding human
actions in videos? In ICCV.
Yonetani, R., Kitani, K. M., and Sato, Y. (2016). Recogni-
zing micro-actions and reactions from paired egocen-
tric videos. In CVPR.
BRIEF BIOGRAPHY
Dima Damen: Associate Professor in Computer Vi-
sion at the University of Bristol, United Kingdom.
Received her PhD from the University of Leeds, UK
(2009). Dima’s research interests are in the automa-
tic understanding of object interactions, actions and
activities using wearable and static visual (and depth)
sensors. She has contributed works to novel research
questions including fine-grained object interaction re-
cognition, understanding the completion of actions,
skill determination from video, semantic ambiguities
of actions and the robustness of classifiers to acti-
ons temporal boundaries. Her work is published in
leading venues: CVPR, ECCV, ICCV, PAMI, IJCV,
CVIU and BMVC. In 2018, she led on releasing the
largest dataset in first-person vision to date (EPIC-
KITCHENS) - 11.5M frames of non-scripted recor-
dings with full ground truth. Dima co-chaired BMVC
2013, is area chair for BMVC (2014-2018), associate
editor of Pattern Recognition (2017-). She was se-
lected as a Nokia Research collaborator in 2016, and
as an Outstanding Reviewer in ICCV17, CVPR13 and
CVPR12.
A Fine-grained Perspective onto Object Interactions from First-person Views
13