ENIGMA: Egocentric Navigator for Industrial Guidance, Monitoring

and Anticipation

Francesco Ragusa

1,2

, Antonino Furnari

1,2

, Antonino Lopes

, Marco Moltisanti

, Emanuele Ragusa

Marina Samarotto

, Luciano Santo

, Nicola Picone

, Leo Scarso

and Giovanni Maria Farinella

1,2

FPV@IPLAB, DMI - University of Catania, Italy

Next Vision s.r.l. - Spinoff of the University of Catania, Italy

Xenia Gestione Documentale s.r.l. - Xenia Progetti s.r.l., Acicastello, Catania, Italy

Morpheos s.r.l. - Catania, Italy

Keywords:

Egocentric Vision, First Person Vision, Industrial Domain.

Abstract:

We present ENIGMA (Egocentric Navigator for Industrial Guidance, Monitoring and Anticipation), an inte-

grated system to support workers in an industrial laboratory. ENIGMA includes a wearable assistant which

understands the worker’s behavior through Computer Vision algorithms which 1) localize the operator, 2)

recognize the objects present in the laboratory, 3) detect the human-object interactions which happen and 4)

anticipate the next-active object with which the worker will interact. Furthermore, a back-end extracts high

semantic information about the worker behavior to provide useful services and to improve his safety. Prelimi-

nary experiments were conducted showing good performance on the tasks of localization, object detection and

recognition and egocentric human-object interaction detection considering the challenging industrial scenario.

1 INTRODUCTION

Understanding human behavior from an egocentric

point of view allows to build an intelligent system

able to support humans equipped with a camera (e.g.,

smartglasses, head-mounted, etc.) to achieve daily

goals in different scenarios such as home environ-

ments (Damen et al., 2014), cultural sites (Farinella

et al., 2019; Cucchiara and Bimbo, 2014), and indus-

trial buildings (Colombo et al., 2019; Ragusa et al.,

2021; Ragusa et al., 2022).

In particular, in the industrial scenario, localiz-

ing the users in an indoor workplace can be helpful

in managing rescue situations such as ﬁres or heart-

quakes guiding them to the closest emergency exit

as well as detecting and recognizing objects in the

surrounding environment allows to provide additional

information on how to use them (i.e., automatic and

continuous training). Moreover, recognizing human-

object interactions can be useful to provide sugges-

tions on how to execute a complex procedure of main-

tenance as well as to implement energy saving strate-

gies. Furthermore, anticipating with which objects a

worker will interact, allows to improve his safety in

a factory, for example by notifying the user with an

alert in case of a dangerous object.

Nowadays different systems have been devel-

oped to train workers for speciﬁc tasks using virtual

(Osti et al., 2021) or augmented reality (Sorko and

Brunnhofer, 2019) as well as to support them with re-

mote assistance (Gurevich et al., 2012) guiding the lo-

cal operator during the execution of procedural tasks

(Rebol et al., 2021; Sun et al., 2021). Despite these

systems provide help to workers in different manners,

they suffer from several limitations due to the inability

to understand the human behavior and the surround-

ing environment.

In this paper, we present ENIGMA (Egocentric

Navigator for Industrial Guidance, Monitoring and

Anticipation), an AI wearable assistant capable of

supporting the workers of an industrial laboratory

during the execution of complex tasks, providing sug-

gestions on how to perform different maintenance

and repairing procedures, improving their safety an-

ticipating potential dangerous interactions and imple-

menting energy saving strategies to reduce electricity

consumption. To achieve the aforementioned goals,

ENIGMA implements algorithms to localize workers

in the industrial laboratory, to recognize the objects

present in the surrounding environment and the inter-

action with them and to anticipate the next-active ob-

jects with which workers will interact from the ego-

centric point of view. Figure 1 shows the concept of

the proposed AI assistant.

Ragusa, F., Furnari, A., Lopes, A., Moltisanti, M., Ragusa, E., Samarotto, M., Santo, L., Picone, N., Scarso, L. and Farinella, G.

ENIGMA: Egocentric Navigator for Industrial Guidance, Monitoring and Anticipation.

DOI: 10.5220/0011787900003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 4: VISAPP, pages

695-702

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

695

Figure 1: The concept of the proposed ENIGMA Assistant.

The proposed system has been tested in a labora-

tory which represents a realistic industrial scenario.

In the considered laboratory there are 23 different ob-

jects both ﬁxed such as electric panels, a power supply

and a welding station and mobile such as screwdrivers

and electric boards. In addition, there are different

IoT devices installed in the sockets of the worktable

and in the electric panel which allow to powering on

and off the electricity of the tools connected to the

sockets. In this laboratory, we have considered 8 con-

texts, 23 objects and 22 different human-object inter-

actions. Preliminary experiments show that the pro-

posed ENIGMA system achieves good performances

on the tasks of Localization, Object Detection and

Recognition and Egocentric Human-Object Interac-

tion Detection while, future experiments will address

the tasks of and Next-Active Objects Detection.

The reminder of the paper is organized as follows.

Section 2 reports the related work. Section 3 presents

the collected and labeled datasets. Section 4 describes

the architecture of the ENIGMA system and explains

the provided services. Section 5 reports the prelim-

inary experimental results, whereas Section 6 con-

cludes the paper.

2 RELATED WORK

Our work is related to different lines of research,

including, visual localization, object detection and

recognition and egocentric human-object interaction

detection. The following sections discuss the relevant

works belonging to the aforementioned research ar-

eas.

2.1 Visual Localization

Localization from egocentric images can be addressed

considering both classiﬁcation and camera pose es-

timation methods. In particular, classiﬁcation based

methods allow to localize the input image discretiz-

ing the space in cells and training a classiﬁer which

assigns the image to a cell. These cells can represent

different generic areas (Torralba et al., 2003), daily-

life environments (Furnari et al., 2018) or speciﬁc

rooms of a museum (Ragusa et al., 2020). Instead,

camera pose estimation methods establish correspon-

dences between 2D pixels positions in the input im-

age and 3D scene coordinates. This phase can be ad-

dressed using a matching algorithm or by regressing

3D coordinates from image patches (Brachmann and

Rother, 2018; Taira et al., 2018). In this work, we

focus on approaches based on both classiﬁcation and

camera pose estimation to give localization informa-

tion to workers at different level of granularity.

2.2 Object Detection and Recognition

Object detection and recognition task has been

tackled exploiting one-stage methods (Redmon and

Farhadi, 2018) which prioritize the speed of detection

over the accuracy of prediction, as well as two-stages

approaches (Girshick, 2015), (Ren et al., 2015) which

localize objects and classify them more precisely at

higher computational time. Several works addressed

the task of detecting and recognizing objects consid-

ering museums (Seidenari et al., 2017; Farinella et al.,

2019) in which objects are represented by statues or

artworks as well as industrial environments in which

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

696

Figure 2: The industrial-like laboratory where the

ENIGMA system has been tested.

tiny and small objects need to be recognized (Ra-

gusa et al., 2021; Ragusa et al., 2022). The proposed

ENIGMA system leverages state of the art object de-

tectors to recognize objects present on the worktable

of the considered industrial laboratory. Speciﬁcally,

our system depends on the Faster-RCNN object de-

tector (Ren et al., 2015).

2.3 Egocentric Human-Object

Interaction Detection

Previous works focused on the Human-Object Inter-

action (HOI) detection task considering third person

view. The authors of (Gupta and Malik, 2015) were

the ﬁrst to explore the HOI task annotating the COCO

dataset (Lin et al., 2014) with verbs. (Gkioxari et al.,

2018) proposed a method which detects and local-

izes humans and objects present in the scene, analyzes

each human-object pair using a heat map to represent

their relationship as well as to estimate the verb which

describes it. The aforementioned problem has been

studied also from the ﬁrst point of view. The authors

of (Nagarajan et al., 2019) studied the problem of un-

derstanding how to interact with an object, learning

human-object interaction “hotspots” from egocentric

videos. (Nagarajan et al., 2020) introduced a model

to capture primary spatial zones of an environment

and the possible activities which could happen there

(i.e. environment affordance). The authors of (Shan

et al., 2020) proposed an hand-centric method which

classiﬁes objects into the active and passive classes

depending if they are or not involved in an interac-

tion. A few works addressed this task considering

an industrial domain. The authors of (Ragusa et al.,

2021; Ragusa et al., 2022) studied human-object in-

teraction releasing the MECCANO dataset while peo-

ple building a toy model of a motorcycle. (Leonardi

et al., 2022) studied egocentric human-object interac-

tion exploiting both synthetic and real images in an

industrial environment.

In the proposed system we adopted an hand-

centric method based on a standard object detector

Figure 3: The 8 contexts of the industrial laboratory.

(Ren et al., 2015) to detect and recognize hands and

objects in the scene as well as to understand their re-

lationship considering the overlap between bounding

boxes.

3 EXPERIMENTAL

LABORATORY AND DATASETS

Our system has been tested in an industrial context.

Speciﬁcally, we set up a laboratory (as shown in Fig-

ure 2) in which there are 23 different objects such as

an electric screwdriver, a welding station and electri-

cal boards as well as there is an electrical panel which

allows powering on and off the sockets placed in the

worktable. We have collected and labeled two dif-

ferent datasets of egocentric videos useful to design

the services which compose the ENIGMA system: 1)

Localization, 2) Object Detection and Recognition,

3) Egocentric Human-Object Interaction and 4) Next-

active Object Detection.

3.1 Localization Dataset

We acquired 62 videos using using Hololens 2 device

with a resolution of 2272x1278 at 30 frame per sec-

ond. We extracted 55824 frames which have been di-

vided into Training, Validation and Test sets consider-

ing 39437, 4394 and 11993 frames respectively. We

labeled the dataset exploiting a Structure from Mo-

tion (SfM) approach using the open source software

COLMAP

, obtaining for each frame the 2D/3D posi-

tions and its orientation. The collected dataset is use-

ful to assess the performances of algorithms for both

punctual and contextual localization. In particular, for

https://github.com/colmap/colmap

ENIGMA: Egocentric Navigator for Industrial Guidance, Monitoring and Anticipation

697

Figure 4: Examples of images annotated with bounding

boxes around the objects.

the contextual localization we divided the laboratory

into 8 cells which represents 8 different contexts: C1

Lab Door, C2 Panel A, C3 Panel B, C4 Fire Extin-

guisher, C5 Workbench, C6 Cabinet, C7 Lab (which

represents where the user is in the lab but is not in

any of the other cells) and C8 Out of the Lab. We

assigned each frame to the correspondent cell consid-

ering its 2D position. Figure 3 shows the 8 contexts

in the industrial laboratory.

3.2 Procedural Dataset

This dataset consists in 8 egocentric videos acquired

with a Microsoft Hololens 2 device while 7 different

subjects performed test and repair procedures on elec-

trical boards in the industrial laboratory. The 8 videos

have been acquired with a resolution of 2272x1278

with a framerate of 30 fps. We manually annotated

human-object interactions selecting the ﬁrst frame in

which the hand of the subject touches an object and

the frame after the hand releases it and assigning a

verb which describes the interaction: 1) Take, 2) Re-

lease, 3) Contact and 4) De-contact. Moreover, for

each frame we annotated both the objects which are

involved in the human-object interaction (active ob-

jects) and all the other objects. In particular, we an-

notated each object with (x, y, w, h, c, a) tuple where

(x,y,w,h) represent the 2D coordinates of the bound-

ing box, c indicates the object class considering a to-

tal of 23 object classes and a indicates if the object

is involved in the current interaction or not. Follow-

ing this procedure we labeled 20000 objects. Figure 4

shows some examples of the annotated frames.

4 ARCHITECTURE AND

SERVICES

In this Section, we ﬁrst discuss the architecture of the

proposed system (Section 4.1), then we present the

services implemented by ENIGMA (Section 4.2).

4.1 Architecture

Figure 5 shows the whole architecture of the proposed

ENIGMA system which is composed of 4 main com-

ponents:

• Wearable Devices: devices such as smartglasses

(i.e. Microsoft Hololens 2) are provided to the

operator in the industrial laboratory. Repair and

testing activities are shown on the screen through

Augmented Reality. Moreover, images and videos

are acquired from the point of view of the sub-

ject and sent to the Artiﬁcial Intelligence Infer-

ence Engine via a dedicated A/V Message Broker;

• AI Inference Engine: high performance multi

core processing unit speciﬁcally designed for AI

tasks. This engine executes AI algorithms on

a dedicated GPU in order to process egocentric

videos and address operator localization, object

detection and recognition, human-object interac-

tions and next-active object detection;

• BI Logic Engine: collects information from AI

inferences and IoT sensors status to take decisions

considering the behavior of the operator (e.g., turn

off the electric power in case of electrical risk).

Messages are exchanged through a message bro-

ker speciﬁcally designed for short messages pro-

tocols;

• IoT Devices: sensing and controlling periphery of

the ENIGMA system. They continuously check

the status of electronic devices connected to the

sockets (e.g., oscilloscope, power supply).

Furthermore, the system relies on different stan-

dard modules which enable 1) the communication be-

tween modules (message broker), fast-access storage

(IMDB), persistent storage (non-relational database),

REST API services (web server - python modules)

and administration, remote control and analytics (web

application).

4.2 Services

This Section presents the services implemented by

ENIGMA:

• Augmented Reality. The ENIGMA system uses

Augmented Reality in order to provide additional

and meaningful information while keeping the

operator’s hands free to work safely. Alarms

and warnings will be shown onto the holographic

glass and the whole interface can be controlled us-

ing vocal commands.

• Localization. This information is used to provide

suggestions and alerts through the Augmented

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

698

Figure 5: Overall architecture of ENIGMA system which is composed of 4 main components.

Reality, as well as to show the position of the op-

erator in the web application, in order to log and

monitor his activities.

• Object Detection. Recognizing the objects

present in the surrounding environment allows

the system to provide useful information to the

worker about the objects such as time of usage

for maintenance purposes or suggestions on how

to use that speciﬁc object.

• Human-Object Interaction. This service allows

to check the correctness of the procedures which

the operator is doing, and also to implement en-

ergy saving strategies and tools preservation in an

automatic way.

• Next-Active Object. Predicting which object is

going to be used in the near future can prevent

dangerous interactions, such as touching powered

devices or tools. These information are used to

alert the operator and to improve his safety.

• Context-based Services. A laboratory equipped

with the ENIGMA system can be used as train-

ing space for people who needs to improve their

skills. The capabilities of the system, allows also

remote training and assistance, ensuring safety

and health of both operator and instructor.

• Authentication, Accounting and Administra-

tion. The ENIGMA system relies on an admin-

istrative backend that provides control and moni-

toring of the system itself and also to the workers

and their activities. A web application, enables

the supervisors and the administrators to assign

the activities to the operators and to monitor the

status of the work. Moreover, the operator can

consult his state of work using the same interface.

• Visual Analytics. An overall view of the labora-

tory, with all the devices and the operator, is avail-

able through a web based interface. The system

reports useful information, such energy consump-

tion, peaks, detected alerts, and so on, as well as

provides an historical log of the events that hap-

pened in the laboratory.

• Energy Saving. The proposed system allows to

optimize energy savings within laboratory due to

the analysis provided by the energy consumption

sensors present in the IOT devices installed in the

laboratory. The system can warn operators or au-

tonomously deactivate working tools that are not

necessary for the current task allowing to save en-

ergy due to any worker oversights.

• Safety. The system is able to detect and provide

solutions to the following safety risks:

– High/Low Voltage: working with high and low

voltage electrical boards exposes the operator

to the electrical risks, due to the fact that the

board could be under current, or to the residual

electricity stored in the capacitors. The system

is able to alert workers with visual and acous-

tic alarms before the board is touched and to

ENIGMA: Egocentric Navigator for Industrial Guidance, Monitoring and Anticipation

699

Table 1: The results obtained by the proposed system in the

task of contextual localization.

Contexts Validation Test

C1 Door 0.936 0.584

C2 Panel A 0.986 0.831

C3 Panel B 0.985 0.727

C4 Fire Extinguisher 0.980 0.870

C5 Workbench 0.937 0.689

C6 Cabinet 0.964 0.242

C7 Lab 0.981 0.752

C8 Out of the Lab 0.960 0.294

Average 0.975 0.647

turn off the electrical current preventing elec-

trical shocks;

– Break from Work: work’s regulations pre-

scribe to take breaks at regular time intervals.

ENIGMA can monitor how long the operator

has been working continuously, and suggest

when it’s time to take a break.

– Safety Procedures: speciﬁc tools and equip-

ment need speciﬁc procedures and usage

modalities. The system provides hints and sug-

gestions and checks that the procedures are re-

spected, improving the safety of workers.

5 PRELIMINARY RESULTS

We tested our ENIGMA system to assess the perfor-

mances of localization (contextual and punctual), ob-

ject detection and recognition and egocentric human-

object interaction detection tasks, which represent

three of the main cores of the whole system. Ta-

ble 1 reports the results of the context-based localiza-

tion system based on TripletNet (Hoffer and Ailon,

2015) for the feature extraction phase and a K-NN

with K = 1 to assign the correct context.

Table 2 reports the results of the punctual localiza-

tion task also based on TripletNet (Hoffer and Ailon,

2015) considering both 6 degrees (Table 2-top) and

3 degrees (Table 2-bottom) of freedom. We reported

the mean and the median errors considering position

(meters), quaternion rotation (degrees) and Euler an-

gles (degrees).

Table 3 shows the results for the object detection

and recognition task. We report the Average Preci-

sion (AP) for each of the 23 object classes. We also

computed the mean Average Precision (mAP) mea-

sure with an Intersection over Union (IoU) of 0.5

(mAP@50). We obtained an mAP@50 over the Test

set of 73.41%. Results suggest that the system is able

to recognize well large objects such as oscilloscope

or the socket obtaining a mAP of 90.12% and 90.27%

Table 2: The results obtained by the proposed system in the

task of punctual localization considering both 6 (top) and 3

(bottom) degrees of freedom.

Validation Test

Errors Avg Median Avg Median

Position 0.034 0.012 0.787 0.406

Quaternion 36.28 01.50 29.82 15.49

X angle 1.095 0.517 7.130 4.934

Y angle 0.756 0.315 5.116 3.732

Z angle 1.874 0.798 25.889 10.902

Validation Test

Errors Avg Median Avg Median

Position 0.031 0.011 0.769 0.386

Angle 1.874 0.798 25.889 10.902

Table 3: The results obtained by the object detector on the

industrial laboratory.

Object Category AP Object Category AP

Power Supply 80.18 Working Area 90.18

Oscilloscope 90.12 Welder Base 88.82

Welder Station 89.87 Socket 90.27

Electric Screwdriver 81.45 Left Red Button 100.00

Screwdriver 58.73 Left Green Button 100.00

Pliers 79.18 Right Red Button 81.82

Welder Probe Tip 50.63 Right Green Button 90.91

Oscilloscope Probe Tip 51.72 Power Supply Cables 41.34

Low Voltage Board 88.53 Ground Clip 44.84

High Voltage Board 61.44 Battery Charger Connector 15.91

Electric Screwdriver Battery 51.72

respectively, whereas it has trouble to recognize small

objects such as the Power Supply Cables (mAP of

41.34%) or the Battery Charger Connector (mAP of

15.91%).

Furthermore, we addressed the egocentric human-

object interaction detection task detecting all the ac-

tive objects involved in the interactions. We trained

an hand-centric method based on a standard object

detector (Ren et al., 2015) on 6 videos of the procedu-

ral dataset and tested on the 2 remaining videos. We

choose the active objects ﬁltering all the detected ob-

jects considering the minimum distance between the

centers of hands and objects bounding boxes. We

evaluated the performance computing the mAP@50

and the mean Average Recall (mAR) obtaining a

value of 36.80% and 28.31% respectively.

6 CONCLUSION

We have presented ENIGMA, a wearable system able

to assist workers in an industrial laboratory provid-

ing information about the surrounding environment

as well as improving the safety of workers and im-

plementing energy saving strategies. Preliminary ex-

periments show good performance considering local-

ization, object detection and recognition and egocen-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

700

tric human-object interaction tasks. Future work are

related to the improvement of these services as well

as the integration of the next-active object detection

service.

ACKNOWLEDGEMENTS

This research has been supported by Next Vision s.r.l.,

by the project MISE - PON I&C 2014-2020 - Pro-

getto ENIGMA - Prog n. F/190050/02/X44 – CUP:

B61B19000520008 and by MEGABIT - PIAno di in-

CEntivi per la RIcerca di Ateneo 2020/2022 (PIAC-

ERI) – linea di intervento 2, DMI - University of Cata-

nia.

REFERENCES

Brachmann, E. and Rother, C. (2018). Learning less is more

- 6d camera localization via 3d surface regression. In

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition (CVPR).

Colombo, S., Lim, Y., and Casalegno, F. (2019). Deep vi-

sion shield: Assessing the use of hmd and wearable

sensors in a smart safety device. In ACM PETRA.

Cucchiara, R. and Bimbo, A. D. (2014). Visions for aug-

mented cultural heritage experience. IEEE MultiMe-

dia, 21(1):74–82.

Damen, D., Leelasawassuk, T., Haines, O., Calway, A., and

Mayol-Cuevas, W. (2014). You-do, i-learn: Discover-

ing task relevant objects and their modes of interaction

from multi-user egocentric video. In BMVC.

Farinella, G. M., Signorello, G., Battiato, S., Furnari, A.,

Ragusa, F., Leonardi, R., Ragusa, E., Scuderi, E.,

Lopes, A., Santo, L., and Samarotto, M. (2019). Vedi:

Vision exploitation for data interpretation. In ICIAP.

Furnari, A., Battiato, S., and Farinella, G. M. (2018).

Personal-location-based temporal segmentation of

egocentric video for lifelogging applications. Journal

of Visual Communication and Image Representation,

52:1–12.

Girshick, R. (2015). Fast R-CNN. In ICCV.

Gkioxari, G., Girshick, R. B., Doll

ar, P., and He, K. (2018).

Detecting and recognizing human-object interactions.

CVPR, pages 8359–8367.

Gupta, S. and Malik, J. (2015). Visual semantic role label-

ing. ArXiv, abs/1505.04474.

Gurevich, P., Lanir, J., Cohen, B., and Stone, R. (2012).

Teleadvisor: a versatile augmented reality tool for re-

mote assistance. Proceedings of the SIGCHI Confer-

ence on Human Factors in Computing Systems.

Hoffer, E. and Ailon, N. (2015). Deep metric learning us-

ing triplet network. In Feragen, A., Pelillo, M., and

Loog, M., editors, Similarity-Based Pattern Recogni-

tion, pages 84–92. Springer International Publishing.

Leonardi, R., Ragusa, F., Furnari, A., and Farinella, G. M.

(2022). Egocentric human-object interaction detection

exploiting synthetic data.

Lin, T. Y., Maire, M., Belongie, S., Bourdev, L., Girshick,

R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L.,

and Doll

ar, P. (2014). Microsoft coco: Common ob-

jects in context.

Nagarajan, T., Feichtenhofer, C., and Grauman, K. (2019).

Grounded human-object interaction hotspots from

video. In ICCV, pages 8687–8696.

Nagarajan, T., Li, Y., Feichtenhofer, C., and Grauman, K.

(2020). Ego-topo: Environment affordances from

egocentric video. ArXiv, abs/2001.04583.

Osti, F., de Amicis, R., Sanchez, C. A., Tilt, A. B., Prather,

E., and Liverani, A. (2021). A vr training system

for learning and skills development for construction

workers. Virtual Reality, 25:523–538.

Ragusa, F., Furnari, A., Battiato, S., Signorello, G., and

Farinella, G. M. (2020). EGO-CH: Dataset and fun-

damental tasks for visitors behavioral understanding

using egocentric vision. Pattern Recognition Letters.

Ragusa, F., Furnari, A., and Farinella, G. M. (2022). Mec-

cano: A multimodal egocentric dataset for humans be-

havior understanding in the industrial-like domain.

Ragusa, F., Furnari, A., Livatino, S., and Farinella, G. M.

(2021). The meccano dataset: Understanding human-

object interactions from egocentric videos in an

industrial-like domain. In IEEE Winter Conference

on Application of Computer Vision (WACV).

Rebol, M., Hood, C., Ranniger, C., Rutenberg, A., Sikka,

N., Horan, E. M., G

utl, C., and Pietroszek, K. (2021).

Remote assistance with mixed reality for procedural

tasks. 2021 IEEE Conference on Virtual Reality and

3D User Interfaces Abstracts and Workshops (VRW),

pages 653–654.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. CoRR, abs/1804.02767.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster R-

CNN: Towards real-time object detection with region

proposal networks. In NeurIPS, pages 91–99.

Seidenari, L., Baecchi, C., Uricchio, T., Ferracani, A.,

Bertini, M., and Bimbo, A. D. (2017). Deep art-

work detection and retrieval for automatic context-

aware audio guides. ACM Transactions on Multime-

dia Computing, Communications, and Applications,

13(3s):35.

Shan, D., Geng, J., Shu, M., and Fouhey, D. (2020). Under-

standing human hands in contact at internet scale. In

CVPR.

Sorko, S. R. and Brunnhofer, M. (2019). Potentials of aug-

mented reality in training. Procedia Manufacturing.

Sun, L., Osman, H. A., and Lang, J. (2021). An aug-

mented reality online assistance platform for repair

tasks. ACM Transactions on Multimedia Computing,

Communications, and Applications (TOMM), 17:1 –

23.

Taira, H., Okutomi, M., Sattler, T., Cimpoi, M., Pollefeys,

M., Sivic, J., Pajdla, T., and Torii, A. (2018). Inloc: In-

door visual localization with dense matching and view

ENIGMA: Egocentric Navigator for Industrial Guidance, Monitoring and Anticipation

701

synthesis. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition (CVPR).

Torralba, A., Murphy, K. P., Freeman, W. T., and Rubin,

M. A. (2003). Context-based vision system for place

and object recognition. In Proceedings of the Ninth

IEEE International Conference on Computer Vision -

Volume 2, ICCV ’03, page 273. IEEE Computer Soci-

ety.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

702