A Robust Real-time Component for Personal Protective Equipment

Detection in an Industrial Setting

Pedro Torres

, Andr

e Davys

, Thuener Silva

, Luiz Schirmer

, Andr

e Kuramoto

, Bruno Itagyba

Cristiane Salgado

, Sidney Comandulli

, Patricia Ventura

, Leonardo Fialho

, Marinho Fischer

Marcos Kalinowski

, Simone Barbosa

and H

elio Lopes

Department of Informatics, PUC-Rio, Rio de Janeiro, Brazil

PETROBRAS, Rio de Janeiro, Brazil

{kuramoto, itagyba, cristiane.salgado, comandulli, patriciarventura, leofmello, marinhof}@petrobras.com.br

Keywords:

Artiﬁcial Intelligence, Industrial Application, Computer Vision, Real-time System.

Abstract:

In large industries, such as construction, metallurgy, and oil, workers are continually exposed to various haz-

ards in their workplace. Accordingly to the International Labor Organization (ILO), there are 340 million

occupational accidents annually. Personal Protective Equipment (PPE) is used to ensure the essential protec-

tion of workers’ health and safety. There is a great effort to ensure that these types of equipment are used

properly. In such an environment, it is common to have closed-circuit television (CCTV) cameras to moni-

tor workers, as those can be used to verify the PPE’s proper usage. Some works address this problem using

CCTV images; however, they frequently can not deal with multiples safe equipment usage detection and oth-

ers even skip the veriﬁcation phase, making only the detection. In this paper, we propose a novel cognitive

safety analysis component for a monitoring system. This component acts to detect the proper usage of PPE’s

in real-time using data stream from regular CCTV cameras. We built the system component based on the top

of state-of-art deep learning techniques for object detection. The methodology is robust with consistent and

promising results for Mean Average Precision (80.19% mAP) and can act in real-time (80 FPS).

1 INTRODUCTION

Workers, especially in an industrial setting, are con-

tinually exposed to various hazards in their work-

place. In this context, unfortunately, there are sev-

eral fatal cases. The Brazilian Protection Statistical

Yearbook

shows an average of six hundred thousand

occupational accidents and 2600 deaths per year, reg-

istered between 2010 and 2017.

A company, such as an oil and gas reﬁnery, could

avoid injuries by monitoring its workers to prompt

corrective measures when the personal protective

equipment (PPE) is not used appropriately. Neverthe-

less, this activity is often performed by a human from

a constant visual local inspection or closed-circuit

television (CCTV). In this scenario, an industry could

beneﬁt from a system powered by Machine Learning

and Computer Vision techniques to automate this task

https://bc.pressmatrix.com/pt-BR/proﬁles/

1227998e328d/editions/0e55e8eba33a3ed62b2e/pages/

page/40

in order to prevent accidents, minimize costs, and in-

juries. Figure 1 illustrates a possible industrial sys-

tem to monitor the use of PPE automatically and emit

alarms when they are missing or not used appropri-

ately. The system is fed with RGB images from a

CCTV, each image initially passes through the detec-

tion and veriﬁcation component. This component is

the fundamental basis of the system and is responsible

for producing evidence of deviations from inappro-

priate use of the equipment by workers. Finally, the

ID association component can match those evidences

with the worker’s identity in the company database

and issue an alert with the worker’s identiﬁcation and

type of deviation.

Regarding the detection and veriﬁcation compo-

nent, there are two main approaches to address the

problem of PPE detection with Computer Vision tech-

niques. One approach uses one-stage classiﬁers (Bo

et al., 2019), which handles the detection and veri-

ﬁcation phases throughout a single model. Another

approach employs a multi-stage classiﬁer (Li et al.,

2017), which uses one or more models to handle each

Torres, P., Davys, A., Silva, T., Schirmer, L., Kuramoto, A., Itagyba, B., Salgado, C., Comandulli, S., Ventura, P., Fialho, L., Fischer, M., Kalinowski, M., Barbosa, S. and Lopes, H.

A Robust Real-time Component for Personal Protective Equipment Detection in an Industrial Setting.

DOI: 10.5220/0010452606930700

In Proceedings of the 23rd International Conference on Enterprise Information Systems (ICEIS 2021) - Volume 1, pages 693-700

ISBN: 978-989-758-509-8

693

phase. Both approaches show good results but are

limited to identify just one PPE (hardhat in the major-

ity of cases). Suggesting that the multiple PPE detec-

tion is more challenging than detecting a single PPE

(Zheng et al., 2019; Nath et al., 2020).

Although important in the scenario described in

the Figure 1, we will not conduct an in-depth explo-

ration of the challenges related to the ID Identiﬁca-

tion component, as they lay outside of the scope of

this work. We shall focus on exploring the two main

approaches for implementing a detection and veriﬁ-

cation component (highlighted as the blue box of the

Figure 1) that is both robust and capable of act in real-

time for monitoring systems in a industrial environ-

ments, especially oil and gas reﬁneries.

Figure 1: Example of an industrial monitoring system for

PPE compliance.

Our main contributions are:

• Evaluating two approaches to solve the PPE de-

tection problem. In our ﬁrst approach, we built

a one-stage classiﬁer. While in our second ap-

proach, we build a multi-stage classiﬁer.

• A dataset for PPE detection that addresses multi-

ple types of equipment.

• Exploring how ensemble classiﬁers performs for

the veriﬁcation stage of a multi-stage implemen-

tation.

The paper is structured as follows: Section 2 presents

a literature review of Deep Neural Networks and the

use of object detection models to solve the PPE de-

tection problem; Section 3 describes the construction

and exploration process adopted for the proposed ap-

proaches and also includes details regarding dataset

generation; Section 4 presents the performance com-

parison of models and approaches and Section 5, the

conclusion.

2 LITERATURE REVIEW

The use of DNN (Deep Neural Networks) has

achieved state-of-the-art in different Computer Vision

tasks in recent years. Convolutional Neural Networks

(CNN) have emerged as an important approach to per-

form a broad range of visual tasks (Krizhevsky et al.,

2012; Ren et al., 2015). CNNs are composed of lay-

ers of ﬁlters that represent neighbourhood spatial con-

nectivity patterns. Its use of convolutions, non-linear

activation functions and downsampling results in a hi-

erarchical understanding of those features. A crucial

aspect of this interleaving of operations is that they

usually fuse spatial and channel-wise information.

Recent advances of image classiﬁcation focus on

training feedforward convolutional neural networks

using “very deep” structure (Simonyan and Zisser-

man, 2014; Szegedy et al., 2015; He et al., 2016).

The feedforward convolutional network mimics the

bottom-up paths of the human cortex. Several ap-

proaches have been proposed to improve further the

discriminative ability of deep convolutional neural

network. VGG (Simonyan and Zisserman, 2014), In-

ception (Szegedy et al., 2015) and residual learning

(He et al., 2016) are frameworks that are able to train

very deep neural networks. VGGNets and Inception

models investigated very deep architectures in detail.

A complementary approach is ResNets, which ap-

plied skip connections also to improve the training of

deep networks (He et al., 2016).

Considering PPE detection, to improve the results,

some studies use different neural network architec-

tures for dealing with this task. One of the ﬁrst works

in this direction was shown in (Fang et al., 2018),

in which the authors use the Faster R-CNN neural

network to detect workers’ non-hardhat-use in a con-

struction scenario. According to results, they demon-

strate that the use of Faster R-CNN can facilitate im-

proved safety inspection and supervision in a real con-

struction environment.

Bo et al. (2019) use the YOLO-v3 for hardhat de-

tection. You only look once (YOLO) (Redmon et al.,

2016) is a state-of-the-art, real-time object detection

system that uses a single neural network to predicts

bounding boxes and class probabilities directly from

full images in one evaluation. Since the whole de-

tection pipeline is a single network, it can be opti-

mized end-to-end directly on detection performance.

The authors tested this detection model in images col-

lected from an electric power construction scenario

and are interesting in detect if a worker use hardhat or

not.

The work presented above aims to construct a

model for hardhat wearing detection. However, in

some cases, we are interested in detect multiple PPEs,

such as hardhat, gloves, and worker vests. In this

sense, Zheng et al. (2019) and Nath et al. (2020) de-

veloped models to handle a higher amount of PPE.

Nath et al. (2020) propose three different approaches

to detect multiple PPEs in construction scenario.

In the ﬁrst approach, they developed a multi-stage

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

694

model, where the YOLO-v3 model detects workers,

hats, and vests and then, a simple machine learn-

ing classiﬁer is applied to verify whether a worker is

wearing hardhat and vest. In the second approach,

they use an one-stage model, based on YOLO-v3,

in which the model simultaneously detects individual

workers and veriﬁes the PPE usage. The third ap-

proach developed a multi-stage model, in which the

YOLO-v3 detects only workers in a scenario and then

cropped these detections to feed a CNN classiﬁer that

is responsible for verifying the PPE compliance.

Deep neural networks have been revealed to

be quite useful for solving object identiﬁcation

tasks (Nath et al., 2020). Despite the meaningful re-

sults described in the literature, in the industrial do-

main, this approach has limitations considering the

time of inference and precision. In this work, we pro-

pose robust approaches that can be used as a funda-

mental component of a monitoring system for PPE

detection that uses YOLO-v4 model as a basis for

implementing the proposed approaches. We also ex-

periment with new adaptations, in terms of models

and parameters for the multi-stage approach, where

we create a solution that uses ensemble classiﬁers.

As one will see in the next sections, our approaches

present promising results (in terms of mAP) and still

capable of act in real-time.

3 METHODOLOGY

This section presents the construction and exploration

process for the two proposed approaches, including

dataset generation, used to carry out model training

and evaluation. In this work, we evaluate whether

these approaches are ready to act in real-time and

with robustness when implemented as a fundamental

component of a monitoring system that seeks to de-

tect which workers are making the appropriate use of

PPEs.

For this study, we choose to focus on two types

of PPE, hardhat and protective clothing. These types

of equipment are often used to ensure the safety of

workers in the oil and gas industry and civil construc-

tion. Our approaches are based on techniques for de-

tecting and classifying objects in images. In this way,

we have four possible classes to perform the classiﬁ-

cation, which are: worker with no PPE (W), worker

wearing a hardhat (WH), worker wearing protective

clothing (vest) (WV), and worker wearing a hardhat

and protective clothing (WHV). Notice that the ap-

proaches implemented for this work can be extended

to any types of PPE, but not without an increase in

complexity due to the number of different combina-

tions of equipment.

The detection and veriﬁcation component is based

on models that empower deep learning methods ca-

pable of being executed in real-time. In Nath et al.

(2020) work, which also addresses the problem of de-

tecting multiple PPEs, the authors adopt as the deﬁni-

tion for real-time system one that can process at least

ﬁve frames per second (FPS). When this is not possi-

ble, values ≥ 1 FPS are considered ”near real-time”.

These deﬁnitions emerge from previous work (Red-

mon et al., 2016) which also raise this concern. In this

work, we follow the same deﬁnition since the context

is quite similar. We expect that our implementation

can act in real-time with a prompt response when a

worker’s life is exposed to risk.

In particular, both approaches employ YOLO-v4

architecture and carry some of the steps in a similar

manner, such as preparing data and the models’ train-

ing. In the approach I, for each image, we annotate

one bounding box for each worker, where the class of

this bounding box informs which PPE the worker is

wearing. This way, we create a single model based on

YOLO-v4 architecture for detection and veriﬁcation

phases. Approach II uses one model for the detection

phase and another model for the veriﬁcation phase.

First, we detect workers bounding boxes from images

based using a model based on the YOLO-v4. Next, a

convolutional neural network is applied for PPE com-

pliance veriﬁcation. The details for both approaches

are clariﬁed in Subsection 3.3 and Subsection 3.4.

3.1 Dataset Generation

Supervised machine learning applications requires a

large annotated dataset to provide the learning model

a way to create and recognize patterns through the

data. As we did not have access to a large image

dataset available for PPE detection, it was necessary

to create the annotated dataset to feed the model.

The dataset used in this work is composed of im-

ages from the following sources: Crowd-sourced (as

used by Nath et al. (2020)), GDUT-HWD (Wu et al.,

2019), Web-scrapped, and images captured by the au-

thors. Of the datasets that already had annotation

(Crowd-sourced and GDUT-HWD), the only one that

deals with the identiﬁcation of workers using multi-

ple PPE’s is Crowd-sourced, where the annotation of

classes is the same as that used in this work. For

GDUT-HWD, only the images were used, since the

dataset annotations are only for the individual iden-

tiﬁcation of safety helmets. To compose the Web-

scrapped source, images were obtained from public

databases using search engines that perform searches

by keywords, e.g., ”workers in reﬁnery”, ”workers in

A Robust Real-time Component for Personal Protective Equipment Detection in an Industrial Setting

695

platforms”. After the complete collection of these im-

ages, there was a visual inspection to remove images

that were out of context or that had low quality. The

images captured by the authors were taken in a con-

trolled environment that sought to reproduce the con-

ditions of an industrial environment. The Figure 2

shows the number of instances for each data source.

Figure 2: Distribution of instances per class for each data

source.

In the annotation procedure to mark bounding boxes

of objects in images for training the models, we aimed

to minimize the annotation bias by taking the follow-

ing procedure. For each data source, except for the

Crowd-sourced (already annotated), the dataset was

split into equal-sized batches. Each batch was ini-

tially annotated by a person using the YOLO mark

tool

, an open source library for image annotation.

At the end of the annotation of a batch, an annota-

tor review the annotations of another one and sepa-

rate those that presented divergences (concerning the

annotated class, or bounding box region). Each diver-

gence was discussed by annotators until they reached

a consensus, deﬁning the ground truth for each image.

3.2 Data Preprocessing

In the dataset preparation, we randomly divided the

images into three subsets: training (70%), validation

(20%) and testing (10%). This division of the dataset

was carefully considered to ensure similar distribu-

tions of the classes for each set. To ensure some simi-

larity among the sets, we split the data sources assur-

ing the proportion was the same, i.e., training, testing,

and validation sets have the same percentage of each

data source, making the sets more homogeneous.

Note that the dataset was split based on the num-

ber of images. Thus, to verify whether the instances

proportion for each subset is similar, we analyzed the

distribution of number of instances, as shown in Fig-

github.com/AlexeyAB/Yolo mark

ure 3. According to the ﬁgure, the same proportion

between classes is maintained for all subsets. More-

over, it is possible to observe that the dataset has many

examples for WH class and a few examples for WV

class, which can hinder the learning model from gen-

eralizing these classes.

Figure 3: Number of instances per class for each subset.

To improve the ability to detect objects accurately,

YOLO uses nine anchor boxes, which must be deﬁned

for use during the training and inference phases. With

the anchor boxes deﬁned, the model can specialize in

objects of certain sizes and objects with a particular

aspect ratio (height × width).

In practice, during the training phase, each cell of

the feature maps of the network’s output layers has an

associated anchor box. Thus, the model learns how to

shift and scale an anchor box so that the coordinates

of the prediction bounding box ﬁts the object of in-

terest. To deﬁne the anchor boxes, the K-Means clus-

tering algorithm was used, with k = 9. The algorithm

was executed with the training set bounding boxes as

input.

3.3 Approach I

The ﬁrst approach uses a YOLO-v4 model, which we

will call YOLO-v4-AP1, to perform the identiﬁcation

and veriﬁcation steps in one-stage. During the anno-

tation phase, the regions where the workers are lo-

cated and labeled with one of the classes (W, WV,

WH, WHV). Once the model is trained, the inference

provides a worker’s location and the class detected de-

termine which PPE he/she is using.

Figure 4 demonstrates how the approach I is car-

ried out. One of the main advantages of approach I is

able to take advantage of the ability that YOLO mod-

els make predictions (with the locations of objects and

their respective classes) using a single network, which

makes a simple and yet effective method.

Following the inference phase, the YOLO-v4-AP1

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

696

Figure 4: Approach I.

model performs predictions, only detections with a

conﬁdence score above 50% are considered as ﬁnal

predictions. It is worth mentioning that the model

can make duplicate detections and present different

classes for the same worker, but following the prob-

lem deﬁnition, a worker can only belong to one class.

To avoid duplicate detections, the model uses non-

maximum suppression (NMS). Usually, NMS adds 2-

3% in mAP (Redmon et al., 2016).

3.4 Approach II

In this approach, we use a multi-stage method to per-

form the identiﬁcation and veriﬁcation of the proper

usage of PPE’s by workers. Initially, a YOLO-v4

model is used (which we will call YOLO-v4-AP2)

only to locate the worker (detection stage). Thus,

there is only the worker class (W), to carry out the

equipment veriﬁcation stage we use a CNN that re-

ceives cropped images for each region in which it is

detected a given worker. This CNN classiﬁes each

image according to the four possible classes (W, WH,

WV, WHV). The Figure 5 illustrate how this multi-

stage approach is performed.

Figure 5: Approach II.

For this approach, we adopted the following classi-

ﬁers: VGG-16, ResNet50, ResNet101 and Inception.

For each classiﬁer, a dropout layer with probability

p = 0.3, a fully connected layer with 256 nodes and

regularization L1 and L2 with rw = 10

−3

was added

after the convolutional blocks (base model). The layer

is followed by the ReLU (Rectiﬁed Linear Unit) acti-

vation function. Additionally, the input dimensions

of the networks were adapted to receive images in

the format 150 × 150. A modiﬁcation was made to

the output layer dimensions, where the network must

have four nodes, one node for each class. Finally, this

layer is followed by the activation function SoftMax.

In the training step, we use a transfer learning

strategy, using the weights pre-trained with the Ima-

geNet dataset for the convolutional layers. We freeze

these layers and retrain the models with 60 epochs us-

ing the Adam optimizer with a learning rate η = 10

−5

As the classiﬁers receive only the image cropped

with the worker’s location, the dataset used was built

from each instance of the annotation that belongs to

the training set’s grounding truth. The images are

scaled to the size 150×150 to match the networks’ in-

put size. For data augmentation, the following trans-

formations were applied to images: random zoom and

shear range of up to 20%, a horizontal ﬂip was also

applied to up to half of the images.

We also improve approach II by using an ensem-

ble of classiﬁers. Thus, when performing the veri-

ﬁcation stage, we do not use the learning of a sin-

gle classiﬁer, but the learning from a set of classiﬁers

(VGG16, Inception, ResNet50 and ResNet101). We

believe that by combining the predictions of multi-

ple classiﬁers, we can reduce the variance and make

the classiﬁcation less dependent. Further, the clas-

siﬁcation bias can also be reduced since the classi-

ﬁers together can make the class distinction criteria

more expressive. The ensemble prediction combina-

tions were given from the voting (majority) method.

That is, each classiﬁer assigns his vote to the class

with the highest probability. The ﬁnal prediction is

given to the class that received the highest number of

votes.

3.5 Model Training

Implemented models from approaches I and II, all

layers, except the last three output layers, have their

weights captured from training the YOLO-v4 model

in the COCO dataset. After this ﬁrst train, a trans-

fer learning method is applied to take advantage of

the model’s ability to detect up to 80 classes (person,

car, motorcycle, etc.) from the COCO dataset and ap-

ply the knowledge in our domain. Since this model

has already been trained from a more signiﬁcant num-

ber of images, it can generalize its learning ability to

distinguish resources for our task, in which there is a

much smaller number of images. It may seem that two

tasks have no evident intersection, in addition to that a

worker is also a person, but in problems that address a

classiﬁcation or detection task, some low-level char-

acteristics, such as, edges, shapes and contours, can

be shared between tasks, thus allowing the spread of

knowledge between them.

The models of approaches I and II were re-trained

following the same deﬁnitions of hyperparameters.

The re-training was allowed 30 epochs with the learn-

ing rate η = 0.0013, using the Adam optimizer. To

accelerate the learning convergence process and to

also mitigate the model’s overﬁtting, the value of

A Robust Real-time Component for Personal Protective Equipment Detection in an Industrial Setting

697

momentum = 0.949 and decay = 0.0005 was adopted.

Additionally, to improve model performance in

the real scenario, we use the data augmentation tech-

nique, which is performed in real-time during the

model’s re-training phase. To create more diversiﬁ-

cation in training images the Mosaic technique was

used, which was introduced together with YOLO-v4.

This method mixes four training images. Hence, four

different contexts are mixed and could allow the de-

tection of objects that are out of their normal context.

In addition, we also make random changes to the HSV

color-space of the images, where the saturation and

exposure values are modiﬁed by a factor of up to 1.5.

4 EXPERIMENTAL RESULTS

4.1 Perfomance of the YOLO-V4

Models

The performance of the YOLO models are evalu-

ated using the mAP (Mean Average Precision) metric,

which is used in several object detection models (e.g.,

Faster R-CNN, R-CNN, SSD). One of the advantages

of the mAP is that we can quantify how well an ob-

ject detection model is performing in a data set using

a single numerical representation. Figure 6 presents

the performance results of the detection models used

in the approach I (YOLO-v4-AP1) and approach II

(YOLO-v4-AP2) for the test set. The highest value

mAP (88.18%) is obtained by the approach II model,

which was expected since there is only one class to be

detected. A factor that may have contributed consid-

erably to the performance of the model was the use of

transfer learning. In this case, we have a notable simi-

larity between the COCO dataset person class and the

worker class (W). For approach II, we have a mAP of

80.19%, which is a good result, since this approach

makes use of a single model to perform the stages

of detection and veriﬁcation. That is, the result of

this approach will not change, while the results of ap-

proach II will change when we add the veriﬁcation

step, being directly affected according to the perfor-

mance of the classiﬁers.

Figure 6: Perfomance of the YOLO-v4 models for each ap-

proach.

4.2 Performance of Classiﬁers

To evaluate the classiﬁers proposed for the veriﬁ-

cation step in approach II, instances of the test set

were used. The accuracy of the VGG-16, Inception,

ResNet50 and ResNet101 models for classifying im-

ages in classes W, WH, WV and WHV are 82.85%,

78.35%, 84.02% and 82.75%, respectively. Figure 7

shows the confusion matrix for each model. Note that

the WHV class has the lowest accuracy in all clas-

siﬁers. One reason is confusion with the WH class,

which has an average value of 20.05% for false nega-

tives. That is, once a worker is being detected with a

helmet, it is difﬁcult to distinguishing between wear-

ing or not wearing safety clothing. The individual

accuracy for the rest of the classes (W, WH, WV)

presents values higher than 80%, with the exception

of class W when evaluated in Inception, which had a

value of 6.6% below the average (82.83%).

Figure 7: Confusion matrices for VGG-16, Inception,

ResNet50 and ResNet101 classiﬁers.

4.3 Performance of Approaches

In this subsection, we will evaluate the ﬁnal perfor-

mance of the approaches I and II. Note that this re-

sult depends directly on the performance of the im-

plemented models. Figure 8 shows that approach I

obtained the best mAP (80.19%), even when com-

pared with the different model combinations used in

approach II. Although the YOLO-v4-AP2 model pre-

sented a MAP of 88.18% in the detection step, the

errors of the classiﬁers for the veriﬁcation step end up

reducing the mAP, since some of the instances will be

classiﬁed incorrectly, generating false positive. The

best result for approach II presents 72.87% of mAP

given from the use of the YOLO-v4-AP2 model plus

the ensemble with the classiﬁers VGG16, ResNet50,

and ResNet101, which are the classiﬁers that reported

the greatest accuracy. In the scenario where only one

classiﬁer was used in the veriﬁcation step, the best re-

sult is obtained using ResNet101 (70.42% of mAP).

We also compared the processing time spent by

each approach. We run all models on the same ma-

chine, which has the following conﬁgurations: Intel

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

698

Figure 8: Performance comparison of approaches imple-

mentation.

Core i9-7900X, 128 GB RAM and TITAN RTX GPU

with 24 GB memory. Figure 9 presents the average

processing time for a test image in each approach.

Approach I has the best processing time (12.55ms).

For approach II the time of the YOLO-v4-AP2 model

is slightly less (12.38ms) than the model implemented

in approach I, but when adding the classiﬁers time,

that time increases signiﬁcantly, making approach II

the slowest, with times in the interval of 53.4ms to

224.67ms. As the proposed models run at a rate of

at least 5 FPS (frames per second), following the def-

inition adopted in this work (same as Redmon et al.

(2016) and Nath et al. (2020)), we can say that these

approaches are capable to process videos in real-time

applications and can be implemented as the base com-

ponent of a monitoring system for PPE compliance.

4.4 Benchmark of Results

Since the dataset or models used in the work of Nath

et al. (2020) were not made available by the authors,

we have to ﬁnd other alternatives. To perform a fair

comparison of the methodology adopted in this work

and verify if the approaches were effective, we estab-

lish a comparison with the results obtained from the

YOLO-v3 model. The model was trained and tested

in a similar way to the YOLO-v4-AP2 model. Ta-

ble 1 shows the comparison between models in terms

of mAP and FPS. Concerning mAP, we have an in-

crease of 21.52% and 13.96% when comparing the

YOLO-v4-AP1 and YOLO-v4-AP2 (with ensemble)

models with baseline (YOLO-v3), respectively. On

the other hand, there is an increase in FPS. However,

since the models of approach I and II are already able

to run in real-time, this gain is not very signiﬁcant.

Table 1: Comparison of mAP and FPS with a baseline

model.

Criteria

Baseline model

(YOLO-v3)

Approach one-stage (I)

(YOLO-v4-AP1)

Approach multi-stage (II)

(Ensemble: V + R50 + R101)

mAP 58.67% 80.19% 72.87%

FPS 104 80 7

Figure 9: Processing time comparison for approaches im-

plementation.

5 CONCLUSIONS

This work explores the implementation of two differ-

ent approaches based on deep-learning to perform the

task of detecting the usage of PPEs by workers. We

aimed to develop and evaluate approaches that are ro-

bust and capable of acting in real-time, so that they

can be implemented as a fundamental component (de-

tection and veriﬁcation) of a monitoring system.

Although in this paper, we focus on two types

of PPEs (hardhat and protective clothing), our re-

sults show that we can employ our methodology to

any number of equipment (e.g., goggles, gloves, and

masks) changing the networks’ output layers. The ef-

fect of this, is an increase in complexity due to the

number of different combinations of equipment.

To carry out the models’ training, we built a

dataset from four different sources to supply our mod-

els with a more signiﬁcant number of images from

different devices, angles, lighting, and environments.

To enable our models to be able to generalize better.

For Approach I, we built a single model based on

YOLO-v4 implemented in one-stage; that is, the same

model is responsible for identifying and verifying the

use of PPE. Hence, when receiving an image as input,

the model classiﬁes each founded region that displays

a worker, with one of the following classes: W, WV,

WH, and WHV. In contrast, Approach II is multi-

stage, with at least two distinct models for the iden-

tiﬁcation and veriﬁcation stages. Initially, a YOLO-

v4 based model detects the regions in which workers

are located. Then, a convolutional neural network re-

ceives the clipping from each region and performs the

veriﬁcation step, which consists of classifying the im-

age into W, WV, WH, and WHV classes.

Both proposed approaches outperforms the base-

line results relating to mAP. Our results show Ap-

proach I presenting the best mAP for detecting PPEs

(80.19%). Although YOLO-v4-AP2 mAP display

superior results (88.18%), the classiﬁers’ errors in

the veriﬁcation step decrease the ﬁnal mAP. This

effect is evident even for our best implementation,

A Robust Real-time Component for Personal Protective Equipment Detection in an Industrial Setting

699

Figure 10: Example of detections obtained from YOLO-v4-AP1 model. The ﬁrst two images are from the Crowd-sourced

dataset. The third image is from the Web-scrapped dataset that is in an industrial setting.

which employs an ensemble of classiﬁers VGG16,

ResNet50, and ResNet101, producing a ﬁnal mAP of

72.87%. These results may indicate that superior re-

sults may be obtained from the individual improve-

ment of the classiﬁers or methods proposed in this

work. The ensemble method achieved an increase

of up to 2.45% compared to the best single classi-

ﬁer (ResNet50) mAP (70.42%) of Approach II. Re-

garding the processing time, Approach I proved to be

more effective because of its one-stage implementa-

tion, which avoids bottlenecks between the process-

ing phases. Although slower, our results demonstrate

that Approach II still feasible to use it in real-time,

even with the use of an ensemble of classiﬁers.

From the implementation carried out for approach

I, it is possible to build a monitoring system that has

a robust detection and veriﬁcation component. Since

the approach proved to be more efﬁcient, not only in

terms of mAP (80.19%) but also in processing time,

reaching up to 11x faster (80 FPS) when compared

to approach II. Considering that, we believe that the

one-stage approach has a high potential for the con-

struction of an effective monitoring system that can

contribute to the safety of workers, minimizing the

number of accidents and live losses.

Regarding ID association component mentioned

in Figure 1, we believe that tracking algorithms such

as DeepSORT (Wojke et al., 2017) may present goods

results when employed along with the component ex-

plored in this work. This happens due to those al-

gorithms working well with robust detection models

to track real-time custom objects and assign unique

identities for each object.

REFERENCES

Bo, Y., Huan, Q., Huan, X., Rong, Z., Hongbin, L., Kebin,

M., Weizhong, Z., and Lei, Z. (2019). Helmet de-

tection under the power construction scene based on

image analysis. In 2019 IEEE 7th International Conf.

on Computer Science and Network Technology (ICC-

SNT), pages 67–71. IEEE.

Fang, Q., Li, H., Luo, X., Ding, L., Luo, H., Rose, T. M.,

and An, W. (2018). Detecting non-hardhat-use by

a deep learning method from far-ﬁeld surveillance

videos. Automation in Construction, 85:1–9.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105.

Li, J., Liu, H., Wang, T., Jiang, M., Wang, S., Li, K., and

Zhao, X. (2017). Safety helmet wearing detection

based on image processing and machine learning. In

2017 9th International Conf. on Advanced Computa-

tional Intelligence (ICACI), pages 201–205. IEEE.

Nath, N. D., Behzadan, A. H., and Paal, S. G. (2020). Deep

learning for site safety: Real-time detection of per-

sonal protective equipment. Automation in Construc-

tion, 112:103085.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proceedings of the IEEE conf. has on

computer vision and pattern recognition, pages 779–

788.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Advances in neural information

processing systems, pages 91–99.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 1–9.

Wojke, N., Bewley, A., and Paulus, D. (2017). Simple on-

line and realtime tracking with a deep association met-

ric. In 2017 IEEE international conference on image

processing (ICIP), pages 3645–3649. IEEE.

Wu, J., Cai, N., Chen, W., Wang, H., and Wang, G. (2019).

Automatic detection of hardhats worn by construction

personnel: A deep learning approach and benchmark

dataset. Automation in Construction, 106:102894.

Zheng, X., Yao, J., and Xu, X. (2019). Violation monitoring

system for power construction site. In IOP Conf. Se-

ries: Earth and Environmental Science, volume 234,

page 012062. IOP Publishing.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

700