An Efﬁcient GPGPU based Implementation of Face Detection Algorithm

using Skin Color Pre-treatment

Imene Guerﬁ

, Lobna Kriaa

and Leila Azouz Saidane

CRISTAL Laboratory, RAMSIS Pole, National School for Computer Sciences (ENSI), University of Manouba, Tunisia

Keywords:

GPU Computing, Parallel Programming, Face Detection, AdaBoost, Haar Cascade, Skin Color Segmentation.

Abstract:

Modern and future security and daily life applications incorporate several face detection systems. Those sys-

tems have an exigent time constraint that requires high-performance computing. This can be achieved using

General-Purpose Computing on Graphics Processing Units (GPGPU), however, some existing solutions to sat-

isfy the time requirements may degrade the quality of detection. In this paper, we aimed to reduce the detection

time and increase the detection rate using the GPGPU. We developed a robust, optimized algorithm based on

an efﬁcient parallelization of the face detection algorithm, combined with the reduction of the research area

using a mixture of two color spaces. for skin detection. Central Processing Unit (CPU) serial and parallel ver-

sions of the algorithm are developed for comparison’s sake. A database is made using a classiﬁcation method

to evaluate our approach in order to discuss all scenarios. The experimental results show that our proposed

method achieved 27,1x acceleration compared to the CPU implementation with a detection rate of 97,05%.

1 INTRODUCTION

In the last decade, the identiﬁcation requirements of

each individual are increasing rapidly, especially in

security and daily life applications. Security experts

have suggested biometrics as an efﬁcient method to

ensure security. The trend started with ﬁngerprints

and now it is shifting towards facial recognition (Khan

et al., 2017). The report published by Market research

Future predicts that the global facial recognition mar-

ket will increase considerably during the forecast pe-

riod (MRF, 2019) due to the growing need for surveil-

lance and security as a result of increased criminal

activities. Mobile facial identiﬁcation and the rise in

popularity of smart homes are also fueling this growth

of the facial identiﬁcation market.

A face processing system comprises face detec-

tion, recognition, tracking, and rendering. The pri-

mary and substantial aspects of any face processing

system are face detection. It is used to detect the pres-

ence and the precise location of one or more faces in a

digital image or video sequence. Recently face detec-

tion has received signiﬁcant attention in academia and

industry, mainly due to its wide range of applications,

such as public security, video conferencing, entertain-

https://orcid.org/0000-0002-2886-1713

https://orcid.org/0000-0002-2112-7807

ment and human-machine interface. Several of these

applications are interactive and require reliable and

fast face processing. Paul Viola and Michael Jones re-

alized the ﬁrst real-time face detection algorithm (Vi-

ola and Jones, 2001). Until today, the viola-jones al-

gorithm has been widely applied in digital cameras

and photo organization software.

The research on face detection, for the most part,

has been focused on designing new algorithms or im-

proving the detection rate and decreasing the false

positive rate of the existing methods. Therefore, the

majority of the available works are software solutions

designed for general-purpose computational proces-

sors (GPP) (Bilaniuk et al., 2014). However, detect-

ing faces in images is a computationally expensive

task; hence, we need to look for high-performance so-

lutions for fast face detection at reasonable cost. Re-

cently, processor performance has evolved by increas-

ing the number of computing units. In particular, the

graphics-processing unit (GPU) was used in collabo-

ration with the CPU to accelerate high computational

general-purpose applications, by ofﬂoading compute-

intensive portions of the application to the GPU. At

the same time, the rest of the code still runs on the

CPU. Heterogeneous computing that combines tra-

ditional processors coupled with GPU has become a

promising solution in most systems to achieve higher

performance. The GPUs, in general, are used to accel-

574

Guerﬁ, I., Kriaa, L. and Saidane, L.

An Efﬁcient GPGPU based Implementation of Face Detection Algorithm using Skin Color Pre-treatment.

DOI: 10.5220/0009833305740585

In Proceedings of the 15th International Conference on Software Technologies (ICSOFT 2020), pages 574-585

ISBN: 978-989-758-443-5

erate applications as they integrate hundreds of cores

designed to handle many simultaneous tasks.

Various works studied how to redesign the face

detection algorithm for GPU parallel execution opti-

mally. Many promising results have been presented

and a lot of works claim to offer the fastest GPU im-

plementation. However, the efﬁciency of face detec-

tion algorithms does not depend only on the speed of

detection, because this last can be affected easily by

the change of some parameters such as the number of

the classiﬁers, the step of scanning, the scaling fac-

tor, the size of images and the number of faces in the

image (Wei and Ming, 2011). To provide an efﬁcient

face detection algorithm the detection rate and speed

should not be decreased and false-positive should not

be increased.

In the present work, we present an efﬁcient

GPGPU based implementation of the Viola-Jones

face detection algorithm using the ”Compute Uniﬁed

Device Architecture” (CUDA) programming model.

For the time acceleration and the improvement of the

detection rate, we propose a new optimized parallel

method combined with a research area reduction us-

ing a skin color pre-treatment. In addition, an upscal-

ing for the size of the small images was done to im-

prove the detection rate. For comparison reasons, a

CPU single-threaded and CPU multithreaded version

of the code was developed. Standard and stable pa-

rameters and a lot of image sizes with variant face

numbers were used for all the implementations so that

the evaluation could be more effective.

We evaluate our efﬁcient face detection algorithm

using a created database composed of 700 color im-

ages taken from the web. The experimental results

indicate that our parallel face detector achieves 27.1x

and 18x speedup compared to CPU single-threaded

and multi-threaded versions respectively while in-

creasing the accuracy and reducing false-positive rate.

The rest of this paper is organized as follows: The

next section gives an overview of the related work.

Section 3 describes the Viola and Jones algorithm.

Section 4 gives details of the proposed optimized al-

gorithm for face detection. Section 5 elaborates ex-

perimental results and discussion, and ﬁnally. Section

6 presents the conclusion.

2 RELATED WORKS

Face detection was one of the ﬁrst computer vision

applications; the search in this ﬁeld was begun since

the middle of the 1960s. Indeed, the majority of

the early works did not propose efﬁcient methods

(Zafeiriou et al., 2015). In 2001, a revolution in this

ﬁeld was made when Viola and Jones invented a re-

liable face detection technique with promising accu-

racy and high-efﬁciency(Viola and Jones, 2001). The

used algorithm was based on Adaboost training and

Haar cascading. It achieved an average of about 15

frames per second (fps) for a (320x288) image. It

made face detection practically feasible in real-world

applications. Until today, this algorithm has been

widely applied in digital cameras and photo organi-

zation software.

Since then, many relevant works have been pre-

sented for accelerating the Viola-Jones face detec-

tion algorithm. The available computational re-

sources limited earlier implementation of this al-

gorithm. However, with the increase of low-cost,

high-performance computational devices, many re-

searchers have begun to explore the usage of these

features.

(Sharma et al., 2009) introduced the ﬁrst GPU re-

alization of a face detection algorithm using CUDA.

They reached a detection at 19 fps on a (1280 × 960)

video stream, which is a good improvement in detec-

tion time. However, the accuracy was only 81% with

16 false positives on the CMU test set. (Kong and

Deng, 2010) proposed a GPU accelerated OpenCV

implementation that achieved between 49.08 ms and

196.73ms (20,4-5,1 fps) on images from (340x240)

to (1280x1024). (Hefenbrock et al., 2010) presented

a multi-GPU implementation. They used a desktop

server containing four Tesla GPUs for the implemen-

tation and achieved 15.2 fps. However, the integral

image computation was not parallelized. In their other

works, (Nguyen et al., 2013) they used 5 Fermi GPUs

and improved in efﬁciency by using a dynamic warp

scheduling approach to eliminate thread divergence.

They used the technique of thread pool mechanism to

signiﬁcantly alleviate the cost of creating, switching,

and terminating threads. They reported realized 95.6

fps on (640x480) images. The proposed approach

(Devrari and Kumar, 2011) includes enhanced Haar-

like features and uses SVM (Support Vector Machine)

for training and classiﬁcation. They achieved 3.3

fps on (2592x1900) images. The (Wei and Ming,

2011) implementation reached 12 fps on (640*480)

images. However, the used method causes inade-

quate usage of resources. A Haar-based face detection

for (1920x1080) video on GTX470 was proposed by

(Oro et al., 2011) (Oro et al., 2012) and achieved a

performance of 35 fps. (Tek and G

okmen, 2012) used

3 GPUs for the implementation and they achieved 99

fps with good detection rate even though the classiﬁer

was small.

Other relevant implementations of the algorithm

are found in the work of (Jeong et al., 2012), (Li et al.,

An Efﬁcient GPGPU based Implementation of Face Detection Algorithm using Skin Color Pre-treatment

575

2012)and (Sun et al., 2013), whose experiments re-

veal improve detection speed for high-resolution im-

ages. Other works (Meng et al., 2014) (Bhutekar and

Manjaramkar, 2014) focused on improving detection

for small images.

(Chouchene et al., 2015) proposed parallel imple-

mentation of face detection on (NVidia310M) GPU

that could achieve 24 fps for small images (32x32)

and only 11 fps for bigger images (1024x1024). (Wai

et al., 2015) presented Open CV accelerated imple-

mentation using the latest GPU at that time (Tesla

k40) and they achieved 37.91 fps on (640x480) im-

ages. (Fayez et al., 2016) proposed an image scan-

ning framework using GPGPU in which they have

implemented the Viola-Jones algorithm. They have

achieved 37fps for (1920x1080) images. In (Jain

and Patel, 2016) introduced an implementation that

increases the speed of the algorithm 19.75x over a

CPU implementation. (Mutneja and Singh, 2018)

presented face detection using a combination of mo-

tion and skin color segmentation, the test samples

were low-resolution videos (600x800) and they get

3.16 fps. In their other work (Mutneja and Singh,

2019) much more analysis was done for the algorithm

and they achieve 25 fps for (480x640) images. (Pati-

dar et al., 2020) presented an optimized parallel face

detection system using CUDA on GPU and achieved

1.28 fps in the FDDB image set. Although interesting

results were recorded, a lack of information about the

used classiﬁer and accuracy and false-positive and the

test data set in most of the works.

3 BACKGROUND

3.1 Viola-Jones Algorithm

The Viola-Jones Face detection algorithm is one

of the best face detection algorithms that have

been developed through time. The algorithm is an

appearance-based model, and it can be divided into

two phases. A training phase where a cascade clas-

siﬁer is generated based on a set of positive and neg-

ative samples. It is a concatenation of several weak

classiﬁers divided into stages that get increasingly

complex. With the AdaBoost algorithm for the selec-

tion of informative human facial features. The second

is the detection phase, where the algorithm will detect

faces in a given image using the pre-trained cascade

classiﬁer. A small window will scan the image, at

each position the cascade classiﬁer is applied. To re-

duce the treatment if a window doesn’t pass a stage of

the cascade classiﬁer is directly rejected. Otherwise,

it passes to the next stage; if it passes all stages, then it

contains a face. This algorithm consists of four parts:

3.1.1 Haar-Like Features

Human faces share some similar properties. These

properties are mapped mathematically to the Haar

features (Figure 1). A Haar feature of a rectangle h(r)

is a scalar calculated by summing up the pixels in the

white region p(w) and subtracting those in the dark

region p(b) (equation 1).

h(r) =

∑

p(w) −

∑

p(b) (1)

These features are used in the training step to form the

classiﬁers and in the detection step. When an image

is scanned to detect a face, at each step, the features

in the actual window are compared to the trained one.

Figure 1: Fore types of Haar features with example.

3.1.2 Integral Image

The integral image was proposed to overcome the

huge amount of calculation caused by identifying

Haar features. The integral value I at pixel (x, y) is

the sum of all the pixels above it and to its left (equa-

tion 2).

I(x, y) =

∑

´x≤x, ´y≤y

i( ´x, ´y) (2)

This allows very fast feature evaluation, since calcu-

lating the sum of pixels inside a rectangle requires the

Integral image values of the four corners only (equa-

tion 3), which leads to computing features in constant

time efﬁciently.

h(x ´xy ´y) = I( ´x, ´y) − I(x, ´y) − I( ´x, y) + I(x, y) (3)

3.1.3 AdaBoost

Adaptive Boosting is a machine learning algorithm.

They are applied in the training phase to select the

features that best describe a face. This algorithm com-

bines these features into weak classiﬁers, and then it

groups them to create strong classiﬁers that form the

cascade classiﬁer.

ICSOFT 2020 - 15th International Conference on Software Technologies

576

3.1.4 Cascade Classiﬁer

The cascade classiﬁer is generated during the training

phase and used in the detection phase. During the

detection, a sub-window scanned the image and tried

at each step to determine if the current window could

be a face. To ensure that each window should pass

through the pre-trained cascade classiﬁer. Efﬁcient

cascade classiﬁer constructed to reject the maximum

of the negative windows at early stages to reduce the

computation time (ﬁgure 2).

Figure 2: A cascade classiﬁer.

Because the focus of this paper is the parallel ac-

celeration of the face detection process, we obtain

the classiﬁer not by self-training, but by using the

OpenCV open-source software (OpenCV, ). The clas-

siﬁer has 2913 features divided into 25 stages (Table

1), with min windows size of 25x25.

Table 1: The used cascade classiﬁer.

stages 0 1 2 3 4 5 6 7 8 9 10 11 12

features 9 16 27 32 52 53 62 72 83 91 99 115 127

stages 13 14 15 16 17 18 19 20 21 22 23 24 total

features 135 136 137 159 155 169 196 197 181 199 211 200 2913

4 PROPOSED ALGORITHM

This section describes our optimized face detection

algorithms based on an efﬁcient parallel implementa-

tion using CUDA and integrating of RGB and YCbCr

color spaces to reduce the research area. Before that,

the single-threaded version is presented to make the

algorithm clear and for comparison reasons.

4.1 Sequential Algorithm

Face detection implementation comprises 4 main

steps: 1) resizing of the original image into a pyra-

mid of images at different scales 2) calculating the in-

tegral images for fast feature evaluation, 3) Comput-

ing the image coordinates for each Haar feature, and

4) detecting faces using a cascade of classiﬁers. Fig-

ure 3 shows the overall ﬂow of the data. The process

starts with reading the input image and loading the

cascade classiﬁer. After that, the image is converted

to grayscale and the new height and width are cal-

culated to resize the image. The algorithm consists of

resizing the image with scale factor 1 for the ﬁrst time

than with a predeﬁned scale factor, until the image is

equal or smaller than the detection windows 25x25

(the size obtained from the used classiﬁer), in our case

the scale factor is 1,2. Each time the image is resized,

we transform it into an integral image format and we

also calculate the integral image of the square root of

the intensities of the pixels to accelerate the further

calculation. The next step is to compute the image

coordinates for each Haar feature, these are the rel-

ative positions of each Haar rectangle boundary in a

25x25 window. For a shifted position of the detection

window, the shifted offset is added to get the new co-

ordinates. After that, the window passes through the

stages of the cascade classiﬁer. At each stage, if the

integral sum is less than the threshold, this window

is rejected. Otherwise, the window passes to the next

stage. If it completes all the stages then it is stored as

a face. Then the window is shifted by a step of one

pixel and passed through the cascade classiﬁer. When

the window scans all the image, the image is resized

again and processed through the same steps until the

condition is satisﬁed. Finally, the stored faces are in-

dicated by a circle around them in the original image.

4.2 GPGPU Implementation

The proposed research work is the detection of multi-

ple human faces from images with different sizes us-

ing Haar-features and cascade classiﬁer, employing

skin color pre-treatment for search space reduction

and GPU acceleration for faster processing.

In this section, we introduce some of the main

ideas for parallelizing face detection exploiting both

CPU and GPU. First of all, when the image is loaded

it is directly transformed into grayscale and save to the

GPU global memory. The GPU doesn’t transfer the

result to avoid the communication overhead. The cas-

cade classiﬁer is also saved to the GPU global mem-

ory as multiple vectors, for example, one for the num-

ber of features in each stage, one for all the rectangles

coordinate and anther one for the threshold. This part

of the algorithm is handled by the CPU. The parts that

can be parallelized in this algorithm are image resiz-

ing, integral image computation, the compute of im-

age coordinates for each Haar feature and scanning

the image and processing each window by the cas-

cade classiﬁer. In what follows, we explain each one

separately.

An Efﬁcient GPGPU based Implementation of Face Detection Algorithm using Skin Color Pre-treatment

577

Figure 3: The face detection algorithm.

4.2.1 Image Resizing

Since the size of faces in the image could be variante

and the detection window is stable (25x25), the im-

age should be resized until it is equal to the detection

window size. for that, we use the nearest neighbor

algorithm to create a pyramid of images at different

scales (ﬁgure 4). The nearest neighbor is the simplest

and fastest implementation of the image scaling tech-

nique (Jiang and Wang, 2015). It is very useful when

speed is the primary concern.

Figure 4: Image pyramid.

In this step the image width and height are down-

scaled by a factor of 1.2, in order to ensure that, we

need to create a new image and compute both the hor-

izontal and vertical ratios between the original image

and the new image, after that, we delete redundant

pixels based on these ratios. After analyzing the al-

gorithm we can see that computing is independent, so

we can map each pixel position to be fetched by a

single thread. The total number of threads should be

equal to the new image (width x height).

4.2.2 Integral Image

After image resizing, we have to compute the inte-

gral image, which is computationally expensive, es-

pecially for large images, because the value of each

pixel is the sum of all the pixels above and to the left

of it. To parallelize this step we have to remove the

dependency between data. The main idea is to split

the algorithm into 2 parts (Figure 5), the ﬁrst is the

sum of each row independently of others and the sec-

ond is the sum of the columns. For that, we need as

much thread as the number of rows, and we have to

create a vector to hold the intermediate image. Here,

each thread processes each row separately, it adds the

previous pixel value, to the current pixel value along

that row.The vertical sum computation is done on the

output of the horizontal sum computation. It is similar

to the horizontal sum; however, the threads compute

the sum over the columns. The results are saved in a

new image.

Figure 5: Image integral computation.

Similarly, we compute the integral sum of the

squared pixel values, that we need to calculate the

variance of the pixels for the Haar rectangle coordi-

nates, in the Cascade classiﬁer stage. We calculate the

integral image and the integral image of the squared

pixel at the same time since they use the same data.

4.2.3 Preparing the Image Coordinates

Next, we compute the image coordinates for each

Haar feature, and these are the relative positions of

each HAAR rectangle boundary in a (25x25) win-

dow. This calculation is a preparation for the next

step when we shift the detection window. We will

just need to shift the offset to get the new coordinates.

The cascade classiﬁer contains a 2913 feature each of

which has 3 rectangles. We processed each feature

ICSOFT 2020 - 15th International Conference on Software Technologies

578

separately, which means that each thread will process

3 rectangles.

4.2.4 Scanning the Image

In order to carry the detection, in the last step, we have

to place a detection window (25x25) over the output

of the previous step. We process through the cascade

classiﬁer, then we slide the window to the next po-

sition, until the end of the image. The cascade clas-

siﬁer processing can’t be split because at each stage,

if the window is rejected there is no need to process

the rest, so this part should be done together. How-

ever, each window computation is independent of the

others. For that, each scan window is processed inde-

pendently. Hence, one thread is assigned to process a

scan window.

This is the simple parallel implementation of the

algorithm. In the following, we will present the opti-

mization added to accelerate the process.

4.3 Optimization

The objective of parallel computing is to reduce the

calculation time of a process. The GPU architectures

are increasingly used for that purpose. To exploit the

GPU performance it is essential to know the proper-

ties of the hardware architecture. The efﬁciency of an

algorithm implemented on a GPU is related to how

the available resources are used. In this subsection,

we will introduce the optimizations and how to ex-

ploit the resources better. In what follows, the op-

timization in each part of the algorithm is explained

separately.

4.3.1 Image Resizing and Sum of Rows

We merge the compute of image resizing and the ﬁrst

part of the integral image (row sum) in the same ker-

nel, to avoid storing the resized image to global mem-

ory between kernel. In this case, each thread pro-

cesses one row separately, when it computes a pixel

value with the nearest neighbor it adds it to the pre-

vious pixel values and it saves it to an intermediate

vector. To facilitate the next calculations, we save the

values in the column that have the same number as

the computed row, so that in the next step when we

calculate the sum of the column, this latter will be the

row with the same index as the column. We trans-

pose it again after the column sum. This optimization

helps us to avoid global memory access, that could

cost 200-800 clock cycles on an NVIDIA. Therefore,

memory optimization is essential in GPU based par-

allel face detection and it improves processing perfor-

mance.

4.3.2 Sum of Columns

After the end of the ﬁrst kernel, the result stays in the

GPU global memory. We invoke 2 kernels, the ﬁrst

for the computation of the sum of columns and the

second for the computation of the sum of the columns

for the square value of pixels. Since we don’t need the

second kernel until the image scanning, we execute it

asynchronously on a different stream, and we left the

ﬁrst kernel to be treated by the default stream. We left

the second kernel to be executed concurrently with the

kernel that prepares the image coordinates for each

Haar feature. Finally, we synchronize all the kernels

before the image scanning.

4.3.3 Scanning the Image

As we mentioned before (section 4.3.4). In this step,

we process each window with one thread through all

the features of the cascade classiﬁer. These features

are common to all the scan windows; hence we import

them to the shared memory so that they can be visi-

ble to all the threads of the same block and like this,

we can avoid multiple global memory access. Shared

memory has lower latency and higher bandwidth than

global memory, for this we use it to improve the per-

formance. The features need nearly 69 KB of mem-

ory, however, in our case, the used GPU has only 48

KB of shared memory. For this purpose, we decide

to import only the most used features so that we re-

duce as much memory transactions as possible. We

mention here that the newer NVIDIA architecture has

more shared memory, and make the import of all fea-

tures in shared memory possible.

4.3.4 Overlap Data Transfers

The last optimization is to overlap data transfers when

we copy the detected face vector to CPU memory. For

that, we divided the data into multiple chunks. We

execute each chunk on different stream and we copy

the data of each streams separately.

The latter approach was further improved by ap-

plying skin color ﬁltering to reduce the search space.

This skin color ﬁltering will be discussed in the next

section.

4.4 Skin Color Segmentation

Skin color segmentation can be accomplished by

explicitly modeling the skin distribution of certain

color spaces using parametric decision rules (bin Ab-

dul Rahman et al., 2007). A literature survey shows

that different color spaces are applied for skin color

analysis. In most cases, the default color space is the

An Efﬁcient GPGPU based Implementation of Face Detection Algorithm using Skin Color Pre-treatment

579

well-known RGB, which is composed of three pri-

mary colors, red, green and blue. The variations in

skin color pixels due to illumination levels are mini-

mized when utilizing the RGB color-space. However,

due to intermixed chrominance (color information)

and luminance (brightness measurement), it is least

recommended for color tone analysis (Shifa et al.,

2020). Another well used color space is YCbCr. In

this space, the intensity of light is represented by lu-

minance (Y) and chrominance is found by calculat-

ing the blue (Cb) and red (Cr) differences relative to

luminance. The experimental result of (Shaik et al.,

2015) shows that YCbCr color space can be applied

for complex color images with uneven illumination.

In the case of real scenario, illumination varia-

tion remains a challenging task. For this, we utilize

the additional luminance and chrominance informa-

tion of the image on top of standard RGB properties

to improve the skin pixels segmentation. We used the

RGB boundary rules introduced by (bin Abdul Rah-

man et al., 2007) (equation 4) and the suitable ranges

of YCbCr introduced by (Saikia et al., 2012) (equa-

tion 5). Equation 6 presents the combination to detect

skin color, the other pixels are colored by the black

color. The results are shown in ﬁgure 6, where (a)

is the original image, (b) is the skin color segmented

image and (c) is the gray scale of (b).

The skin colour at uniform daylight illumination rule

is deﬁned in RGB as:

(R > 95) AND (G > 40) AND (B > 20) AND

(max{R, G, B} − min{R, G, B} > 15) AND

(|R − G| > 15) AND (R > G) AND (R > B) (4a)

while the skin colour under ﬂashlight or daylight lat-

eral illumination rule in RGB is given by:

(R > 220) AND (G > 210) AND (B > 170) AND

(|R − G| ≤ 15) AND (R > B) AND (G > B) (4b)

(4a) OR (4b) (4)

77 ≤ Cb ≤ 127 AND 133 ≤ Cr ≤ 173 (5)

(4) OR (5) (6)

In (Vansh et al., 2020), an improved face detec-

tion approach was proposed using YCbCr color space

and Viola-Jones algorithm. They state that detection

time was a little longer than a simple Viola-Jones al-

gorithm, however, the detection rate was increased. In

this paper, we focus on the parallelization of the algo-

rithm to improve the speed of detection and we add

the skin color pre-treatment to increase the detection

rate and minimizing the treated area in order to im-

prove the speed. In the proposed method we add a

(a) (b)

(c)

Figure 6: Example of skin color segmentation.

treatment, after the integral image calculation, to se-

lect only the pixels that had skin color to be processed

by the cascade classiﬁer.

For smaller images, since the face size may be less

than (25x25), we used the nearest neighbor algorithm

to scale up the image size before starting the previ-

ously described algorithm. We found that adding the

scaling up can improve the detection rate by a factor

of 1.27x on the small image.

An overview of our optimized CUDA implemen-

tation is presented in Figure 7.

5 EXPERIMENTATION AND

DISCUSSION

5.1 Experimental Setup

The proposed method was developed and tested on

Intel

 Core

i5 2.30 GHz loaded with Windows

10 (64 bit) and NVIDIA graphics processing unit

GeForce 920MX. The development and testing have

been done in Microsoft Visual Studio 14.0.25431.01.

The CUDA ﬁles are compiled by the CUDA compiler

of Release 10.1, 10.1.168 with the architecture sup-

port corresponding to compute capability 5.0.

The detection time is related to ﬁve factors: the

size of the image, the number of features in the clas-

siﬁers, the step of scanning, the resizing scale and the

hardware platform (Wei and Ming, 2011). In addi-

tion, the number of faces in the image affects the de-

tection time. In this work, we maintain the classiﬁer,

the step of scanning, scaling factor and hardware plat-

ICSOFT 2020 - 15th International Conference on Software Technologies

580

Figure 7: Our optimized CUDA based face detection algo-

rithm.

form unchanged, and compute the acceleration of our

system. However, we were not able to ﬁnd any pub-

licly available databases contain color images with

different sizes and a variant number of faces. So in

order to evaluate the performance of the ﬁnal face de-

tection algorithm, the considered database is instead a

collection of 700 images taken from the web. The col-

lected database contains frontal face images in color

with 10 different sizes (100x100, 320x240, 480x240,

512x512, 640x480, 720x480, 600x800, 1280x720,

1024x1024, 1024x1280), that are grouped by the

number of faces, 7 different types were distinguished

(1, 2, 3, 4, 5, 6-9, ≥10 faces), with 10 images for each

size and for each number of faces.

5.2 Results and Discussion

Our experiments are based on the comparison of

our proposed solution and several implementations of

Viola-Jones: the sequential one, multithreaded and

Cuda GPU based. The results show that our solu-

tion decrease the detection time, increase the detec-

tion rate and eliminates the false positive rate. Fig-

ure 8 shows the graphical results of our face detection

(all the results details are given in table 2). The pre-

viously presented database (section 5.1) was used for

the testing of the detector. Each ﬁgure presents the ex-

ecution time, according to the number of faces for the

different algorithms (sequential, multithreaded, GPU

and our optimized algorithm). At the top left of the

ﬁgures, we zoomed the execution times of the small

image sizes, in order to make them clearer.

According to the ﬁgures, we distinguish 2 differ-

ent cases, the ﬁrst for small size images and the sec-

ond for big size images.

Figure 9 presents the detection time for the smaller

size images. The Multithreaded version gives a better

result than that of the CPU single-threaded and GPU.

(a)

(b)

(c)

Figure 8: The detection time for all image sizes.

An Efﬁcient GPGPU based Implementation of Face Detection Algorithm using Skin Color Pre-treatment

581

(d)

(e)

(f)

Figure 8: The detection time for all image sizes. (cont.).

(g)

Figure 8: The detection time for all image sizes. (cont.).

We notice that the GPU detection time increase for

the small images. However, our optimized algorithm

offers nearly the same execution time as the CPU ver-

sion.

Figure 9: Detection time for small images (100x100).

For bigger size images, we notice that the GPU detec-

tion time is better than that of the CPU, and the space

between their representation getting bigger with im-

age size increases. From ﬁgures 8, it is obvious that

our algorithm outperforms the CPU single-threaded

and multithreaded versions and even the GPU version.

The performance of the proposed CUDA opti-

mized algorithm with and without skin color ﬁlter-

ing has been compared with that of the sequential

and multithreaded implementation. Table 2 summa-

rizes the detection time in ms for all the previously

described implementations. As per the implemented

results, it has been observed that the optimized GPU

is better than the CPU version in most of the cases.

In the ﬁrst experiment, we measured the process-

ing time of sequential and multithreaded versions. we

found that the multithreaded version can accelerate

ICSOFT 2020 - 15th International Conference on Software Technologies

582

Table 2: The detection time for the different implementations on the created dataset.

size 100 x 100 320 x 240 480 x 240 512 x 512 640 x 480 720 x 480 600 x 800 1280 x 720 1024 x 1024 1024 x 1280

1 face

CPU 16.20 140.93 203.55 523.77 581.76 620.68 899.05 1734.41 2119.42 2802.51

Multithread 9.47 109.08 188.83 383.07 431.76 435.53 644.81 1236.44 1434.50 1860.77

GPU 84.93 119.24 128.43 222.1 203.77 191.95 235.33 319.97 374.88 480.83

GPU optimized 14.41 48.54 49.21 111.83 79.74 98.73 140.89 167.44 143.61 103.41

2 faces

CPU 16.70 147.69 219.18 542.80 623.4 663.11 998.03 1828.85 2140.54 2845.46

Multithread 10.4 112.58 191.63 393.67 444.73 478.29 757.68 1324.35 1454.53 1940.97

GPU 81.84 149.27 163.65 221.3 239.08 217.21 304.6 414.76 456.39 528.12

GPU optimized 15.4 77.56 71.76 87.84 102.53 132.64 117.92 183.97 178.96 193.94

3 faces

CPU 16.65 148.94 224.54 586.58 625.12 744.6 1010 1754.41 2206.9 2863.15

Multithread 10.41 114.32 210.80 428.55 446.43 495.34 710.22 1417.58 1508.11 2176.85

GPU 61.54 167.57 212.71 251.96 259.69 283.91 301.24 400.17 457.7 610.65

GPU optimized 11.65 53.19 125.94 143.14 118.09 141.26 227.39 238.65 126.52 171.61

4 faces

CPU 17.35 158.79 239.24 594.14 641.03 747.02 1022.28 1851.67 2255.01 2930.95

Multithread 10.77 132.98 222.96 435.97 483.91 512.15 759.30 1440.53 1606.86 2242.37

GPU 69.87 162.35 223.72 276.56 293.9 301.17 314.22 438.56 492.47 621.82

GPU optimized 17.93 46.18 78.85 124.47 221.31 140.52 142.46 165.28 259 249.18

5 faces

CPU 17.99 178.03 254.61 595.34 646.07 762.86 1026.78 1901.63 2396.42 2987.78

Multithread 11.01 137.97 239 438.07 510.40 562.95 789.87 1462.99 1657.34 2280.72

GPU 67.27 185.1 196.11 267.15 295.64 299.49 333.74 491 581.57 547.61

GPU optimized 31.33 59.06 89.98 197.4 55.99 219.45 184.51 233.79 316.75 354.17

6-9 faces

CPU 17.24 179.41 284.42 601.1 671.13 778.42 1092.92 2048.3 2426.02 3092.29

Multithread 10.45 142.27 240.91 455.25 511.16 563.37 792.50 1494.26 1719.13 2304.84

GPU 60.47 198.03 221.98 268.34 281.88 325.62 363.74 489.09 486.2 589.29

GPU optimized 28.41 53.79 80.65 125.9 194.58 171.82 114.02 256.24 237.87 316.64

≥10 faces

CPU 19.97 199.6 295.38 661.58 793.43 913.5 1195.36 2306.87 2637.04 3259.85

Multithread 11.04 136.12 244.78 468.06 530.52 609.14 796.13 1577.59 1697.14 2223.2

GPU 51.48 188.17 235.01 296.11 351.4 366.53 361.85 583.5 574.23 650.47

GPU optimized 22 112.13 163.53 235.32 214.39 227.85 262.03 320.03 329.92 293.54

the execution by a factor up to 1.81x. However, the

improvement isn’t enough for realistic scenarios. We

notice that the detection time is increasing with the

rise of image size and the number of faces. After that,

a GPU CUDA based version was implemented. For

smaller images, we notice that the detection time was

increased. Despite this, for the bigger sizes, the sim-

ple GPU implementation is faster than CPU imple-

mentations by a factor up to 5.83x. Since the GPU

version still not that much faster and suffers when we

use small images, an optimized version was imple-

mented. For smaller images, the detection time be-

comes so near to that of the CPU versions. The im-

provement is 27.1x compared to the CPU version.

Although, it is important for a face detection al-

gorithm to be fast, however, it is not the only factor

to measure its efﬁciency. The detection rate and false

positive are also important. First, for the GPU ver-

sion, we achieved a detection rate of 91.35% with 15

false positives. In the GPU optimized version with

the skin color segmentation, since only skin pixels

are treated, the false positive is reduced to zero and

the detection rate is 97.05%. With the use of sta-

ble parameters, it is obvious that our proposed imple-

mentation outperforms those of the literature. How-

ever, there is still a possibility of improvement, with

a better cascade classiﬁer and more amount of shared

memory that exists in newer NVIDIA architecture.

6 CONCLUSIONS

Face detection is a classical computationally-

intensive problem. In this paper, we solve this compu-

tationally intensive problem on an NVIDIA GPGPU

using the CUDA parallel computing language. We

propose a real-time optimized and robust GPGPU im-

plementation of face detection algorithm using the

RGB and YCbCr spaces of color to select the skin

color pixel on which the detection is applied in order

to reduce the research area. To evaluate our method,

we created a database of 700 color images that con-

tain different sizes and face numbers. We achieved

an average of 27.1x acceleration over the CPU imple-

mentation. In order to increase the detection rate for

the smaller images, we increased the size before treat-

ment. This scaling-up increases the detection rate by

a factor of 1.27x. With the use of skin color segmen-

tation and small images scaling up, we could increase

An Efﬁcient GPGPU based Implementation of Face Detection Algorithm using Skin Color Pre-treatment

583

the detection rate to 97.05% and we get rid of the false

positive. We believe that with a more robust classiﬁer

and more optimizations, we can still achieve a better

speedup. Then, we will focus on the creation of a ro-

bust classiﬁer and the optimization of the implemen-

tations to overcome the execution overhead caused by

the number of faces. In addition, our challenge is to

approve the efﬁciency of our approach regardless of

the hardware platform used.

REFERENCES

Bhutekar, S. J. and Manjaramkar, A. K. (2014). Parallel

face detection and recognition on gpu. International

Journal of Computer Science and Information Tech-

nologies, 5(2):2013–2018.

Bilaniuk, O., Fazl-Ersi, E., Laganiere, R., Xu, C., Laroche,

D., and Moulder, C. (2014). Fast lbp face detection

on low-power simd architectures. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition Workshops, pages 616–622.

bin Abdul Rahman, N. A., Wei, K. C., and See, J. (2007).

Rgb-h-cbcr skin colour model for human face detec-

tion. Faculty of Information Technology, Multimedia

University, 4.

Chouchene, M., Sayadi, F. E., Bahri, H., Dubois, J., Mit-

eran, J., and Atri, M. (2015). Optimized parallel im-

plementation of face detection based on gpu compo-

nent. Microprocessors and Microsystems, 39(6):393–

404.

Devrari, K. and Kumar, K. V. (2011). Fast face detec-

tion using graphics processor. International Journal

of Computer Science and Information Technologies,

2(3):1082–1086.

Fayez, M., Faheem, H., Katib, I., and Aljohani, N. R.

(2016). Real-time image scanning framework using

gpgpu-face detection case study. In Proceedings of the

International Conference on Image Processing, Com-

puter Vision, and Pattern Recognition (IPCV), page

147. The Steering Committee of The World Congress

in Computer Science, Computer . . . .

Hefenbrock, D., Oberg, J., Thanh, N. T. N., Kastner, R.,

and Baden, S. B. (2010). Accelerating viola-jones

face detection to fpga-level using gpus. In 2010

18th IEEE Annual International Symposium on Field-

Programmable Custom Computing Machines, pages

11–18. IEEE.

Jain, V. and Patel, D. (2016). A gpu based implementation

of robust face detection system. Procedia Computer

Science, 87:156–163.

Jeong, J.-c., Shin, H.-c., and Cho, J.-i. (2012). Gpu-based

real-time face detector. In 2012 9th International

Conference on Ubiquitous Robots and Ambient Intel-

ligence (URAI), pages 173–175. IEEE.

Jiang, N. and Wang, L. (2015). Quantum image scaling

using nearest neighbor interpolation. Quantum Infor-

mation Processing, 14(5):1559–1571.

Khan, M. A., Shaikh, M. K., bin Mazhar, S. A., Mehboob,

K., et al. (2017). Comparative analysis for a real time

face recognition system using raspberry pi. In 2017

IEEE 4th International Conference on Smart Instru-

mentation, Measurement and Application (ICSIMA),

pages 1–4. IEEE.

Kong, J. and Deng, Y. (2010). Gpu accelerated face detec-

tion. In 2010 International Conference on Intelligent

Control and Information Processing, pages 584–588.

IEEE.

Li, E., Wang, B., Yang, L., Peng, Y.-t., Du, Y., Zhang, Y.,

and Chiu, Y.-J. (2012). Gpu and cpu cooperative ac-

celaration for face detection on modern processors. In

2012 IEEE International Conference on Multimedia

and Expo, pages 769–775. IEEE.

Meng, R., Shengbing, Z., Yi, L., and Meng, Z. (2014).

Cuda-based real-time face recognition system. In

2014 Fourth International Conference on Digital In-

formation and Communication Technology and its Ap-

plications (DICTAP), pages 237–241. IEEE.

MRF (2019). Biometric in government market

research report - global forecast til 2025.

page 195. MARKET RESEARCH FUTURE.

https://www.marketresearchfuture.com/reports/

biometrics-government-market-8035.

Mutneja, V. and Singh, S. (2018). Gpu accelerated face

detection from low resolution surveillance videos us-

ing motion and skin color segmentation. Optik,

157:1155–1165.

Mutneja, V. and Singh, S. (2019). Modiﬁed viola–jones al-

gorithm with gpu accelerated training and parallelized

skin color ﬁltering-based face detection. Journal of

Real-Time Image Processing, 16(5):1573–1593.

Nguyen, T., Hefenbrock, D., Oberg, J., Kastner, R., and

Baden, S. (2013). A software-based dynamic-warp

scheduling approach for load-balancing the viola–

jones face detection algorithm on gpus. Journal of

Parallel and Distributed Computing, 73(5):677–685.

OpenCV. [Online]. https://sourceforge.net/projects/

opencvlibrary/.

Oro, D., Fern

andez, C., Saeta, J. R., Martorell, X., and

Hernando, J. (2011). Real-time gpu-based face de-

tection in hd video sequences. In 2011 IEEE Inter-

national Conference on Computer Vision Workshops

(ICCV Workshops), pages 530–537. IEEE.

Oro, D., Fern’ndez, C., Segura, C., Martorell, X., and Her-

nando, J. (2012). Accelerating boosting-based face

detection on gpus. In 2012 41st International Confer-

ence on Parallel Processing, pages 309–318. IEEE.

Patidar, S., Singh, U., Patidar, A., Munsoori, R. A., and

Patidar, J. (2020). Comparative study on face de-

tection by gpu, cpu and opencv. Lecture Notes on

Data Engineering and Communications Technologies,

44:686–696.

Saikia, P., Janam, G., and Kathing, M. (2012). Face de-

tection using skin colour model and distance between

eyes. International Journal of Computing, Communi-

cations and Networking, 1(3).

Shaik, K. B., Ganesan, P., Kalist, V., Sathish, B., and

Jenitha, J. M. M. (2015). Comparative study of skin

ICSOFT 2020 - 15th International Conference on Software Technologies

584

color detection and segmentation in hsv and ycbcr

color space. Procedia Computer Science, 57(12):41–

48.

Sharma, B., Thota, R., Vydyanathan, N., and Kale, A.

(2009). Towards a robust, real-time face processing

system using cuda-enabled gpus. In 2009 Interna-

tional Conference on High Performance Computing

(HiPC), pages 368–377. IEEE.

Shifa, A., Imtiaz, M. B., Asghar, M. N., and Fleury, M.

(2020). Skin detection and lightweight encryption for

privacy protection in real-time surveillance applica-

tions. Image and Vision Computing, 94:103859.

Sun, L.-c., Zhang, S.-b., Cheng, X.-t., and Zhang, M.

(2013). Acceleration algorithm for cuda-based face

detection. In 2013 IEEE International Conference

on Signal Processing, Communication and Comput-

ing (ICSPCC 2013), pages 1–5. IEEE.

Tek, S. C. and G

okmen, M. (2012). Gpu accelerated real-

time object detection on high resolution videos using

modiﬁed census transform. In VISAPP International

Conference on Computer Vision Theory and Applica-

tions, pages 685–688.

Vansh, V., Chandrasekhar, K., Anil, C., and Sahu, S. S.

(2020). Improved face detection using ycbcr and ad-

aboost. In Computational Intelligence in Data Min-

ing, pages 689–699. Springer.

Viola, P. and Jones, M. (2001). Robust real-time face detec-

tion. In null, page 747. IEEE.

Wai, A. W. Y., Tahir, S. M., and Chang, Y. C. (2015). Gpu

acceleration of real time viola-jones face detection. In

2015 IEEE International Conference on Control Sys-

tem, Computing and Engineering (ICCSCE), pages

183–188. IEEE.

Wei, G. and Ming, C. (2011). The face detection system

based on gpu+ cpu desktop cluster. In Intl. Conf. on

Multimedia Technol.(ICMT’11), pages 3735–3738.

Zafeiriou, S., Zhang, C., and Zhang, Z. (2015). A survey

on face detection in the wild: past, present and future.

Computer Vision and Image Understanding, 138:1–

24.

An Efﬁcient GPGPU based Implementation of Face Detection Algorithm using Skin Color Pre-treatment

585