2000).  In  this  study,  we  are  using  CUDA  platform 
with GPU since CUDA is the most popular platform 
used  to  increase  the  GPU  utilization.  The  main 
difference  between  CPU  and  GPU,  as  shown  in 
Figure  1  (Reddy  et  al,  2017),  is  the  number  of 
processing units. In CPU it has less processing units 
with cache and control units while in GPU it has more 
processing units with its own cache and control units. 
GPUs contain hundreds of cores which causes higher 
parallelism compared to CPUs. 
2.2  NVIDIA, CUDA Architecture and 
Threads 
The  GPU  follows  the  SIMD  programming  model. 
The  GPU  contains  hundreds  of  processing  cores, 
called  the  Scalar  Processors  (SPs).  Streaming 
multiprocessor (SM) is a group of eight SPs forming 
the  graphic  card.  Group  of  SPs  in  the  same  SM 
execute the same instruction at the same time hence 
they  execute  in  Single  Instruction  Multiple  Thread 
(SIMT) fashion (Lad et al, 2012). Compute Unified 
Device  Architecture  (CUDA),  developed  by 
NVIDIA, is a parallel processing architecture, which 
with the help of the GPU produced a significant 
performance  improvement.  CUDA  enabled  GPU  is 
widely used in many applications as image and video 
processing in  chemistry and biology, fluid dynamic 
simulations,  computerized  tomography  (CT),  etc 
(Bozkurt  et  al,  2015).  CUDA  is  an  extension  of  C 
language  for  executing  on  the  GPU,  that 
automatically  creates  parallelism  with  no  need  to 
change  program  architecture  for  making  them 
multithreaded.  It  also  supports  memory  scatter 
bringing  more  flexibilities  to  GPU  (Reddy  et  al, 
2017). The CUDA API allows  the  execution of  the 
code using a large number of threads, where threads 
are grouped into blocks and blocks make up a grid. 
Blocks are serially assigned for execution on each SM 
(Lad et al, 2012). 
2.3  Gaussian Blur Filter 
The Gaussian blur, is a convolution technique used as 
a  pre-processing  stage  of  many  computer  vision 
algorithms  used  for  smoothing,  blurring  and 
eliminating  noise  in  an  image  (Chauhan,  2018). 
Gaussian  blur  is  a  linear  low-pass  filter,  where  the 
pixel value is calculated using the Gaussian function 
(Novák  et  al,  2012).  The  2  Dimensional  (2D) 
Gaussian  function  is  the  product  of  two  1 
Dimensional  (1D)  Gaussian  functions,  defined  as 
shown in equation (1) (Novák et al, 2012): 
                       G(x,y) = 
 
𝑒
 
                     (1) 
 
where  (x,y)  are  coordinates  and  ‘σ’  is  the  standard 
deviation of the Gaussian distribution. 
The  linear  spatial  filter  mechanism  is  in  the 
movement  of  the  center  of  a  filter  mask  from  one 
point to another and the value of each pixel (x, y) is 
the result of the filter at that point is the sum of the 
multiplication  of  the  filter  coefficients  and  the 
corresponding neighbor pixels in the filter mask range 
(Putra et al, 2017). The outcome of the Gaussian blur 
function is a bell shaped curve as shown in Figure 1 
as the pixel weight depends on the distance metric of 
the neighboring pixels (Chauhan, 2018).  
The  filter  kernel  size  is  a  factor  that  affects  the 
performance and processing time of the convolution 
process. In our  study, we used odd numbers for the 
kernel width: 7x7, 13x13, 15x15 and 17x17. 
 
Figure 1: The 2D Gaussian Function.  
2.4  Related Work 
Optimizing image convolution is one of the important 
topics  in  image  processing  that  is  being  widely 
explored  and  developed.  The  effect  of  optimizing 
Gaussian blur by running the filter on CPU multicore 
systems  and its  improvement from  single  CPU  had 
been explored by Novák et al. (2012). Also, exploring 
the effect of running Gaussian blur filter using CUDA 
has been explored by Chauhan (2018). In the previous 
studies,  the  focus  was  getting  the  best performance 
from multicore CPUs or from GPU compared to the 
sequential  code.  Samet  et  al.  (2015)  presented  a 
comparison  between  the  speed  up  of  real  time 
applications on CPU and GPU using C++ language 
and Open Multi-Processing OpenMP. Also, Reddy et 
al.  (2017)  presented  a  comparison  between  the 
performance of CPU and GPU was studied for image 
edge  detection  algorithm  using  C  language  and 
CUDA on NVIDIA GeForce 970 GTX.  In our study 
we are exploring the performance improvement