Hardware-oriented Algorithm for Human Detection using GMM-MRCoHOG Features

Ryogo Takemoto\textsuperscript{1\textsuperscript{a}}, Yuya Nagamine\textsuperscript{1}, Kazuki Yoshihiro\textsuperscript{1}, Masatoshi Shibata\textsuperscript{2}, Hideo Yamada\textsuperscript{2}, Yuichiro Tanaka\textsuperscript{2\textsuperscript{b}}, Shuichi Enokida\textsuperscript{4\textsuperscript{c}} and Hakaru Tamukoh\textsuperscript{1,3\textsuperscript{d}}

\textsuperscript{1}Graduate School of Life Science and Systems Engineering, Kyushu Institute of Technology, 2-4 Hibikino, Wakamatsu-ku, Kitakyushu, Fukuoka, 808-0196, Japan
\textsuperscript{2}AISIN CORPORATION, 2-1 Asahi-machi, Kariya, Aichi, 448-8650, Japan
\textsuperscript{3}Research Center for Neuromorphic AI Hardware, Kyushu Institute of Technology, 2-4 Hibikino, Wakamatsu-ku, Kitakyushu, Fukuoka, 808-0196, Japan
\textsuperscript{4}Department of Artificial Intelligence, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka, 820-8502, Japan

Keywords: Image Processing, Human Detection, HOG, MRCoHOG, GMM-MRCoHOG, FPGA.

Abstract: In this research, we focus on Gaussian mixture model-multiresolution co-occurrence histograms of oriented gradients (GMM-MRCoHOG) features using luminance gradients in images and propose a hardware-oriented algorithm of GMM-MRCoHOG to implement it on a field programmable gate array (FPGA). The proposed method simplifies the calculation of luminance gradients, which is a high-cost operation in the conventional algorithm, by using lookup tables to reduce the circuit size. We also designed a human-detection digital architecture of the proposed algorithm for FPGA implementation using high-level synthesis. The verification results showed that the processing speed of the proposed architecture was approximately 123 times faster than that of the FPGA implementation of VGG-16.

1 INTRODUCTION

The demand for home service robots and self-driving cars has been increasing in response to the recent acceleration in the aging population and decline in birthrate. Because these robots and cars with artificial intelligence are expected to operate near humans, high-precision and high-speed human detection functions are required from the viewpoint of safety. However, the more accurate the human detection, the more complex is the computation and the longer the computation time. Parallelization is one of the effective solutions to accelerate the computation.

A typical device for parallel processing is a graphics processing unit (GPU). However, GPUs are not suitable for embedded systems such as home service robots and self-driving cars in terms of power consumption and heat exhaustion. Instead of software implementation on GPUs, hardware implementation, where a dedicated circuit with parallel architecture for some computation is designed, can achieve a low-power system with high-speed processing because the operation on the dedicated circuit can be more effective than that on GPUs. Therefore, we aim to design a dedicated circuit for human detection and implement it on a field-programmable gate array (FPGA). Because FPGAs have limited physical circuit resources, we need a hardware-oriented algorithm that reduces the number of complex operations in the original algorithm to efficiently utilize the limited resources.

For high-accuracy human detection, histograms of oriented gradients (HOG) features have been proposed (Dalal and Triggs, 2005) and used in multiple applications. This method extracts features of object shapes from luminance gradients in images, and represents the features as histograms of the gradients. For higher-accuracy and smaller-memory resource implementation of human detection compared...
with HOG features, the Gaussian mixture model-
multiresolution co-occurrence histograms of oriented
gradients (GMM-MRCoHOG) features that approx-
imate the conventional histogram-based state space
with a mixed Gaussian distribution and optimize the
feature space have been proposed (Higashi et al.,
2018; Nagamine et al., 2019). However, the algorithm
still requires a large number of complex operations
that are not suitable for FPGA implementation.

In this study, we propose a hardware-oriented
algorithm of GMM-MRCoHOG that simplifies the
complex operation in the original algorithm, such
as the calculation of luminance gradients by using a
lookup table (LUT); we then design a dedicated cir-
cuit for human recognition integrating the hardware-
oriented GMM-MRCoHOG with a binarized neural
network (BNN) (Hubara et al., 2016), and implement
it on an FPGA to achieve a high-accuracy, high-speed,
and low-power system.

2 RELATED WORKS

MRCoHOG (Iwata and Enokida, 2014), a derivative
work of HOG, extracts features by down-sampling
an image in two steps; it represents the gradient co-
occurrence between images of three resolutions as a
two-dimensional co-occurrence histogram. Feature
extraction methods using gradient histograms, such
as HOG and MRCoHOG, require a manual determi-
nation of the optimal class width of the histogram to
discretize the luminance gradients. This is difficult
because the discretization error of the gradient infor-
mation and the generalization ability of the features
vary depending on the class width. Moreover, these
methods require many memory resources to represent
gradient histograms.

Conversely, GMM-MRCoHOG constructs an op-
timal state space by approximating the co-occurrence
histogram with a mixed Gaussian distribution, as
shown in Fig. 1 and performs feature extraction based
on the state space. The approximation results in re-
ducing the required memory resources for gradient
histograms in the original algorithm because only a
small number of memories is required to represent the
mixed Gaussian distribution.

Figures 2 and 3 show the processing flow of
GMM-MRCoHOG. First, the co-occurrences of the
luminance gradient pairs (36 gradient directions for
each axis in Fig. 2) of the positive and negative
data of the training images are mapped to the feature
space as continuous values, and each feature is ap-
proximated by a mixture Gaussian distribution. Then,
using the Jensen–Shannon (JS) information content
(Michishita et al., 2018), only features that can ef-
effectively separate the positive and negative data are
extracted from the respective mixed Gaussian distri-
butions and approximated to a mixed Gaussian dis-
tribution using the EM algorithm (Dempster et al.,
1977). The resulting mixed Gaussian distribution is
then used as the feature space, and the responsibil-
ity (described as “resp” in Fig. 3) of each Gaussian
distribution is calculated and used as the feature value.
In GMM-MRCoHOG, the final number of feature di-
mensions is determined by the number of Gaussian
distributions in 2D space, and not by the number of
gradient quantization.

GMM-MRCoHOG has difficulties in hardware
implementation because it includes an arctangent
function for the luminance gradient angle decision
and the responsibility calculation for the feature value
decision, which are complex operations that require
considerable circuit resources. Nagamine et al. pro-
posed a hardware-oriented algorithm that approxi-
mates these calculations to reduce the circuit re-
sources (Nagamine et al., 2019). The algorithm deter-
mines the luminance gradient angles by using a con-
dition branch of the horizontal and vertical luminance

Figure 1: Sample of Gaussian Mixture Model.

Figure 2: Training Process of State Space in GMM-
MRCoHOG.

Figure 3: Feature Extraction Process in GMM-MRCoHOG.
gradients $f_x$ and $f_y$. Figure 4 shows a first quadrant in the luminance gradient space of $f_x$ and $f_y$, which is divided into several areas at intervals of 16 in Manhattan distance. The condition branch determines an angle by subtracting $f_x$ and $f_y$ according to the divided area; therefore, the angle decision does not require complex operations. For the feature value decision, the algorithm infers a responsibility from the distance between the input vector and each Gaussian distribution. The algorithm also approximates the Gaussian distribution width as a power of two and changes the Gaussian shape as a rectangle so that the computation can be represented by bit-shift operations and fuzzy inferences. Although the hardware-oriented algorithm reduces most circuit resources from the original algorithm, the condition branch for the angle calculation still requires many LUTs, which worsens the performance of the algorithm because of the imprecise angle approximation.

3 PROPOSED METHODS

To improve the method proposed by Nagamine et al., we propose a novel coarse angle calculation method using a fixed-point tan $\theta$ table. We then construct a hardware-oriented GMM-MRCoHOG-based human recognition circuit using the method for a high-speed and low-power human detection system.

3.1 Coarse Angle Calculation Method using Fixed-point Tangent Table

In the GMM-MRCoHOG algorithm, the luminance gradient angle $\theta$ is calculated as $\theta = \tan^{-1}(f_y/f_x)$ and discretized in 36 directions. Here, assuming that the angle $\theta$ appears in the first quadrant of the luminance gradient space, we calculate $\tan \theta$ from $\tan 0^\circ$ to $\tan 80^\circ$ in advance, as given by Eq. (1) and discretize it, as shown in Fig. 5.

\[
\begin{align*}
\text{if } & \tan 0^\circ \leq \frac{f_y}{f_x} < \tan 10^\circ \\
& \text{direction} = 1(\theta : 0^\circ \sim 10^\circ) \\
\text{elif } & \tan 10^\circ \leq \frac{f_y}{f_x} < \tan 20^\circ \\
& \text{direction} = 2(\theta : 10^\circ \sim 20^\circ) \\
\text{...} \\
\text{elif } & \tan 80^\circ \leq \frac{f_y}{f_x} \\
& \text{direction} = 9(\theta : 80^\circ \sim 90^\circ)
\end{align*}
\]

Then, we create a $\tan \theta$ table representing the relationship between the discretized $\tan \theta$, $f_x$, and $f_y$, which enables us to obtain rough angles of luminance gradients. By utilizing the symmetry of the trigonometric functions, the $\tan \theta$ table can be applied to the second through fourth quadrants.

Additionally, we eliminate divisions that require most circuit resources in the conditional branch in the $\tan \theta$ table. As $f_x \geq 0$ and $f_y \geq 0$, we can replace Eq. (1) with Eq. (2), where no division is required.

\[
\begin{align*}
\text{if } & f_x \times \tan 0^\circ \leq f_y < f_x \times \tan 10^\circ \\
& \text{direction} = 1(\theta : 0^\circ \sim 10^\circ) \\
\text{elif } & f_x \times \tan 10^\circ \leq f_y < f_x \times \tan 20^\circ \\
& \text{direction} = 2(\theta : 10^\circ \sim 20^\circ) \\
\text{...} \\
\text{elif } & f_x \times \tan 80^\circ \leq f_y \\
& \text{direction} = 9(\theta : 80^\circ \sim 90^\circ)
\end{align*}
\]
3.2 Human Recognition Circuit
Integrating Hardware-oriented GMM-MRCoHOG and BNN

We designed a dedicated human recognition circuit using the proposed coarse angle calculation algorithm and the responsibility inference method proposed by Nagamine et al. (Nagamine et al., 2019), as shown in Fig. 6.

This circuit receives a 32 × 64 pixels image as input and continuously transfers one pixel at a clock cycle from the top-left to the bottom-right pixel of the image to the image buffers. Here, we set the GMM-MRCoHOG extract features from three resolution images: the original size image, a 1/2-resized image, and a 1/4-resized image; therefore, we implemented three image buffers for these resolutions. Each of the buffers is a three-line buffer to calculate the luminance gradient from 3 × 3 pixels in the image. The derivative filter blocks receive three lines of pixels and calculate the horizontal and vertical luminance differences. The angle calculation blocks calculate the angles of the luminance gradients, and the results are stored in the two-line buffers of the second stage. Then, the gradient co-occurrence is calculated, and the GMM-MRCoHOG feature is extracted. The obtained feature is fed into the BNN, which classifies the input image as human or not human. The synaptic weights and activation of the BNN are binarized such that the circuit requires small memory resources.

Here, the number of mixtures of the Gaussian distribution used in the GMM-MRCoHOG is 6. The BNN has three layers: input, hidden, and output layers, and the number of neurons in the hidden layer is 1.

4 EXPERIMENT

We verified the proposed coarse angle calculation method, implemented a human recognition circuit integrating the hardware-oriented GMM-MRCoHOG and the BNN using high-level synthesis, and estimated the processing speed and circuit size. The experimental environment is presented in Table 1.

4.1 Coarse Angle Calculation Method using the Fixed-point Tangent Table

In this experiment, we verified the proposed coarse angle calculation method with respect to circuit size, estimated the angle matching rate to true angles, processing speed of the circuit, and the approximation effect on accuracy for human recognition tasks.

First, we verified the circuit sizes of the tanθ table when the integer part of the fixed-point numbers in the table was fixed to three bits, and the fraction part was varied from zero to seven bits. The target device was a Xilinx Zynq XC7Z020 FPGA on a Zedboard with a clock frequency of 200 MHz.

Next, we verified the matching rate between the estimated angles calculated using the proposed method and the true angle values. The bit width setting of the fixed-point numbers in the table was the same as the circuit size verification. The true angle values were calculated by feeding $f_1$ and $f_2$ into the atan2 function of the cmath library in C language and discretized in 36 directions. In addition, we compared the proposed method with the angle approximation method from a previous study (Nagamine et al., 2019).

Next, we compared processing speeds of angle calculations of the following three methods:

1. Software implementation of angle calculation by atan2 function
2. Software implementation of angle calculation by the proposed method
3. Hardware implementation of angle calculation by the proposed method

In the software implementation, the average of the calculation times of all 261,121 input luminance gradients executed on an Intel Core i7-8700K central processing unit (CPU) was used as the angle calculation time for software. In the hardware implementation, clock cycles to calculate an angle by the circuit multiplied by the clock cycle time was used as the angle calculation time. Here, the fraction part of the fixed-point numbers in the table was set to six bits, and the target board and its clock frequency were the same as the circuit size verification. Thus, the clock cycle time was set to 5 ns.

Next, we verified the approximation effect of the proposed method on the accuracy of human recognition tasks. To avoid the effect of the binarization of the discriminator using the BNN, we used a support vector machine (SVM) (Cristianini and Shawe-Talor, 2000), which is a floating-point number model, as a discriminator. Here, we compared the accuracy of three algorithms for GMM-MRCoHOG: the original algorithm, the hardware-oriented algorithm of the previous study (Nagamine et al., 2019), and the proposed algorithm. We set the number of mixtures of Gaussian distribution as 16 and 32. The datasets used in this experiment were the Daimler Pedestrian Classification Benchmark Dataset (Gavrila and Enzweiler, 2008) and INRIA Person Dataset (Dalal and Triggs, 2007).
Table 1: Experimental Environment.

<p>| | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>CPU</td>
<td>Intel Core i7-8700K 3.70GHz</td>
</tr>
<tr>
<td>Memory</td>
<td>16GB</td>
</tr>
<tr>
<td>OS</td>
<td>Windows 10</td>
</tr>
<tr>
<td>Circuit Synthesis Environment</td>
<td>Vivado HLS 2018.2</td>
</tr>
<tr>
<td>GUINNESS FPGA Board</td>
<td>ZedBoard, XC7Z020CLG484-1 (200MHz)</td>
</tr>
<tr>
<td>FPGA Board</td>
<td>ZCU102, XCZU9EG-2FFVB1156 (100MHz)</td>
</tr>
</tbody>
</table>

Table 2: Dataset.

<table>
<thead>
<tr>
<th>Dataset</th>
<th>Images</th>
<th>Resolution</th>
</tr>
</thead>
<tbody>
<tr>
<td>Train</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Daimler</td>
<td>human: 10,000</td>
<td>32 × 64 pixels</td>
</tr>
<tr>
<td></td>
<td>not human: 10,000</td>
<td></td>
</tr>
<tr>
<td>INRIA</td>
<td>human: 1,126</td>
<td>32 × 64 pixels</td>
</tr>
<tr>
<td></td>
<td>not human: 4,840</td>
<td></td>
</tr>
<tr>
<td>Test</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

4.2 Human Recognition Circuit
Integrating Hardware-oriented GMM-MRCoHOG and BNN

The designed human recognition circuit was synthesized using Vivado HLS 2018.2, to estimate the processing speed and circuit size. The target device was a Xilinx Zynq UltraScale+ MPSoC XCZU9EG FPGA on a ZCU102 board with a clock frequency of 100 MHz. For comparison, we also implemented the binarized VGG-16 in the XCZU9EG FPGA using GUINNESS (Nakahara et al., 2019).

For the speed comparison between software and hardware implementations of the human recognition systems, the average of the software execution time to process 5,955 images of size 32 × 64 pixels on an Intel Core i7-8700K CPU was used as the image processing time for the software implementation. For the hardware implementation, clock cycles to process an image of 32 × 64 pixels, estimated by C Synthesis of Vivado HLS 2018.2, multiplied by the clock cycle time 10 ns, was used as the image processing time. For the binarized VGG-16, clock cycles to process an
image of 48 × 48 pixels, estimated by GUINNESS, multiplied by the clock cycle time 10 ns, was used as the image processing time.

We also estimated the circuit size of the human recognition system using the Export RTL of Vivado HLS 2018.2, and the circuit size of the binarized VGG-16 using GUINNESS. Moreover, we estimated the power consumption of the circuit using Vivado 2018.2.

5 RESULTS

5.1 Coarse Angle Calculation Method by using Fixed-point Tangent Table

Figures 9 and 10 show the circuit resource utilization of the \( \tan \theta \) table. As shown in Fig. 9, both the LUT and flip-flop (FF) utilization increased almost linearly while the bit width of the fraction part of the fixed-point numbers was zero to six bits. However, in the case of the seven-bit model, the number of resources was lower than that in the six-bit model. As shown in Fig. 10, a digital signal processor (DSP) was required only in the case of the seven-bit model whereas no DSP was required in the range of zero to six bits.

![Figure 9: Circuit Resource Utilization of LUTs and FFs.](image)

![Figure 10: Circuit Resources Utilization of DSPs.](image)

Figure 11 shows the angle matching rate between approximated angles by the proposed method and true angles obtained by atan2 function. According to a previous study (Nagamine et al., 2019), the matching rate was 91%. Therefore, the matching rate of the proposed method was higher than that of the previous study when the bit width of the fraction part of the fixed-point numbers was four or more, and it was approximately 99% when the bit width was six or more. The maximum error of the angle in the figure represents the maximum absolute difference between the angles approximated by the proposed method and the true angles. For example, if some angle is classified by atan2 function in the third direction while the angle is classified by the proposed method as the fourth direction, the error is 1. From the figure, the maximum error of the angle was 1 in cases of more than two bits for the fraction part of the fixed-point numbers.

Table 3 shows the processing time of the angle calculation. As shown in the table, the proposed hardware-oriented algorithm on the CPU required approximately 14 times longer processing time than that of the atan2 function. The proposed hardware-oriented algorithm on the FPGA was approximately twice as fast as the atan2 function, and approximately 28 times faster than the proposed algorithm on the CPU.

![Table 3: Processing Time of Angle Calculation.](image)

Table 4: Human Recognition Accuracy by Hardware-Oriented GMM-MRCoHOG with BNN.

<table>
<thead>
<tr>
<th>Methods</th>
<th>Time [ns]</th>
</tr>
</thead>
<tbody>
<tr>
<td>atan2 (software)</td>
<td>59.6</td>
</tr>
<tr>
<td>Proposed (software)</td>
<td>837.7</td>
</tr>
<tr>
<td>Proposed (hardware)</td>
<td>30</td>
</tr>
</tbody>
</table>

Figures 12 and 13 show the accuracy of the human recognition system with 16 and 32 Gaussian mixtures with the SVM implemented by MATLAB. As shown in these figures, the proposed method improved the accuracy of the human recognition task from the previous study in both mixture cases.

Table 4 presents the human recognition accuracy of the proposed method with the BNN where the number of mixtures was set as six, and Table 5 shows the human recognition accuracy of the binarized VGG-16. The proposed human recognition system with a BNN having one neuron in the hidden layer was able to classify humans with high accuracy and outperform the binarized VGG-16.

![Table 4: Human Recognition Accuracy by Hardware-Oriented GMM-MRCoHOG with BNN.](image)

<p>| | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>train</td>
<td>99.4 [%]</td>
</tr>
<tr>
<td>test</td>
<td>97.1 [%]</td>
</tr>
</tbody>
</table>
Table 5: Human Recognition Accuracy of Binarized VGG-16.

<table>
<thead>
<tr>
<th></th>
<th>Accuracy rate</th>
</tr>
</thead>
<tbody>
<tr>
<td>train</td>
<td>77.4 [%]</td>
</tr>
<tr>
<td>test</td>
<td>44.3 [%]</td>
</tr>
</tbody>
</table>

5.2 Human Recognition Circuit

Integrating Hardware-oriented GMM-MRCoHOG and BNN

Table 6 presents the estimated processing time of human recognition. The proposed hardware was approximately 118 times faster than the software implementation and approximately 123 times faster than the hardware implementation of the binarized VGG-16.

<table>
<thead>
<tr>
<th>Methods</th>
<th>Time [ms]</th>
</tr>
</thead>
<tbody>
<tr>
<td>Proposed (software)</td>
<td>5.2</td>
</tr>
<tr>
<td>Proposed (hardware)</td>
<td>0.044</td>
</tr>
<tr>
<td>Binarized VGG-16 (hardware)</td>
<td>5.4</td>
</tr>
</tbody>
</table>

Table 7 presents the estimated circuit resource utilization of the proposed human recognition circuit and Table 8 shows the estimated circuit resource utilization of the binarized VGG-16. As presented in Table 7, the proposed circuit can be implemented in the XCZU9EG FPGA, whereas the circuit could not be implemented in the XC7Z020 FPGA owing to a lack of resources. The dominant resource in the circuit was the block random access memory (BRAM), which was determined by the number of center coordinates.
and width of the mixture Gaussian distribution, and the synaptic weights of the BNN. Compared with the binarized VGG-16, the proposed human recognition circuit consumed fewer FFs and LUTRAMs, but more BRAMs and LUTs.

Table 7: Circuit Resource Utilization of the Proposed Human Recognition Circuit.

<table>
<thead>
<tr>
<th>Resource</th>
<th>Used</th>
<th>Available</th>
<th>Utilization [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>BRAM</td>
<td>154</td>
<td>912</td>
<td>16.9</td>
</tr>
<tr>
<td>DSP48E</td>
<td>0</td>
<td>2,520</td>
<td>0</td>
</tr>
<tr>
<td>FF</td>
<td>11,529</td>
<td>548,160</td>
<td>2.1</td>
</tr>
<tr>
<td>LUT</td>
<td>27,331</td>
<td>274,080</td>
<td>10.0</td>
</tr>
<tr>
<td>LUTRAM</td>
<td>111</td>
<td>144,000</td>
<td>0.1</td>
</tr>
</tbody>
</table>

Table 8: Circuit Resource Utilization of the Binarized VGG-16.

<table>
<thead>
<tr>
<th>Resource</th>
<th>Used</th>
<th>Available</th>
<th>Utilization [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td>BRAM</td>
<td>148</td>
<td>912</td>
<td>16.2</td>
</tr>
<tr>
<td>DSP48E</td>
<td>0</td>
<td>2,520</td>
<td>0</td>
</tr>
<tr>
<td>FF</td>
<td>21,751</td>
<td>548,160</td>
<td>3.9</td>
</tr>
<tr>
<td>LUT</td>
<td>21,765</td>
<td>274,080</td>
<td>7.9</td>
</tr>
<tr>
<td>LUTRAM</td>
<td>1,934</td>
<td>144,000</td>
<td>1.3</td>
</tr>
</tbody>
</table>

Table 9 lists the estimated power consumption of the circuit. As shown in the table, the power consumption of the proposed circuit is 0.923 [W]. It is noteworthy that this power was for only the programmable logic on the XCZU9EG chip, not for the entire FPGA board, including the processing system on the chip and dynamic RAMs on the board.

Table 9: Estimated Power Consumption of the Circuit.

<table>
<thead>
<tr>
<th>Circuit</th>
<th>Power [W]</th>
</tr>
</thead>
<tbody>
<tr>
<td>Proposed circuit</td>
<td>0.923</td>
</tr>
<tr>
<td>Binarized VGG-16</td>
<td>0.949</td>
</tr>
</tbody>
</table>

6 DISCUSSION

6.1 Coarse Angle Calculation Method by using Fixed-point Tangent Table

As shown in the experimental results (Figs. 9 and 10), the number of LUTs and FFs increased linearly while the fraction part of fixed-point numbers was in range from zero to six bits. In the case of the seven-bit model for the fraction part, the number of LUTs and FFs decreased, and the number of DSPs increased because the high-level synthesis compiler estimated using the DSP was more efficient than using LUTs and FFs to represent multiplications.

Table 10 is a summary of the comparison of FFs and LUTs utilization for the $\tan^{-1}$ function between the high-level synthesis of atan2 function, the method of the previous study (Nagamine et al., 2019), and the proposed method. As presented in the table, the proposed method, even with six bits for the fraction part, which was the most resource-intensive method among the proposed method, required approximately $1/30$ of the circuit resources for both FF and LUT of the high-level synthesis of the atan2 function. Moreover, the number of LUTs in the proposed circuit was significantly smaller than that in the previous study. Therefore, the proposed method succeeded in reducing the size of the circuit.

Table 10: Circuit Resource Utilization of the Original Algorithm, Previous Study, and Proposed Method.

<table>
<thead>
<tr>
<th></th>
<th>FF</th>
<th>LUT</th>
</tr>
</thead>
<tbody>
<tr>
<td>$\tan^{-1}$</td>
<td>6,000</td>
<td>10,000</td>
</tr>
<tr>
<td>Previous</td>
<td>76</td>
<td>3,087</td>
</tr>
<tr>
<td>Proposed</td>
<td></td>
<td></td>
</tr>
<tr>
<td>(0 bit)</td>
<td>52</td>
<td>97</td>
</tr>
<tr>
<td>(1 bit)</td>
<td>75</td>
<td>112</td>
</tr>
<tr>
<td>(2 bit)</td>
<td>100</td>
<td>167</td>
</tr>
<tr>
<td>(3 bit)</td>
<td>119</td>
<td>197</td>
</tr>
<tr>
<td>(4 bit)</td>
<td>130</td>
<td>236</td>
</tr>
<tr>
<td>(5 bit)</td>
<td>137</td>
<td>266</td>
</tr>
<tr>
<td>(6 bit)</td>
<td>183</td>
<td>297</td>
</tr>
</tbody>
</table>

The accuracy of the proposed method for the human recognition task was better than that of the binarized VGG-16, as well as in a previous study (Nagamine et al., 2019). According to a previous study, the accuracy for the same task was 92.4 whereas, the accuracy of the proposed method was 97.1. Additionally, a discrepancy in the angle calculation of the previous method was 9 Therefore, the proposed method extracted more precise features, resulting in better performance in the human recognition task.

6.2 Human Recognition Circuit Integrating Hardware-oriented GMM-MRCoHOG and BNN

Although there was no significant difference between the proposed circuit and binarized VGG-16 in terms of circuit size and power consumption, the proposed circuit outperformed the binarized VGG-16 for the human recognition task, and the processing time of the proposed circuit was significantly faster than that of the binarized VGG-16 because the proposed circuit computed the algorithm in parallel using an effective pipeline architecture with line buffers. Therefore, we concluded that the proposed circuit is more suit-

756
able for a human detection system than the binarized VGG-16.

7 CONCLUSIONS

For robots and self-driving cars operating near humans, a high-accuracy, high-speed, and low-power human detection function is required. In this study, we designed a dedicated circuit of GMM-MRCoHOG with high human recognition performance and implemented it in an FPGA to realize a high-speed and low-power human recognition system. Using the tanθ table, the proposed hardware-oriented algorithm simplifies the calculation of luminance gradients, which is a high-cost operation in the original algorithm. The experimental results show that the proposed method improves the accuracy and processing speed of the human recognition task while reducing the circuit resources.

In future work, we plan to implement a human detection system on an FPGA by feeding multiple regions of interest from an image to the proposed circuit for human recognition. Because the processing speed of the circuit is high, the realization of a real-time human detection system can be expected.

REFERENCES


