validation and train (374 photos) is used for training.
Two subdirectories, train and val, which hold text
files with matching picture tags are also located in the
tag directory.
2.2 YOLO Architecture
YOLO is mainly composed of Backbone, Neck and
Head. The Backbone is in charge of taking
meaningful characteristics out of the input image,
which is often a Convolutional Neural Network
(CNN) that records hierarchical data at various sizes.
The Neck, which gathers and refines the features
extracted by the Backbone network, is the segment
that sits between the Head and the Backbone. Head is
responsible for target detection, including prediction
bounding box, category and confidence.
The YOLO algorithm's main concept is to convert
the target detection task into a regression issue by
utilizing the entire image as input and a neural
network to forecast the boundary box position and
classification. First, YOLO divides the input image
into a fixed size grid. For each grid, YOLO predicts a
fixed number of bounding boxes. Each bounding box
contains the position (center coordinates and width
and height) and confidence of the bounding box, as
well as the category of the target. A CNN is used to
carry out a single forward transfer and predict the
position and category of all bounding boxes at the
same time. YOLO uses multi task loss function to
train the network, including position loss, confidence
loss and category loss. In addition, in the predicted
bounding box, there may be multiple overlapping
boxes, and YOLO uses the Non-Maximum
Suppression (NMS) algorithm to screen the best
bounding box.
Three different YOLO for comparison were
chosen to use: YOLOv5, YOLOv6, and YOLOv8.
They have different improvements.
YOLOv5 (Chen, Ding, and Li, 2022) improves
upon YOLOv4 with enhancements like Mosaic
Augmentation, AutoAnchor box calculation, and
Channel-wise Spatial Pyramid Network (CSPNet) for
better efficiency and reduced computation, while
maintaining the Feature Pyramid Networks (FPN) +
Pyramid Attention Network (PAN) Neck structure.
YOLOv6 (Li, Li, Geng, Jiang, Cheng, Zhang, Ke, Xu,
and Chu, 2023) introduces scalable Backbone and
Neck designs with EffificientRep for small models
and CSPStackRep for larger ones, using an Anchor-
free paradigm and hybrid channel strategy to reduce
computational costs and improve accuracy. YOLOv8
(Jocher, Chaurasia, and Qiu, 2023) integrates CSPNet
with Darknet53, employs advanced activation
functions like SiLU, and optimizes feature extraction
and loss function computation for better performance
and edge deployment efficiency.
2.3 Pose Estimation
The YOLO training dataset alone cannot distinguish
between falling and lying down, making it difficult to
accurately detect falls in the elderly. To address this,
Pose Estimation is added to detect abnormal motion
patterns, as falls typically occur within 1.5 seconds
suggested by Lu, and Chu (2018). YOLOv8-Pose is
used for pose estimation, identifying key points of the
human skeleton. The study involves calculating
motion speed and angle to detect falls. If the hip or
shoulder speed exceeds a threshold, or if the torso
angle is below a certain level, a potential fall is
flagged. The time between the start and end of the fall
is then checked. If within the threshold, a fall is
confirmed.
3 RESULTS
3.1 Dataset Analysis Results
Because the same dataset is used, the output of labels
correlogram by different versions of YOLO are
similar. Here is a unified analysis.
For the histogram on the diagonal from Figure 1,
it can be seen that the histogram distribution of the X
and Y coordinates looks concentrated, which means
that most of the boundary boxes are concentrated in
the middle of the image. The histogram distribution
of width and height is relatively uniform, indicating
that objects with different scales are within the
detection range.
The scatter plot of X and width, Y and width from
Figure 1 shows a certain negative correlation, which
means that the objects in the middle area of the image
are usually larger (the width is larger), while the
objects at the edge may be smaller. The scatter plot of
X and height, Y and height from Figure 1 also showed
a similar trend, and there was a certain negative
correlation between height and position.
The relationship between the center points X and
Y from Figure 1 or Figure 2 illustrates the distribution
of the center of the bounding box in the image. The
points are mainly concentrated in the middle of the
image, which reflects that the detected objects mostly
appear in the central area of the image.
The scatter plot of width and height from Figure 1
or Figure 2 shows that there is a certain positive