Environmental Information Extraction Based on YOLOv5-Object

Detection in Videos Collected by Camera-Collars Installed on Migratory

Caribou and Black Bears in Northern Quebec

Jalila Filali

1,2

, Denis Laurendeau

1,2

and Steeve D. Côté

3,4

Department of Electrical and Computer Engineering, Faculty of Sciences and Engineering,

Laval University, Quebec, Canada

Computer Vision and Systems Laboratory (CVSL), Laval University, Quebec, Canada

Department of Biology, Faculty of Sciences and Engineering, Laval University, Quebec, Canada

Caribou Ungava, Centre for Northern Studies (CEN), Laval University, Quebec, Canada

ﬁ

Keywords:

YOLOv5 Model, Video Object Detection, Video Stabilization, Environmental Information Extraction, Data

Visualization.

Abstract:

With the rapid increase in the number of recorded videos, developing and exploring intelligent systems become

more prominent to analyze video content. Within projects related to Sentinel North’s research program

∗

, our

project involves how to analyze videos that are collected using camera collars installed on caribou (Rangifer

tarandus) and black bears (Ursus americanus) living in northern Quebec. Our objective was to extract valu-

able environmental information such as weather, resources, and habitat where animals live. In this paper,

we propose an environmental information extraction approach based on YOLOv5-Object detection in videos

collected by camera collars installed on caribou and black bears in Northern Quebec. Our proposal consists,

ﬁrstly, in ﬁltering raw data and stabilizing videos to build a wildlife video dataset for deep learning training

and evaluating object detection. Secondly, it focuses on solving the existing difﬁculties in detecting objects

by adopting the YOLOv5 model to incorporate enriched features and detect objects of different sizes, and it

further allows us to exploit and analyze object detection results to extract relevant information about weather,

resources, and habitat of animals. Finally, it consists in visualizing object detection and statistical results by

developing a GUI interface. The experimental results show that the YOLOv5m model was signiﬁcantly better

than the YOLOv5s model and can detect objects with different sizes. In addition, the obtained results show

that our method can extract weather, habitat, and resource classes from stabilized videos, and then determine

their percentage of appearance. Moreover, our proposed method can automatically provide statistics about

environmental information for each stabilized video.

1 INTRODUCTION

Biodiversity has been declining at an alarming rate

in recent years, mainly due to various human-driven

habitat changes and the change in the earth’s climate.

The Living Planet Report 2022 reveals an average de-

cline of 69% in populations of different species since

1970 (Adam et al., 2021). There is an urgent need to

understand the principal causes, as well as the mecha-

nisms of biodiversity loss. Therefore, it is fundamen-

tal to obtain timely and exact information on species

richness distribution, animal behavior, wildlife re-

sources, and animal habitats. In recent years, cam-

era collars installed on wild animals have been widely

used in wildlife surveys and to sample environmen-

∗

https://sentinelnorth.ulaval.ca/en/research

tal parameters. Thus, videos and images can be col-

lected to provide valuable information for biologists

and wildlife conservation scientists. Therefore, de-

veloping and exploring intelligent systems becomes

more and more prominent to analyze video content.

In recent years, deep learning models have been at-

tracting increasing amounts of attention, due to their

ability to help researchers analyze data more efﬁ-

ciently. Several methods based on deep learning tech-

niques have been proposed for object detection such

as animal detection (Jintasuttisak et al., 2022), wild

animal facial recognition (Clapham et al., 2020), and

plant disease detection ((Lakshmi and Savarimuthu,

2022) and (Sunil et al., 2021)).

660

Filali, J., Laurendeau, D. and Côté, S.

Environmental Information Extraction Based on YOLOv5-Object Detection in Videos Collected by Camera-Collars Installed on Migratory Caribou and Black Bears in Nor thern Quebec.

DOI: 10.5220/0011691200003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

660-667

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

In (Norouzzadeh et al., 2018), a deep convolu-

tional neural network is used to automatically iden-

tify, count, and describe wildlife images with high

classiﬁcation accuracy. Recently, You Only Look

Once (YOLO) model (Redmon et al., 2016), the ﬁrst

single-stage detector in deep learning has been widely

applied for object detection.

Nowadays, several object detection models have

been proposed in ecology and biology. However,

object detection for extracting valuable information

from videos is still a challenging task and most of

the proposed object detection methods do not take

the whole image content into consideration to extract

information about weather, about weather, resources,

and habitat where animals live. Moreover, there are

few wildlife video datasets for deep-learning training.

Indeed, it becomes difﬁcult to extract good-quality

images from videos that are collected by cameras car-

ried by animals. This is due to the movement of an-

imals that inﬂuences the video quality (unstabilized

videos, blurred images, visible distortion, etc.). In

this paper, we report our attempts to address the above

issues, as follows: First, we constructed a wildlife

video dataset for deep learning training and evaluat-

ing object detection. Second, we focused on solving

the existing difﬁculties in detecting objects by adopt-

ing the YOLOv5 model to incorporate enriched fea-

tures and detect objects of different sizes, and we fur-

ther exploited and analyzed the object detection re-

sults to extract relevant information about weather,

resources, and habitats that are found in the environ-

ment in which caribou and black bears live. Finally,

we are interested in visualizing object detection and

statistics results by developing a GUI interface.

The main contribution of our work is summarized

as follows: First, we propose four video stabiliza-

tion methods based on motion compensation with dif-

ferent parameter combinations and we compare their

performances to determine which method provides a

better stabilization quality. Then, we study the rele-

vance of stabilized videos for object detection. To this

end, we propose a relevance score for object detection

that can predict if a given stabilized video contains

interesting objects that can be used for extracting in-

formation. Second, we adopt the YOLOv5 model to

detect objects in stabilized videos. Both YOLOv5s

and YOLOv5m models are tested with different pa-

rameters using a large dataset, training results and ob-

ject detection performance are evaluated. Finally, our

proposed method can automatically extract weather,

habitat, and resource information from a given stabi-

lized video and identify the percentage of appearance

of each class in the video.

This paper is organized in the following way. Sec-

tion 2 presents an overview of the related research.

Our proposed method is detailed in Section 3. Exper-

imental results are presented and discussed in Section

4. Finally, some ﬁnal remarks and directions for fu-

ture work are included in Section 5.

2 RELATED WORKS

2.1 Video Stabilization

Video stabilization (VS) consists in transforming a

video corrupted by undesired camera motions into a

stabilized video to improve its quality by removing

unwanted camera shakes and jitters. In the litera-

ture, several methods of VS have been developed to

produce a better video quality and a coherent video

stream. The existing VS approaches can be generally

categorized into classical VS methods and learning-

based methods(Shi et al., 2022). Classical video sta-

bilization algorithms focus on estimating camera mo-

tion, correcting camera path and stabilizing the video

(Guilluy et al., 2021). Learning-based video stabiliza-

tion methods consist in stabilizing videos using learn-

ing and deep-learning models (Shi et al., 2022).

Generally, using classical VS methods, the video

stabilization process includes two main phases,

namely 1) Motion analysis and modeling and 2) Mo-

tion correction and video stabilization. Motion anal-

ysis and modeling go through three steps: Motion es-

timation; 2) Motion outlier removal; and 3) Motion

modeling. The aim of the ﬁrst step is to estimate the

camera motion from the original video. These esti-

mated movements can be divided into two types of

motion: camera motion and object motion that ap-

pears in the scene. To that end, the second step,

namely motion outlier removal, is carried out to re-

move outliers and select only those resulting from the

motion of the camera. These movements are then

used for camera motion modeling. This step allows

to model the camera motion (Guilluy et al., 2021) and

compute the global transformation (Kulkarni et al.,

2016).

The motion correction and video stabilization

phase includes two steps: motion smoothing and

video synthesis. In this phase, parameters of the pre-

vious phase are sent to the motion smoothing mod-

ule, the purpose of which is to perform camera motion

correction. This step allows to apply a low-pass ﬁlter

to remove the high-frequency distortion and smooth

the camera movements. Once the camera motion has

been corrected, the new camera movements are ap-

plied back to the original video and a new video using

Environmental Information Extraction Based on YOLOv5-Object Detection in Videos Collected by Camera-Collars Installed on Migratory

Caribou and Black Bears in Northern Quebec

661

the smoothed camera movements is reconstructed as

a stabilized video within the video synthesis step.

2.2 Video Object Detection

Advances in artiﬁcial intelligence and computer vi-

sion have enabled the creation of efﬁcient systems

to analyze video content and extract semantic infor-

mation. In this context, object detection is one of

the most important and challenging problems in com-

puter vision. It entails identifying and localizing all

instances of an object in input images or videos. In the

literature, numerous approaches have been proposed

to solve the object detection problem. In general,

these approaches can be divided into two categories:

object detection methods based on handcrafted fea-

tures and deep learning-based object detection meth-

ods.

Object detection methods based on handcrafted

features, also called traditional methods, revolve

around extracting features from images that are used

for object detection. Using these methods, object de-

tection models were built as a set of hand-crafted fea-

ture extractors such as Viola Jones Detectors that have

incorporated feature selection and techniques to in-

crease the detection speed (Viola and Jones, 2004).

In addition, several local feature descriptors were ap-

plied to address the problem of scale variations and

rotations. In this context, various handcrafted feature

descriptors have been proposed to extract relevant in-

formation from images. Among these descriptors, we

distinguish scale-invariant feature transform (SIFT)

(Lowe, 2004), speeded-up robust features (SURF)

(Bay et al., 2006) and Histogram of Oriented Gradi-

ents (HOG) (Dalal and Triggs, 2005). These meth-

ods have used a constructed feature pyramid with a

ﬁxed sliding window to detect objects. Despite that

the traditional object detection methods cannot meet

the requirements for video data analytics, they have

provided a strong foundation for future video object

detection systems. Deep learning-based object de-

tection methods can be divided into two categories:

two-stage object detection methods and single-stage

object detection methods.

The most representative two-stage object detec-

tion approaches are Regions-based Convolutional

Neural Networks(R-CNN) (Girshick et al., 2014).

The goal of R-CNN is to extract a set of object pro-

posals using Selective Search (Uijlings et al., 2013),

and then each proposal is warped and propagated

through the convolutional layers. Finally, a feature

vector is extracted from each proposal and used as an

input of linear SVM (Support Vector Machine) classi-

ﬁers to compute conﬁdence scores. Once the class has

been recognized, the algorithm predicts its bounding

box using a trained bounding-box regressor.

The two-stage object detection approaches solve

object detection as a classiﬁcation problem where the

network classiﬁes image content as either object or

background. However, single-stage object detection

approaches solve it as a regression problem that is per-

formed without using pre-generated region proposals,

the main goal being to predict the image pixels as ob-

jects and their bounding box attributes.

In the literature, several single-stage object de-

tection methods have been proposed. In 2015, You

Only Look Once (YOLO) was proposed by (Red-

mon et al., 2016) as the ﬁrst single-stage detector in

deep learning. YOLO was inspired by the GoogLeNet

architecture for image classiﬁcation (Szegedy et al.,

2015). The input image is divided into a S x S grid

where each cell of the grid is responsible for ob-

ject detection. This network predicts bounding boxes

with their conﬁdence scores for each grid cell simul-

taneously. In recent years, the YOLO network has

been improved and it was a milestone in object de-

tection due to its efﬁciency in real-time with better

accuracy. The second version YOLOv2 (Redmon and

Farhadi, 2017) incorporated many techniques to im-

prove speed and precision. In the original YOLO, at-

tributes of predicted boxes were generated by fully

connected layers. In YOLOv2, the fully connected

layers are removed, this version used anchor boxes to

generate offsets as well as predicted boxes. Classes

and objectness are predicted for every anchor box.

YOLOv3 (Redmon and Farhadi, 2018) proposed a

larger and robust feature extractor network called

Darknet-53. YOLO has further been improved in the

following versions, including YOLOv5 (Jocher et al.,

2022) which is proposed by Glenn Jocher in 2020,

and YOLOv7 (Wang et al., 2022).

3 PROPOSED APPROACH

In this section, we describe the architecture of our

proposed approach and detail the different phases and

their components. As depicted in Figure 1, the pro-

posed approach is composed of four main phases: (1)

Data preprocessing, (2) Video object detection, (3)

Environmental information extraction and (4) Results

Visualization. The different components are detailed

below.

3.1 Data Preprocessing

In our project, camera collars are used as tools for col-

lecting videos. These videos are gathered from cam-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

662

Figure 1: Architecture of the proposed approach.

eras mounted on collars that are carried by caribou

and black bears. Due to the movement of the animals,

videos are often low quality (unstabilized videos,

blurred images, visible distortion, useless data, etc.).

To this end, video processing needs data ﬁltering and

video stabilization as a prerequisite for object detec-

tion. The data preprocessing phase includes three

components, namely: data ﬁltering component, video

stabilization component, and video relevance predic-

tion component.

3.1.1 Data Filtering

It is observed that several videos contain blurry and

noisy frames due to the undesired motion of cameras

carried by the animal. In addition, other videos suf-

fer from a lot of frames that do not contain any in-

formation related to the animal’s environment. Data

ﬁltering is an important step to clean our dataset to

be provided as input for the next process. The data

ﬁltering process consists in rejecting videos that con-

tain only dark frames, and videos that contain blurred

images which do not represent any information about

the environment of animals.

3.1.2 Video Stabilization

In our work, we focus on classical video stabiliza-

tion methods. To stabilize videos, we proposed four

stabilization methods: 1) Video stabilization based

on movement compensation (translation and rotation)

without limit; 2) Video stabilization based only on

camera motion (translation) with limited rotation; 3)

Video stabilization based only on rotation with lim-

ited camera motion, and 4) Video stabilization based

on sequential combined method: method 2 + method

3. The main idea is to perform video stabilization by

incorporating motion compensation with different pa-

rameter combinations and then to compare their per-

formances to determine which method can provide a

better stabilization quality. The ﬁrst method consists

in stabilizing video based on compensation of trans-

lation and rotation movement without limit, the sec-

ond one allows to stabilize only camera motion with

limited rotation. As opposed to the second method,

the third method involves only rotation to compen-

sate motion to obtain the stabilized output video. The

last method consists in performing stabilization in two

stages, by using a sequential combined method which

allows applying the second method, and then the ob-

tained stabilized videos are used to perform a second

stabilization that is based on the third method. The

proposed stabilization methods have many parameters

that inﬂuence stabilization performance. The param-

eter combinations are presented in Section 4.2.1.

Environmental Information Extraction Based on YOLOv5-Object Detection in Videos Collected by Camera-Collars Installed on Migratory

Caribou and Black Bears in Northern Quebec

663

Table 1: Video stabilization results.

Methods Shakiness Accuracy Smoothing Optalgo Interpol Maxshift Maxangl PSNR(dB)

Method 1 5 8 30 gauss linear no limit no limit 35,40

Method 2 10 15 15 avg bilinear no limit 0 40,95

Method 3 10 15 15 avg bilinear 0 no limit 42,07

Method 4

10 15 15 avg bilinear nolimit 0

47,55

10 15 15 avg bilinear 0 no limit

3.1.3 Video Relevance Prediction for Object

Detection

Once raw videos have been ﬁltered and stabilized, a

study should be carried out to predict if the stabilized

videos are relevant for object detection. The aim is

to predict if a given stabilized video contains interest-

ing objects that will be used for extracting information

about animal environment. To achieve this goal, we

propose a relevance score for object detection.

Let us present the following deﬁnitions to explain

our proposed relevance score:

• V is a given stabilized video.

• F = F

, F

, ..., F

is the set of frames extracted

from a given video V ,

• P = P

i=1...M

is the set of patches (a patch

presents a group of pixels in the frame and each

frame is divided into small patches) extracted

from the frame F

• P

is the target patch of the V .

• ST P

is the number of similar patches to the tar-

get patch P

of the frame F

• NP

number of patches processed for the frame

Given the stabilized video V and the target patch P

the aim is to determine a relevance score Rs

det

(V )

for each video V to predict if it is relevant for ob-

ject detection. Our main idea is, ﬁrstly, to compute a

weight for each frame w(F

) depending on the num-

ber of similar patches to target patch ST P

, and then

to weigh the similarity between every two adjacent

frames of the video using the weight w(F

), and ﬁ-

nally to compute the weighted average of all video

frames. We deﬁne the following functions to compute

the proposed relevance score for object detection:

• W (F

) = 1 −

ST P

Where : W (F

) is the weight of the frame F

• Sim(F

, F

j+1

) =

1+d(F

j+1

)

Where : Sim(F

, F

j+1

) is the similarity between

the frames F

and F

j+1

and d(F

, F

j+1

) is the Eu-

clidean distance between frames.

These two functions are used for computing the pro-

posed relevance score Rs

det

(V ):

det

(V ) =

∑

|N|

j=1

w(F

) ∗ Sim(F

, F

j+1

)

∑

|N|

j=1

w(F

)

(1)

3.2 Video Object Detection

To detect objects from stabilized videos, we adopted

feature enriched object detector based on the

YOLOv5 model. Our choice of YOLOv5 net-

work is based on its detection speed and accuracy,

which incorporates better parameter structure into

the backbone. Its detection speed and accuracy on

COCO datasets are better than previous YOLOv4 and

YOLOv3 models. Also, YOLOv5 incorporated vari-

ous optimization techniques like auto-learning bound-

ing box anchors, data augmentation, and the cross-

stage partial network (CSPNet). In addition, the head

of Yolov5 generates 3 different sizes (18 × 18, 36 ×

36, 72 × 72) of feature maps to perform multi-scale

prediction that allows to handle small, medium, and

large objects. In our project, many resources and

habitats of caribou and black bears are characterized

by their small size like "lichens" and "rocks". These

objects can be transformed into medium and large ob-

jects depending on the camera position. To this end,

multi-scale detection ensures that the model can fol-

low size changes in the process of object detection.

The YOLOv5 model consists of an input layer of

640 x 640 image, Backbone, Neck, and Head. The

backbone consists of Focus, Conv, C3 (CSPNet Bot-

tleneck with three convolutions), and Spatial Pyra-

mid Pooling (SPP) modules. The Focus module di-

vides the mosaic input image horizontally and verti-

cally and then stitches it together. The main purpose

of the Focus layer is to reduce layers, parameters,

and FLOPS (Floating-point Operations per Second).

Conv is the convolution unit of YOLOv5, it performs

dimensional convolution, regularization, and activa-

tion operations. C3 contains 3 Convs and some bottle-

necks. These BottleneckCSPs are used to reduce the

number of calculations and increase the speed of in-

ference. The spatial pyramid pooling (SPP) layer per-

forms three different sizes of maximum pooling op-

erations on the input, and the output result is spliced

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

664

into Concat. Upsample modules are used by the 11

and 15 layers of the Neck Network to expand fea-

tures. The head of YOLOv5 has three feature detec-

tion scales that allow to achieve feature detection of

different sizes.

3.3 Environmental Information

Extraction and Results

Visualization

Once objects are detected, a video object detection re-

sult ﬁle is generated for each stabilized video, which

contains details about which camera is used, the

number of object detection, and the detected classes

of each category (weather, resources, and habitats).

VOD result ﬁles are then used to determine the ap-

pearance percentage of each object that is detected in

the video ( cf. Figure 1). The generated result ﬁles are

then exploited to visualize object detection and envi-

ronmental information extraction results with a GUI

interface.

4 EXPERIMENTATION AND

RESULTS ANALYSIS

4.1 Experimental Setup

4.1.1 Dataset and Image Annotation

In our work we used about 22597 videos which are

collected by 4 cameras. To prepare data for train-

ing models, we annotated about 1177 images using

the Roboﬂow framework

. A semantic vocabulary

is used to describe the content of videos. It can be

divided into 3 categories: weather, habitats, and re-

sources which have their own classes and are used

as annotation labels and were respectively deﬁned as

follows: Weather: sunny, cloudy, rain, fog, snow;

Habitats: snow on the ground, boreal forest, tun-

dra meadow, shrubby areas, wet meadow, lakes and

streams, rocks and boulders and Resources: lichens,

birches, willows, grasses and sedges, broad-leaved

herbaceous, mushrooms, berries. The images were

divided into a training set, a validation set, and a test

set. The number of images in the training set, vali-

dation set and test set was 937 (80%), 120(10%) and

120(10%), respectively.

https://roboﬂow.com/

4.1.2 Evaluation Metrics

Peak Signal-to-Noise Ratio (PSNR) is used to evalu-

ate the quality of stabilized videos. PSNR computed

between the original video and the stabilized video is

deﬁned as:

PSNR = 10log

Max

MSE

(2)

Where MSE measures the Mean-Square-Error be-

tween the original and stabilized video frames and

MAX

is the maximum pixel value of an image.

Greater PSNR values indicate better quality video sta-

bilization.

To evaluate object detection performance, we used

precision and recall, which are the most popular met-

rics for object detection evaluation.

4.2 Evaluation Results and Analysis

4.2.1 Video Stabilization

Table 1 shows the stabilization results of each method

in terms of PSNR depending on the stabilization pa-

rameters using 1500 videos. On performing a de-

tailed analysis of each of the above mentioned meth-

ods, it was found that Method 4 has a higher PSNR

(47,55) value when compared to the other methods.

The higher the PSNR, the better video stabilization.

Thus, the stabilized output videos obtained by the 4th

method are better than those stabilized using the other

methods. This explains that performing stabilization

with rotation and translation movement compensation

in two stages improves the video stabilization quality.

4.2.2 Video Relevance Prediction

To predict if the stabilized videos contain relevant ob-

jects that can be used for object detection and extract-

ing information about weather, resources, and habi-

tats, ﬁrstly, a patch set of each video is extracted, and

then the proposed relevance score of each video is

computed using the proposed functions.

• Patch Extraction. Table 2 shows the patch ex-

traction results for 4615 black bear videos and

4590 caribou videos.

• Relevance Score Analysis and Prediction.

Once patches are extracted, the relevance score

det

(V ) for each video is computed. To pre-

dict if videos are relevant for object detection, we

used a prediction threshold of 0.5. Therefore, if

the relevance score is larger than the threshold,

videos are predicted as relevant for object detec-

tion, if not, they are predicted as irrelevant. Table

Environmental Information Extraction Based on YOLOv5-Object Detection in Videos Collected by Camera-Collars Installed on Migratory

Caribou and Black Bears in Northern Quebec

665

Table 2: Patch extraction results.

Videos Avg(Frames\vid) Patches\frames Avg(patches\vid) Rejected_patches

4615 (black bears) 420 40 16800 2100

4590 (caribou) 445 32 14240 5785

Table 3: Prediction analysis.

Cam Videos R.PV(Rs

det

≥ 0.5) I.PV(Rs

det

≤ 0.5) R.PV_STP I.PV_STP

Cam1424_ID37585 4450 3627 823 14 27

Cam1435_ID37584 4128 2710 1418 11 31

Collar21226_cam06 3678 1178 2500 12 29

Collar21227_cam07 4023 2245 1778 13 22

Figure 2: Examples of two interfaces for object detection and information extraction results visualization: the ﬁrst one

shows the visualization of object detection results of a selected caribou video, and the second one shows the visualization of

environmental information extraction results of a selected black bear video.

3 shows the prediction analysis for each camera

where R.PV and I.PV is the number of videos that

are predicted as relevant and irrelevant, respec-

tively, for object detection. R.PV _AVG_ST P and

I.PV _AV G_ST P is the average of similar patches

to the target patch per frame for R.PV and I.PV

videos, respectively. The obtained results show

that the black bear videos are better than the cari-

bou videos in terms of pertinence for object detec-

tion.

4.2.3 Object Detection and Information

Extraction Results

The experimental model was trained using Ultralyt-

ics YOLOv5. We performed 4 training runs using

YOLOV5m and YOLOV5s with 300 epochs. Both

YOLOV5s et YOLOV5m were implemented using 16

and 32 batch sizes. The comparison results of object

detection in terms of precision and recall are shown

in Table 4. It is found that the YOLOv5m-B32E300

model outperforms with 0.81 precision. Therefore,

Table 4: Comparisons of different runs.

Run imges P R

YOLOv5s-B16E300 100 0.61 0.41

YOLOv5s-B32E300 100 0.69 0.42

YOLOv5m-B16E300 100 0,78 0.44

YOLOv5m-B32E300 100 0.81 0.46

the YOLOv5m model with batch size adjusted to 32

is highly capable of detecting objects with different

sizes, so it is used for object detection and environ-

mental information extraction. Examples of two in-

terfaces for object detection and information extrac-

tion results visualization are detailed in Figure 2.

5 CONCLUSIONS

In this paper, an environmental information extraction

method based on YOLOv5-object detection in videos

was proposed. It relies on analyzing videos collected

by camera collars ﬁtted on caribou and black bears

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

666

in northern Quebec. First, videos are ﬁltered, stabi-

lized, and then, a relevance score for object detection

is proposed and computed for each stabilized video

to predict if it is relevant for object detection. Sec-

ond, the YOLOv5 model is adopted to incorporate

enriched features and detect small, medium, and large

objects. Object detection results are then exploited to

extract relevant information about weather, resources,

and habitats found in the environment in which cari-

bou and black bears live. Finally, the environmental

information is analyzed and statistical results are vi-

sualized for each stabilized video. In this work, we

have conducted an experimental study where we fo-

cused on evaluating each phase of our proposed ap-

proach. It is worth to note that the proposed stabiliza-

tion method, based on motion compensation with dif-

ferent parameter combinations, can improve the qual-

ity of the videos. Also, the YOLOv5m model was

signiﬁcantly better than the YOLOv5s model and can

detect small, medium, and large objects. Moreover,

the obtained results show that our method can extract

weather, habitat, and resource classes and then de-

termine their percentage of appearance in videos. In

future research, the network model structure will be

improved to analyze animal behavior using a wildlife

dataset.

REFERENCES

Adam, M., Tomášek, P., Lehej

cek, J., Trojan, J., and J˚unek,

T. (2021). The role of citizen science and deep learn-

ing in camera trapping. Sustainability, 13(18):10287.

Bay, H., Tuytelaars, T., and Gool, L. V. (2006). Surf:

Speeded up robust features. In European conference

on computer vision, pages 404–417. Springer.

Clapham, M., Miller, E., Nguyen, M., and Darimont, C. T.

(2020). Automated facial recognition for wildlife that

lack unique markings: A deep learning approach for

brown bears. Ecology and evolution, 10(23):12883–

12892.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. In 2005 IEEE com-

puter society conference on computer vision and pat-

tern recognition (CVPR’05), volume 1, pages 886–

893. Ieee.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detec-

tion and semantic segmentation. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 580–587.

Guilluy, W., Oudre, L., and Beghdadi, A. (2021). Video

stabilization: Overview, challenges and perspec-

tives. Signal Processing: Image Communication,

90:116015.

Jintasuttisak, T., Leonce, A., Sher Shah, M., Khafaga,

T., Simkins, G., and Edirisinghe, E. (2022). Deep

learning based animal detection and tracking in drone

video footage. In Proceedings of the 8th International

Conference on Computing and Artiﬁcial Intelligence,

pages 425–431.

Jocher, G. et al. (2022). ultralytics/yolov5: v6. 2-

yolov5 classiﬁcation models, apple m1, reproducibil-

ity, clearml and deci. ai integrations. Zenodo. org.

Kulkarni, S., Bormane, D., and Nalbalwar, S. (2016). Stabi-

lization of jittery videos using feature point matching

technique. In International Conference on Commu-

nication and Signal Processing 2016 (ICCASP 2016),

pages 708–717. Atlantis Press.

Lakshmi, R. K. and Savarimuthu, N. (2022). Pldd—a

deep learning-based plant leaf disease detection. IEEE

Consumer Electronics Magazine, 11(3):44–49.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. International journal of computer

vision, 60(2):91–110.

Norouzzadeh, M. S., Nguyen, A., Kosmala, M., Swanson,

A., Palmer, M. S., Packer, C., and Clune, J. (2018).

Automatically identifying, counting, and describing

wild animals in camera-trap images with deep learn-

ing. Proceedings of the National Academy of Sciences,

115(25):E5716–E5725.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 779–

788.

Redmon, J. and Farhadi, A. (2017). Yolo9000: better, faster,

stronger. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 7263–

7271.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. arXiv preprint arXiv:1804.02767.

Shi, Z., Shi, F., Lai, W.-S., Liang, C.-K., and Liang, Y.

(2022). Deep online fused video stabilization. In Pro-

ceedings of the IEEE/CVF Winter Conference on Ap-

plications of Computer Vision, pages 1250–1258.

Sunil, C., Jaidhar, C., and Patil, N. (2021). Cardamom plant

disease detection approach using efﬁcientnetv2. IEEE

Access, 10:789–804.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 1–9.

Uijlings, J. R., Van De Sande, K. E., Gevers, T., and

Smeulders, A. W. (2013). Selective search for object

recognition. International journal of computer vision,

104(2):154–171.

Viola, P. and Jones, M. J. (2004). Robust real-time face

detection. International journal of computer vision,

57(2):137–154.

Wang, C.-Y., Bochkovskiy, A., and Liao, H.-Y. M. (2022).

Yolov7: Trainable bag-of-freebies sets new state-of-

the-art for real-time object detectors. arXiv preprint

arXiv:2207.02696.

Environmental Information Extraction Based on YOLOv5-Object Detection in Videos Collected by Camera-Collars Installed on Migratory

Caribou and Black Bears in Northern Quebec

667