Detection, Estimation & Tracking Road Objects for Assisting Driving

Afnan Alshkeili, Wenliang Qiu and Bidisha Ghosh

Dept. of Civil, Structural & Environmental Engineering, Trinity College Dublin, Ireland

Keywords:

Multi-object Tracking, Object Detection, Trafﬁc Flow Estimation, Distance Estimation.

Abstract:

The new era of mobility is moving towards automation. Detecting, estimating, and tracking objects from

moving vehicles using dash-cam images in real-time can provide substantial advantages in supporting drivers’

decision making in advance. In this paper, an advanced deep learning-based object detection, distance es-

timation, and tracking framework has been proposed for this purpose. RetinaNet algorithm with ResNeXt

backbone network has been used to detect ﬁve trafﬁc object classes, including cars, cyclists, pedestrians,

buses, and motorcycles, with improved accuracy. Additionally, distance estimation algorithm was introduced

to increase both reliability and precession of detection. Moreover, an improved Simple Online and Real-

time Tracking (SORT) algorithm were sequentially used to estimate trafﬁc parameters such as volume and

approach speed of each of these trafﬁc object classes. The algorithm was trained and tested on stock imagery

(COCO2017, MOT16, and TDD) of real-world videos taken from urban arterials with multimodal, signalized

trafﬁc operations.

1 INTRODUCTION

Autonomous vehicles (AV), self-driving cars, or

driverless cars are widely used phrases to describe ve-

hicles capable of sensing the environment and safely

driving with no or little human inputs. These cars

were introduced to reduce driving efforts, especially

on urban roads (Aneesh et al., 2019), and due to their

safety implications. It is estimated that the penetra-

tion rate of AV in trafﬁc ﬂeets can reduce trafﬁc con-

ﬂicts proportionally with over 90% reduction if all ve-

hicles on the road are AVs (Papadoulis et al., 2019).

Concerning AVs, moving object detection, clas-

siﬁcation, and tracking algorithms have received ex-

tensive research attention. The object detection al-

gorithms studied in literature considered multiple

objects focusing primarily on pedestrians and road

signs. R-CNN (Bunel et al., 2016) is a common

deep learning algorithm used for object detection.

However, this cannot be implemented for real-time

on road trafﬁc object detection due to high com-

putational complexity. Faster R-CNN (Zhao et al.,

2016) was developed to overcome this. The latest

algorithms, such as Single Shot Multi-Box Detector

(SSD) (Lin et al., 2017a) and You Only Look Once

(YOLO) achieved high efﬁciency as they overcome

some disadvantages presented in CNN and R-CNN.

Gavrila (Gavrila, 2000) proposed a prototype system

for pedestrian detection from a moving vehicle using

a two-step approach algorithm. Lee et al. (Lee et al.,

2009) developed an object detection algorithm in 3D

cues for detecting pedestrians and vehicles. In 2013,

(Felix Albu, 2013) invented an object detection from

image proﬁles that is enables enhancing digital im-

ages.

In this paper, a RetinaNet based multi-class ob-

ject detection and tracking framework is proposed to

detect moving objects such as neighboring vehicles,

pedestrians, and other trafﬁc modes. By conduct-

ing vision-based analysis using video footage from

a moving vehicle in an urban signalized road net-

work. Detection based on RetinaNet has been inves-

tigated in the literature concerning autonomous driv-

ing. (Pei et al., 2020) (Aneesh et al., 2019)(Hoang

et al., 2019) used a single-stage detector where Reti-

naNet was applied to form a trafﬁc sign detection net-

work and a CNN-based classiﬁer for road signs, traf-

ﬁc light detection, and sign recognition. RetinaNet

showed an improvement in detection accuracy and

real-time classiﬁcation operations. These algorithms

have been used for multispectral pedestrian detection

(Rajendran et al., 2019). Moreover, pedestrian detec-

tion has been investigated where (He and Zeng, 2017)

has developed a warning system using Faster R-CNN

that aims to lower trafﬁc accidents.

Our proposed framework detected ﬁve different

trafﬁc-related objects (cars, buses, motorcycles, cy-

clists, and pedestrians) simultaneously using Reti-

678

Alshkeili, A., Qiu, W. and Ghosh, B.

Detection, Estimation Tracking Road Objects for Assisting Driving.

DOI: 10.5220/0010496806780685

In Proceedings of the 7th International Conference on Vehicle Technology and Intelligent Transport Systems (VEHITS 2021), pages 678-685

ISBN: 978-989-758-513-5

naNet (Xie et al., 2017) having ResNeXt (Xie et al.,

2017) as a backbone. The framework utilized a se-

quential tracking algorithm employing Simple On-

line and Real-time Tracking (SORT) to track all ob-

jects’ classes. Furthermore, the framework estimated

trafﬁc parameters such as trafﬁc volume, speed, and

distance. Estimating these parameters using vision-

based analysis from a moving vehicle by analyzing

video images from a single camera source is a sig-

niﬁcant improvement in assisting driving. Results of

the method were ﬁrst trained on different datasets and

then tested on single data.

The paper is organized as follows: Section 1 pro-

vides background and points on the associated topic’s

main algorithm. Section 2 covers the methodology

used, and the framework developed. Section 3 high-

lights the data used for this paper. Section 4 an analy-

sis and the results of the application. Finally, section

5 discusses the proposed method’s applications, and a

conclusion of the study is provided.

2 METHODOLOGY

The complete proposed framework for detecting mov-

ing objects from a moving vehicle in an urban trans-

port network is described in this section.The theoreti-

cal background of the different elements of the frame-

work is discussed.

The study focuses on developing methodologies

for detecting, estimating, and tracking the ﬂow, speed,

and distance of pedestrians, cyclists, and other types

of vehicles from an autonomous vehicle utilizing only

visual information. The study utilizes videos captured

from a single dash-cam of a human-driven car in the

absence of access to an autonomous vehicle or ap-

propriate video footage acquired from an autonomous

vehicle.

2.1 Framework

The framework module consists of three main mod-

ules: detection, estimation, and tracking. The detec-

tion module consists of a RetinaNet detector, which

contains the ResNeXt backbone network (Xie et al.,

2017), Feature Pyramid Network (FPN) (Welch et al.,

1995), and class/bbox subnets (Lin et al., 2017b)

where it outputs a visual and a numerical results. The

visual results are boundary boxes that show relative

categories, and the numerical result is the conﬁdence

score, which is then fed to the next module. The esti-

mation module, where the algorithm is adopted iden-

tiﬁes the distance, volume, ﬂow rate per second, and

the detected pedestrian’s speed. In this module, the

Input

images/video

Detection (Traffic RatenaNet)

ResNeXt

Feature

Pyramid Net

Class/Bbox

subnet

Detection

Results

Detection per object :

1. Boundary boxes

(bbox)/location

2. Confidence score

3. Object class/category

Detected

Pedestrians

Distance

Estimation

Estimation (Similar triangle based)

Detecti

Results

Kalman

Prediction

Object

Association

Update

Tracker

Tracking(Improved SORT)

Estimation

Results

Estimation per object :

Boundary boxes (bbox)/location

Confidence score

Object class/category

Distance

Tracking

Results

Result per object :

Boundary boxes (bbox)/location

Confidence score

Object class/category

• Flow of pedestrian

• Flow of vehicle

• Relative speed of

pedestrian

• Relative speed of

vehicle

• Distance to the front

pedestrian

Numerical

Visual

Output

PersonSpeed:

VehicleSpeed:

Figure 1: Framework of detection, tracking and estimation.

only inputs that will be processed are pedestrians due

to applying an object ﬁlter to select the pedestrian

class from the detected results. It will output both

visual and numerical results, a visual with distance

indicated, and ﬂow shown will be displayed and the

estimated volume and ﬂow will be displayed. The

third module is the improved SORT tracking algo-

rithm, which consists of Kalman prediction (Welch

et al., 1995), object association, buffer module for

miss detection, and tracking information update. The

output result from the detection and estimation will be

further tracked. Figure 1 provides schematics of the

whole process.

2.2 Trafﬁc RetinaNet

This study uses a simple dense RetinaNet detector

formed by improving existing single-stage object de-

tection models presented in FPN for object detection

and Focal Loss for Dense Object Detection . To ini-

tialize our network, we started by transfer learning us-

ing a pre-trained model. Thus we can use the weight

and architecture obtain and apply it to our problem

statement. ImageNet dataset is a widely used dataset

to build various architectures since it is large enough

to create a generalized model. The dataset is made of

14,197,122 images, 21841 synsets indexed, and about

500 images per node (Russakovsky et al., 2015a). The

problem statement is to train a model to classify im-

ages into ﬁve different categories, as mentioned ear-

lier. The pre-trained model shows a strong ability to

generalize images outside the dataset (ImageNet) by

transfer learning. Fine-tuning is the process in which

Detection, Estimation Tracking Road Objects for Assisting Driving

679

model parameters are precisely adjusted to ﬁt with

certain observations (Gunawan et al., 2011). This was

used on the COCO dataset in order for modiﬁcations

to take place to suit our model.

Class

Bounding

Box

ResNeXt Feature Pyramid Net

Class/Bbox subnet

A block of ResNeXt

2x up

1x1 conv

A block of lateral

connection

Details of subnet

WxH

x256

WxH

X256

WxH

xK(4)A

• K: # category

• A: # anchor

Figure 2: Framework of RetinaNet with ResNeXt backbone

(Trafﬁc RetinaNet).

2.2.1 ResNeXt

In order to achieve more efﬁcient detection, we uti-

lize ResNext (Xie et al., 2017) as the backbone of, as

shown in Figure 2. The initialization product and ﬁne-

tuning output obtained from the previous steps are

now feed into ResNeXt. The network is made of re-

peated building blocks that aggregates through several

transformations with the same topology. ResNeXt

is a homogeneous, multi-branch architecture that has

only a few hyper-parameters to set. As shown in Fig-

ure 2, the input image passes through a set of lower-

dimensional embeddings (by 1×1 convolutions), fol-

lowed by specialized ﬁlters (3x3, 5×5, etc.). A block

of ResNeXt with 32 cardinalities with similar com-

plexity in which aggregation of residual transforma-

tion is performed (Xie et al., 2017). The output of

each convolution is then passed to the feature pyra-

mid net.

2.2.2 Feature Pyramid Net

Feature Pyramid Net (FPN) developed is an accurate

in-network feature pyramid that can replace featured

image pyramids without sacriﬁcing representational

power, speed, or memory (Lin et al., 2017a). This

designed network passes the high-level semantic fea-

tures to shallow layers using a lateral connection. In

which each level of the pyramid is detecting objects at

a different scale. Figure 2(top middle) illustrates the

pathway, and Figure 2(bottom middle) shows the lat-

eral connection details. The outputs are then passed

to the last subnet.

2.2.3 Class/Box Subnet

For each pyramid level, a featured network attached

to it. As shown in Figure 2(right), the classiﬁcation

subnet predicts the object’s present probability at each

position. The box regression subnet in Figure 2(bot-

tom right) regresses each anchor’s offset and matches

it to the nearest ground-truth. Both subnets (object

classiﬁcation and box regression) though sharing a

common structure, use separate parameters (Lin et al.,

2017b).

2.2.4 Focal Loss

Focal Loss is an improvement on cross-entropy loss

that reduces the relative loss for well-classiﬁed exam-

ples and focuses on challenging, misclassiﬁed exam-

ples (Lin et al., 2017b). Equations to calculate the

cross-entropy losses were adopted from (Lin et al.,

2017b) where:

FL(p,t) = −α

(1 − p

)

log(p

) (1)

where α is the weighting factor and (1 − p)

is the

modulating factor to the cross-entropy loss, with tun-

able focusing parameter γ ≥ 0.

The proposed RetinaNet with ResNeXt backbone

network is termed as Trafﬁc-RetinaNet (TRN) from

this section onwards. The output obtained from this

algorithm (boundary box that illustrates the location

of the object, conﬁguration score, and the object

class) used as an input for both estimation and track-

ing.

2.3 Distance Estimation

𝛼𝜃

𝑑

𝑍

Image Plane

𝑌

𝑋

DASH CAM

𝛼

Object

ℎ

Road Surface

camera

Pedestrian Objec t

(

𝛽

)

𝛽

)

𝛾

)

𝜃

𝑓

Figure 3: Distance estimation in 2D image plane.

After detection, a pedestrians object ﬁlter is applied,

using a similar triangle-based distance estimation al-

gorithm to estimate the distance of road objects more

accurately. The center point of the bottom line of the

boundary box is considered as the input for this stage.

We performed an experiment in which a dash-

cam was located toward the car rear mirror looking

forward at a height H above the road surface, and

an angle α, the camera was tilted at an angle θ

XcY cZc coordinates. Supposing the detected object

VEHITS 2021 - 7th International Conference on Vehicle Technology and Intelligent Transport Systems

680

on the road sense, located on an unknown position

(XwY wZw). θ

is the angle of the projected ray (from

the camera) pointing to the intersection of the planar

of the detected object with the road surface planar, as

shown in Figure 3 (the top ﬁgure). The distance D is

the actual distance between the vehicle and the object

that is equal to d2 − d1 can be calculated using the

following equation (Rezaei et al., 2015):

D = H ∗tan (θ

) − H ∗ tan(γ)

= H ∗

tan (θ

+ β) −tan



−

i

(2)

To compute D we need β as both θ

and α are known.

On the other hand, we have (Rezaei et al., 2015):

tan (β) =



− d



(3)

where h

is the hight captured image plane (in pixel),

d p is the distance from the bottom side of the detected

vehicle to the bottom of the image plane (in pixel),

and f is the focal length of the camera. Where (Rezaei

et al., 2015):

f =

2tan





(4)

Substituting all the parameters back to evaluate D we

will get the following equation (Rezaei et al., 2015):

D = H ∗





tan





+ tan

−1





− d

2tan

(

)









−tan



−

i

(5)

2.4 Improved Simple Online and

Realtime Tracking (SORT)

Simple Online and Realtime Tracking (SORT) is a

method used for online and real-time tracking. It

tracks multiple objects in a simple, efﬁcient manner.

SORT algorithm combines both Kalman ﬁlter (Welch

et al., 1995) and Hungarian (Kuhn, 1955) method thus

to handle motion prediction and the data association

components (Bewley et al., 2016).

Following both detection and estimation, tracking

by detection using visuals only is introduced. The

state of each detected object is modeled as:

x = [u, v, s, r, ˙u, ˙v, ˙s]

(6)

where u and v represent the targeted object’s verti-

cal pixel location. The other two variables, s, and r,

correspond to scale (area) and the aspect ratio of the

targeted object bounding box, respectively (Bewley

et al., 2016).

The overlapping of objects in trafﬁc scenes can

result in miss tracking some of these objects. The

original SORT algorithm failed to track those objects.

To solve this issue, we proposed a simple yet power-

ful buffer module is introduced after the unmatched

track-lets, as shown in Figure 4.

Detection

Kalman box predictor

Predictions

Frame 1

Detection

Traffic

RatinaNet

Prediction

of previous

frame

Association

Hungary algorithm

dets trks

Trackers

Prediction

Kalman box predictor

Prediction

Association

Hungary algorithm

dets trks

Buffer

Frame 2 to end

(a)

(b)

Traffic

RatinaNet

Matched

Unmatched

dets

Unmatched

trks

New

Trackers

Keep for k

iters

Matched

Unmatched

Update

Figure 4: Framework of improved SORT.

2.5 Number and Speed Estimation

The number of objects in key classes and their ap-

proach speed are estimated based on the average dif-

ference of location in 5 consecutive frames. The ﬂow

was estimated and computed using the detection re-

sults (boundary box and class), where we considered

only the detected objects of both cyclist and pedestri-

ans class. Speed estimation carried out using tracking,

where objects speed was estimated by computing the

average difference of consecutive frames, as shown in

the following equation (Kumar and Kushwaha, 2016):

S = α

∑

d f (7)

where d is the difference in distance between con-

secutive frames in meter, f is the frame rate in

frames/second, n is the number of tracking object

(pedestrian / vehicle) per frame, m is the frame pair,

and α is a parameter introduced to convert units to

(meters/second).

3 DATASET

In this section, we will highlight the different datasets

used for training and testing purpose. TRN was de-

ployed on Ubuntu16.04 with pyTorch 1.2 environ-

ment. The backbone net was initialized according

to ResNeXt (Hoang et al., 2019) pre-trained on Im-

ageNet (Russakovsky et al., 2015b). The rest of conv

layers except class/bbox subnet are initialized with

Detection, Estimation Tracking Road Objects for Assisting Driving

681

bias, b = 0 and a Gaussian weight with standard devi-

ation, σ = 0.01. For class/bbox subnet, the bias is ini-

tialized as b = −log

((1 − τ)τ), where τ = 0.01. We

trained the model with synchronized Stochastic Gra-

dient Descent (SGD) over single GTX2080Ti GPUs

with a total of 2 images per minibatch. The initial

learning rate of 0.0025, weight decay of 0.0001 and

momentum of 0.9 were used. Dataset was splited as

follow 80000 images for training, 40000 for valida-

tion and 20000 for testing.

For estimation distance, we established a Trafﬁc

Distance Dataset (TDD) which was collected using

dash camera, with 30 fps recording rate, a frame di-

mension of 1920 x 1080 pixel, 70

◦

vertical ﬁeld of

view, camera was located at height H = 155 cm

above road surface and tilled at an angle θ

= 88.5

◦

Those parameters where used to estimate distance us-

ing Matlab and to compare the obtained values to the

ground truth values. Additionally, for testing pur-

poses, different datasets (COCO data and data col-

lected) are used to test pedestrians and cyclists’ ﬂow

and volume.

We introduced an unsupervised learning method

for tracking, where MOT (Multi-Object Tracking)

dataset is used. We chose this particular dataset be-

cause it consists of the different classes we are inter-

ested in (pedestrians, vehicles, occlusion targets, and

other categories). Improved SORT algorithm is then

tested on the real-time data captured. Analysis and

evaluation of all algorithms introduced and developed

are evaluated and discussed in the next section.

4 ANALYSIS & RESULTS

4.1 Evaluation of TRN

4.1.1 Objective Evaluation of Detection using

TRN

RetinaNet detector can work with different backbone

encoders such as ResNet (He et al., 2016), ResNeXt

(Hoang et al., 2019), and DenseNe (Lin et al., 2017b).

Using ResNeXt as a backboned, we managed to in-

crease the detection’s average precision compared

to the original RetinaNet, which includes a ResNet

backbone (A.Alshkeili et al., 2019). Table 1 show the

results obtained by applying the proposed algorithm

on the COCO2017 dataset.

In this application, small stands for object had a

pixel area of < 32

pixels. Medium stands for ob-

jects with an area range of 32

< area < 96

pixels.

Large stands for objects with an area of 96

< area

pixels. The object detection results showed that both

the AP and AR values improved for TRN except in

medium objects. AP and AR were evaluated un-

der IoU (Intersection over Union of boundary boxes)

= 0.5 : 0.05 : 0.95 with AP: MaxDets = 100 (given

100 detection 100 image), AR: MaxDets = 1 (given 1

detection per image).

Table 1 shows the precision and recall for each

category of objects detected. TRN achieves good re-

sults on pedestrians and buses, reasonable for cars

and motorcycles but limited for cyclists. The main

reason for this is the imbalance of training data in

the COCO2017 dataset as it is not designed for this

speciﬁc purpose of object detection from moving ve-

hicles. However, due to the lack of an appropri-

ately labeled dataset for evaluating these algorithms,

it was prudent to use a well-known stock video such

as COCO2017.

Table 1: Average percentage and average recall of the dif-

ferent categories.

Detected

objects

Average Precision

All Small Medium Large

Pedestrian 0.513 0.335 0.593 0.689

Cyclist 0.263 0.155 0.322 0.506

Car 0.393 0.301 0.549 0.563

Motorcycle 0.384 0.215 0.34 0.55

Bus 0.595 0.17 0.4 0.763

Detected

objects

Average Recall

All Small Medium Large

Pedestrian 0.185 0.475 0.685 0.776

Cyclist 0.223 0.268 0.5 0.0718

Car 0.172 0.459 0.687 0.756

Motorcycle 0.251 0.338 0.509 0.675

Bus 0.481 0.331 0.649 0.843

4.1.2 Subjective Evaluation of Detection using

TRN

In Figure 5, a chosen set of scenes detecting all ob-

ject types for illustrative purposes. The algorithm

successfully detected different trafﬁc object classes in

the same scene, as shown in the images. TRN suc-

cessfully classiﬁes the different objects from different

angles and distances, which is crucial for analyzing

dash-cam footage from a moving vehicle where the

angles and distance are uncontrollable. The objects

were detected both in shadows and in illuminated ar-

eas of the same scene. Additionally, occlusion effects

were minimized as a large number of bounding boxes

(bbox) were identiﬁed in images in multiple objects.

4.1.3 Computational cost of Trafﬁc RetinaNet

It is essential to estimate any trafﬁc object-detection

algorithms’ computational costs to establish whether

real-time detection is plausibility. Floating-point op-

eration per second (FLOPs) is how fast the micropro-

VEHITS 2021 - 7th International Conference on Vehicle Technology and Intelligent Transport Systems

682

Figure 5: Detection of multiple classes of objects in MOT16

dataset.

cessor operates; it has a performance unit of the mul-

tiplier–accumulator (Mac). Table 2 shows the perfor-

mance parameters and FLOPs of the framework.

Table 2: Computational performance of the framework.

FLOPs:286.83 GigaMAC Parameters: 54.86 Million

Training Testing

Detection using TRN 3 days 18 h 9 Frames/Sec

Tracking - 263 Frames/Sec

The detection is at the rate of 9 fps, which is

slightly slower than a video rate of 30 fps. However,

this rate can be considered real-time for vehicles and

pedestrians in urban signalized roads from a relative

movement point of view. The tracking rate is much

higher than the video rate and is compatible with AV

advanced collision avoidance alarm requirements.

4.2 Evaluation of Tracking

4.2.1 Objective Evaluation of Tracking

In trafﬁc scenes, the object’s scale is changing, so

scale-insensitivity is crucial for the tracker. We chose

SORT as it tracks objects only depending on the IoU

region of objects, robust to object size. Table 3 shows

the results of the improved SORT described earlier.

The speciﬁc evaluation indicators used are (Welch

et al., 1995):

• IDF1 : The ratio of correctly identiﬁed detections

over the average number of ground-truth and com-

puted detections

• IDP : Identiﬁcation precision

• IDR : Identiﬁcation recall

• GT : Total Number

• MT : Number of objects tracked for at least 80

percent of lifespan

• PT : Number of objects tracked between 20 and

80 percent of lifespan

• ML : Number of objects tracked less than 20 per-

cent of lifespan.

• MOTA : Multiple object tracker accuracy.

• MOTP : Multiple object tracker precision.

Table 3: Comparing results of the original SORT to the im-

proved.

IDF1 IDP IDR GT MT

SORT 44.1% 58.2% 35.5% 500 112

Imp 47.1% 56.9% 40.2% 500 118

PT ML MOTA MOTP

SORT 224 164 32.9 73.7

Imp 236 146 39.8 72.8

Additionally, statistical estimation of the videos

used presented in the following Table 4.

Table 4: Statistical Estimation Results of Videos.

Scene (a) (b) (c)

Length (frames) 750 525 837

FPS 25 30 14

Av.

Number

Ped. 14.7 14 8.7

Veh. 9.8 0 0.7

Av.

Speed

Ped. 3.48 m/s 1.7 m/s 1.87 m/s

Veh 12.5 km/h 0 km/h 3 km/h

4.2.2 Subjective Evaluation of Tracking

Figure 6: Tracking results in multiple scenes over three con-

secutive frames.

Figure 6 shows the result of tracking algorithm, where

the distance and robustness of tracking vary between

a) and c); thus, it shows the accuracy of the model.

Figure 7 illustrates the tracking of pedestrians for

illustrative purposes; however, the framework was ca-

pable of tracking vehicles. The variables in the ﬁgure

, Q

, V

and V

is the number of pedestrian,Q

is the number of vehicles, V

is the estimated pedes-

trian speed ﬂow and V

is the relative estimated speed

ﬂow of vehicle) shows the number of objects tracked

and their average approach speed.

Detection, Estimation Tracking Road Objects for Assisting Driving

683

(a)

(b)

(c)

𝑄

: ~ 18

𝑄

: ~ 8

𝑄

: ~ 10

𝑄

: ~ 0

𝑄

: ~ 7

𝑄

: ~ 2

𝑉

: ~ 5.76 𝑚𝑠

𝑉

: ~ 16.31 𝑘𝑚ℎ

𝑉

: ~ 1.14 𝑚𝑠

𝑉

: ~ 2.22 𝑚𝑠

𝑉

: ~ 17.57 𝑘𝑚ℎ

Figure 7: Flow and Speed estimation.

4.3 Distance, Speed and Flow

Estimation

4.3.1 Objective Evaluation

-21.5 -18.4 -15.4 -12.3 -9.3 -6.2 -3.2 0 2.9 6 9.1 12.1 15.2 18.3

Frequency

Errors (10

)

Zero Error

Figure 8: Percentage error of distance estimation.

Distance estimation error histogram illustrated in the

following Figure 8 where the distance to vehicle er-

rors, deﬁned by comparing with ground truth repre-

sented by the red line. Objects are at a distance of 2 to

25m to the dash camera. We considered a conﬁdence

interval of ±20cm for ground truth measurement. The

error level lies between ±20%, which is considered a

reasonable percentage in our experiment.

4.3.2 Flow, Speed and Distance Evaluation

Figure 9 shows the ﬂow of pedestrians and vehicles

per frame. The curves’ tendency illustrates the ob-

ject speed Note that there is a gap around frame 480,

as shown in blue curves; no vehicles exist in that pe-

riod. Figure 10 presents an evaluation of the enter

Figure 9: Number and Speed of Pedestrian and Vehicle in

Scene (a).

(a)

(b)

Figure 10: Detection, Tracking and estimation results.

algorithm applied to our data. (a) illustrates the de-

tection and distance estimation results, (b) shows the

both vehicle and pedestrian estimated ﬂow.

5 DISCUSSION & CONCLUSION

This paper proposed a driver assistance framework

based on visual information, including object detec-

tion, tracking, and trafﬁc-related information estima-

tion. Visual information is easier to obtain and ap-

plied to existing vehicles on a large scale than sensor

information. Additionally, visual information is more

VEHITS 2021 - 7th International Conference on Vehicle Technology and Intelligent Transport Systems

684

in line with human perception of trafﬁc. Based on the

above considerations, this paper utilized the ResNeXt

as the backbone net of the original RetinaNet, namely

Trafﬁc RetinaNet, thus enhancing object detection

performance on ﬁve different trafﬁc targets. More-

over, it also introduces the Improved SORT algorithm

with a buffer module to enhance multi-object track-

ing’s robustness. Finally, the object’s category, tra-

jectory, and location are used to inference the trafﬁc

ﬂow, relative speed, and distance. The framework’s

performance succeeded in different light conditions,

change of scenes due to the moving frame of refer-

ence, angles and relative distances, and crowded en-

vironments (occlusion). Comprehensive experiments

and detailed analysis via visualization demonstrate

the effectiveness of the proposed driver assistance

framework.

REFERENCES

A.Alshkeili, Ghosh, B., and Qiu, W. (2019). Cyclist and

pedestrian in autonomous vehicles view. In Irish

Transport Research Network (ITRN).

Aneesh, A. N., Shine, L., Pradeep, R., Moore, V. S., and

Lopes, J. (2019). Real-time trafﬁc light detection and

recognition based on deep retinanet for self driving

cars. In 2019 2nd International Conference on Intel-

ligent Computing, Instrumentation and Control Tech-

nologies (ICICICT). ICICICT.

Bewley, A., Ge, Z., Ott, L., Ramos, F., and Upcroft, B.

(2016). Simple online and realtime tracking. In 2016

IEEE International Conference on Image Processing

(ICIP), pages 3464–3468.

Bunel, R., Davoine, F., and Philippe Xu (2016). Detec-

tion of pedestrians at far distance. In 2016 IEEE In-

ternational Conference on Robotics and Automation

(ICRA), pages 2326–2331.

Felix Albu, Larry Murray, P. S. I. R. (2013). Object detec-

tion from image proﬁles withinsequences of acouired

digital images.

Gavrila, D. M. (2000). Pedestrian detection from a moving

vehicle. In European conference on computer vision,

pages 37–49. Springer.

Gunawan, A., Lau, H. C., and Lindawati (2011). Fine-

tuning algorithm parameters using the design of ex-

periments approach. In Coello, C. A. C., editor,

Learning and Intelligent Optimization, pages 278–

292, Berlin, Heidelberg. Springer Berlin Heidelberg.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

He, X. and Zeng, D. (2017). Real-time pedestrian warn-

ing system on highway using deep learning meth-

ods. In 2017 International Symposium on Intelligent

Signal Processing and Communication Systems (IS-

PACS), pages 701–706. IEEE.

Hoang, T. M., Nguyen, P. H., Truong, N. Q., Lee, Y. W., and

Park, K. R. (2019). Deep retinanet-based detection

and classiﬁcation of road markings by visible light

camera sensors. Sensors, 19(2):281.

Kuhn, H. W. (1955). The hungarian method for the as-

signment problem. Naval research logistics quarterly,

2(1-2):83–97.

Kumar, T. and Kushwaha, D. S. (2016). An efﬁcient ap-

proach for detection and speed estimation of moving

vehicles. Procedia Computer Science, 89(2016):726–

731.

Lee, P., Chiu, T., Lin, Y., and Hung, Y. (2009). Real-

time pedestrian and vehicle detection in video using

3d cues. In 2009 IEEE International Conference on

Multimedia and Expo, pages 614–617.

Lin, T.-Y., Doll

ar, P., Girshick, R., He, K., Hariharan, B.,

and Belongie, S. (2017a). Feature pyramid networks

for object detection. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 2117–2125.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll

ar, P.

(2017b). Focal loss for dense object detection. In

Proceedings of the IEEE international conference on

computer vision, pages 2980–2988.

Papadoulis, A., Quddus, M., and Imprialou, M. (2019).

Evaluating the safety impact of connected and au-

tonomous vehicles on motorways. Accident Analysis

& Prevention, 124:12–22.

Pei, D., Jing, M., Liu, H., Sun, F., and Jiang, L. (2020).

A fast retinanet fusion framework for multi-spectral

pedestrian detection. Infrared Physics & Technology,

105:103178.

Rajendran, S. P., Shine, L., Pradeep, R., and Vijayaragha-

van, S. (2019). Fast and accurate trafﬁc sign recogni-

tion for self driving cars using retinanet based detec-

tor. In 2019 International Conference on Communi-

cation and Electronics Systems (ICCES), pages 784–

790.

Rezaei, M., Terauchi, M., and Klette, R. (2015). Robust

vehicle detection and distance estimation under chal-

lenging lighting conditions. IEEE Transactions on In-

telligent Transportation Systems, 16(5):2723–2743.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., and et al. (2015a). Imagenet large scale vi-

sual recognition challenge. International Journal of

Computer Vision, 115(3):211–252.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., et al. (2015b). Imagenet large scale visual

recognition challenge. International journal of com-

puter vision, 115(3):211–252.

Welch, G., Bishop, G., et al. (1995). An introduction to the

kalman ﬁlter.

Xie, S., Girshick, R., Doll

ar, P., Tu, Z., and He, K. (2017).

Aggregated residual transformations for deep neural

networks. In 2017 IEEE Conference on Computer Vi-

sion and Pattern Recognition (CVPR), pages 5987–

5995.

Zhao, X., Li, W., Zhang, Y., Gulliver, T. A., Chang, S., and

Feng, Z. (2016). A faster rcnn-based pedestrian detec-

tion system. In 2016 IEEE 84th Vehicular Technology

Conference (VTC-Fall), pages 1–5.

Detection, Estimation Tracking Road Objects for Assisting Driving

685