Crash Prediction using Ensemble Methods

Mohammed Ameksa

, Hajar Mousannif

, Hassan Al Moatassime

and Zouhair Elamrani Abou

Elassad

LISI Laboratory, Computer Science Department, FSSM, Cadi Ayyad University, Marrakech, Morocco

OSER Research Team, Computer Science Department, FSTG, Cadi Ayyad University, Marrakech, Morocco

Keywords: Crash Prediction, Random Forest, Gradient Boosting Machine, Extreme Gradient Boosting, Road Safety

Modelling, Machine Learning

Abstract: The prediction of traffic accidents is a major concern worldwide due to its negative impact on all sectors. The

human and financial losses caused by road accidents have become increasingly important. This study aims to

investigate the prediction of an accident using several ensemble-based methods, including the GBM gradient

boosting machine, the XGB extreme gradient boosting, and RF random forest. To achieve this, we used driver’

inputs data extracted from the outputs of a driving simulator located at Cadi Ayyad University UCA. And the

evaluation of the developed models was carried out for both configurations, before and after tuning of

hyperparameters. Results show that the RF outperforms GBM and XGB for the default parameters with an

accuracy of 93%. However, after hyperparameters optimization, it had been noticed that even if the algorithms

are not the same, their performances are almost equal. The highest performance was performed after tuning

of hyperparameters.

1 INTRODUCTION

Road traffic crashes and related fatalities are a rapidly

increasing trend worldwide due to rapid growth in

demographics. According to (Lagarde E, 2020),

every year, more than 1 million people die in road

traffic crashes worldwide, and more than 50 million

are injured. Indeed, the human and financial losses

caused by road accidents are becoming more and

more important; consequently, a good prediction of

the occurrence of an accident before it happens is of

great benefit to humanity.

A significant effort has been made to reduce the

frequency and the occurrence of accidents on the

roadway. Based on the literature search, many studies

have been investigated factors that contribute to the

occurrence of accidents and analyzed crash

predictions (Silva et al., 2020). as well as the

potential of data mining techniques to predict the

occurrence of a crash has been evaluated in various

scientific studies using several algorithms. Indeed,

several studies found that Machine learning models

are more robust and powerful than statistical models

in predicting future events and have been used

successfully in many transportation systems.

(Abdelwahab and Abdel-Aty, 2001) built an

artificial neural network model for the prediction of

driver injury severity in traffic crashes using 1997

traffic accident data from the Central Florida region,

the MLP neural network was found to achieve the best

performance for training (65.6%) and testing (60.4%).

Other machine learning algorithms including, K

nearest neighbor (KNN), random forest (RF), support

vector machine (SVM), decision trees (DT), Gaussian

naive Bayes (Gaussian NB), and adaptive boost

(AdaBoost), are used by (Osman et al., 2019) to

predict near-crash based on observed in vehicle

kinematics data. This mentioned article proves that

the AdaBoost model achieved an excellent and

consistent performance while predicting near-crashes

(by accuracy of 100%) or everyday driving (by

accuracy of 98%). (Zouhair et al., 2020) used

Artificial Neural Networks (ANN) and Decision

Trees (DT) machine learning models to analyze crash

events and to outline the effect of both light and heavy

rain on road safety. Using driver inputs data as throttle

pedal position, brake pedal position, and wheel angle,

the author found that the ANN model offered a higher

precision with a value of 92.14%. In comparison, the

DT model showed an accuracy of 89.90%. Among the

limitations highlighted in the latter study was to take

advantage of the decision tree benefits by using the

ensemble methods.

Ameksa, M., Mousannif, H., Al Moatassime, H. and Elamrani Abou Elassad, Z.

Crash Prediction using Ensemble Methods.

DOI: 10.5220/0010731200003101

In Proceedings of the 2nd International Conference on Big Data, Modelling and Machine Learning (BML 2021), pages 211-215

ISBN: 978-989-758-559-3

211

From the literature mentioned above and other

studies, it is undoubtedly clear that the developed

models’ performances depend on the variety of data

mining algorithms and the dataset used and differ

from one study to another.

On the other hand, road accidents are a very

complex issue involving different dimensions of

driving, such as the driver, the vehicle, and

environmental dimension. (Aljanahi et al., 1999), In

an attempt to understand this human-machine

system, our team has performed an interpretation

framework integrating multiple dimensions

influencing the driver. Although driver inputs are

often thought to be a significant cause. (Treat et al.,

1979) indicated that approximately 90% of road

accidents were due to driving errors, demonstrating

the importance of the driver's inputs in the accident

occurrence process. Indeed, our previous study

(Ameksa et al., 2020) found that the vehicle data was

the primary focus of the researchers in terms of data

collection, including data provided by drivers such as

acceleration, braking, steering wheel.

Therefore, the purpose of this study is to

investigate and provide some insights into the power

of the supervised learning model, especially the

ensemble machine learning (ML) techniques, to

predict crashes using driver response features. It

should be noted that those features were used as

inputs so the model can predict the occurrence of a

collision based on the initial information provided

from a driving simulator database.

The rest of the paper is structured as follows:

Section 2 present the dataset and the methodology

used in model development. Section 3 highlights

significant findings and discussion. The final section

outlines the conclusions and presents the limitations

and recommendations for future studies.

2 DATA AND METHOD

2.1 Data Description

The fact that it is very dangerous to conduct trials in

a real road environment and knowing that a driving

experience on a simulator is an excellent tool for

collecting data in a safe environment. The database

used in this work is extracted from the outputs of a

driving simulator located at Cadi Ayyad University

UCA. Indeed, a group of volunteers participated in

an experimental study in which their driving data was

continuously recorded while they drove and followed

traffic rules as they usually do in real-life situations.

is worth noting that participants viewed the

Select random combinations to train the model

simulation on a 27-inch LCD screen with a resolution

of 1920x1080 pixels and heard through a surround

speaker system. The computer was equipped with a

Logitech® G27 Racing system.

For this study, we are interested in the driver’s

inputs data such as braking, steering wheel, and speed

being independent variables and the crash state as the

target variable. And regarding the size of the

database, it contains about 10500 rows.

2.2 Modelling

In this study, we used ensemble-based methods,

including the GBM gradient boosting machine

(Friedman, 2001), the XGB extreme gradient

boosting (Chen and Guestrin, 2016), and RF random

forest (BREIMAN, 2001) to address the aim of this

paper. The following should be highlighted, the three

algorithms used in this study are principally based on

two techniques, bagging, and boosting methods. The

RF algorithm is based on the first technique

(bagging), while the GBM and XGB algorithms are

built based on the second one (boosting). Both

bagging and boosting are ensemble methods, where

a collection of individual weak learners are combined

to produce a strong learner who performs better than

a single one. The difference between them is in the

training phase. For the bagging technique, each

model is built in parallel and independently, while

boosting improves the performance of the model by

adding new models that work well where previous

models have failed; it creates new learners

sequentially. The final decision is made for both

methods by averaging the N learners. However, it is

a Weighted Average for Boosting; and an Equal

Weighted Average for Bagging, with higher weights

for the better performing learners on the training data.

(Bari et al., 2020).

Concerning the methodology adopted in this

work includes the two significant steps:

1) The development of ensemble-based algorithms

was carried out using the default

hyperparameters. This first step was achieved

through the process described below for each

algorithm (example of RF algorithm, see the

Appendix ):

 Import the libraries and the data file.

 Defined Independent variables and the target.

 Split the dataset into 70% for the training set

and 30% for the test.

 Creating an instance of the model.

 Evaluating the model using the test data.

2) The tuning of hyperparameters was performed

using the random search

(see Appendix) and grid

BML 2021 - INTERNATIONAL CONFERENCE ON BIG DATA, MODELLING AND MACHINE LEARNING (BML’21)

212

(see Appendix) technique for each

algorithm, and by setting up a grid of

hyperparameter values for the most powerful (the

Appendix illustrated the most powerful

hyperparameters for the RF algorithm). This

process helped us to obtain the optimum

hyperparameters, and subsequently, the best

model for each algorithm. This is the most

challenging part of the machine learning model-

building process.

To evaluate the performance of the crash

prediction model based on the test dataset, the

precision, recall, and F1 score (scikit-learn) are used

to measure the accuracy of the algorithms.

3 RESULTS AND DISCUSSION

Analysing the results of the three following

algorithms RF, GBM, and XGB (Figure 1) indicates

that the random forest provides better performance

than the other ensemble methods GBM and XGB for

the default configuration. To be more precise, 93%

was the accuracy of RF and 82% for both GBM and

XGB. Then, after hyper-parameters tuning, the

accuracy score of the three algorithms was found to

be similar and equals 95%.

The random search optimization method is used

to randomly select a combination of hyperparameters

among all ranges given at the beginning. In our case,

XGB was found to be the best model to predict

crashes using this method with an accuracy of 93%.

Figure 1: The accuracy of the three developed models

Since we are forecasting the occurrence of an

accident or not, our problem is a binary classification

case. And comparing the performance indicator has a

great benefit on the selection of algorithms.

Therefore, after the hyperparameters tuning process,

Train every combination of hyperparameter values on a

model, and select only the one with the highest score.

Figure 2: The performance indicators of RF, GBM and

XGB after hyperparameters tuning

the other performance benchmarks presented in

section 2.2 have been calculated, and the results are

illustrated in Figure 2.

The precision score of the developed models had

a slight difference between them in terms of accident

class (i.e., 0.90, 0.91, and 0.92 for XGB, GBM, and

RF, respectively). And for the no accident class, the

algorithms have the same precision. For the recall

score, we noticed that for both developed models

based on GBM and XGB had a similar score of 0.97

for the no-crash class, 0.87 and 0.86 were recorded

for GBM and XGB respectively regarding the crash

class, while the RF model proved to be the best with

a score of 0.98 for the no crash class and 0.87 for the

class with crashes (see Figure 3).

Figure 3: Entire results of 3 algorithms for both

configurations

75% 80% 85% 90% 95%

Default

RandomSearch

GridSearch

xGboost Gradienboostingmachine Randomforest

0,96

0,98

0,97

0,96

0,97 0,97

0,96

0,97

0,96

0,92

0,87

0,90

0,91

0,87

0,89

0,90

0,86

0,88

0,80

0,82

0,84

0,86

0,88

0,90

0,92

0,94

0,96

0,98

1,00

precision

recall

f1‐score

precision

recall

f1‐score

precision

recall

f1‐score

RF GBM xGboost

NoCrash Crash

Crash Prediction using Ensemble Methods

213

4 CONCLUSIONS

Crash prediction offers a proactive solution to

increase road safety and save lives. For this reason, it

has been a long-standing interest of researchers,

industry, and policymakers. However, crash

prediction remains complex and requires high

resolution and large data sets to develop powerful

models that effectively predict accidents. According

to the scientific literature, different approaches have

been used to address this topic. Although, as far as

the authors know, few works have been focused on

investigating crash prediction based on driver inputs

and ensemble methods.

In the present work, the performance of ensemble

methods machine learning algorithms has been

assessed in a classification case. in fact, the developed

models in this research consist of predicting the

occurrence of a crash from the simulator-based

dataset of some driver inputs as features.

Results show that the Random Forest outperforms

the gradient boosting machine and extreme gradient

boosting for the default parameters. But after the

optimization of hyperparameters, it was noticed that

even if the algorithms are not the same, their

performances are almost equal. Therefore, tuning

hyperparameters impacts the machine learning

developed model and offers the highest improvement

and benefit in accident prediction accuracy.

One of the limitations of this paper is that the

comparison is made only between ensemble

methods-based algorithms. Thus, future work should

include more algorithms, as well as integrating other

variables such as the vehicle kinematics and the

surrounding environment. In addition, the necessity

for having a large, and comprehensive training data

sets presents a clear challenge that should be

mitigated to ensure that machine learning

applications remain accurate.

ACKNOWLEDGEMENTS

The Moroccan Ministry of Equipment, Transport,

and Logistics; and the Moroccan National Centre for

Scientific and Technical Research (CNRST) are

gratefully acknowledged for their valuable support of

this research.

REFERENCES

Abdelwahab HT, Abdel-Aty MA. Development of

artificial neural network models to predict driver injury

severity in traffic accidents at signalized intersections.

Transp Res Rec 2001:6–13.

https://doi.org/10.3141/1746-02.

Aljanahi AAM, Rhodes AH, Metcalfe A V. Speed, speed

limits and road traffic accidents under free flow

conditions 1999;31:161–8.

Ameksa M, Mousannif H, Al Moatassime H, Elamrani

Abou Elassad Z. Toward Flexible Data Collection of

Driving Behaviour. ISPRS - Int Arch Photogramm

Remote Sens Spat Inf Sci 2020; XLIV-4/W3-2020:33–

43. https://doi.org/10.5194/isprs-archives-xliv-4-w3-

2020-33-2020.

Bari D, Ameksa M, Ouagabi A. A comparison of

datamining tools for geo-spatial estimation of visibility

from AROME-Morocco model outputs in regression

framework. Proc - 2020 IEEE Int Conf Moroccan

Geomatics, MORGEO 2020 2020.

https://doi.org/10.1109/Morgeo49228.2020.9121909.

BREIMAN L. Random forests. Random For 2001;45:5–

32. https://doi.org/10.1201/9780429469275-8.

Chen T, Guestrin C. XGBoost: A scalable tree boosting

system. Proc. ACM SIGKDD Int. Conf. Knowl.

Discov. Data Min., vol. 13-17- Augu, 2016, p. 785–94.

https://doi.org/10.1145/2939672.2939785.

Friedman JH. Greedy function approximation: A gradient

boosting machine. Ann Stat 2001;29:1189–232.

https://doi.org/10.1214/aos/1013203451.

Lagarde E. Road traffic injuries. Encycl Environ Heal

2020:572–80. https://doi.org/10.1016/B978-0-444-

63951-6.00623-9.

Osman OA, Hajij M, Bakhit PR, Ishak S. Prediction of

Near-Crashes from Observed Vehicle Kinematics

using Machine Learning. Transp Res Rec 2019.

https://doi.org/10.1177/0361198119862629.

scikit-learn. classification report n.d. https://scikit-

learn.org/stable/modules/generated/sklearn.metrics.cla

ssification_report.html.

Silva PB, Andrade M, Ferreira S. Machine learning applied

to road safety modeling: A systematic literature review.

J Traffic Transp Eng (English Ed 2020;7:775–90.

https://doi.org/10.1016/j.jtte.2020.07.004.

Treat JR, Tumba NS, McDonald ST, Shinar D, Hume RD,

Mayer RE, et al. Tri-Level Study of the Causes of

Traffic Accidents: An overview of final results. Proc

Am Assoc Automot Med Annu Conf 1979;21:391–

403.

Zouhair EAE, Mousannif H, Al Moatassime H. Towards

analyzing crash events for novice drivers under

reduced-visibility settings: A simulator study. ACM

Int. Conf. Proceeding Ser., 2020.

https://doi.org/10.1145/3386723.3387849.

Abdelwahab HT, Abdel-Aty MA. Development of

artificial neural network models to predict driver injury

severity in traffic accidents at signalized intersections.

Transp Res Rec 2001:6–13.

https://doi.org/10.3141/1746-02.

Aljanahi AAM, Rhodes AH, Metcalfe A V. Speed, speed

limits and road traffic accidents under free flow

conditions 1999;31:161–8.

Ameksa M, Mousannif H, Al Moatassime H, Elamrani

Abou Elassad Z. Toward Flexible Data Collection of

BML 2021 - INTERNATIONAL CONFERENCE ON BIG DATA, MODELLING AND MACHINE LEARNING (BML’21)

214

Driving Behaviour. ISPRS - Int Arch Photogramm

Remote Sens Spat Inf Sci 2020;XLIV-4/W3-2020:33–

43. https://doi.org/10.5194/isprs-archives-xliv-4-w3-

2020-33-2020.

Bari D, Ameksa M, Ouagabi A. A comparison of

datamining tools for geo-spatial estimation of visibility

from AROME-Morocco model outputs in regression

framework. Proc - 2020 IEEE Int Conf Moroccan

Geomatics, MORGEO 2020 2020.

https://doi.org/10.1109/Morgeo49228.2020.9121909.

BREIMAN L. Random forests. Random For 2001;45:5–

32. https://doi.org/10.1201/9780429469275-8.

Chen T, Guestrin C. XGBoost: A scalable tree boosting

system. Proc. ACM SIGKDD Int. Conf. Knowl.

Discov. Data Min., vol. 13-17- Augu, 2016, p. 785–94.

https://doi.org/10.1145/2939672.2939785.

Friedman JH. Greedy function approximation: A gradient

boosting machine. Ann Stat 2001;29:1189–232.

https://doi.org/10.1214/aos/1013203451.

Lagarde E. Road traffic injuries. Encycl Environ Heal

2020:572–80. https://doi.org/10.1016/B978-0-444-

63951-6.00623-9.

Osman OA, Hajij M, Bakhit PR, Ishak S. Prediction of

Near-Crashes from Observed Vehicle Kinematics

using Machine Learning. Transp Res Rec 2019.

https://doi.org/10.1177/0361198119862629.

scikit-learn. classification report n.d. https://scikit-

learn.org/stable/modules/generated/sklearn.metrics.cla

ssification_report.html.

Silva PB, Andrade M, Ferreira S. Machine learning applied

to road safety modeling: A systematic literature review.

J Traffic Transp Eng (English Ed 2020;7:775–90.

https://doi.org/10.1016/j.jtte.2020.07.004.

Treat JR, Tumba NS, McDonald ST, Shinar D, Hume RD,

Mayer RE, et al. Tri-Level Study of the Causes of

Traffic Accidents: An overview of final results. Proc

Am Assoc Automot Med Annu Conf 1979;21:391–

403.

Zouhair EAE, Mousannif H, Al Moatassime H. Towards

analyzing crash events for novice drivers under

reduced-visibility settings: A simulator study. ACM

Int. Conf. Proceeding Ser., 2020.

https://doi.org/10.1145/3386723.3387849.

APPENDIX

Crash Prediction using Ensemble Methods

215