Seeing Risk of Accident from In-Vehicle Cameras

Takuya Goto, Fumihiko Sakaue and Jun Sato

Nagoya Institute of Technology, Nagoya 466-8555, Japan

{goto@cv., sakaue@, junsato@}nitech.ac.jp

Keywords:

Trafﬁc Accident Prediction, Accident Risk, Risk Visualization, Instance Segmentation, Lane Detection.

Abstract:

In this paper, we propose a method for visualizing the risk of car accidents in in-vehicle camera images by

using deep learning. Our network predicts the future risk of car accidents and generates a risk map image that

represents the degree of accident risk at each point in the image. For training our network, we need pairs of

in-vehicle images and risk map images, but such datasets do not exist and are very difﬁcult to create. In this

research, we derive a method for computing the degree of the future risk of car accidents at each point in the

image and use it for constructing the training dataset. By using the dataset, our network learns to generate risk

map images from in-vehicle images. The efﬁciency of our method is tested by using real car accident images.

1 INTRODUCTION

In recent years, automated driving and driver assis-

tance systems for automobiles have advanced rapidly,

and advanced safety systems that use cameras for ob-

ject recognition and collision avoidance have been de-

veloped actively. Furthermore, methods for predict-

ing trafﬁc accidents have also been proposed in recent

years (Suzuki et al., 2018; Corcoran and Clark, 2019;

Yao et al., 2019; Bao et al., 2020). However, these

methods can only estimate the risk of the entire scene

in the image, and they cannot map the risk for each

pixel. That is, these methods cannot visualize the risk

in the image.

Thus, we in this paper propose a method for esti-

mating accident risk per pixel and generating risk map

images by using deep learning. Our method takes an

RGB video image as input and uses time series ad-

versarial learning to obtain an image that represents

where in the image the accident poses a danger to

one’s own vehicle. By using our network, the acci-

dent risk map shown in Figure 1 (b) is generated from

the RGB image shown in Figure 1 (a).

However, there is no dataset in which camera im-

ages correspond to accident risk maps. In this re-

search, we ﬁrst propose a method for computing the

degree of the future risk of car accidents at each point

in the image and use it for constructing a training

dataset that consists of pairs of in-vehicle images and

risk map images. We next propose a network that gen-

erates risk map images from in-vehicle images and

train it with the constructed dataset. In order to gen-

erate sequential risk map images from sequential in-

(a) input image (b) risk map image (our

result)

Figure 1: Accident risk map (b) estimated from in-vehicle

image (a) using the proposed method.

vehicle camera images, we develop a network based

on vid2vid (Wang et al., 2018a), a model that extends

cGAN to video sequences. We combine semantic seg-

mentation obtained from PCAN (Ke et al., 2021) and

optic ﬂow obtained from ﬂownet2 (Ilg et al., 2017)

with vid2vid for generating accurate sequential risk

map images from sequential in-vehicle camera im-

ages.

In general, we can say that an object approaching

our vehicle is dangerous, as in the case of car-to-car

accidents and car-to-pedestrian accidents. However,

vehicles approaching our vehicle in other lanes, such

as an oncoming vehicle in another lane, pose little

danger. Therefore, we in this research deﬁne a new

measure of future accident risk and compute it for

each point in the image.

There are many different types of in-vehicle cam-

eras and they have different camera parameters. Thus,

the degree of future accident risk must be able to be

computed from images taken by cameras with differ-

ent camera parameters. Therefore, we deﬁne a new

measure of future accident risk so that it is invariant

to the camera parameters. Then, we generate risk map

672

Goto, T., Sakaue, F. and Sato, J.

Seeing Risk of Accident from In-Vehicle Cameras.

DOI: 10.5220/0011743900003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 4: VISAPP, pages

672-679

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

Figure 2: Dataset generation.

images that correspond to the input images by using

the new risk measure and construct a training dataset

using different types of cameras. By training the risk

map generator by using the dataset constructed in this

way, we obtain a risk map generator for various cam-

era parameters.

2 RELATED WORK

Trafﬁc Accident Anticipation (TAA). In order to

predict trafﬁc accidents, the existing methods esti-

mate the likelihood of accidents occurring at each

time step. DSA (Chan et al., 2016) deﬁnes candi-

date accident objects in each frame and uses a spatial

attention mechanism for predicting trafﬁc accidents.

Based on the DSA framework, Zeng et al. (Zeng et al.,

2017) realized the localization of accidents by using a

soft-attention RNN. Suzuki et al. (Suzuki et al., 2018)

proposed a method for early accident prediction us-

ing a quasi-recurrent neural network. Corcoran et al.

(Corcoran and Clark, 2019) estimated the risk of traf-

ﬁc accidents using the features of candidate objects as

spatial streams and optical ﬂows as temporal streams.

The two streams of information were multiplied to

achieve accurate estimation. On the other hand, Yao

et al. (Yao et al., 2019) proposed a method to detect

anomalies on the road by predicting the future loca-

tion of each object on the road. UString-Net (Bao

et al., 2020) also proposes a model that takes into ac-

count the estimation of uncertainty when estimating

whether an accident will occur. These methods can

estimate the future risk of the entire scene from the

image, but cannot map the risk for each pixel. Al-

though (Zeng et al., 2017) realized the localization

of accidents, the extracted accident has nothing to do

with the observer (own vehicle), and the observer’s

degree of danger cannot be visualized. Differing from

these existing methods, we propose a new deﬁnition

of danger to observers, and propose a method to com-

pute and map the danger for each pixel. Our method

makes it possible to predict and visualize the danger

to the observer, i.e. own vehicle.

Video Instance Segmentation (VIS). In our method,

we use instance segmentation to compute the danger

of each object in the image. The generation of an in-

stance segmentation image from a single image con-

tains two steps, as in Mask-RCNN (He et al., 2017).

It is realized by ﬁrst detecting the region of interest

(RoI), then adding an object mask and dividing it into

instances. Furthermore, in the case of instance seg-

mentation in moving images, temporal mapping of

each instance is realized as in Track-RCNN (Voigt-

laender et al., 2019) and PCAN (Ke et al., 2021) to

realize segmentation that enables tracking of the same

instance in moving images. In this research, we use

PCAN to perform object recognition in moving im-

ages and compute the danger of each object.

Lane Detection (LD). In our method, we perform

lane detection to compute the accident risk invariant

to camera parameters. The lane detection algorithm

started with the Hough transform (Hassanein et al.,

2015), which detects straight lines in a physics-based

manner, and LaneNet (Wang et al., 2018b) used deep

learning to detect individual lanes as instances. Enet-

SAD (Hou et al., 2019) greatly improves the accu-

racy by using the Attention mechanism so that the

model can self-learn. In addition to lane detection,

as in Hybrid-Nets (Vu et al., 2022), we can further

improve the accuracy of lane detection by segment-

ing the range of the roadway and multitask learning

with object detection. In this research, lane detection

Seeing Risk of Accident from In-Vehicle Cameras

673

Figure 3: Projection from the image plane onto the road surface

Figure 4: Distance between camera and line of object mo-

tion.

is performed using Hybrid-Nets, and the collision risk

with the own vehicle is computed based on the de-

tected lane information.

3 ACCIDENT RISK MAP

DATASET

In this research, risk visualization is realized by net-

work learning. For training our network, we need

pairs of in-vehicle images and risk map images, but

such datasets do not exist. Thus, in this section,

we describe a method for constructing the training

dataset.

3.1 Overview of Dataset Generation

Figure 2 shows the overview of our method for cre-

ating pairs of RGB images and accident risk map im-

ages. The accident risk map image is generated by

multiplying the accident probability of the scene with

the risk value for each segmented region in the image.

By using UString-Net (Bao et al., 2020), we can

obtain the accident probability a

in the scene. How-

ever, this is just a single value that represents the prob-

ability of an accident across the image. Thus, we mul-

tiply it with a risk map image computed from the rela-

tive position and motion between the own vehicle and

each object in the image. The objects in the image are

segmented and extracted by using PCAN (Ke et al.,

2021). For computing the risk map image invariant to

camera parameters, we also use the lane information

extracted by using HybridNets (Vu et al., 2022).

We explain the detail of the risk map computation

in the following sections.

3.2 Invariant Road Representation

We ﬁrst extract lane markers in the image by using

Hybrid-Nets (Vu et al., 2022), and obtain their line

parameters l

(i = 1, 2), i.e. homogeneous coordi-

nates of these lines. These image lines correspond to

lane markers, L

(i = 1, 2) on the road surface. Since

the lane markers are parallel to each other, their ho-

mogeneous coordinates can be represented as L

[1, 0, 1]

⊤

and L

= [1, 0, −1]

⊤

in the 2D projective

space. We also have the line correspondence between

the bottom and top image lines, l

= [0, 1, −y

min

]

⊤

and

= [0, 1, −y

max

]

⊤

, and their corresponding lines on

the road surface, L

= [0, 1, −1]

⊤

and L

= [0, 1, 1]

⊤

where y

min

and y

max

represent the coordinates of the

bottom edge and top edge of the image. These 4 pairs

of corresponding lines have the following projective

relationship:

= H

−⊤

(i = 1, ··· , 4), (1)

and we can compute the projective transformation H

by using these 4 pairs of lines. Once the projective

transformation is obtained, we can transfer the image

points x

to the road surface points X

as follows:

= H

−1

(2)

Since the lowest point of the object region ob-

tained by PCAN in the image is considered as a point

on the road surface in the 3D space. We transfer the

lowest image point of each object region to the road

surface point by using the projective transformation.

The derived road surface representation is invariant to

camera parameters, so we can use it for computing the

accident risk invariant to camera parameters.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

674

Figure 5: Proposed network for accident risk map estimation.

3.3 Computation of Accident Risk

For estimating the accident risks, we need to consider

two components. The ﬁrst one is distance risk and the

second one is directional risk

The distance risk represents how fast the object is

approaching relative to the distance to the object on

the road, and it can be modeled by using Time-to-

Contact (TTC) (Tresilian, 1991) as follows:

T TC

− d

t−1

, (3)

where d

is the Euclidean distance between the object

i and the camera center on the road surface. Then, the

distance risk P

of object i is obtained by taking the

inverse of TTC as follows:

T TC

(4)

The directional risk on the other hand represents

how far the direction of motion of an object on the

road is pointing in the direction of the observer. As

shown in Figure 4, this can be represented by the dis-

tance L

between the camera center and a line that rep-

resents the motion of the object O

on the road. The

inverse of L

is then taken to be the directional risk D

as follows:

(5)

Then, by using these two components of risk, the

risk R

of object i can be computed as follows:

= σ(P

· D

), (6)

where, σ denotes the sigmoid function.

Then, the accident risk Q

of the region i is com-

puted by multiplying R

with the accident probability

across the image a

obtained from UString-Net.

= a

· R

(7)

In this way, the accident risk at each image pixel can

be computed, and we can generate an accident risk

map image.

4 ESTIMATION OF ACCIDENT

RISK MAPS

In Section 3, we considered a method for generating

risk map images from in-vehicle images. However,

this method does not work when we fail to extract

lane markers in images. This situation frequently oc-

curs around road intersections, so the method is not

so practical. Thus, in this research, we use good im-

ages generated by the method described in Section 3

as training images, and realize stable risk image gen-

eration by learning the network with these training

images. In our network, we also use time-series in-

formation for generating better risk map images.

Our network is based on vid2vid (Wang et al.,

2018a) and uses a Markov process to capture tempo-

ral changes in images. Using the T time in-vehicle

images x

as input, T time images of accident risk

map

is generated by the network. The risk map

image

at the current time t is generated by using

the past input images x

t−L

and the risk rate map im-

Seeing Risk of Accident from In-Vehicle Cameras

675

ages

t−1

t−L

generated in the past as follows:



| x



∏

t=1



t−1

t−L

, x

t−L



(8)

The network consists of one Generator and two

Discriminators, as shown in Figure 5. Generator G

takes RGB images x

t−L

up to time t and risk map im-

ages generated in the past a

t−1

t−L

as input, and gener-

ates accident risk map image at time t. It also takes

segmentation images s

t−L

and ﬂow images w

t−L

further improve the accuracy. Changes of the object

region are extracted by video instance segmentation,

and the pixel-level temporal changes in the image are

represented using optical ﬂow derived from ﬂownet2.

Our network has two discriminators for adversarial

learning. One is the image discriminator D

, which

judges whether the generated image is true or false,

and the other is the video discriminator D

, which

judges whether the generated video image is natural

or not. The evaluation function is as follows:

min



max

(G, D

) + max

(G, D

)



(9)

+λ

(G) + λ

(G),

where L

is the adversarial loss for images as follows:

= E

(

)

[logD

, x

)] (10)

(

)

[log(1 − D

(

, x

))],

and L

is the adversarial loss for videos as follows:

= E

(

T −1

)



logD



i−1

i−K

, w

i−2

i−K



(11)

(

T −1

)



log



1 − D



i−1

i−K

, w

i−2

i−K



Also, L

represents the L1 loss for the generated

optical ﬂow image as follows:

T − 1

T −1

∑

t=1

(

∥

− w

∥

), (12)

and L

is the L1 loss for the generated accident risk

map images.

∑

t=1

(

∥

− a

∥

) (13)

By learning to minimize these losses, the network is

trained to generate accident risk maps from RGB im-

ages.

5 EXPERIMENT

We next show the results of the experiments. The

Dachcam Accident Dataset (DAD) (Chan et al.,

2016), which is video data of real accidents, is used to

create the accident risk map dataset described in Sec-

tion 3. From these data, 100 types of data in which ac-

cidents have occurred and 100 types of data in which

no accidents have occurred are used. For video seg-

mentation, we used PCAN (Ke et al., 2021) trained

on BDD100K dataset (Yu et al., 2018), and for lane

segmentation, we used pre-trained HybridNets (Vu

et al., 2022). For the true value of the optical ﬂow,

we also use the ﬂow map estimated by using ﬂownet2

pre-trained by the Cityscapes dataset (Cordts et al.,

2016). Our network is trained using 312 video images

and tested using 88 video images. We set λ

= 1.0

and λ

= 100.0. The generator and discriminator are

trained for 200 epochs.

5.1 Dataset Creation

We ﬁrst show the dataset created by using the pro-

posed method described in Section 3.

Figure 6 shows some examples of the created

dataset. We compare the data generated by the pro-

posed method with data derived from simple Time-to-

Contact (TTC) and data derived by combining TTC

and directional risk without multiplying the accident

probability across the image.

As shown in Figure 6, in the risk images gener-

ated from TTC only, the risk value is computed for

all approaching objects, but by combining the direc-

tional risk, higher risks are assigned to the objects ap-

proaching our vehicle. However, some risks are dis-

played even for objects in other lanes that have no risk

at all. On the contrary, the risk maps generated by the

proposed method represent the risks that truly lead to

accidents, as shown in the right-most column. In ad-

dition, although the angle of view and the elevation

angle of these data are signiﬁcantly different, the pro-

posed method derives risk map images appropriately,

and we ﬁnd that the proposed method is invariant with

the camera parameters.

We next show the temporal changes in the risk

map images generated by the proposed method in Fig-

ure 7. For objects approaching the observer, we can

see that the risk value increases as they get closer.

5.2 Estimation of Accident Risk Maps

We next show the results of our risk estimation

method proposed in Section 4. The accident risk map

estimated by using the network trained on the dataset

generated in Section 5.1 is shown in Figure 8. In Fig-

ure 8, we show the results for two consecutive time in-

stants for each scene. For comparison, the results de-

rived by the existing method vid2vid are also shown.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

676

Input image Segmented image TTC TTC with direction Accident risk map

Figure 6: Dataset generated by the proposed method. ﬁrst row: a scene approaching a stationary vehicle, second row: a scene

without danger, third row: a scene approaching a stopped bike.

Scene 1: Vehicle is approaching from the right.

Scene 2: Bike is coming towards the observer from behind a passing bus.

Figure 7: Sequential images and their risk map images in the dataset generated by the proposed method.

Table 1: Accuracy of generated accident risk maps. To com-

pute accuracy, precision, and recall, we binarised non-zeros

to 1 and zeros to 0 in the risk value for each pixel in the

generated image and computed them by taking the average

of all pixels.

RMSE Acc Prec Rec

Vid2Vid 10.58 91.18 31.95 33.25

Ours

(L1+gan) 9.42 94.88 56.60 58.95

Ours

(L1+ﬂow+gan) 9.39 95.10 58.74 65.10

To see the effect of ﬂow information in the proposed

method, we also compare the case where ﬂow infor-

mation is not used in the proposed method.

As shown in scene 1 in Figure 8, the risk re-

gions derived from vid2vid are vague and not accu-

rate, whereas the proposed method can derive the risk

regions accurately. Furthermore, as shown in scene

2 and scene 3, the proposed method can estimate the

risk accurately even for distant objects. We also ﬁnd

that combining ﬂow information improves the accu-

racy of the proposed method. In scene 4, the bike

was extracted accurately by adding ﬂow information

to the proposed method. The results in scene 5 also

show that the risk regions derived from the existing

method are vague, whereas our method can extract

risk regions more accurately. However, the proposed

method sometimes over-detects risk regions, so we

Seeing Risk of Accident from In-Vehicle Cameras

677

Input image Ground Truth Vid2Vid Ours(L1+GAN) Ours(L1+Flow+GAN)

Scene 1: Just before the motorbike collided with the car and fell over.

Scene 2: Motorbike brakes against a white car approaching from the right.

Scene 3: Truck crashes into a wall and creates dust.

Scene 4: Just before a car and motorbike collide.

Scene 5: Motorbike approaches from the left just before colliding with another motorbike.

Figure 8: Accident risk map estimated by using the proposed network trained by using the generated dataset.

need further improvement in its accuracy.

We next show the results of the quantitative eval-

uation in Table 1. RMSE, Accuracy, Precision, and

Recall were measured as evaluation metrics. To com-

pute Accuracy, Precision, and Recall, we binarised

non-zeros to 1 and zeros to 0 in the risk value for each

pixel in the generated image and computed them by

taking the average of all pixels.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

678

While the existing method failed to correctly cap-

ture the risk regions, the proposed method was able to

identify the risk regions accurately, so the proposed

method improved the accuracy of RMSE and other

metrics. From the values of Precision and Recall, we

also ﬁnd that the risks were estimated more accurately

by the proposed method for regions where objects are

present.

6 CONCLUSION

In this paper, we proposed a method for estimating

accident risk maps, which represent the accident risk

to the own vehicle, based on in-vehicle images.

The dataset required for training the GAN was

created using a model independent of the camera pa-

rameters. Unlike the conventional Time-to-Contact,

the dataset created by the proposed method can rep-

resent with high accuracy the greater risk only for ob-

jects approaching in the direction of the own vehicle.

Moreover, by combining the trained UString-Net, it is

possible to create a dataset of accident risk maps that

represent only hazards in situations where accidents

are likely to occur.

We also proposed a network for generating the

risk map images from in-vehicle images. The pro-

posed network trained by the proposed dataset can

estimate the accident risk map more accurately than

the conventional network by dealing with scenes with

different camera parameters.

Finally, we conﬁrmed through real-world experi-

ments that the proposed method can visualize the risk

to the own vehicle using any type of in-vehicle cam-

era.

REFERENCES

Bao, W., Yu, Q., and Kong, Y. (2020). Uncertainty-based

trafﬁc accident anticipation with spatio-temporal rela-

tional learning. In Proc. ACM International Confer-

ence on Multimedia.

Chan, F.-H., Chen, Y.-T., Xiang, Y., and Sun, M. (2016).

Anticipating accidents in dashcam videos. In Proc.

Asian Conference on Computer Vision.

Corcoran, G.-P. and Clark, J. (2019). Trafﬁc risk as-

sessment: A two-stream approach using dynamic-

attention. In Proc. Conference on Computer and

Robot Vision (CRV), pages 166–173.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,

M., Benenson, R., Franke, U., Roth, S., and Schiele,

B. (2016). The cityscapes dataset for semantic urban

scene understanding.

Hassanein, A. S., Mohammad, S., Sameer, M., and Ragab,

M. E. (2015). A survey on hough transform, theory,

techniques and applications. arXiv.

He, K., Gkioxari, G., Doll

ar, P., and Girshick, R. (2017).

Mask R-CNN. In Proc. International Conference on

Computer Vision (ICCV).

Hou, Y., Ma, Z., Liu, C., and Loy, C. C. (2019). Learning

lightweight lane detection cnns by self attention dis-

tillation. In Proc. International Conference on Com-

puter Vision.

Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A.,

and Brox, T. (2017). Flownet 2.0: Evolution of opti-

cal ﬂow estimation with deep networks. In Proc. Con-

ference on Computer Vision and Pattern Recognition

(CVPR).

Ke, L., Li, X., Danelljan, M., Tai, Y.-W., Tang, C.-K.,

and Yu, F. (2021). Prototypical cross-attention net-

works for multiple object tracking and segmentation.

In Proc. Advances in Neural Information Processing

Systems.

Suzuki, T., Kataoka, H., Aoki, Y., and Satoh, Y. (2018).

Anticipating trafﬁc accidents with adaptive loss and

large-scale incident db. In Proc. Conference on Com-

puter Vision and Pattern Recognition (CVPR).

Tresilian, J. (1991). Empirical and theoretical issues in the

perception of time to contact. Journal of Experimen-

tal Psychology: Human Perception and Performance,

17(3):865–876.

Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B.

B. G., Geiger, A., and Leibe, B. (2019). MOTS: Multi-

object tracking and segmentation. In Proc. Conference

on Computer Vision and Pattern Recognition (CVPR).

Vu, D., Ngo, B., and Phan, H. (2022). Hybridnets: End-to-

end perception network. arXiv.

Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A., Kautz,

J., and Catanzaro, B. (2018a). Video-to-video synthe-

sis. In Proc. Conference on Neural Information Pro-

cessing Systems (NeurIPS).

Wang, Z., Ren, W., and Qiu, Q. (2018b). LaneNet: Real-

time lane detection networks for autonomous driving.

arXiv.

Yao, Y., Xu, M., Wang, Y., Crandall, D. J., and Atkins,

E. M. (2019). Unsupervised trafﬁc accident detection

in ﬁrst-person videos. In Proc. International Confer-

ence on Intelligent Robots and Systems (IROS).

Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F.,

Madhavan, V., and Darrell, T. (2018). Bdd100k: A

diverse driving dataset for heterogeneous multitask

learning. In Proc. Conference on Computer Vision and

Pattern Recognition.

Zeng, K.-H., Chou, S.-H., Chan, F.-H., Niebles, J. C., and

Sun, M. (2017). Agent-centric risk assessment: Ac-

cident anticipation and risky region localization. In

Proc. Conference on Computer Vision and Pattern

Recognition (CVPR).

Seeing Risk of Accident from In-Vehicle Cameras

679