Proposal of a Signal Control Method Using Deep Reinforcement

Learning with Pedestrian Trafﬁc Flow

Akimasa Murata, Yuichi Sei

, Yasuyuki Tahara

and Akihiko Ohsuga

The University of Electro Communications, Tokyo, Japan

Keywords:

Agents, Deep Reinforcement Learning, Trafﬁc Control.

Abstract:

In dealing with trafﬁc control problems, there have been studies on learning signal change patterns and timing

by using reinforcement learning for signals. In most of them, the focus is on improving the delay time of

vehicles, and few of them assume the trafﬁc situation including pedestrians. Therefore, the objective of this

study is to provide trafﬁc control to reduce trafﬁc delays for both vehicles and pedestrians in an environment

where pedestrian trafﬁc volume varies greatly. Then, we will verify the accuracy with trafﬁc signals consider-

ing the temporal changes of the environment. Results of veriﬁcation, although vehicle wait times increased, a

signiﬁcant reduction in pedestrian wait times was observed.

1 INTRODUCTION

In the modern trafﬁc environment, vehicles and

pedestrians are mixed, and trafﬁc control is achieved

by using trafﬁc signals appropriate to the environ-

ment. In response to this, research is being conducted

to create trafﬁc signals that can respond to real-time

trafﬁc changes by learning signal control policies us-

ing DQN (Deep Q-Network)(Mnih et al., 2013) , one

of the reinforcement learning methods. Most of those

studies focus on vehicles, but in the real environment,

we need to pay attention to exceptional factors such

as pedestrians. Information on the size of pedestrian

groups and their destinations is an important factor

in understanding the trafﬁc environment and under-

standing of human ﬂow should lead to appropriate

trafﬁc control. Therefore, this study focuses on pedes-

trian control and aims to develop a trafﬁc control sys-

tem that can respond to changes in the environment.

Then, by creating trafﬁc signals with a network with

LSTM (Long Short-Term memory) (Hochreiter and

Schmidhuber, 1997), (Gers et al., 2000) added. This

attempt to control for the large differences in speed

between vehicles and pedestrians, as well as the dif-

ferences in where they are moving on the road. In

this way, we verify the feasibility and the system

of control based on temporal changes in trafﬁc vol-

https://orcid.org/0000-0002-2552-6717

https://orcid.org/0000-0002-1939-4455

https://orcid.org/0000-0001-6717-7028

ume. Trafﬁc signals that are controlled using the cur-

rent trafﬁc environment are referred to as DTC (DQN

Trafﬁc Control signal), while trafﬁc signals that are

controlled using past trafﬁc conditions are referred to

as LTC (LSTM Trafﬁc Control signal). Trafﬁc sig-

nals were created using deep reinforcement learning,

and their accuracy was evaluated using DTC for traf-

ﬁc signals that are controlled based on the current en-

vironment and LTC for trafﬁc signals that are con-

trolled using past information on the environment. As

a result of a comparison with FTC (Fixed Trafﬁc Con-

trol signal), which switches signals in sequence, DTC

reduces waiting time per vehicle by about 74% and

increase waiting time per pedestrian by about 196%,

DTC reduced overall vehicle and pedestrian waiting

time by approximately 12%, and it can be concluded

that DTC is able to control the environment accord-

ing to the conditions of the environment. LTC was

not learning to control well because it increased ve-

hicle and pedestrian wait times. However, even in

environments with large numbers of pedestrians, the

waiting time per person did not increase signiﬁcantly.

This paper is organized as follows. Section 2 intro-

duces related research. Section 3 describes the pro-

posed method. Section 4 describes the experiment,

the evaluation and its discussion. Finally Section 5

summarizes the conclusions of this paper.

Murata, A., Sei, Y., Tahara, Y. and Ohsuga, A.

Proposal of a Signal Control Method Using Deep Reinforcement Learning with Pedestrian Trafﬁc Flow.

DOI: 10.5220/0011665000003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 319-325

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

319

2 RELATED WORKS

2.1 DQN

Q-learning is one of the reinforcement learning meth-

ods. It is a learning method that uses a table called

a Q-table to determine the value of an action in each

situation. Based on the Q-value on the Q-table, the

value of an action in each state is determined, and the

next action is selected accordingly. The Q value is up-

dated for each action based on the value of the action

in each state and the value of the subsequent actions.

However, the creation of a Q-table for estimating Q-

values becomes more complicated as the number of

states and actions increases. DQN takes in a neural

network in the estimation of this Q-value and is capa-

ble of stable and accurate estimation.

2.2 R2D2

DRQN (Deep Recurrent Q-Learning) (Hausknecht

and Stone, 2017) is a learning method that combines

LSTM, a model that enables learning of long-term de-

pendencies on time series data, and DQN. However,

this learning method is incompatible with experience

replay of DQN, and there was a problem that the hid-

den state of LSTM was initialized. R2D2 (Recur-

rent Experience Replay in Distributed Reinforcement

Learning) (Kapturowski et al., 2019) is a method that

solves this problem. During the simulation, hidden

state of LSTM is saved with the experience saved,

such as the state of the environment and the agent’s

rewards. The network is then updated during train-

ing. By allowing time to pass through only time-

series data without learning for a certain period, this

method enables hidden state to be close to the state

when experience is saved.

2.3 Trafﬁc Control Studies

Most of the studies on trafﬁc control using trafﬁc sig-

nals trained by reinforcement learning have been con-

ducted using trafﬁc simulators. The Simulation of

Urban Mobility (SUMO) (Behrisch et al., 2018) is

the main simulator used. The method used in traf-

ﬁc control is to learn to use trafﬁc signals as agents

by using tensors obtained from the presence or ab-

sence of vehicles in the trafﬁc network (Vidali, 2017),

(Liang et al., 2018). By passing the obtained tensor

representing the trafﬁc condition to the neural net-

work, an action is selected. The value of the action

is determined according to the change of the state by

the action, and the network is updated by using it. In

addition, a cooperative system that updates Q-values

by transferring Q-values of adjacent trafﬁc lights to a

multi-agent system (Ge et al., 2019) and a study intro-

ducing LSTM have also been conducted(Choe et al.,

2018).

3 METHOD

3.1 Overview

In this study, we consider signal control in an envi-

ronment where vehicles and pedestrians are present

on the road and their trafﬁc volume varies greatly.

SUMO is used to create an environment in which the

trafﬁc volume changes, from which the trafﬁc condi-

tion of the environment is obtained. Deep reinforce-

ment learning is performed based on the results. In

the trafﬁc simulation, the durations of green and yel-

low lights are ﬁxed, and the total time is set to k. In

this study, the value of k is 20 seconds on the simula-

tor. Trafﬁc light acquires trafﬁc conditions (location

information of vehicles and pedestrians), selects ac-

tions, and calculates rewards of reinforcement learn-

ing at each time k. In the following description, the

value that increases with each time k is deﬁned as step

t, and the list of obtained values is shown in Table 1.

Table 1: Symbol list.

Symbol Expression

State of the environment at step t

Waiting time between step t − 1 and step t

Reward between step t − 1 and step t

Selected action at step t

M Experiential memory

3.2 State Representation

The trafﬁc light acquires information on the location

of vehicles and pedestrians in the environment every

time k. Then, the road in the vicinity of the signal

is divided into segments of a predetermined distance.

The state of the environment is represented by cre-

ating a list according to the number of vehicles and

pedestrians moving or stopped within the segmented

area. In addition, the number of pedestrians present

in the pedestrian crossing at the intersection is added

to the list. This is to treat pedestrians in crosswalks as

information to determine whether a vehicle can turn

right or left. The environment state list s

at step

t is represented as s

={N

, N

,. . . ,N

104

}. This is a

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

320

list with 104 elements and the number of vehicles or

pedestrians in the road division range j with N

. The

vehicle division range is set to the same size as that of

the vehicle in the three-lane portion near the signal. In

the two-lane portion, the range is gradually expanded

to twice, three times, and four times the size of the

vehicle. The pedestrian division range is about half

that of a vehicle, and is more segmented than that of

a vehicle (Figure 1). This is because the size of each

pedestrian in the environment is smaller than that of

a vehicle, and the purpose is to clarify the density of

pedestrians within the segmented area.

Figure 1: Segmentation range.

3.3 Action

The trafﬁc light chooses its action using a state list ob-

tained from the environment at each time k. The ac-

tion is to maintain the current signal pattern or change

to one of the signal patterns shown in Figure 2. In

low trafﬁc environments, simple control is possible

with only P

and P

. However, in an environment with

heavy pedestrian trafﬁc, right and left turns by vehi-

cles are restricted, so P

and P

are set to allow for

this. In addition, P

, P

, and P

are provided for the

case of extremely heavy vehicle or pedestrian trafﬁc.

3.4 Reward

The reward values in reinforcement learning are

mainly determined by the waiting time of vehicles and

pedestrians near intersections. The waiting time of

stopped vehicles and pedestrians is obtained for each

step t. Assuming that the vehicle waiting time at step

t is CW

and the pedestrian waiting time is PW

, the re-

ward function R

is expressed by the following equa-

tion 1

= (CW

t−1

−CW

)+α(PW

t−1

−PW

)−βEM

(1)

α is an arbitrary constant, and it’s set up to ad-

just for the importance of pedestrians in the environ-

ment. In this study, α is set to 3 because the environ-

ment used in the study is an environment with many

Figure 2: Signal pattern.

pedestrians. Also, EM

in equation is the count of the

number of emergency brakes the vehicle has caused

from the previous step to the current step. Emergency

braking may occur at the timing of a signal change or

when pedestrians are present in the crosswalk. The

emergency braking is considered to be the basis for

causing accidents and was established with the aim of

reducing the number of such accidents. β is a con-

stant that is negative, and the reward decreases as the

number of emergency brakes increases. In this study,

β is set to -5000. This value was determined based

on experiments with trafﬁc signals that switch signals

in sequence. The results of this experiment showed a

very low value for the vehicle reward, so we set a very

high value compared to that value. This was expected

to promote the inhibition of emergency braking dur-

ing learning.

3.5 Network

The network used in this study consists of ﬁve fully

connected layers. In addition, in order to have traf-

ﬁc control using temporal information, we prepared a

network including LSTM (Figure 3). This network is

the one in which the ﬁrst layer of the aforementioned

network consisting only of fully connected layers is

changed to the LSTM layer. In this network, the input

is a tensor consisting of states over several steps, and

depending on the state of the environment, one of the

signal patterns shown in section 3.3 is obtained as an

output.

Proposal of a Signal Control Method Using Deep Reinforcement Learning with Pedestrian Trafﬁc Flow

321

Figure 3: Overview of the network.

3.6 Algorithm

The learning algorithms Algorithm 1 and Algorithm 2

are as follows. Algorithm 1 shows the ﬂow of learning

by DQN and is based on the algorithm published by

Andrea Vidali (Vidali, 2017). First, it runs the simula-

tion with SUMO. Second, it obtains a state by number

of vehicles and pedestrians in the environment. Third,

using the obtained state list, the signal selects the ac-

tion. Then, the algorithm calculates rewards based on

vehicle and pedestrian waiting time. Fourth, save the

acquired state and reward values to experience mem-

ory. Repeat this process for each step t. After a cer-

tain amount of time, the network is updated using the

saved experience. Experimential memory stores the

state of the environment and the value of rewards,

the maximum amount of storage is deﬁned. There-

fore, if it is exceeded, the oldest data is deleted. The

reason for determining the maximum amount of stor-

age is that ε-greedy makes the older data highly ran-

dom. Algorithm 2 shows learning including LSTM.

In learning with this network, R2D2 (Kapturowski

et al., 2019) methods were incorporated. Stores the

hidden state of the LSTM layer at each step t in the ex-

perience and updates the network with it during learn-

ing. Also, do not use the time series tensor for a few

steps for training, but just pass it through the network.

This ensures that the hidden state of LSTM is close to

the state when the time series tensor is saved.

4 EXPERIMENT

4.1 Test Environment

In this study, we performed simulations using SUMO

and evaluated the accuracy of trafﬁc signals. Simu-

lation was performed up to 100 episodes, with 4000

steps on SUMO as one episode. The environment

Algorithm 1: Learning Algorithm.

1: for episode = 1 to N do

2: while step < max step do

3: get tra f f ic states and waiting time: s

, w

4: estimate reward : r

5: update old ad jacent signal action

6: if len(M) > max size − 1 then

7: delete M[0]

8: end if

9: append experience = (s

, a

t−1

, r

, s

t−1

)

10: select action with ε−greedy : a

11: yellow phase and green phase

12: t = t + 1

13: update s

t−1

= s

, a

t−1

= a

14: end while

15: update target network

16: repeat training epochs do

17: get batch size experiences

18: update network parameters

19: end repeat

20: end for

Algorithm 2: Learning Algorithm using time series data

1: for episode = 1 to N do

2: while step < max step do

3: get tra f f ic states and waiting time: s

, w

4: estimate reward : r

5: get hidden state : h

t−2

6: que and pop time step states : ls

7: if ls

> (max ls size) − 1 then

8: if len(M) > max size − 1 then

9: delete M[0]

10: end if

11: append

12: experience = (ls

t−1

, a

t−1

, r

, ls

, h

)

13: select action with ε−greedy : a

14: else

15: select action randomly

16: end if

17: yellow phase and green phase

18: t = t + 1

19: update ls

t−1

= ls

, a

t−1

= a

20: end while

21: repeat training epochs do

22: get batch size experiences

23: burn in process

24: update network parameters

25: end repeat

26: end for

shown in Figure 4 was prepared for simulation and

learning. In this environment, vehicles and pedestri-

ans travel on separate roads and do not collide at any

point on the intersection except at crosswalks. Dur-

ing the simulation, the number of vehicles and pedes-

trians in an episode is determined at the stage where

an episode is executed. At each step, vehicles and

pedestrians are generated according to a certain prob-

ability and start moving to the destination determined

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

322

at the same time. Vehicles and pedestrians have their

generation probabilities changed at certain steps. The

purpose of this is to verify whether trafﬁc signals are

capable of responding to changes in trafﬁc volume by

establishing time periods during which trafﬁc volume

varies greatly.

Figure 4: Simulation Environment.

4.2 Evaluation Experiment

4.2.1 Experimental Setup

To evaluate the performance of the trained trafﬁc light

agents, simulations up to 100 episodes were con-

ducted. The evaluation was based on comparisons

of average vehicle and pedestrian waiting times. In

the experiment, trafﬁc patterns with one point one

times, one point ﬁve times, and two times the num-

ber of pedestrians as compared to vehicles are pre-

pared, the trafﬁc patterns are referred to as low-

density, medium-density, and high-density trafﬁc en-

vironments, respectively, and are evaluated. The per-

formance was evaluated by comparing the learned

trafﬁc lights, DTC (DQN Trafﬁc Control signal) and

LTC (LSTM Trafﬁc Control signal), with a trafﬁc

light called FTC (Fixed Trafﬁc Control signal). DTC

is a trafﬁc light with a network of fully connected lay-

ers, and controls using the state of the intersection

for one step at the intersection. LTC is a trafﬁc light

with a network including LSTM, and controls using

the state for several steps at the intersection. FTC is

a trafﬁc light that switches color in a predetermined

order at regular intervals and provides stable control

regardless of the trafﬁc environment. In this experi-

ment, its signal is switched between the north-south

pattern of P

, P

and the east-west pattern of P

, P

50 seconds intervals, as shown Section3.4.

4.2.2 Experimental Result

Table 2: Vehicle waiting time.

Method Low Medium High

×10

s ×10

DTC 25.7 29.4 29.5

LTC 248.9 348.2 520.3

FTC 106.5 107.0 111.2

Table 3: Pedestrian waiting time.

Method Low Medium High

×10

s ×10

DTC 99.0 161.5 266.6

LTC 44.3 63.2 98.0

FTC 37.6 53.7 81.4

Table 4: Waiting time per vehicle.

Method Low Medium High

s s s

DTC 6.60 7.59 7.62

LTC 64.4 89.8 134.0

FTC 27.44 27.67 28.74

Table 5: Waiting time per pedestrian.

Method Low Medium High

s s s

DTC 20.80 25.04 27.68

LTC 9.35 9.49 9.96

FTC 8.05 8.26 8.46

Tables 2 through 5 show the experimental results.

Tables 2 and 3 summarize the averaged cumulative

values of waiting time per episode for vehicles and

pedestrians. Table 4 shows the total amount of time

vehicles wait between episodes divided by the num-

ber of vehicles, which is the waiting time per vehicle

value. Table 5 shows the waiting time per pedestrian.

The waiting time represents the time on the simulator,

and the smaller the value is, the more appropriate the

trafﬁc control is for the environment.

Proposal of a Signal Control Method Using Deep Reinforcement Learning with Pedestrian Trafﬁc Flow

323

4.2.3 Discussion

In Tables 2 to 5, which are the experimental results,

we evaluate and discuss the accuracy of the learned

trafﬁc signals. Table 2 shows that, compared to the

stable control of FTC, the learned trafﬁc lights, DTC,

reduced waiting time for vehicle by about 74% and in-

creased vehicle waiting time for pedestrians by about

196%, respectively, on average at each density com-

pared to FTC. Based on the wait times on both sides,

DTC’s overall wait time total is up. However, Table

4 and Table 5 shows that, compared to FTC, DTC,

reduced the total waiting time of per vehicle and per

pedestrians by about 12%. This can be thought of

as a reduction in the amount of waiting time incurred

by vehicles and pedestrians in the transportation net-

work as a whole. Table 3 shows that for the two trafﬁc

signals trained, the waiting time for LTC is shorter,

but Table 2 shows that the waiting time of vehicles

at LTC is signiﬁcantly higher than that at DTC. This

is due to the fact that LTC has placed more empha-

sis on reducing pedestrian waiting time. From Sec-

tion3.4, the importance of the walker in the environ-

ment is adjusted by setting a constant α in the reward

function. This high value the pedestrian learns that

continuing to take actions that reduce the pedestrian’s

waiting time is a simple way to increase the value of

the reward. Therefore, it is necessary to design re-

wards that reduce waiting time for both pedestrians

and vehicles. It can also be inferred that the presence

of the prepared road environment was a factor that in-

creased the overall waiting time including vehicles.

Since the prepared environment is only a crossroad, it

can be said that it is easy to grasp the scale of vehicles

and pedestrians moving in the road. In such an envi-

ronment, it can be judged that DTC, which uses the

current state of the environment, is easier to perform

control that reﬂects the state of the environment. On

the other hand, LTC can lead to better control in com-

plex road environments where it is difﬁcult to judge

trafﬁc conditions on the spot. Tables 4 and 5 show that

DTC and FTC does not show a signiﬁcant change in

waiting time per vehicle or per pedestrian with respect

to changes in trafﬁc density. In contrast, the learned

trafﬁc signals show a large change in vehicle wait-

ing times, but no noticeable difference in pedestrian

waiting times. This indicates that the learned trafﬁc

signals work to reduce the waiting time for pedestri-

ans in response to changes in the trafﬁc environment.

Therefore, it can be determined that the learned trafﬁc

signals are superior in terms of control adapted to the

trafﬁc environment.

5 CONCLUSION

The signal condition was veriﬁed by using a network

that uses the state of the environment between one

step and a network that uses the state of the environ-

ment between several steps. In this research, trafﬁc

control is performed using trafﬁc signals learned by

deep reinforcement learning in an environment with a

mixture of vehicles and pedestrians. As a result, DTC,

a learned trafﬁc light, led to a reduction in the wait-

ing time experienced by each of the trafﬁc network

as a whole. Furthermore, LTC was not possible to

reduce the waiting time of the entire trafﬁc network,

but it was possible to reduce the pedestrian waiting

time by adapting to changes in trafﬁc volume. In the

future, we will expand the learning and experimental

environment to create a trafﬁc signal system that can

control trafﬁc in a large-scale trafﬁc network. In addi-

tion, we will improve the control of trafﬁc signals us-

ing time series information. We will try to control ap-

propriately by increasing or decreasing the number of

states obtained from the environment and by adjusting

the past states used to select actions and adjusting the

past states used to select actions. On top of that, we

try to control not only simple structures such as cross-

roads, but also complex road environments by using

human ﬂow, such as the size of pedestrian groups and

the direction of movement.

ACKNOWLEDGEMENTS

This work was supported by JSPS KAKENHI Grant

Numbers JP21H03496, JP22K12157.

REFERENCES

Behrisch, M., Lopez, P. A., Bieker-Walz, L., Erdmann, J.,

Flotterrod, Y., Hilbrich, R., Lucken, L., Rummel, J.,

Wagner, P., and WieBner, E. (2018). Microscopic

Trafﬁc Simulation using SUMO. IEEE Intelligent

Transportation Systems Conference (ITSC).

Choe, C., S.Back, Woon, B., and Kong, S. (2018). Deep

Q Learing with LSTM for Trafﬁc Light Control.

2018 24th Asia-Paciﬁc Conference on Communica-

tions(APCC).

Ge, H., Y.Song, Wu, C., Ren, J., and Tan, G. (2019). Co-

operative Deep Q-Learning With Q-Value Transfer for

Multi Intersection Signal Control. 2019 IEEE Access

2907618.

Gers, F. A., Schmidhuber, J., and Cummins, F. (2000).

Learning to forget: continual prediction with LSTM.

Neural Computation 12(10), 2451-2471.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

324

Hausknecht, M. and Stone, P. (2017). Deep Recur-

rent Q-Learning for Partially Observable MDPs.

arXiv:1507.06527v4.

Hochreiter, S. and Schmidhuber, J. (1997). LONG SHORT-

TERM MEMORY. Neural Computation 9(8): 1735-

1780.

Kapturowski, S., Ostrovsk, G., Quan, J., and Dabney, R.

M. W. (2019). Recurrent Experience Replay In Dis-

tributed Reinforcement Learning. ICLR 2019.

Liang, X., X.Du, and Han, Z. (2018). Deep rein-

forcement learning for trafﬁc light control in

vehicular networks. IEEE Trans. Veh. Technol.:

https://arxiv.org/abs/1803.11115(2022/06/10 refer-

ence).

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, L., Wierstra, D., and Riedmiller, M.

(2013). Playing Atari with Deep Reinforcement

Learning. arXiv preprint arXiv:1312.5602.

Vidali, A. (2017). Simulation of a trafﬁc light scenario

controlled by Deep Reinforcement Learning agent.

https://github.com/AndreaVidali/Deep-QLearning-

Agent-for-TrafﬁcSignal-Control(2022/06/10 refer-

ence).

Proposal of a Signal Control Method Using Deep Reinforcement Learning with Pedestrian Trafﬁc Flow

325