A Deep Q-learning based Path Planning and Navigation System for

Fireﬁghting Environments

Manish Bhattarai

1,2

and Manel Mart

ınez-Ram

University of New Mexico, Albuquerque, NM, 87106, New Mexico, U.S.A.

Los Alamos National Laboratory, Los Alamos, NM, 87544, U.S.A.

Keywords:

Path Planning, Navigation, Fireﬁghting, Decision Making, Reinforcement Learning, Deep Q-learning,

Situational Awareness.

Abstract:

Live ﬁre creates a dynamic, rapidly changing environment that presents a worthy challenge for deep learning

and artiﬁcial intelligence methodologies to assist ﬁreﬁghters with scene comprehension in maintaining their

situational awareness, tracking and relay of important features necessary for key decisions as they tackle these

catastrophic events. We propose a deep Q-learning based agent who is immune to stress induced disorientation

and anxiety and thus able to make clear decisions for ﬁreﬁghter navigation based on the observed and stored

facts in live ﬁre environments. As a proof of concept, we imitate structural ﬁre in a gaming engine called

Unreal Engine which enables the interaction of the agent with the environment. The agent is trained with

a deep Q-learning algorithm based on a set of rewards and penalties as per its actions on the environment.

We exploit experience replay to accelerate the learning process and augment the learning of the agent with

human-derived experiences. The agent trained under this deep Q-learning approach outperforms agents trained

through alternative path planning systems and demonstrates this methodology as a promising foundation on

which to build a path planning navigation assistant. This assistant is capable of safely guiding ﬁreﬁghters

through live-ﬁre environments in ﬁreground navigation activities that range from exploration to personnel

rescue.

1 INTRODUCTION

Near-zero visibility, unknown hallways, deadly heat

and ﬂame, and people in dire need. These are the chal-

lenges ﬁreﬁghters face with every structure ﬁre they

respond to. Fireﬁghters endure both extreme external

conditions and the internal hazards of stress, panic,

and disorientation as part of their daily job. Their cen-

tral weapon against both internal and external hazards

is their training on maintenance of situational aware-

ness or understanding of the activities, and circum-

stances occurring in ones immediate vicinity. Main-

taining situational awareness is key to a ﬁreﬁghters

quick and apt response to an ever-changing environ-

ment and is critical to accurate decision-making. Sit-

uational awareness can be heavily impacted by both

external hazards related to ﬁre, and the corresponding

internal stresses experienced by ﬁrst responders. Loss

of situational awareness is one of the main causes

in the loss of life of ﬁreﬁghters on scene. Fireﬁght-

ers must make prompt decisions in high-stress envi-

ronments, constantly assessing the situation, planning

their next set of actions, and coordinating with other

colleagues, often with an incomplete picture of the sit-

uation. Situational awareness is the foundation of fur-

ther decisions on how to coordinate both rescue oper-

ations and ﬁre suppression. Fireﬁghters on-scene pass

their scene interpretations on via portable radio de-

vices to ﬁeld commanders for further assistance in de-

cision making and the passing along of an inaccurate

understanding of current conditions can prove disas-

trous. The limitation of this decision-making system

is well reﬂected in the annual statistics by the US Fire

Administration on the loss of human life

. Existing

ﬁre ﬁghting protocols present an excellent use case

for institution of state-of-the-art communication and

information technologies to improve search, rescue,

and ﬁre suppression activities through improved uti-

lization of the data already being collected on-scene.

Fireﬁghters often carry various sensors in their

equipment, including a thermal camera, gas sensors,

and a microphone to assist in maintaining their situ-

ational awareness but this data currently is used only

Fireﬁghter Fatalities in the United States in 2017

Bhattarai, M. and Martínez-Ramón, M.

A Deep Q-learning based Path Planning and Navigation System for Fireﬁghting Environments.

DOI: 10.5220/0010267102670277

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 2, pages 267-277

ISBN: 978-989-758-484-8

267

in real-time by the ﬁreﬁghter holding the instruments.

Such data holds great potential for improving the ca-

pability of the ﬁre teams on the ground if the data

produced by these devices could be processed with

relevant information extracted and returned to all on

scene ﬁrst responders quickly, efﬁciently and in real-

time in the form of an augmented situational aware-

ness. The loss of situational awareness is at the core

of disorientation and poor decision making. Advance-

ment in computing technologies, small, cheap, wear-

able sensors paired with wireless networks combined

with advanced computing methodologies such as ma-

chine learning (ML) algorithms that can perform all

data processing and predicting utilizing mobile com-

puting devices makes it not only possible, but quite

feasible to create AI systems that can assist ﬁreﬁght-

ers in understanding their surroundings to combat

such disorientation and its consequences. This re-

search presents a theoretical approach that can serve

as the backbone upon which such an AI system can be

built by demonstrating the power of deep Q-learning

in building a path planning and navigation assistant

capable of tracking scene changes and offering ﬁre-

ﬁghters alternative routes in dynamically changing

ﬁre environments.

AI planning is a paradigm that specializes in de-

sign algorithms to solve planning problems. This is

accomplished by ﬁnding a sequence of actions and

addressing the needs and constraints to drive an agent

from a speciﬁed initial state to a ﬁnal state satisfying

several speciﬁed goals. We utilize these paradigms

to build a framework that teaches the agent about

ﬁre avoidance and deploys a decision process reactive

enough to successfully guide the agent through simu-

lated spaces that are as dynamic as those encountered

in live ﬁre events. Training in a simulated environ-

ment allows us the ability to test a multitude of situ-

ations and train the agent for exposure to a vast num-

ber of scenarios that would otherwise be impossible

in real life. As a result, we get a vastly experienced

pilot capable of presenting quick recommendations to

a wide variety of situations. The presentation of this

technology is meant to serve as the basis upon which

to build a navigation assistant in future work.

2 PRELIMINARIES

The work in this paper is based on two distinct ﬁelds

which are 1) path planning and navigation and 2) deep

reinforcement learning.

2.1 Path Planning and Navigation

A large amount of work focused on path planning and

navigation to aid ﬁreﬁghting has been done, but few

works address dynamic, continuously changing en-

vironments. (Su and Su, 2012) proposes a mobile

robot with various sensors to detect ﬁre sources and

use the so called A* search algorithm for rescue. An

algorithm based on ﬁre simulation to plan safe trajec-

tories for an unmanned aerial system in a simulator

environment is presented in (Beachly et al., 2018).

(Jarvis and Marzouqi, 2005) shows the efﬁcacy of

the covert robotic algorithmic tool for robot naviga-

tion in high-risk ﬁre environments. The usage of an

ant colony optimization tool to automatically ﬁnd the

safest escape routes in an emergency situation in a

simulator environment is shown in (Goodwin et al.,

2015) whereas (Zhang, 2020) formulates the naviga-

tion problem as a ”Traveling Salesman” problem and

proposes a greedy-algorithm-based route planner to

ﬁnd the safest route to aid ﬁreﬁghters in navigation.

(Ranaweera et al., 2018) proposes a particle swarm

optimization for shortest path planning for ﬁreﬁght-

ing robots, whereas (Zhang et al., 2018) proposes ap-

proximate dynamic programming to learn the terrain

environment and generate the motion policy for op-

timal path planning for UAV in forest ﬁre scenarios.

A methodology for path reconstruction based on the

analysis of thermal image sequences is demonstrated

in (Vadlamani et al., 2020) which is based on the es-

timation of camera movement through estimation of

the relative orientation with SIFT and Optical ﬂow.

Despite the large quantity of work in the litera-

ture to aid the ﬁreﬁghters in path planning and navi-

gation, most tend to solve the path planning consider-

ing a static environment where a one-time decision

is made to guide the agent from source to destina-

tion. Such algorithms fail when the environment is

dynamic. Furthermore, these algorithms do not al-

low for the agent to take immediate decisions when

encountered with a sudden ﬁre in the path of the cho-

sen navigation path. We propose a deep reinforce-

ment learning-based agent that is capable of taking an

instantaneous decision based on learned experiences

when subjected to sudden environment changes dur-

ing navigation.

2.2 Deep Reinforcement Learning

Reinforcement learning (RL) is a technique that tends

to learn an optimal policy by choosing actions based

on maximizing the sum of expected rewards. Even

though several works exist for path planning in ﬁre

environments, no RL based path planning implemen-

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

268

tations were found for ﬁre scenarios. Outside of the

ﬁre scenario, several RL based path planning imple-

mentations do exist. (Romero-Mart

ı et al., 2016)

demonstrates an RL based navigation of a robot which

is provided with a topological map.(Li et al., 2006)

uses a Q-learning based path planning for an au-

tonomous mobile robot for dynamic obstacle avoid-

ance. An RL based complex motion planning for an

industrial robot is presented in (Meyes et al., 2017).

RL integrated with deep learning has demonstrated

phenomenal breakthroughs that are able to surpass

human-level intelligence for computer games such as

Atari 2600 games (Mnih et al., 2013) (Mnih et al.,

2015), AlphaGo zero (Silver et al., 2017) (Tang et al.,

2017) along with various other games. In these frame-

works, the AI agent was trained by receiving only the

snapshots of the game and game score as inputs. Deep

RL (DRL) has also been used for autonomous nav-

igation based on inputs of visual information (Sur-

mann et al., 2020) (Kiran et al., 2020). (Bae et al.,

2019) proposes a multi-robot path planning algorithm

based on deep Q-learning whereas (Lei et al., 2018)

demonstrates the autonomous navigation of a robot

in a complex environment via path planning based

on deep Q-learning(DQL) with SLAM. Most of these

deep reinforcement learning-based path planning and

navigation tasks are based on visual input i.e raw im-

ages/depth data which encodes the information about

the environment. Based on this information, the nav-

igation agents can establish the relationship between

action and the environment. The agent in the DRL

system embeds the action-policy map in the modal

parameters of the neural nets.

Despite the efﬁcacy of the DRL system in nav-

igation, they are based on a learning experience of

trial and error where the agent goes through numer-

ous failures before actual success. Considering the

hazardous behavior of the ﬁre-environments, training,

and evaluation of such DRL systems is very danger-

ous and practically infeasible. In addition to that, it

is very expensive as well as time-consuming. To ad-

dress these challenges, we developed the training en-

vironment for the RL agent in a virtual gaming envi-

ronment Unreal Engine. The virtual environment de-

picts the actual ﬁreﬁghting scenario and enables the

user to collect a large number of visual observations

for action and reaction in various ﬁre environments.

The agent can interact with the environment through

the actions and can also be trained with various user-

deﬁned rewards and goals. The framework also al-

lows a plug and play option for the ﬁreﬁghting envi-

ronment where one can depict a variety of ﬁre sce-

narios from structural ﬁre to wildﬁres for training the

DRL agent.

In this paper, we introduce a DRL approach to

train a virtual agent in a simulated ﬁre environment.

Taking advantage of simulation, we are able to expose

our agent to a vast number of scenarios and dynamics

that would be cost as well as safety prohibitive in real

life but the results of the training can be applied to real

life ﬁre events. The resulting algorithm can be used in

conjunction with other deep learning/machine learn-

ing approaches to produce a robust navigation assis-

tant that can operate in real-time, effectively guiding

ﬁre ﬁghters through a ﬁre scene and aid their deci-

sion making by supplementing information gaps and

situational awareness lags through the correct inter-

pretation of the scenes they have passed and/or are

currently in.

3 RL PROBLEM DESCRIPTION

AND RL ARCHITECTURE

The virtual environment is achieved in a gaming plat-

form Unreal Engine(Qiu and Yuille, 2016) depict-

ing a ﬁre scenario of burning objects and smoke.

The gaming engine uses computational ﬂuid dynam-

ics (CFD)(Anderson and Wendt, 1995) based physics

models to simulate a real-life dynamic situation where

the parameters are a function of time. This gaming

environment allows an external interaction where the

agent can navigate in the scene via external controls

such as a keyboard or head movement in a Virtual re-

ality(VR) device. We take advantage of an interface

software AirSim (Shah et al., 2018) that allows com-

munication to and from the gaming environment to

a deep learning framework (Tensorﬂow (Abadi et al.,

2016)). AirSim can grab various parameters from the

gaming environment such as RGB feed, Infrared feed,

depth, and semantic map information corresponding

to the scene and provides the feed to the python block.

The python block then processes this information and

dictates an agent’s movement such as move forward,

move backward, turn right, turn left, jump based on

the deep reinforcement learning(RL) algorithm, and

passes to the AirSim. AirSim further provides these

control commands to the Unreal Engine environment

which emulates these motions. We deploy a deep

Q-learning agent that is trained on a policy-reward

mechanism along with experience replay. For the ex-

perience replay, in addition to storing agent self play,

we also recorded the user interactions with the en-

vironment where they were asked to safely navigate

the environment, avoiding the ﬁre and reaching the

target in the given scene. With each new start, the

user is asked to take different routes with the virtual

agent to reach the destination while avoiding ﬁre and

A Deep Q-learning based Path Planning and Navigation System for Fireﬁghting Environments

269

Actions

Observations

Jump

Front

Left

Right

Experience

Parameters

Sample

Experiences

Figure 1: Deep Q-Network implementation.

the video frames and controls are recorded. During

the training of RL agent, the sequence of frames and

controls from the experience replay memory are pro-

vided to accelerate the training process and make the

non-differentiable optimization problem converge in

a reasonable time with better accuracy. The knowl-

edge gained by the virtual agent on how to success-

fully navigate the virtual scene can then be transferred

to a cyber-human system that can use this knowledge

to interpret a real scene and provide step-by-step di-

rections to ﬁreﬁghters to assist them in avoiding ﬁre

or other dangerous obstructions. The overview of the

proposed DQN is shown in Figure 1.

Now, we deﬁne our objectives and various param-

eters associated with the proposed DRL framework.

3.1 Objective

The goal of the proposed deep Q-learning based agent

is to reach the destination while safely navigating the

ﬁre in a dynamic environment. Safe navigation is

deﬁned as avoiding any contact with simulated ﬁre.

During the test, the agent needs to be prepared to

make instantaneous decisions in instances where ﬁre

appears unexpectedly in the chosen navigation path.

To achieve the best decisions under such situations,

the agents can be subjected to many worst case situa-

tions during the training phase. The rewards need to

be deﬁned precisely to handle such task-driven learn-

ing.

3.2 Observations

The observations for the Q-learning framework are

collected through the agent’s ﬁeld of view(FOV) from

the virtual simulation environment (i.e Unreal En-

gine) using the AirSim app. The observations are re-

ceived by the Python deep learning environment in the

form of various feeds which include RGB, infrared,

depth, and semantic map frames. Out of these, we

are particularly interested in the infrared frames as

the CNN framework is developed to perform recog-

nition on thermal imagery. This CNN framework is

based on (Bhattarai and Mart

ıNez-Ram

on, 2020). In

real life scenarios, infrared cameras are the only feed

type which can withstand extreme ﬁre and smoke sit-

uations and improve visibility in heavy smoke. The

virtual environment is also able to provide informa-

tion about the camera position and the agent position,

which will be helpful to locate the agent in the given

3D environment.

3.3 Actions

For ease of implementation and demonstration of

proof of concept, we have transformed the action

space from continuous to discrete space which com-

prises ﬁve primary agent motions. This discretization

of the agent space also helps to reduce the model com-

plexity. The ﬁve actions are move front, move back,

turn left, turn right, and jump. With these motions, the

agent can navigate in a structural building containing

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

270

obstacles like ladders and furniture. The same set of

motions also enables the agent to navigate in wild-

ﬁre scenarios. The agent may take one or a combina-

tion of these actions to navigate along the ﬁre scene

to reach the destination.

3.4 Rewards

It is very important to deﬁne the direction of the goal

while training an RL agent. To achieve a task-driven

learning objective, it is vital to deﬁne the rewards to

the navigating agent. The ultimate goal is to ﬁnd a

safe and minimal trajectory length to the navigation

target. Unlike the trivial objective of ﬁnding the mini-

mal trajectory length, the additional constraint of ﬁnd-

ing the safe path makes the optimization algorithm

more complex. This results in a time-varying deci-

sion system whose instantaneous decisions are based

on contemporary information of the environment. We

introduce a reward and a penalty for familiarizing the

agent to the ﬁre environment. The ﬁre has a penalty

of -10 and a reward for reaching the goal is 10.

3.5 Problem Statement

The RL agent tends to pivot the actions in the direc-

tion of maximizing the rewards. The DRL system

optimizes the hyperparameters of the Q-neural net-

work to encode the experience of the agent for nav-

igation. The backbone deep network aids the naviga-

tion by detection of the objects of interest for naviga-

tion. This information is then fed to the Q-network

which then chooses the optimal actions to guide the

agent. The idea of a DRL system is to provide an end

to end learning framework for transforming the pixel

information into actions (Mnih et al., 2015). Most

of the DRL systems aim to learn the parameters for

the neural network to ﬁnd a transformation from state

representations s to policy π(s). Also, it is desirable

to have an agent that can learn the navigation from

a single environment and can generalize the experi-

ence to various environments. To achieve that, the aim

is to learn a stochastic policy function π, which can

process a representation of current state s

and target

state s

to produce a probability distribution over ac-

tion space π(s

, s

). During the test, the agent samples

an action from this distribution until it can achieve the

destination. To summarize, the objective function that

is used to assess the model performance is given in the

form

z = g(x; θ) = g(x; β(θ); θ) (1)

where g is the navigation problem, which can be

deﬁned as ﬁnding the optimal actions with a DQN

whose parameters θ and β are the parameters of the

navigation agent and x is the observation. z describes

the navigation where a given ﬁtness function J(z) is

applied to it in order to measure the optimality of the

estimated decisions for navigation. For a given ﬁxed

set of neural net parameters β, the optimizer tends

to seek optimal θ that determines the actions for the

agent.

3.6 Model

The emphasis of this work is to ﬁnd an optimal policy

that can aid a ﬁreﬁghter to navigate in a ﬁre setting via

deep reinforcement learning. A deep neural network

is trained for a non-linear approximation of the policy

function π, where action a at time t is sampled as :

a ∼ π(s

, s

|β) (2)

where β corresponds to NN model parameters, s

the current observation frame, s

is the target obser-

vation to which navigation is to be performed with

action sequence a. Here, s

belongs to a discrete set,

π is a distribution function. The target scene can com-

prise ﬁre victims which need help for rescue. So, once

the deep learning (DL) model estimates the target to

be rescued, the RL agent tends to propose navigation

paths that successfully rescue the victim.

3.7 Q-learning and Deep Q-learning

We employ a variant of Q-learning called Deep Q-

learning (DQL)(Mnih et al., 2013) to train an agent

for navigating the ﬁre to reach the destination safely.

In this section, we brieﬂy give an overview of the Q-

learning and Deep Q-learning algorithms.

Q-learning learns the action-value function Q(s, a)

to quantify the effectiveness of taking an action at a

particular state. Q is called the action-value function

(or Q-value function ). In Q-learning, a lookup ta-

ble/memory table Q[s, a] is constructed during train-

ing to store Q-values for all possible combinations of

states s and actions a. An action is sampled from the

current state, followed by computation of reward R (if

any) and then the new state s. From the memory table,

the next action a is determined based on the maximum

of Q(s, a). After this, an action a is performed to seek

a reward of R. Based on this one-step look ahead, the

target Q(s, a) is set to

target = R(s, a, s

) + γmax

, a

) (3)

, where k corresponds to an episode. The update equa-

tions are called Bellman equations (Bellman, 1966)

and are performed iteratively with dynamic program-

ming. As this update is performed iteratively until

convergence, a running average for Q is maintained.

A Deep Q-learning based Path Planning and Navigation System for Fireﬁghting Environments

271

Algorithm 1: Deep Q-learning algorithm for path planning

agent.

1: Initialize replay memory R to capacity N.

2: Initialize the Q-function Q(s, a) for all s,a with

random weights.

3: for k in 1,2,.. M do  each execution sequence,

where k corresponds to an episode

4: Initialize sequence s

= x

5: for t in 1,2..T do  decision epoch

6: With probability ε, select a random ac-

tion, otherwise select a

= max

∗

(s, a;β) 

Exploration vs Exploitation Step

7: Action a

is performed by agent in the

environment and corresponding rewards r

and

scene x

t+1

is observed.

8: Set s

t+1

= s

, a

, x

t+1

9: Store s

t+1

in R .

10: Sample a batch of transitions e

, a

, r

, x

k+1

) from R .

11: if x

t+1

is terminal then

12: y

= r

13: else

14: y

= r

+ γmax

Q(s

k+1

, a

;β)

15: end if

16: Compute loss (y

− Q(s

, a

;β

))

and

then update neural net parameters β with gradi-

ent descent and back-propagation as per equa-

tions 4,5 and 6 .

17: end for

18:

19: end for

However, for solving a real-world problem such as

path planning and navigation, where the combinations

of states and actions are too large, the memory and the

computational requirements for Q is very expensive

and intractable in some cases. To address that issue, a

deep Q-network (DQN) framework was introduced to

approximate Q(s, a) with the aid of neural network pa-

rameters. The associated learning algorithm is called

Deep Q-learning. Based on this approach. we can ap-

proximate the Q-value function with the neural net-

work rather than constructing a memory table for Q-

function for state and actions.

An RL system needs to know the current state and

actions to compute the Q-function. However, for our

proposed simulation environment, the internal state

information is not available. In one way, the state

information can be constructed based on a recogni-

tion system that can identify the object of interest in

the scene resulting in a discretization of the observa-

tion space by assigning pixels discrete values based

on their identity. This objective is out of the scope of

this paper and will be pursued in the future. For this

implementation, we only focus on observing a frame

from the emulator, which is a grayscale infrared

image. Based on the action performed in the envi-

ronment, the agent receives a reward r

, along with a

change in the internal state of the environment. Since

we have deﬁned a ﬁnite reward/penalty corresponding

to speciﬁc states, the agent might need to go through a

series of actions before observing any reward/penalty.

To estimate the Q-function, we consider the se-

quence of actions and observations for a game play

episode. It is given as s

= x

, a

, x

, a

, ..a

t−1

, x

Considering t is a ﬁnite time where the game termi-

nates either by reaching the target or getting burnt in

a ﬁre, this sequence can be formulated as a markov

decision process (MDP). The goal of the agent is

to choose the action that maximizes the sum of fu-

ture rewards where the reward at time t is given as

∑

−t

for T being episode time. We

then use a Q

∗

(s, a) as optimal action-value func-

tion for a given sequence s and action a where

∗

(s, a) = max

E[R

= s, a

= a, π], π being the dis-

tribution over actions. To estimate this Q

∗

(s, a), we

use the deep neural network (Q-network) of param-

eters β as a non-linear function approximator in the

form Q(s, a;β) where it is expected that Q(s, a; β) ≈

∗

(s, a). This network is trained with an objective

of minimizing a sequence of loss functions L

(β

)

where,

(β

) = E

s,a∼ψ(.)

[(y

− Q(s, a;β

))

], (4)

Where y

= E

[r + γmax

Q(s, a;β

k−1

)|s, a] is the

target for iteration k and ψ(.) is the probability distri-

bution of sequences s and actions a.

The neural net then back-propagates the gradient

given as

∇

(β

) = E

s,a∼ψ(.)

h

r + γmax

Q(s, a;β

k−1

)

−Q(s, a;β

)



∇

Q(s, a;β

)

(5)

The parameters of the neural network are updated

k+1

= β

− α∇

(β

) (6)

where α is the learning rate of the neural network.

Furthermore, a technique called experience replay

(Mnih et al., 2013) is used to improve convergence.

This occurs through exposing the model to human-

controlled navigation and decision making. To imple-

ment experience replay, the agent’s experience e

, a

, r

, s

t+1

) at each time step t is stored, where s

is the current state, a

is the action, r

is the reward,

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

272

Figure 2: Architecture of Path Planning and Navigation system.

t+1

is the next state on taking action a

. The expe-

rience calculations presented result from the human-

controlled navigation training and interaction with the

environment. Based on the human interaction, var-

ious episodes {e

}

i=1

are then stored in a memory

buffer M. During the inner loop Q-learning updates,

a sample of experiences are drawn randomly from the

memory buffer M. The agent then selects and exe-

cutes an action based on an ε−greedy policy. The ap-

proach of sampling randomly from experience replay

enables the agent to learn more rapidly via improved

exposure to reactions to different environmental con-

ditions during an episode of training and allows the

model parameters to be updated based on diverse and

less correlated state-action data. The algorithm corre-

sponding to the deep Q-learning is presented in Algo-

rithm 1 and implementation methodology is presented

in Figure 1.

3.8 Network Architecture

The DQN framework is built on top of a VGG-

net-like framework (Bhattarai and Mart

ıNez-Ram

on,

2020) as a backbone and is shown in Figure 2. The

backbone framework is used as a feature extrac-

tor that produces 4096-d features on a 224x224 in-

frared/thermal image. The VGG framework is frozen

during the training. A stack of 4 history frames are

used as state inputs to account for the past sequence

of actions of the agent. Then the concatenated fea-

ture set comprising 4 × 4096 is projected into a 512-d

embedding space. This vector is then passed through

a fully-connected layer producing 5 policy outputs

which give the probability over actions and value out-

put.

4 EXPERIMENTAL RESULTS

The implementation of the DQN model was done in

Tensorﬂow (Abadi et al., 2016) on a dual NVIDIA

GeForce 1080Ti GPU. The DQN framework is

trained with an RMSProp optimizer (Tieleman and

Hinton, 2012) with a learning rate of 10

−4

and batch

size of 32. The training was performed with ε−

greedy with ε started at 1 and decayed to 0.1 over

5,000 frames. The training afterwards was continued

with ε of 0.1. The whole training was performed with

as many as 100,000 frames with a replay memory of

20,000 frames.

While training, the gradients are back-propagated

from the Q-layer outputs back to lower-level layers

while the backbone model is frozen. The navigation

performance was measured by the agent’s ability to

reach 100 different targets set in a given ﬁre envi-

ronment. The target was placed at various locations

in the virtual building on different ﬂoors where the

agent needed to navigate using a combination of all

actions. The actions of the agent in successful navi-

gation are shown in Figure 3. The images correspond

A Deep Q-learning based Path Planning and Navigation System for Fireﬁghting Environments

273

Figure 3: Demonstration Of agent actions in ﬁre Environment as dictated By reinforcement learning algorithm. Four

primary actions demonstrated respectively are turning left, right, jump, and moving forward. This enables the agent to avoid

ﬁre and obstacles and safely reach a given destination.

to the agent’s action of move forward, turn left, turn

right and jump. It is complicated to report the aver-

age trajectory length due to the constant changes oc-

curring along the virtual path. In this simulation, the

agent needs to avoid ﬁre, and the trade-off for that is

time. Imposing a time constraint and weighing the

reinforcement model for rewarding or penalizing ac-

cording to both strictures is a goal for future work. To

prove the efﬁcacy of the proposed method’s learning,

we worsen the situation by adding more ﬁre occur-

rence. The ﬁre volume per scene was increased from

10% to 80%, adding ﬁre at random locations. The

shortest path planning strategy failed to reach the des-

tination for ﬁre percentage 30%, and our proposed al-

gorithm was able to navigate to the destination for the

ﬁre percentage as high as 76%. The agent was consis-

tently able to navigate to the destination with a narrow

escape. For extreme ﬁre conditions, we carefully in-

creased the rewards and deﬁned additional penalties

(distance to ﬁre) to better the agent’s learning condi-

tion. The deep Q-learning agent can only be trained

under certain ﬁre conditions, and it can exploit that

knowledge to navigate under different ﬁre conditions.

For example, in this setup, the agent was trained for a

ﬁre percentage of 40%, and it was able to navigate for

a ﬁre percentage of as high as 76%.

The main goal of the proposed algorithm is to

ﬁnd the least number of combinations of actions that

helps the agent to navigate from the current position

to the destination while avoiding ﬁre. Due to the dy-

namic nature of the environment, when we attempted

to solve this problem with other path planning tech-

niques including shortest path technique, breadth-

ﬁrst search(BFS) (Beamer et al., 2012), depth-ﬁrst

search(DFS) (Tarjan, 1972), A*(LaValle, 2006) and

random walk(Spitzer, 2013), the probability of the

agent reaching the destination was very low (less then

5%) under the simulation environment. Since these

methods use a single shot decision map to navigate

the agent to the destination, the agent was unable to

quickly adapt to the continuously varying surround-

ings. When the agent encountered ﬁre which was not

present before the decision, the agent failed to reach

the target in most of the cases. In contrast, the agent

governed by our proposed method was able to reach

the destination with a probability greater then 80%.

During the evaluation of the agent’s navigation

performance with different path planning algorithms,

including the proposed, we observed that a random

initialization for the position of the agent and target

point resulted in an increase of the distance between

them with a larger probability. This leads to a higher

chance of failing the agent to reach the destination.

To address the agent’s challenge, we selectively chose

a distance parameter L between the agent and target,

starting with a smaller value and increasing them over

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

274

the training allowing the agent to learn a better navi-

gation experience for a continually changing environ-

ment leading to more extensive training samples dis-

tribution.

During the navigation, when the DQL trained

agent is unable to ﬁnd the path to proceed, we design

penalties so the agent is constrained to either staying

in the same position until the path is cleared or re-

tracing its steps backwards to the previous possible

path. We acknowledge that this part of the algorithm

requires further attention, as remaining in one place

in a real ﬁre scenario is not realistic.

5 MOVEMENT PLANNING

THROUGH DEEP Q LEARNING

FOR FIREFIGHTING

APPLICATION

We have demonstrated the potential deep Q-learning

based algorithms hold as a base framework off of

which a successful navigation assistant can be built.

This methodology can provide an efﬁcient decision-

making system for aiding ﬁreﬁghters whos decision-

making abilities may be impaired due to disorienta-

tion, anxiety, and heightened stress levels. This work

presents a novel approach to eliminating faulty deci-

sions made under duress through the application of AI

planning paradigms. The paths followed by the ﬁre-

ﬁghters are useful to determine their positions, which

is particularly important in search and rescue.

Existing path planning algorithms can process the

information of all paths followed by the ﬁreﬁghters

but fail under the constantly changing nature of the

ﬁre ground which can make a previously deﬁned res-

cue plan unavailable. Also, the presence of smoke

and other visual impairments could make difﬁcult the

rapid identiﬁcation of these incidents by a ﬁreﬁghter.

Incidents in the ﬁre ground are hardly predictable by

a machine learning system. Machine learning does

however, perform well in rapid assessment and pro-

duction of a decision given the current set of circum-

stances. In other research outside of the scope of this

paper, (Bhattarai and Mart

ıNez-Ram

on, 2020; Bhat-

tarai et al., 2020) have developed a machine learn-

ing based methodology that detects and tracks objects

of interest such as doors, ladders, people and ﬁre in

the thermal imagery generated by ﬁreﬁghter’s thermal

cameras. Such information may be valuable to fur-

ther improve the reinforcement learning algorithm’s

ability to understand aspects of the environment that

may be used in navigation or escape. Future work

looks to incorporate a similar object detection work

with the path planning work described here to make

a robust, navigation assistant that is capable of un-

derstanding the surrounding environment outside of

ﬁre presence and then recommend best paths to ﬁre

ﬁghters. To deploy the agent in a real ﬁre situation,

we also aim to ﬁrst construct the 3D map based on

multimodal data(RGB, infrared and depth map) col-

lected from various sensors attached to ﬁreﬁghter’s

body sensors. Such a map can be imported to the em-

ulator to train the agent in more natural look-a-like

environment.

Assuming that a path has been previously deter-

mined by the system by using the information coming

from the camera of the rescuer, the rescuer has access

to an initial rescue path. The system tells the rescuer

to take a direction, which is the present action. The

states will be represented by the objects present in the

scene. The objects of interest can be represented in

a matrix that contains the extracted feature’s image.

Each detected feature has a different reward. Fire and

obstacles have associated penalties, while a clear path

has a positive reward. If an obstacle is detected, then

the ﬁreﬁghter is told to take a different direction. The

new state will be computed for the action taken and

a new path will be traced. The recursion repeats un-

til the rescuer has reached the desired position. The

plan for this part of the research will include an initial

model constructed by simulation. This will be useful

to determine the right design for the neural network in

terms of stability and convergence speed in different

simulated situations. When incorporating information

from other paths followed by other ﬁre ﬁghters in a

real scene, it is not evident that all obstacles can be

determined by their past experiences due to the dy-

namic nature of a real ﬁre scene. Nevertheless, the

parallax estimation obtained from sequences of cam-

eras in motion can be helpful. Parallax data can be

used to determine the depth of a given path because

it gives the distances between the camera and the key

points detected by the SIFT algorithm. This informa-

tion only needs to be stored and compared with fu-

ture sequences of the same path. We can consider that

an obstacle has been found in a previously clear path

if the estimated depth has dramatically changed. In

this case. the direction pointed by the camera will be

given a low reward instead of a high one.

6 CONCLUSION AND FUTURE

DIRECTIONS

We present a deep Q-learning based agent trained in

a virtual environment that is able to make decisions

for navigation in an adaptive way in a ﬁre scene. The

A Deep Q-learning based Path Planning and Navigation System for Fireﬁghting Environments

275

Unreal engine was used to emulate the ﬁre environ-

ment and AirSim was used to communicate data and

controls between the virtual environment to the deep

learning model. The agent was successfully able to

navigate extreme ﬁres based on its acquired knowl-

edge and experience.

This work serves as the foundation on which to

build a deep learning framework that is capable of

identifying objects within the environment and incor-

porating those objects into its decision making pro-

cess in order to successfully deliver safe, navigable

routes to ﬁreﬁghters.

The learning process is currently slow and needs

several hours of training. In the future, we aim to uti-

lize A2C and A3C based reinforcement learning mod-

els to train a shared model utilized in parallel by mul-

tiple agents with multiple goals simultaneously. we

also aim to use the deep learning-based results such as

object detection, tracking, and segmentation to create

a more informative situational awareness map of the

reconstructed 3d scene.

The proposed system is intended to be integrated

in a geographic and visual environment with data of

the ﬂoor plan, which will also include scene infor-

mation about the ﬁre locations, doors, windows, de-

tected ﬁreﬁghters, health condition of the ﬁreﬁghters

and other features that are collected from the sensors

carried in the ﬁre ﬁghter gear, which will be transmit-

ted over a robust communication system to an inci-

dent commander to produce a fully ﬂexed situational

awareness system.

ACKNOWLEDGEMENTS

This work was supported by the National Science

Foundation (NSF) Smart & Connected Communities

(S&CC) Early-Concept Grants For Exploratory Re-

search (EAGER) under Grant 1637092. We would

like to thank the UNM Center for Advanced Research

Computing, supported in part by the National Sci-

ence Foundation, for providing the high-performance

computing, large-scale storage, and visualization re-

sources used in this work. We would also like to thank

Sophia Thompson for her valuable suggestions and

contributions to the edits of the ﬁnal drafts.

REFERENCES

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A.,

Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard,

M., et al. (2016). Tensorﬂow: A system for large-

scale machine learning. In 12th {USENIX} sympo-

sium on operating systems design and implementation

({OSDI} 16), pages 265–283.

Anderson, J. D. and Wendt, J. (1995). Computational ﬂuid

dynamics, volume 206. Springer.

Bae, H., Kim, G., Kim, J., Qian, D., and Lee, S. (2019).

Multi-robot path planning method using reinforce-

ment learning. Applied Sciences, 9(15):3057.

Beachly, E., Detweiler, C., Elbaum, S., Duncan, B., Hilde-

brandt, C., Twidwell, D., and Allen, C. (2018). Fire-

aware planning of aerial trajectories and ignitions.

In 2018 IEEE/RSJ International Conference on In-

telligent Robots and Systems (IROS), pages 685–692.

IEEE.

Beamer, S., Asanovic, K., and Patterson, D. (2012).

Direction-optimizing breadth-ﬁrst search. In SC’12:

Proceedings of the International Conference on High

Performance Computing, Networking, Storage and

Analysis, pages 1–10. IEEE.

Bellman, R. (1966). Dynamic programming. Science,

153(3731):34–37.

Bhattarai, M., Jensen-Curtis, A. R., and Mart

ıNez-Ram

on,

M. (2020). An embedded deep learning system for

augmented reality in ﬁreﬁghting applications. arXiv

preprint arXiv:2009.10679.

Bhattarai, M. and Mart

ıNez-Ram

on, M. (2020). A deep

learning framework for detection of targets in ther-

mal images to improve ﬁreﬁghting. IEEE Access,

8:88308–88321.

Goodwin, M., Granmo, O.-C., and Radianti, J. (2015).

Escape planning in realistic ﬁre scenarios with ant

colony optimisation. Applied Intelligence, 42(1):24–

35.

Jarvis, R. A. and Marzouqi, M. S. (2005). Robot path plan-

ning in high risk ﬁre front environments. In TENCON

2005-2005 IEEE Region 10 Conference, pages 1–6.

IEEE.

Kiran, B. R., Sobh, I., Talpaert, V., Mannion, P., Sallab, A.

A. A., Yogamani, S., and P

erez, P. (2020). Deep rein-

forcement learning for autonomous driving: A survey.

arXiv preprint arXiv:2002.00444.

LaValle, S. M. (2006). Planning algorithms. Cambridge

university press.

Lei, X., Zhang, Z., and Dong, P. (2018). Dynamic path

planning of unknown environment based on deep re-

inforcement learning. Journal of Robotics, 2018.

Li, Y., Li, C., and Zhang, Z. (2006). Q-learning based

method of adaptive path planning for mobile robot.

In 2006 IEEE international conference on information

acquisition, pages 983–987. IEEE.

Meyes, R., Tercan, H., Roggendorf, S., Thiele, T., B

uscher,

C., Obdenbusch, M., Brecher, C., Jeschke, S., and

Meisen, T. (2017). Motion planning for industrial

robots using reinforcement learning. Procedia CIRP,

63:107–112.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, I., Wierstra, D., and Riedmiller, M.

(2013). Playing atari with deep reinforcement learn-

ing. arXiv preprint arXiv:1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-

ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

276

Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-

level control through deep reinforcement learning. na-

ture, 518(7540):529–533.

Qiu, W. and Yuille, A. (2016). Unrealcv: Connecting com-

puter vision to unreal engine. In European Conference

on Computer Vision, pages 909–916. Springer.

Ranaweera, D. M., Hemapala, K. U., Buddhika, A., and

Jayasekara, P. (2018). A shortest path planning al-

gorithm for pso base ﬁreﬁghting robots. In 2018

Fourth International Conference on Advances in Elec-

trical, Electronics, Information, Communication and

Bio-Informatics (AEEICB), pages 1–5. IEEE.

Romero-Mart

ı, D. P., N

unez-Varela, J. I., Soubervielle-

Montalvo, C., and Orozco-de-la Paz, A. (2016). Navi-

gation and path planning using reinforcement learning

for a roomba robot. In 2016 XVIII Congreso Mexicano

de Robotica, pages 1–5. IEEE.

Shah, S., Dey, D., Lovett, C., and Kapoor, A. (2018). Air-

sim: High-ﬁdelity visual and physical simulation for

autonomous vehicles. In Field and service robotics,

pages 621–635. Springer.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I.,

Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,

Bolton, A., et al. (2017). Mastering the game of go

without human knowledge. nature, 550(7676):354–

359.

Spitzer, F. (2013). Principles of random walk, volume 34.

Springer Science & Business Media.

Su, H.-S. and Su, K.-L. (2012). Path planning of ﬁre-

escaping system for intelligent building. Artiﬁcial Life

and Robotics, 17(2):216–220.

Surmann, H., Jestel, C., Marchel, R., Musberg, F., El-

hadj, H., and Ardani, M. (2020). Deep reinforce-

ment learning for real autonomous mobile robot

navigation in indoor environments. arXiv preprint

arXiv:2005.13857.

Tang, Z., Shao, K., Zhao, D., and Zhu, Y. (2017). Re-

cent progress of deep reinforcement learning: from

alphago to alphago zero. Control Theory and Appli-

cations, 34(12):1529–1546.

Tarjan, R. (1972). Depth-ﬁrst search and linear graph algo-

rithms. SIAM journal on computing, 1(2):146–160.

Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop:

Divide the gradient by a running average of its recent

magnitude. COURSERA: Neural networks for ma-

chine learning, 4(2):26–31.

Vadlamani, V. K., Bhattarai, M., Ajith, M., and Martınez-

Ramon, M. (2020). A novel indoor positioning system

for unprepared ﬁreﬁghting scenarios. arXiv preprint

arXiv:2008.01344.

Zhang, L., Liu, Z., Zhang, Y., and Ai, J. (2018). Intel-

ligent path planning and following for uavs in forest

surveillance and ﬁre ﬁghting missions. In 2018 IEEE

CSAA Guidance, Navigation and Control Conference

(CGNCC), pages 1–6. IEEE.

Zhang, Z. (2020). Path planning of a ﬁreﬁghting robot pro-

totype using gps navigation. In Proceedings of the

2020 3rd International Conference on Robot Systems

and Applications, pages 16–20.

A Deep Q-learning based Path Planning and Navigation System for Fireﬁghting Environments

277