Recurrent Neural Networks Analysis for Embedded Systems

Gonc¸alo Fontes Neves

, Jean-Baptiste Chaudron

and Arnaud Dion

ISAE-SUPAERO, Universit

e de Toulouse, France

goncalo.fontes-neves@student.isae-supaero.fr, {jean-baptiste.chaudron, arnaud.dion}@isae-supaero.fr

Keywords:

Deep Learning, RNN, GRU, LSTM, Embedded Systems.

Abstract:

Artiﬁcial Neural Networks (ANNs) are biologically inspired algorithms especially efﬁcient for pattern recog-

nition and data classiﬁcation. In particular, Recurrent Neural Networks (RNN) are a speciﬁc type of ANNs

which model and process sequences of data that have temporal relationship. Thus, it introduces interesting

behavior for embedded systems applications such as autopilot systems. However, RNNs (and ANNs in gen-

eral) are computationally intensive algorithms, especially to allow the network to learn. This implies a wise

integration and proper analysis on the embedded systems that we gather these functionalities. We present

in this paper an analysis of two types of Recurrent Neural Networks, Long-Short Term Memory (LSTM)

and Gated-Recurrent Unit (GRU), explain their architectures and characteristics. We propose our dedicated

implementation which is tested and validated on embedded system devices with a dedicated dataset.

1 INTRODUCTION

As the industry and services change their business

models to be increasingly reliant on automation, the

search for ways to make machines more independent

and capable of interpreting the world by themselves

has rapidly increased. One way of doing this has

been through the many ﬁelds of the so called Artiﬁ-

cial Intelligence (AI). The terms AI, Machine Learn-

ing (ML) and Deep Learning (DL) are often used in-

terchangeably, but must be seen as separate concepts

as they are, in fact, sub-ﬁelds of one another. Deep

Learning regroups various techniques and theories to

allow a digital processing unit to learn from a set of

data. Artiﬁcial Neural Networks (ANNs) are one of

the most used paradigm of Deep Learning which is

designed to mimic the principles and the behavior

of the neurons in a human brain. ANNs are able to

model a function by masking its complexity in a se-

ries of hidden layers which can be adjusted in size

and number. ANNs technology has grown exponen-

tially over the last years and is today at the heart of

scientiﬁc concerns for different applications in mul-

tiple research areas such as medicine with genome

analysis and general electronic health record track-

ing for predictive diagnosis (Yazhini and Loganathan,

2019), Natural Language Processing (NLP) (Kamis¸

https://orcid.org/0000-0001-6939-5096

https://orcid.org/0000-0002-2142-1336

https://orcid.org/0000-0002-1264-0879

and Goularas, 2019) or autopilot for self-driving ve-

hicles (Kulkarni et al., 2018)(Grigorescu et al., 2019).

In the context of the PRISE

project, we are in-

vestigating new concepts and techniques for embed-

ded systems such as next generation fault tolerant

ﬂight control systems. As for cars autopilots (Grig-

orescu et al., 2019), we are interested in the integra-

tion of AI concepts for future Unmanned Aerial Vehi-

cle (UAV) and aircraft autopilots, such as learning the

pilot (or crew) skills and proﬁle (Baomar and Bentley,

2016), or helping the pilot in difﬁcult situations such

as landing with critical conditions (Baomar and Bent-

ley, 2017). Recurrent Neural Networks (RNNs) are a

specialized class of ANNs which can efﬁciently pro-

cess data that contains temporal relationships by inte-

grating a time dependent feedback loop in its memory.

This handling of temporal relation makes these types

of ANNs very promising for autopilot and ﬂight con-

trol applications (Salehinejad et al., 2018)(Flores and

Flores, 2020). However, as embedded systems have

generally a limited amount of memory and process-

ing power (Rezk et al., 2020), the implementation of

RNNs must be carefully analysed.

This paper focuses on the study of two types of

RNNs: the Long-Short Term Memory (LSTM) and

the Gated-Recurrent Unit (GRU). We present here a

detailed analysis of their architectures and speciﬁci-

ties. Then, we propose our own open-source imple-

French acronym for Platform for Embedded Systems

Research and Engineering

374

Neves, G., Chaudron, J. and Dion, A.

Recurrent Neural Networks Analysis for Embedded Systems.

DOI: 10.5220/0010715700003063

In Proceedings of the 13th International Joint Conference on Computational Intelligence (IJCCI 2021), pages 374-383

ISBN: 978-989-758-534-0; ISSN: 2184-3236

mentation which has been validated on a dedicated

test case. This paper is organised as follow:

• Section 2 describes the state of the art for ANNs

and focuses on the speciﬁcities of RNNs.

• Section 3 explains how ANNs can learn with

back-propagation and what are the parameters to

consider for its implementation.

• Section 4 outlines the architectural aspects of

LSTM and GRU and their characteristics.

• Section 5 presents the experimentation details,

shows our results then section 6 concludes and of-

fers some perspectives for future work.

2 STATE OF THE ART

2.1 The Neuron Concept

A Neural Network, as deﬁned by Haykin (Haykin,

1999), is a system made up of interconnections

between a large number of nonlinear processing

units called neurons. The ﬁrst algorithm consid-

ered to implement a neuron was the perceptron algo-

rithm (Rosenblatt, 1957) which was trained as a bi-

nary classiﬁer and immediately expanded to perform

the operations of logical gates such as the XOR gate.

From this baseline, new complex structures have been

developed but the basic building blocks of a neuron

remain usually fundamental to all types of structures

with three essential elements (Cf. Figure 1):

1. A set of connection, or synapses. Each connection

is associated with a weight (noted W

in Figure 1).

This weight basically deﬁnes the inﬂuence of the

signal passing through the connection.

2. An adder function (

∑

in Figure 1) that sums the

weighted inputs of the neuron and the bias (addi-

tional parameter, noted b

3. An activation function (noted Φ(...) in Figure 1),

which is essentially the processing unit of the neu-

ron. The selection of this function is crucial in im-

plementation and the behavior of ANNs (see Sec-

tion 3.2).

2.2 Towards More Complex Structures

Moving on from the neuron basic building blocks of

the ANNs, the structure has been extended and up-

dated to create architectures with more neuron enti-

ties and layers, started with the three layers percep-

tron (Irie and Miyake, 1988). This introduced scala-

bility to the original model, with stacked neuron enti-

ties in an input layer, multiples hidden layers and an

Figure 1: Generic model of a neuron.

output layer such as depicted in Figure 2. These up-

dated structures allow new applications with the pos-

sibilities to model more complex systems and provide

more sophisticated outputs.

Figure 2: From neuron entity to MLP.

From this baseline, investigations have been per-

formed to extend these concepts to different problems

and applications. Nowadays, there are a lot of differ-

ent types of ANNs which can be combined and mixed.

However, the three majors types of ANNs remains:

• As explained previously, the Multi-Layer Percep-

tron MLP is an extended version of the original

perceptron (Irie and Miyake, 1988) that contains

three or more layers. Each layer has one or sev-

eral nodes (or neurons). Each node of a layer can

be fully or partially connected to the nodes of the

following layer.

• The Convolutional Neural Network CNN is a

variation of MLP that uses convolution operations

(matrix operations) between layers and shows out-

standing results in image classiﬁcation and speech

recognition (Rawat and Wang, 2017).

• The Recurrent Neural Network RNN is analysed

in this paper and described in Section 2.3.

Recurrent Neural Networks Analysis for Embedded Systems

375

2.3 Recurrent Neural Networks (RNNs)

An RNN has an internal memory state creating a feed-

back loop. This principle allows a temporal relation-

ship between the data, the output of a cell is thus not

only inﬂuenced by the input but also by the previous

computations. Figure 3 shows this temporal relation-

ship from an folded RNN graph to an unfolded RNN

graph with the different time-steps.

Figure 3: Unfolding feedback loop for time relation.

The ﬁrst version for RNNs was proposed in (El-

man, 1990) and was using this approach with feed-

back loop in the cells but it has shown a limitation.

When containing a large number of time steps and

cells, the RNNs were usually suffering from the van-

ishing gradient problem (Pascanu et al., 2013) during

training with back-propagation (explained in Section

3). Therefore, training such RNNs with any gradient-

based approach was difﬁcult or even impossible due

to this exploding gradients phenomena (Rosindell and

Wong, 2018). Therefore, to tackle this issue, new spe-

ciﬁc RNNs architectures have been proposed. These

architectures overcome that by implementing sepa-

rate gates to add and, also, remove information about

past states. Nowadays, the two most popular RNNs

architectures are the LSTM (Gers et al., 2000) and

the GRU (Cho et al., 2014). As both outperform the

vanilla RNN (Chung et al., 2014), we have chosen

them for our work. They are described in Section 4.

3 BACK-PROPAGATION

3.1 Basic Principles

The heart of all ANN algorithms is the ability to learn

from experience. While the biological process is not

yet completely decoded, the attempts to mimic this

system have been studied for years. The ﬁrst per-

ceptron model proposed by (Rosenblatt, 1957) had

some interesting concepts but had limited learning hy-

pothesis. Also, back in these days, the processing

machines were also not powerful enough to compute

large amount of data and to enable the capability of

learning (Minsky and Papert, 1969). The concept of

back-propagation was ﬁrst introduced by Paul Werbos

in 1990 (Werbos, 1990). The main purpose of these

algorithms is to propagate the error signal generated

at the output per the feed-forward pass back across the

network. Thus, the contributions of each parameter

to the error can be calculated and used to correct the

weights of the connections and the biases. To sum up,

the feed-forward process generates the outputs from

the inputs and the back-propagation process updates

the weights and biases from the error in the opposite

direction (Cf. Figure 4).

Figure 4: Feed-forward and back-propagation illustration.

In large ANNs, this process happens layer by

layer, the error signal is ﬁrst generated at the output

layer then propagated to each neuron. For RNNs, the

back-propagation, called Back-Propagation Through

Time (BPTT), consists of unfolding the whole neural

network (unrolling all the time-steps). Then, the error

is calculated from cell to cell, calculating the varia-

tions caused by the error from previous time-step us-

ing the chain rule to determine the derivative of each

operation from the feed-forward process. The goal

is to determine the error produced by the output with

respect to the temporal relationship (inherent to the

dataﬂow) and to calculate the corresponding gradi-

ents. These gradients are used to update the weights

and biases in order to decrease the error (see Section

3.3). To sum up, the back-propagation is based on two

concepts for all ANNs:

• A gradient calculation method which is based on

the ﬁrst-order derivatives of the activation func-

tions φ(...) with respect to its input parameters.

• An optimization algorithm to update the weights

and bias from the gradient calculation.

NCTA 2021 - 13th International Conference on Neural Computation Theory and Applications

376

3.2 Activation Functions

The activation functions are functions that mimic the

behaviour of biological neurons. In the early days of

ANNs, the functions used were based on a threshold

logic output (basically ON or OFF). Then, the use of

nonlinear functions became essential to solve nontriv-

ial complex problems. For LSTM and GRU neural

networks, two activation functions are used, the sig-

moid and the hyperbolic tangent.

3.2.1 Sigmoid Activation Function

The sigmoid activation function, noted σ in this paper,

is a popular log based function that has been widely

used since the early days of ANNs. Equation 1 repre-

sents the function in exponential form. An interesting

characteristic of this function is that it maps all real

number to the range from 0 to 1.

σ(x) =

1 + e

−x

(1)

For back-propagation purpose, the gradient can be

expressed in terms of the function itself, as shown in

Equation 4. This is interesting for back-propagation

process because it allows less arithmetic operations.

(x) = σ(x)(1 −σ(x)) (2)

3.2.2 Hyperbolic Tangent Activation Function

The hyperbolic tangent activation function, noted τ in

this paper, is another popular log based function. Sim-

ilar to the sigmoid function, the hyperbolic tangent

maps all real number between -1 and 1. The func-

tion can be expressed in exponential terms as shown

in Equation 3.

τ(x) = tanh(x) =

1 + e

−2x

−1 (3)

As the sigmoid function, for back-propagation

purpose, the gradient can be expressed in terms of the

function itself, as shown in Equation 4.

(x) = 1 −τ(x)

(4)

3.3 Optimization Algorithms

The ﬁrst choice made towards the back-propagation

algorithm is picking a loss function, meaning the

function that will provide the measurement for how

poorly the model is performing. For this work the

loss function used is the mean squared error (Brown-

lee, 2019) which is the average of the squared error

per output and per time-step. The expression (5) rep-

resents how this value is calculated where N is the

number of time-steps (in the time window under anal-

ysis) and M is the dimension of the output of the net-

work conﬁguration.

Γ =

∑

i=1

(

∑

j=1

(λ

−y

)

) (5)

In our work, we have considered three optimiza-

tion algorithms, which are described in the follow-

ing sections. A survey of optimization algorithms for

back-propagation can be found in (Ruder, 2016). For

the sake of brevity of the notation, the parameters are

all referred to with θ and their gradients calculated

with ∇θ.

3.3.1 Stochastic Gradient Descent (SGD)

The SGD method with momentum (Qian, 1999) relies

on two hyper-parameters: the learning rate (noted α)

and the momentum (noted β). The tuning of the hyper

parameters was done maintaining the proposed value

for the momentum from (Ruder, 2016): β = 0.9. The

learning rate was trialled decrementally from 0.1 until

convergence was assured at α = 0.0001.

= (β −1)∇θ

t−1

+ β ·m

t−1

(6)

= θ

t−1

−α ·m

(7)

3.3.2 Adam

The Adam method (Kingma and Ba, 2015) also re-

lies on learning rate and momentum (noted β

). It in-

troduces a second order term of momentum to calcu-

late the correction of the parameter as well as another

hyper-parameter (noted β

) to avoid division by zero

and assure numerical stability. The hyper-parameters

used for training with the Adam optimizer also fol-

lowed the proposed values from (Kingma and Ba,

2015): β

= 0.99 and β

= 0.999. The learning rate

was again decrementally tuned from 0.1 until conver-

gence was assured, this time at α = 0.001 and ε (addi-

tional parameter) was incrementally tuned from 10

−

until the oscillations in the testing learning curve were

achieved for ε = 10.0.

·m

t−1

+ (1 −β

) ·∇θ

t−1

1 −β

(8)

·v

t−1

+ (1 −β

) ·(∇θ

t−1

)

1 −β

(9)

= θ

t−1

−

α ·m

√

+ ε

(10)

Recurrent Neural Networks Analysis for Embedded Systems

377

3.3.3 Adamax

Adamax is a variation of Adam, also proposed by

(Kingma and Ba, 2015), which uses exponentially

weighed norm instead of a second order momentum

term. The values of the hyper parameters used here

were the same as the ones used for Adam and pro-

duce the same desired convergence and stability.

= max(β

·u

t−1

, |∇θ

t−1

|) (11)

= θ

t−1

−

α ·u

(1 −β

) ·ε

(12)

4 GRU/LSTM ARCHITECTURES

4.1 Notations

All the derivatives mentioned in this article are deriva-

tives of the output with respect to a certain vector or

matrix (Ξ):

∂Γ(e)

∂Φ

. However, for the sake of sim-

plicity, the derivatives will just be identiﬁed by what

they are with respect to: ∂Φ. This interpretation is to

be extended to the use of the ∇ symbol representing

gradients. This means that ∇Φ does not represent the

gradient of Φ but rather the gradient of the output with

respect to Φ.

4.2 GRU Architecture

As can be seen in the schematic representation of the

GRU cell in Figure 5, the GRU architecture is com-

posed of three gates: the reset gate (R), the update

gate (Z) and the candidate gate (Z) (see expressions

(13), (14) and (15) for mathematical descriptions).

Figure 5: Anatomy of a GRU cell.

= σ(U

×Y

t−1

×X

+ b

) (13)

= σ(U

×Y

t−1

×X

+ b

) (14)

= τ(W

×X

×(R

·Y

t−1

) + b

) (15)

The output for the step is calculated using the re-

sults of expression (16) where 1 represents a vector of

ones.

= (1 −Z

) ·Y

t−1

+ Z

·G

(16)

These forward-propagation expressions and the

chain rule are used to determine the back-propagation

formulas. To recall, the goal is to establish the contri-

bution of the previous state vector to the error (as part

of BPTT) and the gradients of the weights and biases

of the current cell. Expressions (17), (18) and (19) are

the derivative of the output with respect to each gate.

This is then used to calculate each gate’s contribution

to the error of the model.

∂R

= σ

) ·∂Y

·U

·∂G

(17)

∂Z

= σ

) ·∂Y

·(G

−Y

t−1

) (18)

∂G

= τ

) ·∂Y

·Z

(19)

As stated, the expressions (20) and (21) are the

derivatives of the output with respect to the hidden

state and the input respectively, as part of unfolding

the network for BPTT.

∂Y

t−1

= U

×∂R

×∂Z

+ (U

·∂G

) ×R

+∂Y

×(1 −Z

)

(20)

∂X

= ∂R

×W

+ ∂Z

×W

+ ∂G

×W

(21)

Expressions (22), (23) and (24) are the general

form to calculate the gradients in order to update

weights and bias with the back-propagation. The

expressions below abbreviate the gradients for all

weights and all biases where ξ ∈ {R, Z, G}.

∇W

= x

×∂ξ (22)

∇U

= h

×∂ξ (23)

∇b

= ∂ξ (24)

4.3 LSTM Architecture

LSTM cells store information about the stream of data

in two state vectors: a cell state (C

) which stores the

long-term memory of the cell and a hidden state (H

)

which handles the short-term memory and ultimately

renders the output of the cell at each instant. In this

paper, this hidden state (H

) is noted with its timing

representation depending on the output (Y

t−1

). The

activation of the state vectors is done by four gates:

the output gate (O), the candidate gate (G), the in-

put gate (I) and the forget gate (F). Expressions (25),

(26), (27) and (28) give the mathematical description

for these gates. The details about the LSTM cell struc-

ture are given in Figure 6. Note that some literature

NCTA 2021 - 13th International Conference on Neural Computation Theory and Applications

378

refers to the gates as layers because they perform a

similar transition as the one between layers of a MLP

network. However, in this paper, they will be referred

to as gates.

Figure 6: Anatomy of a LSTM cell.

= σ(X

×W

t−1

×U

+ b

) (25)

= τ(X

×W

t−1

×U

+ b

) (26)

= σ(X

×W

t−1

×U

+ b

) (27)

= σ(X

×W

t−1

×U

+ b

) (28)

The vectors that result from the computation of

the above expressions are directly used to determine

the following time-step of the hidden and cell states

according to (29) and (30).

= F

·C

t−1

+ I

·G

(29)

= O

·C

(30)

Expressions (34), (33), (32) and (31) determine

the derivative of the output with respect to each gate.

This is then used to calculate each gate’s contribution

to the error of the model.

∂O

= σ

) ·∂Y

·τ(C

) (31)

∂G

= τ

) ·∂C

·I

(32)

∂I

= σ

) ·∂C

·G

(33)

∂F

= σ

) ·∂C

·C

t−1

(34)

Expressions (35), (36) and (37) determine one

of the mentioned goals of BPTT: the derivatives of

the output with respect to the previous cell state, the

previous hidden state and the input vector, respec-

tively. These calculations are used as input to the

back-propagation of the previous cell.

∂C

t−1

= ∂C

t−1

·F

+ ∂Y

·O

·τ

) (35)

∂Y

t−1

= ∂G

×U

+ ∂I

×U

+ ∂F

×U

+∂O

×U

(36)

∂X

= ∂G

×W

+ ∂I

×W

+ ∂F

×W

+∂O

×W

(37)

The gradients for the update of the weights and the

bias can be calculated using the same expressions as

for the GRU (cf. Formulas (22), (23) and (24)) where

ξ ∈ {F, I, G, O}).

5 EXPERIMENTS AND RESULTS

5.1 Test Case Overview

To test our implementation, an Inertial Measure-

ment Unit (IMU) sensors fusion problem has been

taken. The values considered are the measurements

of the IMU containing accelerometers, gyroscopes,

and magnetometers as inputs and the roll, pitch and

yaw attitude angle as outputs. The reference out-

puts (called labels) have been given and sampled by

the PX4 autopilot using the dedicated (and calibrated)

Kalman ﬁlter (Garc

ıa et al., 2020). The data sets con-

tains 11 log ﬁles, each relative to a different ﬂight of a

quad-copter UAV and generated with a sampling rate

of 100 Hz. Overall, each time step will use 9 inputs

(since each sensor produces measurements in x, y and

z axis) and up to 3 outputs for roll, pitch and yaw. We

have selected two types of LSTM/GRU structure, one

structure takes all the inputs to estimate all the outputs

(i.e. one single 9x3 structure) and the other is com-

posed of three LSTM/GRU structures that take all the

inputs to estimate each of the output separately (i.e.

three combined 9x1 structures). This is illustrated in

Figure 7. Note that the temporal window (number of

time-steps) for each structure is the same and equals

to 320 milliseconds (i.e. 32 time-steps at 100 Hz).

Figure 7: Test case illustration.

All log ﬁles loaded contributed with 70% of their

data to the training data set and 30% to the testing

data set. The batches included in each data set are not

contiguous, meaning that they were shufﬂed before

being divided into testing and training. The data-set

Recurrent Neural Networks Analysis for Embedded Systems

379

and our code are available as an open-source package

at: https://github.com/ISAE-PRISE/rnn4ap.

5.2 Data Normalization

The raw data from the IMU sensors are provided

in different units and oscillate within different in-

tervals and this introduce the problem of data scal-

ing (Brownlee, 2019). As described in subsection

3.2, the LSTM and GRU architectures are using the

sigmoid (σ) and the hyperbolic tangent (τ) as acti-

vation functions which provide outputs in [0, 1] and

[−1, 1] ranges respectively. Several options to scale

inputs for proper use of the activation functions were

studied (Sola and Sevilla, 1997). In our implemen-

tation, we have been using the scaling method which

combines a Z-score normalization with feature scal-

ing (Cf. Equation (38)) where µ is the average and σ

is the standard deviation.

y =

Z −min

max −min

with Z =

X −µ

(38)

5.3 Training Results

The learning curves (evolution of training and testing

losses) are shown in Figures 9, 8, 11 and 10 for the

different conﬁgurations described in this paper. The

capability of a neural network to model a system in-

creases with its numerical complexity and the output

size. Thus, the 9x3 conﬁgurations took longer to train

and converge (around 200 epochs) while the 9x1 con-

ﬁgurations were only trained for 100 epochs to obtain

satisfying results.

Figure 8: Results of 200 epochs of training the LSTM 9x3.

The results of training the LSTM, in both conﬁg-

urations, present similar results for all optimizers ex-

cept for Adamax. It converges to a higher loss value

for the 9x1 conﬁguration and presents for its train-

ing a slightly lower testing loss than training loss (this

Figure 9: Results of 100 epochs of training the LSTM 9x1.

Figure 10: Results of 200 epochs of training the GRU 9x3.

Figure 11: Results of 100 epochs of training the GRU 9x1.

could be a sign of under-ﬁtting). For the GRU archi-

tectures, the 9x3 conﬁguration behaves in an unpre-

dictable way after 60 epochs for all optimizers. This

means that after achieving a certain number of itera-

tions, the training process itself introduce error in the

model, making it impossible to be used. The 9x1 con-

ﬁguration shows a better evolution even if Adamax

optimizer also suffers from issue seen for the LSTM.

Figure 12 shows the results (Targets versus Out-

puts) that are achieved with and 9x1 LSTM model

NCTA 2021 - 13th International Conference on Neural Computation Theory and Applications

380

Figure 12: Targets vs Outputs for LSTM after training.

trained with SGD. Along with the output and the tar-

get a plot of the moving average of the output was

added to get a smoother version of the output, show-

ing that some operations to the output signal could

improve the ﬁnal result.

5.4 Performance Results

Our GRU and LSTM prototypes have been imple-

mented in C++ using simple ﬂoating point precision

(i.e. 32 bits encoding) and have been tested on 4

embedded devices with different software conﬁgura-

tions:

1. Beaglebone AI with Linux Debian (kernel 4.14)

and gcc 6.3.0 (info: https://beagleboard.org/)

2. Raspberry 4 with Linux Raspy (kernel 5.10) and

gcc 8.3.0 (info: https://www.raspberrypi.org/)

3. Jetson Nano with Linux Ubuntu

(kernel 4.9) and gcc 7.5.0 (info:

https://developer.nvidia.com/embedded/)

4. Pynq Z2 with PetaLinux (kernel 4.19) and gcc

7.3.0 (info: http://www.pynq.io/)

The performance measurements obtained are pre-

sented in tables 1, 2, 3 and 4. The execution times

are expressed in milliseconds (ms) and represent the

duration of one execution step for a feed-forward

process (noted ff only) and one execution step for

a complete training process (feed-forward and back-

propagation, noted bp sgd). Note that all the training

algorithms implemented (see Section 3.3) have simi-

lar performance measurements therefore, for a sake of

clarity, we are only presenting the results from SGD.

6 CONCLUSION

This article focuses on the evaluation of RNNs for

real-time on-board applications. To do this, the the-

oretical background was analyzed in order to develop

Table 1: LSTM 9x3 performances on devices.

Boards bbai rpi4 pynq nano

(ms) (ms) (ms) (ms)

ff only

min 0.485 0.275 1.306 0.255

max 0.565 0.355 1.405 0.369

mean 0.491 0.294 1.316 0.259

bp sgd

min 2.619 1.720 7.390 1.680

max 3.147 1.927 7.651 1.958

mean 2.651 1.790 7.441 1.707

Table 2: LSTM 9x1 performances on devices.

Boards bbai rpi4 pynq nano

(ms) (ms) (ms) (ms)

ff only

min 0.185 0.102 0.476 0.091

max 0.225 0.145 0.544 0.142

mean 0.187 0.108 0.480 0.094

bp sgd

min 0.826 0.501 2.255 0.489

max 0.937 0.660 2.334 0.582

mean 0.835 0.536 2.27 0.497

Table 3: GRU 9x3 performances on devices.

Boards bbai rpi4 pynq nano

(ms) (ms) (ms) (ms)

ff only

min 0.370 0.212 1.022 0.190

max 0.418 0.276 1.066 0.268

mean 0.374 0.227 1.029 0.193

bp sgd

min 1.990 1.290 5.602 1.222

max 2.482 1.452 5.950 1.414

mean 2.013 1.340 5.639 1.242

Table 4: GRU 9x1 performances on devices.

Boards bbai rpi4 pynq nano

(ms) (ms) (ms) (ms)

ff only

min 0.145 0.081 0.382 0.070

max 0.174 0.132 0.417 0.102

mean 0.148 0.085 0.386 0.072

bp sgd

min 0.633 0.384 1.741 0.368

max 1.132 0.480 1.845 0.440

mean 0.643 0.410 1.756 0.372

our own optimal implementation of LSTM and GRU.

The analysis compared several conﬁgurations differ-

ing in algebraic complexity and optimization algo-

rithms. Finally, our implementation was success-

Recurrent Neural Networks Analysis for Embedded Systems

381

fully tested on various embedded targets with lim-

ited computing capacity and latency, showing that a

pre-trained LSTM or GRU can be embedded on such

devices (for example to complement a Kalman Fil-

ter for fault-tolerance purposes). However, the train-

ing phase is still too greedy in term of computing re-

sources to allow an online training capacity for em-

bedded devices. One beneﬁt of our work is the release

of an open-source package including source code,

data-sets and logs. Therefore, our application with

its implementation details (and the results) are acces-

sible, can be used, reproduced and extended.

Two important future investigations would be to

migrate our implementation to a GPU based version

(using the GPU entity of the Jetson Nano device for

example) and also to create a FPGA based architec-

ture (using the FPGA entity of the Pynq Z2 device for

example). This would enables (1) the possibility to

run large LSTM/GRU neural networks and (2) tackle

online training capacities which can be key for em-

bedded systems algorithms dealing with uncertainties

in their environment. Our ﬁrst efforts regarding GPU

and FPGA based architectures are very encouraging.

From the LSTM/GRU architecture point of view, we

are currently exploring the possibility of modeling the

autopilot data using a bi-directional conﬁguration for

GRU and LSTM such as it has been done for ma-

chine translation applications (Schuster and Paliwal,

1997) (Sutskever et al., 2014). Last but not least,

even if the results obtained for the proposed test case

were sufﬁcient to make the desired analysis, it has to

be extended. A real life application using RNNs re-

quires more effort for the training part especially on

the training dataset, therefore we are working now on

building a more dense dataset (for example to include

a lot more ﬂight conditions or UAV types).

ACKNOWLEDGEMENTS

This work has been partially supported by

the Defense Innovation Agency (AID) of the

French Ministry of Defense under Grant No.:

2018.60.0072.00.470.75.01.

REFERENCES

Baomar, H. and Bentley, P. (2017). Autonomous landing

and go-around of airliners under severe weather con-

ditions using artiﬁcial neural networks. 2017 Work-

shop on Research, Education and Development of Un-

manned Aerial Systems (RED-UAS), pages 162–167.

Baomar, H. and Bentley, P. J. (2016). An intelligent au-

topilot system that learns piloting skills from human

pilots by imitation. In 2016 International Conference

on Unmanned Aircraft Systems (ICUAS), pages 1023–

1031.

Brownlee, J. (2019). Loss and Loss Functions for Training

Deep Learning Neural Networks.

Cho, K., Van Merri

enboer, B., Gulcehre, C., Bahdanau, D.,

Bougares, F., Schwenk, H., and Bengio, Y. (2014).

Learning phrase representations using RNN encoder-

decoder for statistical machine translation. EMNLP

2014 - 2014 Conference on Empirical Methods in Nat-

ural Language Processing, Proceedings of the Confer-

ence, pages 1724–1734.

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014).

Empirical Evaluation of Gated Recurrent Neural Net-

works on Sequence Modeling. pages 1–9.

Elman, J. L. (1990). Finding structure in time. Cognitive

Science, 14(2):179–211.

Flores, A. and Flores, G. (2020). Transition control of a

tail-sitter uav using recurrent neural networks. In 2020

International Conference on Unmanned Aircraft Sys-

tems (ICUAS), pages 303–309.

Garc

ıa, J., Molina, J. M., and Trincado, J. (2020). Real eval-

uation for designing sensor fusion in uav platforms.

Information Fusion, 63:136–152.

Gers, F. A., Schmidhuber, J. A., and Cummins, F. A. (2000).

Learning to forget: Continual prediction with lstm.

Neural Comput., 12(10):2451–2471.

Grigorescu, S. M., Trasnea, B., Cocias, T. T., and Mace-

sanu, G. (2019). A survey of deep learning techniques

for autonomous driving. CoRR, abs/1910.07738.

Haykin, S. (1999). Neural Networks: A Comprehensive

Foundation. Prentice Hall, Upper Saddle River, NJ.

2nd edition.

Irie and Miyake (1988). Capabilities of three-layered per-

ceptrons. In IEEE 1988 International Conference on

Neural Networks, pages 641–648 vol.1.

Kamis¸, S. and Goularas, D. (2019). Evaluation of Deep

Learning Techniques in Sentiment Analysis from

Twitter Data. Proceedings - 2019 International Con-

ference on Deep Learning and Machine Learning in

Emerging Applications, Deep-ML 2019, pages 12–17.

Kingma, D. P. and Ba, J. (2015). Adam: A method for

stochastic optimization. In Bengio, Y. and LeCun,

Y., editors, 3rd International Conference on Learn-

ing Representations, ICLR 2015, San Diego, CA, USA,

May 7-9, 2015, Conference Track Proceedings.

Kulkarni, R., Dhavalikar, S., and Bangar, S. (2018). Trafﬁc

Light Detection and Recognition for Self Driving Cars

Using Deep Learning. Proceedings - 2018 4th Inter-

national Conference on Computing, Communication

Control and Automation, ICCUBEA 2018, pages 2–5.

Minsky, M. and Papert, S. A. (1969). Perceptrons: An Intro-

duction to Computational Geometry. The MIT Press.

Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On

the difﬁculty of training recurrent neural networks.

In Proceedings of the 30th International Confer-

ence on International Conference on Machine Learn-

ing - Volume 28, ICML’13, page III–1310–III–1318.

JMLR.org.

NCTA 2021 - 13th International Conference on Neural Computation Theory and Applications

382

Qian, N. (1999). On the momentum term in gradi-

ent descent learning algorithms. Neural Networks,

12(1):145–151.

Rawat, W. and Wang, Z. (2017). Deep Convolutional Neu-

ral Networks for Image Classiﬁcation: A Comprehen-

sive Review. Neural Computation, 29(9):2352–2449.

Rezk, N. M., Purnaprajna, M., Nordstrom, T., and Ul-

Abdin, Z. (2020). Recurrent neural networks: An

embedded computing perspective. IEEE Access,

8:57967–57996.

Rosenblatt, F. (1957). The Perceptron - A Perceiving and

Recognizing Automaton. Technical Report 85-460-1,

Cornell Aeronautical Laboratory.

Rosindell, J. and Wong, Y. (2018). Biodiversity, the tree

of life, and science communication. Phylogenetic Di-

versity: Applications and Challenges in Biodiversity

Science, (2):41–71.

Ruder, S. (2016). An overview of gradient descent opti-

mization algorithms. CoRR, abs/1609.04747.

Salehinejad, H., Baarbe, J., Sankar, S., Barfett, J., Colak, E.,

and Valaee, S. (2018). Recent advances in recurrent

neural networks. CoRR, abs/1801.01078.

Schuster, M. and Paliwal, K. (1997). Bidirectional recur-

rent neural networks. IEEE Transactions on Signal

Processing, 45:2673 – 2681.

Sola, J. and Sevilla, J. (1997). Importance of input data

normalization for the application of neural networks

to complex industrial problems. IEEE Transactions

on Nuclear Science, 44(3 PART 3):1464–1468.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence

to sequence learning with neural networks. CoRR,

abs/1409.3215.

Werbos, P. J. (1990). Backpropagation Through Time:

What It Does and How to Do It. Proceedings of the

IEEE, 78(10):1550–1560.

Yazhini, K. and Loganathan, D. (2019). A state of art ap-

proaches on deep learning models in healthcare: An

application perspective. Proceedings of the Interna-

tional Conference on Trends in Electronics and Infor-

matics, ICOEI 2019, (Icoei):195–200.

Recurrent Neural Networks Analysis for Embedded Systems

383