The Comparison of Word Embedding Techniques in RNNs for

Vulnerability Detection

Hai Ngoc Nguyen

, Songpon Teerakanok

2 a

, Atsuo Inomata

and Tetsutaro Uehara

1 b

Cyber Security Lab, College of Information Science and Engineering, Ritsumeikan University, Japan

Research Organization of Science and Technology, Ritsumeikan University, Japan

Graduate School of Information Science and Technology, Osaka University, Japan

Keywords:

Deep Learning, Word Embeddings, Vulnerability Detection, RNNs.

Abstract:

Many studies have combined Deep Learning and Natural Language Processing (NLP) techniques in security

systems in performing tasks such as bug detection, vulnerability prediction, or classiﬁcation. Most of these

works relied on NLP embedding methods to generate input vectors for the deep learning models. However,

there are many existing embedding methods to encode software text ﬁles into vectors, and the structures of

neural networks are immense and heuristic. This leads to a challenge for the researcher to choose the appro-

priate combination of embedding techniques and the model structure for training the vulnerability detection

classiﬁers. For this task, we propose a system to investigate the use of four popular word embedding tech-

niques combined with four different recurrent neural networks (RNNs), including both bidirectional RNNs

(BRNNs) and unidirectional RNNs. We trained and evaluated the models by using two types of vulnerable

function datasets written in C code. Our results showed that the FastText embedding technique combined with

BRNNs produced the most efﬁcient detection rate, compared to other combinations, on a real-world but not

on an artiﬁcially-produced dataset. Further experiments on other datasets are necessary to conﬁrm this result.

1 INTRODUCTION

Software quality is a signiﬁcant concern within the

cybersecurity ﬁeld since vulnerabilities in software

code can greatly damage an organization’s day-to-day

operations. As a matter of fact, securing the software

code by both dynamic and static analysis methods has

been studied widely among security experts. In soft-

ware source code, many similar characteristics were

present in natural language texts (Allamanis et al.,

2018). For that reason, the use of NLP applications

in automatically detecting vulnerability in code has

been investigated. With the recent breakthrough of

deep learning in numerous ﬁelds including NLP ap-

plications, researches have shown the great potential

of deep learning in source code static analysis (Rus-

sell et al., 2018). In any machine learning or deep

learning system, a speciﬁed embedding technique is

required for generating model inputs as vector repre-

sentations. Nevertheless, there are many existing em-

bedding methods in the NLP ﬁeld such as Word2Vec

https://orcid.org/0000-0002-1058-149X

https://orcid.org/0000-0002-8233-130X

(Mikolov et al., 2013) and GloVe (Pennington et al.,

2014). This makes it difﬁcult to select a suitable

method for the vector encoding tasks.

Among deep learning models, sequence mod-

els like RNNs are famous for dealing with text se-

quences. The simple RNN model faces problems of

gradients vanishing or exploding when the input se-

quences get too long. To deal with long sequence in-

puts, other structures of RNNs, namely Long Short-

Term Memory (LSTM) (Hochreiter and Schmid-

huber, 1997) and Gated Recurrent Units (GRUs)

(Kostadinov, 2017), were introduced. These sequence

models proved to be the suitable learning models for

encoding code ﬁles in Li (2018) and Li (2019). These

studies used Word2Vec to produce code vector repre-

sentations, but other embedding methods like GloVe

or FastText (Bojanowski et al., 2017) have yet to be

evaluated on these models. Different combinations of

the embedding method and deep learning model can

capture different types of knowledge representations

likes linguistic contexts of identiﬁers and their tem-

poral sequences. Changing the embedding method

for training the deep model could therefore impact

the performance of classiﬁers. Moreover, the deep

Nguyen, H., Teerakanok, S., Inomata, A. and Uehara, T.

The Comparison of Word Embedding Techniques in RNNs for Vulnerability Detection.

DOI: 10.5220/0010232301090120

In Proceedings of the 7th International Conference on Information Systems Security and Privacy (ICISSP 2021), pages 109-120

ISBN: 978-989-758-491-6

109

model also requires distinguished amounts of time

consumed for training and testing on different types

of representations. To select the suitable embedding

method can be a critical task since it can affect the

performance of the models and the time complexity

of training and detection. Thus, we aim to deter-

mine the most viable combination between certain se-

quence models and available embedding methods for

generating semantic vectors.

We present a system to train code vulnerability de-

tectors for evaluating four word embedding methods

combined with four popular RNNs. The system was

built based on the open-source API benchmark (Lin

et al., 2019a). As an extended look to the API, we also

used two types of the dataset which were originally

setup as the baselines for comparisons in the bench-

mark. Both datasets contain ﬁles written in C program

language where each ﬁle represents either a vulnera-

ble or non-vulnerable function. The ﬁrst dataset is

the Nine-projects dataset that was constructed from

nine open-source projects with the vulnerability in-

formation extracted from the National Vulnerability

Database (NVD, 2019) and the Common Vulnera-

bilities and Exposures (CVE, 2019) websites. The

second dataset is obtained from the Software Assur-

ance Reference Dataset (SARD, 2019) project, which

consists of the artiﬁcially synthesized function ﬁles.

Through our experiments, we explored the combina-

tions of the word embeddings techniques and RNNs

for building vulnerability detectors at the function

level. Our system trained and tested the vulnerabil-

ity detectors in a supervised manner of deep learning.

Since the system processes program source code as

text ﬁles in ﬁle-level classiﬁcation exercises, source

code analysis is not necessary to analyze the program.

The main contributions of this paper are con-

cluded as follows:

• We extend a benchmark system by evaluating

three additional word embedding techniques to

encode the C program functions as vector repre-

sentations.

• We implement the LSTM, bidirectional LSTM

(Bi-LSTM), GRU, and bidirectional GRU (Bi-

GRU) models for training the vulnerability de-

tectors at the function ﬁle level on two different

datasets.

• We conduct an overall performance evaluation of

all trained classiﬁers on the two datasets. Partic-

ularly, each classiﬁer is examined on different in-

put representations to discover the compatibility

of the embedding algorithms and the models.

The rest of this paper is arranged as follows: Sec-

tion 2 presents the related studies where the word em-

bedding techniques and deep learning models were

applied. Section 3 describes the detailed design of

our system. In section 4, we explain the experiments

and performance metrics. Section 5 provides the re-

sults and its comparative analysis. We conclude our

work and discuss future directions in Section 6.

2 RELATED WORK

Word embedding techniques are widely used in build-

ing NLP applications. Inspired by the success of

NLP and neural language models, the earlier stud-

ies observed the strong resemblances in semantic

and syntactic information between natural language

to the programming language. They had leveraged

the advantages of these methods to detect vulnera-

bilities and predict defects in software code analysis.

One of the earliest applications of this technique was

done by implementing classical NLP algorithms, such

as n-grams, combined with machine learning tech-

niques for non-NLP tasks of detecting and classifying

vulnerable code practices in programming languages

(Mokhov et al., 2014). It was done in an identical

manner as a classic text identiﬁcation task.

Afterwards, more studies tested increasingly com-

plicated machine learning models while employing

different word embedding techniques for generating

vector representations as inputs for the training pro-

cess. Pradel and Sen used Word2Vec for generating

code vectors derived from the custom Abstract Syntax

Trees (ASTs) - based contexts (Pradel and Sen, 2017).

These vectors were used to train deep learning models

to detect vulnerability in JavaScript code. Likewise,

the Word2Vec model was applied for making vector

representations from C/C++ source code and trained

vulnerability detection models with both Word2Vec

representations and the control ﬂow graphs (CFGs)

data (Harer et al., 2018). Instead of using Word2Vec,

Henkel applied the GloVe model to produce vectors

learned from the Abstracted Symbolic Traces of C

programs (Henkel et al., 2018). Furthermore, Fast-

Text was used in FastEmbed for vulnerability pre-

diction based on ensemble machine learning models

(Fang et al., 2020). Although there are already sev-

eral examples of using word embedding techniques

in vulnerability detection, comparisons between these

techniques were not possible to make due to differ-

ences in baseline dataset types and machine learning

models structures.

Deep learning has recently attracted more inter-

est in code analysis research since it has achieved

great success in numerous ﬁelds such as computer

vision, image processing, and natural language pro-

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

110

Figure 1: Approach Overview.

cessing. By converting the self-constructed dataset

called code gadget into Word2Vec vector represen-

tations, VulDeePecker was developed from the Bi-

LSTM model to detect speciﬁc types of C/C++ vul-

nerabilities (Li et al., 2018). The same authors also

provided a comparison for several deep learning mod-

els on the same artiﬁcially constructed dataset (Li

et al., 2019). Another study used Word2Vec for

the embedding task, but their model architecture em-

ployed a convolutional layer on top of the standard Bi-

LSTM model (Niu et al., 2020). Although the men-

tioned systems have achieved well vulnerability de-

tection performance, their trained models were tested

on their self-constructed datasets like building ASTs,

CFGs, etc. The success of the methods based on syn-

tactic artifacts dataset raises a question of whether the

customized dataset proved more useful than basic in-

put such as the word vectors. This is challenging for

making an overall comparison between these systems

and requires program analyzing expertise.

There are studies that have started to explore the

effectiveness of using different representations for

deep learning models to deal with program classiﬁca-

tion tasks. A comparative analysis was conducted to

assess how different deep learning models learn over

distinctive input representations of Java code (Ram

et al., 2019). Additionally, Lin (2019) proposed a

benchmark framework and compared three models

which are Text-CNN (Kim, 2014), DNN and LSTM.

However, further evaluations of using different em-

bedding algorithms or different neural networks are

yet to be explored. In this paper, we present an ap-

proach that allows users to observe the performance

of deep learning models on different types of vec-

tor representations. The detection granularity of this

project is at the function level based on the two types

of datasets.

3 APPROACH

Our goal is to design a system for investigating the

effectiveness of word embedding techniques for train-

ing vulnerability detectors. The system initially loads

the source code ﬁles and preprocesses them into se-

quences of word tokens which were identiﬁers, data

types, variables, etc. Each of these sequences stands

for a semantic function representation. The list of rep-

resentations would have the corresponding list of la-

bels by processing the function names. Subsequently,

the system applies the predeﬁned word embedding al-

gorithm to map the sequences of tokens into vector

representations. The speciﬁed neural network used

eighty percent of the vector representations for train-

ing the vulnerability detector, while the rest of the rep-

resentations are tested by the trained detector. After

testing, the vulnerable probabilities of the test sam-

ples were produced correspondingly.

3.1 Overview

In this work, we apply four popular word embedding

techniques to train four different RNNs. Figure 1

presents our workﬂow. In detail, given a corpus of

function code ﬁles, training a deep model classiﬁer in-

cludes several steps. In the ﬁrst stage, the source code

ﬁles are loaded and processed to generate sequence

data and labels. The data is then passed to the next

stage to be transformed into the code embedding vec-

tors. These vectors will then be partitioned and fed

to the constructed neural network for the training pro-

cess. When training is completed, the model is tested,

and the detailed logs are automatically collected in the

last phase.

The Comparison of Word Embedding Techniques in RNNs for Vulnerability Detection

111

Figure 2: Data Loader and Label Generator Module.

3.2 Data Loader and Label Generator

Figure 2 shows the data ﬂow within the Data Loader

and Label Generator module. This module initially

loads the source code ﬁles to get a list of identiﬁer

tokens and a list of function names. When prepro-

cessing the raw data, each ﬁle in the corpus was split

into a list of words and punctuation characters before

performing tokenization technique by Keras tokenizer

(Chollet et al., 2015). By ﬁtting the whole corpus data

to the tokenizer, it turns each function ﬁle into a se-

quence of integers. Finally, the list of these sequences

and the vulnerability labels are produced for the latter

encoding stage.

To generate ground truth labels, we loaded two

datasets into our module and tasked the module to de-

tect certain keywords in the ﬁlenames. Files which

contained the designated keywords were then labeled

as either vulnerable (1) or invulnerable (0). These two

datasets are the Nine-projects and the synthetic SARD

datasets. For the Nine-projects dataset, we set the vul-

nerable keywords to match the strings such as “CVE”

or “cve”. Similarly, the SARD’s ﬁles contain such

keywords as “BAD”, “bad”, etc. These keywords

were incorporated directly into the module. One of

the module settings is to select the type of dataset be-

fore execution. This guarantees the system adaptabil-

ity to other types of dataset.

Here, the label generator settings can be cus-

tomized to a suitable type of dataset. It is important to

notice that the function body of the ﬁles in the SARD

dataset also has such keywords as those we picked

for labeling. These words potentially add bias to the

training process of the deep learning models. Since

word embedding techniques are used for generating

vectors, the model can look at the keyword vectors to

decide the vulnerability results. Therefore, we scan

the function body to look for those keywords, replac-

ing them with the same length dummy strings. Hence,

those words in the function body can not affect the

model performance.

3.3 Encoder Module

This module converts each code function ﬁle into a

vector that could retain both semantic and syntactic

information out of the source code. To allow the mod-

els to learn effectively in the later stage, it is impor-

tant to extract the information from the code tokens.

Particularly, for preserving the semantic knowledge

expressed by the identiﬁer names, we used the em-

bedding layer to map these identiﬁers in code to the

semantic vector representation. By observing the con-

tent of the two datasets, more than 90% of sequences

were acknowledged to have lengths that are shorter

than 1000. To balance the sequence length and the

sparsity of sequences (Lin et al., 2019b), we select

the maximum length of code sequence to 1000. For

those functions containing the sequence length longer

than 1000, they are truncated to length 1000. Con-

versely, we append zeros at the end for the sequences

having a shorter length than 1000. Next, the mod-

ule will pick one of the embedding methods to start

mapping the ﬁle sequences into a meaningful code

vector. Word2Vec, GloVe, FastText and GloVe pre-

trained models (GloVePre) (Pennington et al., 2014)

were implemented in our work. We set the embedding

layer to generate ﬁxed-length vectors at the dimension

(d = 100) as default.

The Word2Vec model was implemented to learn

semantic information from a large amount of raw

data. The model was provided by the GenSim

(

Reh

rek and Sojka, 2010) package with the Con-

tinuous Bag of Words (CBOW) and Skip-gram algo-

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

112

Figure 3: The example of the GRU model structure.

rithms. CBOW learns to predict the word by its con-

text, while the Skip-gram is built to predict the con-

text. Therefore, we chose CBOW over the other since

we needed to extract the syntactic code sequence in-

formation but not its context. The other parameters of

the Word2Vec were setup by default.

In a similar manner, the FastText model was con-

structed from the GenSim package. Moreover, it is

considered as the main comparison to the Word2Vec

model in training the neural networks, since FastText

can even construct the vector for the word from its

character n grams even when the word is out of its

vocabulary. The number of threads and the window

size were customized to 4 and 5 likes in the Word2Vec

model, while the rest of the parameters were conﬁg-

ured by default.

Finally, the GloVe model was also trained from

the given corpus of code. Constructing GloVe was

done by the glove-python package (Kula, 2019). We

tuned the GloVe model with the learning rate at 0.05,

set the window parameter to 10, and trained it with

four threads in 500 Epochs. While this generates em-

beddings, which is an identical task to Word2Vec,

GloVe presented its embeddings by factorizing the

logarithm of the corpus word co-occurrence matrix.

The GloVePre model was additionally implemented

to watch the baseline difference between convert-

ing words to vectors and code identiﬁers to vectors.

The pre-trained layer was selected with the 100d pre-

trained model GloVe.6B.100d.txt.

Given these points, the sequences collected from

the previous stages can be translated into vector rep-

resentations with the shape of (1000, 100) by one of

the embedding models. Here, the selected model is

responsible for preserving the code semantic informa-

tion into the embeddings.

3.4 Training Module

Taking the meaningful code representation vectors

from the previous stage as inputs, the training mod-

ule is responsible for training the neural networks

to distinguish between vulnerable and non-vulnerable

function samples. Our work focuses on investigat-

ing the effectiveness of RNNs since such models like

Long Short-Term Memory (LSTM) or Gated Recur-

rent Unit (GRU) are well-known for dealing with se-

quential data (Li et al., 2019). More importantly,

their bidirectional forms, having a backward layer and

a forward layer can adapt efﬁciently with program

code, where the order of statements plays a signiﬁcant

role in vulnerability detection. The module conﬁgura-

tion can be customized to select one of the four RNNs

models. Here, we explored LSTM, Bi-LSTM, GRU,

and Bi-GRU for learning the features extracted from

the embedded code representations.

The LSTM network in this work was designed

with eight layers. The ﬁrst layer is an LSTM recur-

rent layer with the 128 neurons. This layer takes the

meaningful embedding vectors extracted from code

sequences in the encoder module as input. The second

layer is a dropout regularization layer with dropout

rate at 0.5. It helps the model to prevent over-ﬁtting

issues by randomly removing hidden units and their

connections in the training process. The third layer

is another LSTM recurrent layer with 128 neurons.

After this, the output of the LSTM layer was concate-

nated for downsampling by a pooling layer. Another

dropout layer with the same rate is added after the

pooling layer. The last three layers are dense layers.

The ﬁrst dense layer has 64 neurons, and this num-

ber of neurons is reduced by half in the second dense

layer. The last layer has only one neuron and uses sig-

moid activation for converging the output into a single

The Comparison of Word Embedding Techniques in RNNs for Vulnerability Detection

113

Table 1: The distribution of the vulnerable functions on two datasets.

probability between 0 and 1. The GRU network was

constructed in the same way as the LSTM network.

The only difference is that GRU recurrent layers were

implemented instead of using LSTM layers. The ex-

ample of the GRU model structure is described in Fig-

ure 3.

The BRNNs were constructed in a similar struc-

ture for both Bi-LSTM and Bi-GRU. The Bi-LSTM

model consists of eight layers. The ﬁrst layer is a

bidirectional LSTM recurrent layer with 64 LSTM

cells. The bidirectional layer allows its output to con-

currently acquire the information from both preced-

ing and succeeding scenarios. The second layer is a

dropout regulation layer with the same dropout rate as

in LSTM model. The third layer is designed the same

as the ﬁrst layer. Henceforth, the output of the Bi-

LSTM layer will be reduced by one dimension with

a pooling layer. After pooling, a dropout layer using

the same rate is used. Ultimately, the last three dense

layers were built in the same way as in the structure

of the unidirectional RNNs. Regarding the network

structures and hyper-parameters, the Bi-LSTM model

was constructed based on the work of the group of

Lin (2019b). Following that reference, we designed

and tuned the other network architectures using simi-

lar methodology.

3.5 Test Module and Logs Collector

In the previous stage, the constructed model em-

ployed eighty percent of the dataset for training the

classiﬁer. Here, the test module can select one of the

trained classiﬁers by specifying its name and use the

rest of the dataset for testing. When the test process

is completed, the raw CSV data will be presented as

a list of function names and its vulnerability proba-

bility. Next, the logs collector processes the CSV ﬁle

data and sorts the list of function names to another

table following their vulnerability probability in the

order from the highest to the lowest. Finally, it can

calculate performance metrics and produce results as

a .txt ﬁle.

4 EXPERIMENTS

4.1 Experimental Setup

Our experiments focused on answering the following

research questions:

• Question 1: Can applying different embedding

methods improve the effectiveness of the vulnera-

bility detector?

• Question 2: How are the training speed and per-

formance of each model when using different em-

bedding techniques?

• Question 3: Would change the neural network

model affect detection performance?

To summarize, we used four embedding methods to

train and test four types of RNNs on two types of

datasets respectively. It means that we had trained

and tested 16 classiﬁers for each dataset. We set the

optimizer for all the networks as Stochastic Gradi-

ent Descent (SGD) followed by the default setting of

Keras. The binary cross-entropy was selected as our

loss function. The deep learning models were imple-

mented in Python (version 3.6.9) using Keras (version

2.2.4) with a TensorFlow backend (version 1.14.0)

(Abadi et al., 2016). Word2Vec and FastText mod-

els were constructed by the GenSim pip library (ver-

sion 3.4.0) while the GloVe model was used from the

glove-python (version 1.0.1) package (Kula, 2019).

Our experiments were designed and carried out on an

Ubuntu server (18.04 LTS) having 64GB RAM with

an NVIDIA GeForce RTX 2080 SUPER 8GB GPU

and an Intel(R) Core (TM) i7-9700K 3.60GHz CPU.

4.2 Datasets

The Nine-projects dataset is the proposed dataset in

the benchmark API (Lin et al., 2019a). The au-

thors had shared their dataset on their GitHub website

(NSCLab, 2020). The second dataset is the synthetic

dataset supplied by the Software Assurance Refer-

ence Dataset project (SARD, 2019). The project is

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

114

Table 2: Performance Metrics.

known as the Juliet Test Suites (Black, 2018). It in-

cludes test functions for C/C++ and Java. In this

work, we only take the C source code for our ex-

periments. Following the studied cases in the bench-

mark API, we extracted randomly 35000 vulnerable

and 40000 non-vulnerable C function ﬁles from the

SARD functions dataset provided by the same GitHub

repository. For both datasets, after being encoded to

the labeled vectors, the dataset is distributed with the

rate of 0.8 for training and validation set, and 0.2 for

the test set. The content of the datasets is described

in Table 1. We keep this data partition setting to train

and test all the deep learning models.

4.3 Performance Metrics

For most of the cases, precision, recall, and F1-score

are used for evaluating deep learning classiﬁcation

models. However, in many circumstances of vulner-

ability detection, the dataset imbalance between non-

vulnerabilities and vulnerabilities showed that these

metrics would undervalue the model detection per-

formance (Lin et al., 2019a). Therefore, the metrics

applied for evaluating our classiﬁers are the ranked

retrieval precision and recall (P@K% and R@K%).

Moreover, our approach aims for the retrieval task

of vulnerable function, these metrics are well recom-

mended for this task and would be more suitable for

evaluating the detection results(Manning et al., 2009).

Speciﬁcally, when a detector ﬁnishes testing, it

will produce a ranked list of functions by sorting the

vulnerability probability. Among the top k percent of

the total retrieved functions, we have T P@k% stands

for the number of the truly vulnerable samples, while

FP@k% denotes the false vulnerable ones. Next,

FN@k% denotes the number of the truly vulnera-

ble functions that could not be discovered when re-

trieving the top k% highest vulnerable probability.

For instance, the total number of test functions was

60000, and the number of vulnerable functions was

1500. With k = 10, the top 10% will accordingly re-

trieve 6000 ﬁles with the highest vulnerable rate. Fur-

thermore, given the following details: 1400 vulner-

able ﬁles were found true positive among the 6000

ﬁles, 100 vulnerable ﬁles were found to be miss-

ing, and 4500 non-vulnerable ﬁles were found to be

false positive; the reported values will be as follows:

T P@k% is 1400, FP@k% is 4500, and FN@k% is

100. Hence, P@K% and R@K% can be calculated as

the formulas in Table 2.

5 RESULTS

5.1 Model Training Results

The time consumed for the training neural models

usually receives less attention than standard perfor-

mance metrics like precision, recall rate, etc. How-

ever, when the models reach their capacity and show

identical results in performance, understanding time

complexity would be a valuable insight to choose the

appropriate methods for training. Figure 4 summa-

rizes the training time of the four RNNs on the SARD

and the Nine-projects datasets. Training the detec-

tors on the SARD dataset took more time since its

size is much larger than the other. In general, us-

ing FastText achieves the shortest time consumption

for training the classiﬁers, while training with GloVe

takes the longest time. Training models by Word2Vec

is faster than by GloVePre. Indeed, this was ex-

pected since FastText was proved to accomplish op-

timal speed when compared to other word embed-

ding models (Joulin et al., 2016). Among the neural

networks, the GRU and Bi-GRU models require the

largest amount of time to train. Conversely, Bi-LSTM

and LSTM models take less time for training than the

others.

5.2 Evaluation of Word Embedding

Methods on the Trained Models

In Figure 5.A, the results show the detection per-

formance of the LSTM models trained on four em-

bedding methods. The LSTM models which were

trained on FastText’s vector representations achieved

the highest precision and recall for all categories of

The Comparison of Word Embedding Techniques in RNNs for Vulnerability Detection

115

Figure 4: Time consumption summary for training RNNs on the SARD and the Nine-projects datasets.

top k% retrieved functions. In detail, the precision at

the top 1% reached 83%. The recall rate at the top

20% and top 50% could reach 93% and 99%. The

next best performing detector was the model trained

on GloVe, which got 80% for the precision at the top

1%. The next ranks followed by the models applied

by Word2Vec, and GloVePre.

Figure 5.B presents the detection performance

of the Bi-LSTM models trained on four embedding

methods. The Bi-LSTM models trained on Word2Vec

and FastText showed quite identical rates at Top 1%

retrieved functions. Their precision rates were 87

and 86% respectively. For the rest of the top k%

items, using FastText still achieved the highest recall

rates. The lowest testing performance was the model

that applied GloVe, followed by the one that applied

GloVePre. Noticeably, the performance of the Bi-

LSTM models is much higher than the LSTM mod-

els. For example, at the top 1% retrieved functions,

the precision rate had improved by 11% in the case of

Word2Vec and by 4% in the case of FastText.

When retrieving less than 50% of vulnerability

functions, Figure 5.C shows a similar trend that de-

tectors using FastText achieved the highest precision

and recall rates. On average, the GRU models have

lower performance than the Bi-LSTM models, but

higher than the LSTM models. For instance, in the

top 1% vulnerable samples for the Word2Vec cate-

gory, the LSTM detector reached only 76% in preci-

sion rate while the precision rates of the GRU and Bi-

LSTM models were higher by 8% and 11% respec-

tively. The performance of the GRU models that em-

ployed GloVe and Word2Vec was similar.

In an identical manner, Figure 5.D witnesses the

Bi-GRU model where FastText was implemented,

achieving the highest performance for the top 1%

most vulnerable ﬁles. Compared with other models,

the Bi-GRU detectors have much better performance

than the LSTM and GRU detectors. Likewise, the

Bi-GRU models achieved higher performance rates in

both precision and recall for top k items when com-

paring to the Bi-LSTM models. Hence, there is a clear

performance gap between the bidirectional RNNs and

the unidirectional RNNs.

Figure 5.A and 5.B show, LSTM, and Bi-LSTM

models performed better with FastText and Word2Vec

techniques. For all the models, the precision rates

at Top k become lower, and the recall rates increase

when k is increasing. This is due to the proportion of

the vulnerable data getting smaller when the number

of the retrieved ﬁles increases. When retrieving 50%

the total number of the ﬁles, all the detectors could

collect more than 97% of the relevant ﬁles. Particu-

larly, the Bi-LSTM models applied FastText, and the

Bi-GRU model employed GloVe could retrieve all rel-

evant vulnerable functions back.

Overall, the detectors that applied FastText got the

best performance on the Nine-projects dataset. The

models that used GloVe and Word2Vec did not clearly

show which one is better than the other. GloVePre

generally had the lowest general rates, that was be-

cause its embedding vectors were meant for words in

the predeﬁned dictionary. The better performance of

FastText is likely due to its capability to produce the

vector for a word from its character n-grams even if

the word had no presence in the training corpus (Bo-

janowski et al., 2017). GloVe and Word2Vec does not

have the same capability. Furthermore, as an exten-

sion model of Word2Vec, FastText was able to gen-

erate better embeddings for infrequent words since it

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

116

Figure 5: Distribution of precision and recall over top k% retrieved functions among the four models tested on the Nine-

projects dataset: (A) LSTM, (B) Bi-LSTM, (C) GRU, (D) Bi-GRU.

The Comparison of Word Embedding Techniques in RNNs for Vulnerability Detection

117

Table 3: Precision and Recall over top k% retrieved functions of the RNNs on the SARD dataset.

treats each word by considering the word’s character

n grams. Finally, the use of the hierarchical softmax

with careful implementation in FastText helps to op-

timize the computation process (Joulin et al., 2016).

For the model test results on the SARD dataset,

we observe the precision and recall rate at top K are

very identical between all the models. On Table 3, the

results showed that changing the embedding meth-

ods did not greatly affect the performance in the case

of the synthetic dataset, since the dataset has a well-

balanced rate between vulnerable and non-vulnerable

ﬁles and contains a large enough size to effectively

train the detectors. All the detectors can reach their

highest precision at Top 1% and Top 10%. At Top

50%, all the vulnerable functions were retrieved suc-

cessfully. Our experiments conﬁrm further the con-

clusion in (Lin et al., 2019a) that no statistically sig-

niﬁcant difference in performance was found for all

RNNs on the SARD dataset regardless of combin-

ing different embedding techniques. The vulnerabil-

ity patterns extracted from the artiﬁcially synthesized

samples are much simpler to capture compared to the

real-world samples by the neural networks.

5.3 The Comparisons of the Four Deep

Neural Networks

As Table 3 indicates, the performance of all detectors

which were trained on the SARD dataset are sufﬁ-

cient and identical due to the dataset synthetic char-

acteristic. Therefore, the comparisons between the

four neural networks are made based solely on the

detector trained on the Nine-projects dataset (Figure

5). In general, the BRNNs perform better than the

unidirectional RNNs, likely due to the advantages of

BRNNs discussed in section 3.4. Among BRNNs,

the Bi-GRU detectors showed higher precision and

recall rates in top k% most vulnerable samples on

average. Likewise, the GRU models achieved better

performance than the LSTM ones in the group of the

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

118

unidirectional RNNs. The GRU and Bi-GRU models

can detect more effectively than the LSTM and Bi-

LSTM models in case of training on the Nine-projects

dataset. This is due to the structure of the GRU

networks being more compatible with the small size

dataset. Since the Nine-projects dataset has smaller

size than the SARD dataset, GRU models had better

advantages for less memory consumption.

6 CONCLUSION AND FUTURE

WORK

Automated detection of software vulnerability is an

important direction in cybersecurity research. How-

ever, conventional techniques such as dynamic anal-

ysis or symbolic-execution showed inefﬁciency when

dealing with an immense amount of source code (Lin

et al., 2019b). To enhance the vulnerability discov-

ery capability, applying deep learning techniques was

necessary to speed up the code analysis process. Our

work presented an approach to examine the effective-

ness of word embeddings combined with four deep

learning models for the vulnerability detection task.

The system trained the models and tested them on two

genres of the datasets. With the synthetic dataset, all

models could present sufﬁcient but identical vulnera-

bility retrieval results. In contrast, the models showed

differences clearly with the real-world implemented

dataset. This is worth noticing since the real vulnera-

ble dataset in the released software code can be lim-

ited to size and numbers in varied scenarios. Thus, it

is vital to select the right combination of embedding

methods and neural network structures to build an ef-

fective detection system that can accommodate well

to the dataset.

Our approach investigated the use of embedding

algorithms on the supervised learning methods, and

the system can generate the vulnerability detectors

at the function level. It can be used as an assisting

tool for selecting the good combinations of embed-

ding methods and deep learning models for building

effective vulnerability detection systems. There are

several research directions for extending our work and

improving system performance. First, we can collect

and build up the volume of the real vulnerable dataset

to resolve the imbalance issue in the open-source

dataset. Second, we can work on implementing other

embedding solutions such as adapting an ASTs ex-

tractor (Kovalenko et al., 2019). This could extract

different patterns of information in source code for the

center machine learning models to learn in the later

stage. Finally, building better neural network mod-

els should be investigated to reduce the gap between

natural language text and programming. This would

allow the vulnerability detection system to learn bet-

ter and adapt to other programming languages.

REFERENCES

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A.,

Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard,

M., et al. (2016). Tensorﬂow: A system for large-

scale machine learning. In 12th {USENIX} Sympo-

sium on Operating Systems Design and Implementa-

tion ({OSDI} 16), pages 265–283.

Allamanis, M., Barr, E. T., Devanbu, P., and Sutton, C.

(2018). A survey of machine learning for big code

and naturalness. ACM Computing Surveys (CSUR),

51(4):1–37.

Black, P. E. (2018). Juliet 1.3 Test Suite: Changes From

1.2. US Department of Commerce, National Institute

of Standards and Technology.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.

(2017). Enriching word vectors with subword infor-

mation. Transactions of the Association for Computa-

tional Linguistics, 5:135–146.

Chollet, F. et al. (2015). https://github.com/fchollet/keras.

CVE (2019). Common vulnerabilities and exposures web-

site. https://cve.mitre.org/.

Fang, Y., Liu, Y., Huang, C., and Liu, L. (2020). Fastembed:

Predicting vulnerability exploitation possibility based

on ensemble machine learning algorithm. Plos one,

15(2):e0228439.

Harer, J. A., Kim, L. Y., Russell, R. L., Ozdemir, O., Kosta,

L. R., Rangamani, A., Hamilton, L. H., Centeno, G. I.,

Key, J. R., Ellingwood, P. M., et al. (2018). Auto-

mated software vulnerability detection with machine

learning. arXiv preprint arXiv:1803.04497.

Henkel, J., Lahiri, S. K., Liblit, B., and Reps, T. (2018).

Code vectors: understanding programs through em-

bedded abstracted symbolic traces. In Proceedings of

the 2018 26th ACM Joint Meeting on European Soft-

ware Engineering Conference and Symposium on the

Foundations of Software Engineering, pages 163–174.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9(8):1735–1780.

Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T.

(2016). Bag of tricks for efﬁcient text classiﬁcation.

arXiv preprint arXiv:1607.01759.

Kim, Y. (2014). Convolutional neural networks for sentence

classiﬁcation. arXiv preprint arXiv:1408.5882.

Kostadinov, S. (2017). Understanding gru networks. https://

www.towardsdatascience.com. Accessed 25 Jan 2020.

Kovalenko, V., Bogomolov, E., Bryksin, T., and Bacchelli,

A. (2019). Pathminer: a library for mining of path-

based representations of code. In Proceedings of the

16th International Conference on Mining Software

Repositories, pages 13–17. IEEE Press.

Kula, M. (2019). A python implementation of glove: glove-

python. https://github.com/maciejkula/glove-python.

The Comparison of Word Embedding Techniques in RNNs for Vulnerability Detection

119

Li, Z., Zou, D., Tang, J., Zhang, Z., Sun, M., and Jin,

H. (2019). A comparative study of deep learning-

based vulnerability detection system. IEEE Access,

7:103184–103197.

Li, Z., Zou, D., Xu, S., Ou, X., Jin, H., Wang, S.,

Deng, Z., and Zhong, Y. (2018). Vuldeepecker: A

deep learning-based system for vulnerability detec-

tion. arXiv preprint arXiv:1801.01681.

Lin, G., Xiao, W., Zhang, J., and Xiang, Y. (2019a).

Deep learning-based vulnerable function detection: A

benchmark. In International Conference on Informa-

tion and Communications Security, pages 219–232.

Springer.

Lin, G., Zhang, J., Luo, W., Pan, L., De Vel, O., Mon-

tague, P., and Xiang, Y. (2019b). Software vulnera-

bility discovery via learning multi-domain knowledge

bases. IEEE Transactions on Dependable and Secure

Computing.

Manning, C. D., Raghavan, P., and Sch

utze, H. (2009). In-

troduction to Information Retrieval. Cambridge Uni-

versity Press.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space. arXiv preprint arXiv:1301.3781.

Mokhov, S. A., Paquet, J., and Debbabi, M. (2014). The

use of nlp techniques in static code analysis to de-

tect weaknesses and vulnerabilities. In Canadian

Conference on Artiﬁcial Intelligence, pages 326–332.

Springer.

Niu, W., Zhang, X., Du, X., Zhao, L., Cao, R., and Guizani,

M. (2020). A deep learning based static taint analysis

approach for iot software vulnerability location. Mea-

surement, 152:107139.

NSCLab (2020). Cyber code intelligence github website.

https://github.com/cybercodeintelligence/CyberCI.

NVD (2019). National vulnerability database website.

https://nvd.nist.gov/.

Pennington, J., Socher, R., and Manning, C. D. (2014).

Glove: Global vectors for word representation. In

Proceedings of the 2014 conference on empirical

methods in natural language processing (EMNLP),

pages 1532–1543.

Pradel, M. and Sen, K. (2017). Deep learning to ﬁnd bugs.

TU Darmstadt, Department of Computer Science.

Ram, A., Xin, J., Nagappan, M., Yu, Y., Lozoya, R. C.,

Sabetta, A., and Lin, J. (2019). Exploiting to-

ken and path-based representations of code for iden-

tifying security-relevant commits. arXiv preprint

arXiv:1911.07620.

Reh

rek, R. and Sojka, P. (2010). Software Framework

for Topic Modelling with Large Corpora. In Proceed-

ings of the LREC 2010 Workshop on New Challenges

for NLP Frameworks, pages 45–50, Valletta, Malta.

ELRA.

Russell, R., Kim, L., Hamilton, L., Lazovich, T., Harer,

J., Ozdemir, O., Ellingwood, P., and McConley, M.

(2018). Automated vulnerability detection in source

code using deep representation learning. In 2018 17th

IEEE International Conference on Machine Learning

and Applications (ICMLA), pages 757–762. IEEE.

SARD (2019). Software assurance reference dataset

project. https://samate.nist.gov/SRD/.

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

120