Security Issue Classiﬁcation for Vulnerability Management with

Semi-supervised Learning

Emil W

areus

1,2

, Anton Duppils

, Magnus Tullberg

and Martin Hell

Debricked AB, Malm

o, Sweden

Dept. of Electrical and Information Technology, Lund University, Lund, Sweden

Keywords:

Machine Learning, Open-Source Software, Vulnerabilities, Semi-supervised Learning, Classiﬁcation.

Abstract:

Open-Source Software (OSS) is increasingly common in industry software and enables developers to build

better applications, at a higher pace, and with better security. These advantages also come with the cost of

including vulnerabilities through these third-party libraries. The largest publicly available database of easily

machine-readable vulnerabilities is the National Vulnerability Database (NVD). However, reporting to this

database is a human-dependent process, and it fails to provide an acceptable coverage of all open source

vulnerabilities. We propose the use of semi-supervised machine learning to classify issues as security-related

to provide additional vulnerabilities in an automated pipeline. Our models, based on a Hierarchical Attention

Network (HAN), outperform previously proposed models on our manually labelled test dataset, with an F1

score of 71%. Based on the results and the vast number of GitHub issues, our model potentially identiﬁes

about 191 036 security-related issues with prediction power over 80%.

1 INTRODUCTION

In today’s competitive environment Open-Source

Software (OSS) enables organizations to leverage

technology to better meet customer needs. A report

from Synopsys found that 99% of codebases contain

open source, and 70% of each codebase was open

source (Synopsys, 2020). The increased exposure to

open source software is often regarded as an enabler

of higher productivity, but which may come at the cost

of higher susceptibility to attacks enabled through

vulnerabilities in OSS. To maintain control over the

security of the proprietary software, the maintainers

need to monitor vulnerabilities introduced through

external software. This is part of what is known as

Software Composition Analysis (SCA).

Vulnerabilities can be reported and be given a

Common Vulnerabilities and Exposures (CVE) iden-

tiﬁer and can then be stored in databases such as the

National Vulnerability Database (NVD). This allows

for a centralized repository and collection of vulner-

abilities. However, not all vulnerabilities are given

a CVE identiﬁer. Even if such vulnerabilities are

patched in the OSS component, it can be difﬁcult for

end users to identify the need to update the compo-

nent to its most recent version. This will risk software

being exposed to attacks.

In OSS, contributors, users, and community mem-

bers often use issues to organize their work, specify

requirements, and report bugs in the software. These

issues may contain security-related information about

the OSS, such as bugs with security implications, vul-

nerability reports, or information on a security update.

Vulnerabilities reported as issues may sometimes not

ﬁnd their way to a CVE. Being able to classify issues

as security-related is the ﬁrst step towards assessing if

they describe a vulnerability.

In this paper, we use Natural Language Pro-

cessing (NLP) to automate the process of ﬁnding

such security-related issues. Our model leverages

unlabeled issues through Semi-Supervised Learning

(SSL) to increase the performance during inference.

SSL enables the model to better generalize to the task

and to better learn the underlying software-related se-

mantics in the issues. Our contributions are summa-

rized as follows:

• We analyze, through Term-Frequency Inverse

Document Frequency (TF-IDF) and truncated

Singular Value Decomposition (SVD), the use of

issue labels and CVE summaries as labeled train-

ing data, and show that such data should not be

used to train an issue classiﬁer.

• We describe how to model the problem with a

Wåreus, E., Duppils, A., Tullberg, M. and Hell, M.

Security Issue Classiﬁcation for Vulnerability Management with Semi-supervised Learning.

DOI: 10.5220/0010813000003120

In Proceedings of the 8th International Conference on Information Systems Security and Privacy (ICISSP 2022), pages 84-95

ISBN: 978-989-758-553-1; ISSN: 2184-4356

Hierarchical Attention Network (HAN) with Vir-

tual Adversarial Training (VAT) and show that this

model provides better results than previously pub-

lished models.

The result is a state-of-the-art classiﬁcation of GitHub

issues into security and non-security related issues.

We also provide our manually labeled test dataset for

future comparison

The paper is outlined as follows. Section 2 pro-

vides some background on NLP, vulnerabilities, is-

sues, and our method of evaluation. In Section 3,

we describe and analyze the data that is used in our

model. Then, the model is detailed in Section 4, fol-

lowed by the results in Section 5. We compare our

approach and results to related works in Section 6,

before the paper is concluded in Section 7.

2 BACKGROUND

2.1 Natural Language Processing

NLP is the task to make computers understand lin-

guistics, usually with the support of machine learning.

Within NLP, tasks such as machine translation, doc-

ument classiﬁcation, question answering systems, au-

tomatic summary generation, and speech recognition

are common (Khurana et al., 2017). One of the main

advantages of using machine learning for NLP is that

the algorithms may gain a contextual semantic under-

standing of text where classiﬁcations are not depen-

dent on a single word, but rather a complex sequence

of words that can completely alter the meaning of the

document. This is beneﬁcial in the endeavor to ﬁnd

vulnerabilities through a classiﬁcation of issues, as

the context of single words in these documents may

matter.

Supervised classiﬁcation methods require lots of

labeled training data. As in many real-world use cases

of machine learning, there is a limit to the amount

of available training data. In a semi-supervised ap-

proach, a more limited set of labeled training data can

be used together with a large set of unlabeled data.

There are multiple different techniques for leveraging

the unlabeled data during training. In our approach,

we use a technique called Virtual Adversarial Train-

ing (VAT), which helps the model generalize by al-

tering the input of labeled examples during training.

VAT was chosen as the SSL-method due to that it

is usable with already implemented neural networks,

All data used for training, validation and tests will be

made publically available if the paper is accepted.

which makes it good to benchmark SSL vs non-SSL

methods.

2.2 Vulnerabilities and Security Related

Issues

Vulnerability data is to a large extent centralized

through CVE identiﬁers, maintained by Mitre. Each

vulnerability also comes with a very short summary,

describing e.g., the affected software, the nature of

the vulnerability, and the potential impact in plain

text. These vulnerabilities are collected in NVD,

which adds further information useful for better un-

derstanding and analyzing the vulnerabilities. This

includes a severity score and Common Platform Enu-

meration (CPE) identiﬁers that uniquely identify the

product and versions that are vulnerable. Some CVE

entries also relate the vulnerability to the underlying

weakness by providing a Common Weakness Enu-

meration (CWE) identiﬁer. NVD currently collects

around 15k-20k data annually, and the database cur-

rently contains around 150k vulnerabilities.

GitHub is the world’s largest host of source code

and provides services such as source code manage-

ment, version control, issue tracking, and continuous

integration. Issues are used to track ideas, enhance-

ments, bugs, and tasks related to a repository. Issues

can be assigned metadata for categorizing it, e.g., bug,

feature, question, etc. In particular, the security label

can be used to mark that an issue is security-related.

However, it is at the user’s discretion to choose labels,

so this tag may or may not be accurate, depending on

the project and developers. For purposes of accessing

GitHub data, an API is provided that allows anyone

to fetch GitHub information without having to parse

web pages.

2.3 Evaluation in Machine Learning

Data is typically divided into three independent sets.

First, the training data is used to train the ML model.

Then, a validation dataset is used to monitor model

performance during development. Finally, a test set is

used to evaluate the performance of the model. In the

evaluation, standard tools and notions include preci-

sion, recall, F1, and receiver operating characteristic

- area under curve (ROC-AUC). Denote the number

of true positives as TP, false positives as FP, true neg-

atives as TN, and false negatives as FN. Fixing the

decision threshold for a classiﬁer, the ﬁrst three are

then deﬁned by

precision =

T P

T P + FP

, recall =

T P

T P + FN

Security Issue Classiﬁcation for Vulnerability Management with Semi-supervised Learning

and their harmonic mean F1

F1 = 2 ·

precision · recall

precision + recall

Our algorithms will be optimized for F1 in order to

reduce the risk of building a trivial classiﬁer.

ROC-AUC is derived from the integral of the

curve that is created from varying the classiﬁcation

threshold for the true-positive rate (t pr) and the false-

positive rate ( f pr). These are derived from

tpr =

Positives correctly classiﬁed

Total positives

T P

T P + FN

and

fpr =

Negatives incorrectly classiﬁed

Total negatives

FP + T N

3 UNDERSTANDING AND

EXPLORING THE DATA

As a ﬁrst step, we need to have a good understand-

ing of the underlying data. We divide this process

into data acquisition, data cleaning, and exploratory

analysis. The results will be used when developing

the classiﬁcation model.

3.1 Data Acquisition

Three different data-sources were used.

• Issues from GitHub, where over 7 000 000 is-

sues were acquired from the GitHub API, cover-

ing over 30 000 of the most popular repositories

according to the number of stars.

• CVEs from NVD, which can safely be considered

as security-related in their text-semantics.

• SecureReqNet (SRN) (Palacio et al., 2019),

where the authors provided a labeled open-source

dataset of security-related issues collected from

GitHub, NVD, and GitLab. The labels for this

dataset were automatically generated from issue-

labels, where 10% of the data has been sampled

for quality assurance.

GitHub issues include several data points, includ-

ing e.g., title, body, creation date, labels, closing date,

and references to commits and releases in which the

issue was closed. A CVE, as provided by NVD, con-

sists of e.g., the CVE id, when it was updated, a brief

summary of the vulnerability, a CWE identiﬁer, links

to other resources, and a list of CPEs.

The data from GitHub and SRN primarily consist

of community-generated data in a weakly controlled

environment with varying quality of each data point.

In contrast, CVE data stems from a more controlled

environment with higher data quality. While 150k

security-related texts (CVE summaries) appear to be

excellent candidates for training data, this fundamen-

tal difference requires careful analysis. Otherwise,

there is a risk that an NLP model would only learn

to differ between data sources. Issues can be linked to

CVEs by being listed as an external resource on NVD.

Such issues can safely be regarded as security-related.

This provided us with an additional 941 security-

related issues from GitHub, among which 847 are

used for training and 94 for validation.

Issues may be linked to labels in a zero-to-many

relationship. Labels can provide additional sig-

nal to the distinction between security-related and

non-security-related issues. Thus, another candidate

for training data are issues with an explicit, user-

generated, security label. These issues will also be

explored in more detail in order to determine their

suitability for training the classiﬁer.

3.2 Data Cleaning

To prepare the data for further analysis and classiﬁ-

cation we must clean noisy parts of the data that will

not contribute signal to the model. To properly work

with text, input documents are tokenized, i.e., split

into tokens, such as words and punctuation. To give

the model as useful tokens as possible, the following

cleaning measures were implemented:

• Non-English documents were removed.

• All emojis were removed.

• Everything within the HTML-tag code was re-

moved.

• All text was converted to lowercase.

• Special characters were removed and end of sen-

tences were replaced with a special tag.

• Very long (over 60,000 characters) and very short

(under 10 characters) documents were removed.

• Words were stemmed to their root word.

The reason to remove code from the documents

is that code and text are vastly different and would

require a separate model to accurately extract signals

from the code.

3.3 Exploratory Analysis

We start by exploring the labels. Since these are

user-generated, labels with different casing, wording

ICISSP 2022 - 8th International Conference on Information Systems Security and Privacy

and format may in fact be the same underlying la-

bel. To handle different variations of the same label,

all labels were clustered using character-level term

frequency–inverse document frequency (TF-IDF) and

K-means clustering, and similar labels were aggre-

gated to the resulting clusters. Each cluster name was

determined by the most common label in that par-

ticular cluster. The security-labeled issues represent

about 0.2% of all labels, while help wanted makes up

23%, bug makes up 13%, and enhancement makes up

9%. It is clear that the share of security-labeled is-

sues was very low, which suggests that labels should

not be included in the model. Furthermore, variations

within label groups were signiﬁcant and may provide

too much noise to the model. This conclusion will

also be supported by the data visualization below.

To evaluate the potential use of CVE summaries as

training data for security-related text, we ﬁrst extract

bigrams and trigrams from both GitHub issues and

CVEs. The ten most common bigrams/trigrams are

given in Table 1. Looking at these samples, there is no

apparent similarity between the two datasets in terms

of term frequency.

To verify this non-similarity further, we perform

a dimensionality reduction to visualize this. The

cleaned issues and CVE-summaries were fed into

a term frequency–inverse document frequency (TF-

IDF) model, which calculates the term importance of

each term in a document in relation to the full corpus.

The resulting high dimensional sparse matrix is a nu-

merical representation of the input documents. For

visualization, the dimensionality of the data was re-

duced with a truncated singular value decomposition

model (Truncated SVD) (Halko et al., 2011). Fig. 1

can be viewed as term-frequency-similarity between

different data labels. It is clear that there is a large dif-

ference in term-frequency distribution between CVE-

summaries and issues. There is also no meaningful

difference between security-related issues and non-

security related issues, which makes this a non-trivial

problem. In conclusion, this tells us that CVEs and

issues use vastly different words in their text docu-

ments, which may be expected as CVEs are formally

written by a very limited number of people, while is-

sues can be written by anyone. Moreover, since there

is no signiﬁcant difference between the different la-

bels, and all labels use a signiﬁcantly different lan-

guage than CVEs, the label information is not consid-

ered in our model.

From this analysis, we decided to exclude CVE-

summaries from the training phase of the model. If

such data would be included, the model would simply

become a classiﬁer that determines the source of the

input data, rather than a classiﬁer that determines the

security-relevance of the document.

0.0

0.5

1.0

Figure 1: TF-IDF and Truncated SVD over cleaned issues

and cleaned CVEs, where the distribution of the ﬁrst feature

is visualized. The labels enhancement (13%), bug (10%),

and security (0.2%) were picked out from the issue dataset,

and all other issues are marked as other. 1000 data-points

were sampled from each class to visualize the distributions.

Since security-related and non-security related is-

sues are not linearly separable, to build an effective

model for a classiﬁer, a non-linear model is required.

To further increase the performance we will explore

the option of using semi-supervised learning, which

uses unlabeled examples to help the model generalize

well.

A summary of the data used for training and val-

idation is given in Table 2. It is a mix between data

provided by SRN (Palacio et al., 2019) (their train-

ing and validation sets, unfortunately, they did not re-

lease their test set) and our own dataset collected from

GitHub. We split the datasets into four different sets

for clear validation. The labeled training set consists

of examples from the SRN training set, issues from

our GitHub dataset that are references in CVEs from

NVD, as well as unlabeled data from our own GitHub

dataset. The validation set consists of a hold-out sub-

set from the SRN training set. We test our ﬁnal model

on the SRN Validation Dataset to enable a fair com-

parison of results. The User Labeled Test Dataset is

a separate test set consisting of sampled issues, uni-

formly sampled from our own GitHub dataset, that

has been manually labeled by us with the Annotation

Guidelines described in the Appendix .1. The User

Labeled Test Dataset gives us a gold standard to com-

pare the SRN datasets to, as these sets are automati-

cally generated as described in (Palacio et al., 2019).

Security Issue Classiﬁcation for Vulnerability Management with Semi-supervised Learning

Table 1: The top 10 Bigrams and Trigrams from NVD CVE summaries and cleaned GitHub Issue bodies respectively. (There

is no relationship between bigrams and trigrams on the same line more than them having same ranking within their own

dataset.).

Bigram Trigram

NVD Summaries Github Issues NVD Summaries Github Issues

remote attackers step reproduc allows remote attackers unknown sourc java

allows remote java org cause denial service java desktop java

execute arbitrary expect behavior attackers execute arbitrary desktop java awt

denial service node modul remote attackers execute sourc java desktop

cause denial py line execute arbitrary code java awt eventdispatchthread

attackers execute unknown sourc cross site scripting java org apach

arbitrary code oper system attackers cause denial avail avail avail

via crafted java lang site scripting xss python site packag

cross site java android remote attackers cause java android view

attackers cause java awt arbitrary web script java awt eventqueu

Table 2: Summary of the data used in our training, evaluation, and testing. Note that only 3M of the available 7M of the

unlabeled GitHub issues were used for training. This was due to time-complexity and diminishing returns.

Dataset GitHub Gitlab Source

Train Dataset

non-security related 47095 460 SRN

security 3691 (2844, 847) 452 SRN + Author

unlabeled 3M 0 Author

Validation Dataset

non-security related 4683 66 SRN

security 453 (359, 94) 55 SRN + Author

SRN Validation Dataset (used as a test set)

non-security related 555 0 SRN

security 514 0 SRN

User Labeled Test Dataset

non-security related 835 0 Author

security 112 0 Author

4 MODELING

In this section, we describe the model used for our

classiﬁer. The amount of labeled training data, in

particular security-related, is very limited, requiring

a semi-supervised model. We describe a Hierarchical

Attention Network, and then show how we combine

this with Virtual Adversarial Training in order to sup-

port a semi-supervised approach.

4.1 Hierarchical Attention Network

Hierarchical Attention Network (HAN) was intro-

duced in (Yang et al., 2016) to better model

full documents with an attention-based neural net-

work (Vaswani et al., 2017). It tries to mirror the input

document by having one attention-mechanism for the

word level, and one for the sentence level in a hier-

archical structure. This helps the model learn sparse

semantics in documents and creates a better contex-

tual representation of important terminology, attrac-

tive properties for our task.

The model takes the stemmed word sequences as

input w

,t ∈ [1,T ], for the i-th sentence and the t-th

word in the sequence length T . The text is converted

to a numerical representation through pre-trained em-

beddings provided by SRN (Palacio et al., 2019),

which is represented as matrix W

. From the nu-

merical word vectors, seen as input in Fig. 2, a bi-

directional contextual word-encoding is derived from

two LSTM (Hochreiter and Schmidhuber, 1997) lay-

ers running in different directions on the input se-

quence. The outputs from the LSTM-cells are derived

= W

, t ∈ [1,T ], (1)

−→

−−−→

LST M(x

), t ∈ [1,T ], (2)

←−

←−−−

LST M(x

), t ∈ [T, 1], (3)

ICISSP 2022 - 8th International Conference on Information Systems Security and Privacy

and the bi-directional results are concatenated as h

[

−→

←−

]. Then, h

is used as input to the word-level

attention layer, which directs the model’s attention

to more important words, e.g., buffer-overﬂow, XSS,

and injection. Such words contribute more signal to

positive security classiﬁcation. The attention mecha-

nism is derived from

= tanh(W

+ b

), (4)

exp(u

)

∑

exp(u

)

, (5)

∑

, (6)

where u

is the output from the learned weights and

biases that transforms every single word. This output

is the prediction-power for the attention for that par-

ticular word. Then, Eq. (5) normalizes the attention

for sentence i and multiplies u

with a trainable con-

text vector u

to ﬁnd the relevant words to give atten-

tion to. Eq. (6) multiplies each word with its attention

to either increase or decrease its relative importance.

After the word level attention, sentences are en-

coded to further propagate through the model. The

sentence encodings are then derived from another bi-

directional LSTM-layer as

−→

−−−−→

LSTM(s

), s ∈ [1,L], (7)

←−

←−−−−

LSTM(s

), s ∈ [L,1], (8)

and concatenated to h

= [

−→

←−

]. From this trans-

formation, h

becomes a contextual representation of

sentence i that considers the results from neighbor-

ing sentences. This is then passed on to a sentence-

level attention layer, derived in a similar manner as

word-level attention, but with variables trained with

sentence-level context. The output v (see Fig. 2) is

then used for further modeling. A visualization of

word- and sentence-layer attention is given in Fig. 3,

where high impact word and sentences are shown in

red.

4.2 Adversarial Training

Adversarial Training (Goodfellow et al., 2014) is a su-

pervised learning method based upon creating adver-

sarial examples. These are created by slightly modi-

fying existing examples, making the model misclas-

sify the adversarial example. The idea is to use ob-

servations that are very close in input space but are

very different in their output. For these, there ex-

ists a small variation to the input data, perturbations,

that will make the model misclassify that example by

adding the perturbation to the input data, which cre-

ates the adversarial examples. By training on these

adversarial examples, the model can regularize and

generalize better.

Adversarial Training modiﬁes only the loss func-

tion and can thus be applied to already existing mod-

els. Denote x as the input, y as the label paired with

x, θ as the parameters of the model,

θ as the parame-

ters with a backpropagation stop, and r as a small uni-

formly sampled perturbation with the same dimension

as x. The adversarial loss L

adv

is then given by

adv

(θ) = −log p(y|x + r

adv

;θ), (9)

where

adv

= arg min

r,||r||≤ε

log p(y|x + r;

θ). (10)

The ε is a hyperparameter that restricts the absolute

value of r. Stopping the backpropagation in

θ means

that the backpropagation algorithm should not be used

to propagate the gradients in the case of

θ.

4.3 Virtual Adversarial Training

Virtual Adversarial Training (VAT) (Miyato et al.,

2015) is an extension of Adversarial Training, mak-

ing it accessible in a semi-supervised environment.

Instead of using the labels to determine the pertur-

bations, the direction of the gradient is followed us-

ing an approximation. This is done by calculating the

Kullback-Leibler divergence (D

) between the input

probability distribution and the input probability dis-

tribution plus a small random perturbation.

The D

between two discrete probability distri-

butions P and Q over the same probability space χ is

deﬁned as

[P||Q] =

∑

x∈χ

P(x)log



P(x)

Q(x)



. (11)

The VAT cost is given by

v-adv

(θ) = D

[p(·|x,

θ)||p(·|x + r

v−adv

;θ)], (12)

where

v-adv

= arg max

r,||r||≤ε

[p(·|x;

θ)||p(·|x + r;

θ)]. (13)

A classiﬁer is trained to be smooth by minimiz-

ing Eq. (12), which can be seen as making the classi-

ﬁer resilient to worst-case perturbation(Miyato et al.,

2015).

Security Issue Classiﬁcation for Vulnerability Management with Semi-supervised Learning

Figure 2: The structure of HAN.

Figure 3: Example of attention mechanism both for word level and sentence level attention. The red word highlights indicates

relevance to the sentence. The red highlights to the left of each sentence shows the relevance of each sentence to the document.

Grey or non-highlighted words are deemed irrelevant to the core message of the document.

4.3.1 VAT in Text Classiﬁcation

The original formulation of VAT, as described in Sec-

tion 4.3, does not consider sequential data with arbi-

trary length. Therefore, the technique needs to be re-

purposed for this case, as proposed in (Miyato et al.,

2017). Let s be a sequence containing word embed-

dings, s = [ ˆv

, ˆv

,.. . , ˆv

] where ˆv

is a normalized

word embedding derived as

ˆv

− E(v)

Var(v)

. (14)

By using a sequence of word embeddings as the in-

put instead of the sequence of the tokenized words,

applying the perturbations obtained from the VAT-

calculation directly on the embeddings will create

adversarial examples suitable for text, as shown in

Fig. 4.

In VAT for text classiﬁcation, the approximated

virtual adversarial perturbation is calculated during

the training step as

v-adv

(θ) =

∑

[p(·|s

;

θ)||p(·|s

+ r

v-adv,n

;θ)],

(15)

where

v-adv

= εg/||g||

, (16)

and

g = ∇

s+d

[p(·|s;

θ)||p(·|s + d;

θ)]. (17)

N denotes the number of labeled examples and the

unlabeled examples are denoted as N

. The symbol ∇

is the gradient using the observation x during back-

propagation.

4.4 Hierarchical Attention Virtual

Adversarial Network

The HAN architecture is also expanded with a VAT-

implementation. Hierarchical Attention Virtual Ad-

versarial Network (HAVAN) still retained the HAN-

layer structure, but with some extra SSL steps added

to it. The embeddings are normalized using Eq. (14).

The loss L

v-adv

from Eq. (15) is then added to the loss

ICISSP 2022 - 8th International Conference on Information Systems Security and Privacy

Figure 4: An overview of embeddings with HAN (left), and the perturbed embeddings with HAN (right). Dim is the output

dimension of each layer and y is the output of the network.

function as well as the option to perturb the embed-

dings of the model during a training step. In HAVAN,

both labeled and unlabeled data is used during train-

ing, making it an SSL-based approach. Labeled data

is used for the standard loss function, while both unla-

beled and labeled data are used for the VAT loss func-

tion.

Since we have a binary classiﬁcation problem, the

Bernoulli distribution is used as the distributions in

Eq. (15). An overview of the model can be seen in

Fig. 5.

Figure 5: Overview of the layer structure of HAVAN (HAN

with VAT). Perturb is a perturbation that is added to the em-

beddings.

5 RESULTS AND DISCUSSION

5.1 Main Results

We evaluate four different models. A simple logis-

tic regression is used as a base model. The SRN from

(Palacio et al., 2019) was adapted for comparison with

the two models proposed in this paper, namely the

HAN and the HAVAN models as detailed in Section 4.

The SRN model was retrained on the same data with

conﬁguration according to their GitHub Repository

(SEMERU-Lab, 2021), and optimized to equal extent

as other models. All models are evaluated against the

automatically generated SRN Validation Dataset and

our User Labeled Test Dataset. The results are pre-

sented in Table 3. The best performance in each se-

curity category is presented for each dataset with bold

numbers.

For the SRN Validation Dataset, it is clear that

the HAN-model outperforms the other models with a

macro average of 73% F1, and security classiﬁcation

of 65% F1.

For this dataset, all models have very high preci-

sion scores and quite low recall, which may indicate

that the models are ﬁnding somewhat similar exam-

ples, but not other variants of security issues. This

is an indication of a skewed dataset, which is not un-

common when labels are automatically generated as

in the case of the SRN Validation Dataset. To further

analyze the skewness of the dataset, it would be in-

teresting to link the training and validation set-issues

to CWEs when applicable. CWE information could

indicate if we have higher or lower performance for

certain types of security weaknesses.

Looking at the User Labeled Test Dataset, one can

observe a signiﬁcant decrease in performance in terms

of classifying as security-related, with the best per-

forming modeling being HAVAN at a macro F1 of

71%, and a security F1 of 48%. The most proba-

ble reason for the different levels of performance is

that the different testing sets draw data from different

distributions, as one is automatically generated and

one is manually labeled. It may be that the User La-

beled Test Dataset is more inclusive in its deﬁnition

of security-related issues, which can be analyzed in

the Annotations Guidelines in Appendix .1.

In Fig. 6, we can see the 95% conﬁdence of the

models with error on the y-axis, and it is clear that

the variance is quite high for the User Labeled Test

Security Issue Classiﬁcation for Vulnerability Management with Semi-supervised Learning

Dataset as it contains quite few examples. This makes

a good argument to try to label more data with high

conﬁdence on its security relevance to further increase

the testing and training capabilities. A larger, less

skewed training set would also further increase gen-

eralization to the underlying challenge and real-world

performance of the model.

Looking at the AUC in Table 5, we observe

that the models achieve a much higher score on the

SRN Validation Dataset than the User Labeled Test

Dataset. This is because the training set draws from

the same, possibly skewed, distribution as the SRN

Validation Dataset. The Logistic Regression outper-

forms the other models in AUC, which means that

there is a more distinct separation of the distributions

of security and non-security. This may be an attrac-

tive property if the predictions are used as a ”priori-

tized list” to pop for further analysis, which may be a

valid use-case.

5.2 Findings from Predictions

Perhaps against intuition, the closing rate of security-

related issues is lower than the global closing rate. In

our results, we found that issues with security predic-

tion power over 80% had a closing rate of 80.54%,

and the global closing rate is 81.45%. This is also

visible in the longevity of issues, as it takes on av-

erage 11.6% longer to close a security issue in com-

parison to a non-security related issue. On average, it

takes 119 days to close an issue with security predic-

tion power above 80%, in comparison to 107 days for

issues with a security prediction power ≤ 50%. In to-

tal, with predictions on 7M issues, we found 497 019

(7%) issues with a prediction power over 50%, and

191 036 (2.7%) with over 80%. These numbers can

be compared to the existing 158 000 vulnerabilities on

NVD, indicating that our approach could potentially

be used to additionally identify and enumerate a large

number of new vulnerabilities. Centralizing all vul-

nerabilities allows more efﬁcient processes to iden-

tify, evaluate and prioritize vulnerabilities in software,

and subsequently to apply adequate remediation. As

of now, there are over 72M issues on GitHub, which

could result in about 2M security-related issues with

a prediction power of over 80%, assuming the distri-

bution is representative.

6 RELATED WORK

In (Palacio et al., 2019), Palacio et. al implemented a

model to perform the same task, but with a different

model architecture and a supervised approach. They

released an open-source version of their model and

data called SecureReqNet (SRN). They use a convo-

lutional neural network (CNN) with a strong analogy

to N-gram features in the documents and achieve a

performance of 98.6% AUC, but evaluate on a differ-

ent dataset than our experiments as their test set is not

publicly available. Their dataset was automatically

derived from CVE-references and validate by experts

through random-sampling. This process opens up the

model to bias towards security issues with strong link

to NVD, and may not be a ﬁt representation of secu-

rity related issues in a broader sense, and thus not an

accurate representation of the reality. In this paper, we

derived our ﬁnal testing set from completely random-

sampled issues, with no prior bias, to get a better rep-

resentation of the real underlying dataset. We call

this the User Labeled Test Dataset, and a substantial

difference in performance is presented in Table 3 be-

tween our model and the model presented in (Palacio

et al., 2019).

In (Chen et al., 2020) the authors are using a

SSL to ﬁnd vulnerability candidates from commit-

messages, issues, pull requests, and patches, with a

high F1-measure of 70.5% (72% recall, 69% preci-

sion). Their approach uses self-learning (Nigam and

Ghani, 2000) as their SSL-approach, which may treat

predictions on unlabeled examples as labels on suc-

ceeding training iterations. This SSL-approach drasti-

cally improves their performance, which also implies

the hypothesis that unlabeled data should be used

to increase performance for this task. This work is

based on work from some of the same authors where

they conduct similar research without SSL (Zhou and

Sharma, 2017).

The authors of (Zou et al., 2018) present a solution

to distinguish security related bug reports and non-

security related bug reports. The model was trained

with a supervised approach with textual and meta-

features extracted from 23 608 reports from Bugzilla

with bugs in Firefox, Seamonkey, and Thunderbird.

For this, more narrow approach, they achieved the

strong F1 of 88.6% (79.9% recall, 99.4% precision).

7 CONCLUSION

We propose the use of a Hierarchical Attention Net-

work (HAN) to classify GitHub issues as security re-

lated. To increase the amount of training data, we also

propose to use Virtual Adversarial Training (VAT).

The models are compared to the previously proposed

SRN model using both the automatically labeled SRN

validation set as testset and a manually labelled test-

set provided in this paper. Comparing the models, the

ICISSP 2022 - 8th International Conference on Information Systems Security and Privacy

Table 3: The best results for each model over the different test sets. Bold entries shows the best result for security classiﬁcation

in each column.

SRN Validation Dataset User Labeled Test Dataset

Precision Recall F1 score Precision Recall F1 score

Logistic Regression

non-security related 65% 99% 79% 92% 92% 92%

security 99% 42% 59% 40% 39% 40%

macro average 81% 72% 69% 66% 66% 66%

SRN

non-security related 66% 99% 79% 91% 89% 90%

security 97% 44% 61% 30% 36% 33%

macro average 81% 71% 70% 61% 62% 61%

HAN

non-security related 68% 99% 80% 91% 97% 94%

security 97% 49% 65% 57% 33% 42%

macro average 82% 74% 73% 74% 65% 68%

HAVAN (HAN w/ VAT)

non-security related 66% 99% 79% 92% 98% 95%

security 97% 44% 61% 75% 35% 48%

macro average 82% 72% 70% 83% 67% 71%

Figure 6: The average error on security related data with its 95% conﬁdence interval for the shows the SRN Validation Dataset

(a) and the User Labeled Test Dataset (b). The y-axes represent the average error. Note that the scales in the two graphs are

different.

Table 4: The Area Under Curve (AUC) score for each model and test set.

Logistic Regression HAN

SRN Validation Dataset 0.962 SRN Validation Dataset 0.932

User Labeled Test Dataset 0.729 User Labeled Test Dataset 0.666

SRN HAVAN (HAN w/ VAT)

SRN Validation Dataset 0.955 SRN Validation Dataset 0.939

User Labeled Test Dataset 0.634 User Labeled Test Dataset 0.707

HAN model outperforms the other models on the au-

tomatically generated test data, though the differences

are not dramatic for any of the macro average scores.

For the manually labeled dataset, the HAVAN model

gives the best results, with precision being particu-

larly strong. At the same time, SRN has the worst

performance for this dataset.

It seems clear that there are performance advan-

tages to use a model tailored to full document classi-

ﬁcation, the HAN-model, for classifying issues. The

attention mechanisms could also enable deeper anal-

ysis of important parts of each document, and even

potentially UX-capabilities with short summaries of

each document where the sentence with the most at-

Security Issue Classiﬁcation for Vulnerability Management with Semi-supervised Learning

Table 5: The AUC score for each model using the SRN Val-

idation Dataset (SRN) and our User Labeled Test Dataset

(UL).

SRN UL

Logistic Regression 0.962 0.729

SRN 0.955 0.634

HAN 0.932 0.666

HAVAN (HAN w/ VAT) 0.939 0.707

tention could be presented to a user. It also seems ad-

vantageous to leverage the vast number of unlabeled

examples with semi-supervised learning to classify is-

sues as security-related. Still, for our approach, it is

important to note that the number of labeled relevant

security examples is relatively few in comparison to

the full unlabeled dataset. To enable the use of more

aggressive SSL-methods, there is a need to acquire

more labeled examples.

REFERENCES

Chen, Y., Santosa, A. E., Yi, A. M., Sharma, A., Sharma,

A., and Lo, D. (2020). A machine learning approach

for vulnerability curation. In Proceedings of the 17th

International Conference on Mining Software Repos-

itories, MSR ’20, page 32–42. Association for Com-

puting Machinery.

Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014).

Explaining and harnessing adversarial examples.

arXiv:1412.6572.

Halko, N., Martinsson, P. G., and Tropp, J. A. (2011).

Finding structure with randomness: Probabilistic al-

gorithms for constructing approximate matrix decom-

positions. SIAM Review, 53(2):217–288.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9:1735–80.

Khurana, D., Koli, A., Khatter, K., and Singh, S. (2017).

Natural language processing: State of the art, current

trends and challenges. arXiv:1708.05148.

Miyato, T., Dai, A. M., and Goodfellow, I. (2017). Adver-

sarial training methods for semi-supervised text clas-

siﬁcation. arXiv:1605.07725.

Miyato, T., Maeda, S.-i., Koyama, M., Nakae, K., and Ishii,

S. (2015). Distributional smoothing with virtual ad-

versarial training. arXiv:1507.00677.

Nigam, K. and Ghani, R. (2000). Analyzing the effective-

ness and applicability of co-training. In Proceedings

of the Ninth International Conference on Information

and Knowledge Management, CIKM ’00, page 86–93.

Association for Computing Machinery.

Palacio, D. N., McCrystal, D., Moran, K., Bernal-C

ardenas,

C., Poshyvanyk, D., and Sheneﬁel, C. (2019). Learn-

ing to identify security-related issues using convolu-

tional neural networks. In 2019 IEEE International

Conference on Software Maintenance and Evolution

(ICSME), pages 140–144.

SEMERU-Lab (2021). Securereqnet. https://github.com/

WM-SEMERU/SecureReqNet.

Synopsys (2020). 2020 open source security and

risk analysis. https://www.synopsys.com/

software-integrity/resources/analyst-reports/

open-source-security-risk-analysis.html.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. In Advances in

neural information processing systems, pages 5998–

6008.

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy,

E. (2016). Hierarchical attention networks for docu-

ment classiﬁcation. In Proceedings of the 2016 Con-

ference of the North American Chapter of the Asso-

ciation for Computational Linguistics: Human Lan-

guage Technologies, pages 1480–1489. Association

for Computational Linguistics.

Zhou, Y. and Sharma, A. (2017). Automated identiﬁcation

of security issues from commit messages and bug re-

ports. In Proceedings of the 2017 11th Joint Meeting

on Foundations of Software Engineering, ESEC/FSE

2017, page 914–919. Association for Computing Ma-

chinery.

Zou, D., Deng, Z., Li, Z., and Jin, H. (2018). Automatically

identifying security bug reports via multitype features

analysis. In Susilo, W. and Yang, G., editors, Infor-

mation Security and Privacy, pages 619–633, Cham.

Springer.

APPENDIX

A Annotation Guidelines

An annotation policy was established in order to make

the annotation process more efﬁcient and to favor re-

peatability and reproducibility. All data in the User

Labeled Test Dataset was annotated by one of the au-

thors with knowledge in the ﬁeld of cybersecurity, a

condition that must be met in order to adequately label

data as relating to cybersecurity. Some data was an-

notated by multiple parties and compared in the cases

of mismatch to ensure the annotations were similar.

Many issues were ambiguous and unclear, making

it important to create a policy. The annotation guide-

line was used to establish a uniﬁed labeling method.

It was updated regularly during the annotation phase

whenever a new kind of case arose. The categories

do not discriminate between questions, warnings, or

other discussions about a certain topic. The text is

annotated as the most severe category that accurately

describes it. The priority goes from Vuln being high-

est to Safe being lowest.

Vuln: Presence of known exploits, user-reported vul-

nerabilities.

ICISSP 2022 - 8th International Conference on Information Systems Security and Privacy

Risk: Commonly exploited methods such as

unrestricted user input, memory leaks, unex-

pected/unintended r/w/e os/database access, over-

ﬂows, user-reported potential risk, segmentation fault,

access violation.

Caution: Breaking changes, breaking dependencies,

breaking compilation, breaking updates, installation

issues, authentication problems, port or socket mal-

functioning, ﬁrewall issues service unavailable, site

down, failed tests, out of memory, crash due to in-

stabilities, unexpected/unintended r/w/e os/database

deny, broken links, unknown CPU usage (mostly high

usage with no obvious reason for it), incorrect mathe-

matical calculations (with potential side effects), run-

time errors, unknown memory issues, conﬁguration

problems of server, error-ﬂags concerning security,

talks about computer security in some way.

Unsure: Unexpected behavior, minor breaking

changes (e.g., new functionality that has not been

used in production in a previous version), lack of con-

ﬁdence in its safety, UI bugs, development mode only

issues

Safe: Text does not cover topics concerning the cate-

gories above, such as issues asking for help with po-

tential programming mistakes.

During the evaluation, the issues labeled with

Vuln, Risk, and Caution were considered security-

related in our binary classiﬁcation. Unsure and Safe

was considered not security-related.

Security Issue Classiﬁcation for Vulnerability Management with Semi-supervised Learning