Optimizing Leak Detection in Open-source Platforms with Machine

Learning Techniques

Soﬁane Lounici

, Marco Rosa

, Carlo Maria Negri

, Slim Trabelsi

and Melek Önen

SAP Security Research, France

EURECOM, France

Keywords:

Data Mining, Security Tool, Machine Learning.

Abstract:

Public code platforms like GitHub are exposed to several different attacks, and in particular to the detection and

exploitation of sensitive information (such as passwords or API keys). While both developers and companies

are aware of this issue, there is no efﬁcient open-source tool performing leak detection with a signiﬁcant

precision rate. Indeed, a common problem in leak detection is the amount of false positive data (i.e., non

critical data wrongly detected as a leak), leading to an important workload for developers manually reviewing

them. This paper presents an approach to detect data leaks in open-source projects with a low false positive

rate. In addition to regular expression scanners commonly used by current approaches, we propose several

machine learning models targeting the false positives, showing that current approaches generate an important

false positive rate close to 80%. Furthermore, we demonstrate that our tool, while producing a negligible false

negative rate, decreases the false positive rate to, at most, 6% of the output data.

1 INTRODUCTION

Data protection has become an important issue over

the last few years. Despite the multiplication of

awareness campaigns and the growth of good devel-

opment practices, we observe a major rise of data

leaks in 2019, with passwords representing 64% of

all data compromised

. It has become a huge concern

for companies to protect themselves and to efﬁciently

detect these data leaks.

GitHub

is a hosting platform for software devel-

opment version control. With more than 100 million

repositories (with at least 28 million public ones), it is

the largest host of source code in the world. Users can

use GitHub to publish their code, to collaborate on

open-source projects, or simply to use publicly avail-

able projects. In such an environment, one of the most

critical threats is represented by hardcoded (or plain-

text) credentials in open-source projects (MITRE,

2019). Indeed, when developers integrate an authen-

tication process in their source code (e.g., a database

access), a common practice is the use of password

or authentication tokens (also known as API Keys).

In this process, there is a risk that secrets may be

unintentionally published in publicly available open-

source projects, possibly leading to data breaches. For

https://preview.tinyurl.com/y7bygg8d

https://www.github.com

instance, Uber sustained in 2016 a massive data leak

affecting 57 million customers by revealing personal

data such as names, and phone numbers. This attack

was originating from a password found in a private

GitHub repository.

Several tools are already available to detect leaks

in open-source platforms such as GitGuardian

TrufﬂeHog

. Nevertheless, the diversity of creden-

tials, depending on multiple factors such as the pro-

gramming language, code development conventions,

or developers’ personal habits, is a bottleneck for

the effectiveness of these tools. Their lack of preci-

sion leads to a very high number of pieces of code

detected as leaked secrets, even though they consist

in perfectly legitimate code. Data wrongly detected

as a leak is called false positive data, and compose

the huge majority of the data detected by currently

available tools. Thus, various companies (including

GitHub itself

), are starting to automate the detection

of leaks while reducing false positive data.

In this paper, we present a novel approach to ana-

lyze GitHub open-source projects for data leaks, with

a signiﬁcant decrease in false positives thanks to the

use of machine learning techniques. First, a Regex

https://tinyurl.com/yd3c37lc

https://www.gitguardian.com/

https://github.com/dxa4481/trufﬂeHog

https://preview.tinyurl.com/ycnllvfd

Lounici, S., Rosa, M., Negri, C., Trabelsi, S. and Önen, M.

Optimizing Leak Detection in Open-source Platforms with Machine Learning Techniques.

DOI: 10.5220/0010238101450159

In Proceedings of the 7th International Conference on Information Systems Security and Privacy (ICISSP 2021), pages 145-159

ISBN: 978-989-758-491-6

145

Scanner searches through the source code for po-

tential leaks, looking for any correspondence with a

set of programming patterns. Then, machine learn-

ing models ﬁlter the potential leaks by detecting false

positive data, before a human reviewer can check the

classiﬁed data manually to correct possible wrongly

classiﬁed data. These machine learning models are

using various techniques such as data augmenta-

tion (Shorten and Khoshgoftaar, 2019), code stylom-

etry (Long et al., 2017; Quiring et al., 2019) and rein-

forcement learning (Watkins and Dayan, 1992).

The main contributions of this paper can be sum-

marized as follows.

• We present an automated leak detector for pass-

words and API Keys in open-source platforms,

with low false positive rate.

• We evaluate our solution by scanning 1000 pub-

lic GitHub and 300 company-owned repositories,

and we show that the classic regular expression

approaches generate a high false positive rate, that

we estimate close to 82%.

• We manually assess the results of this scan, prov-

ing that our solution reaches a negligible false

negative rate.

• We investigate the false positives induced by the

machine learning models, and we show it is be-

tween 5% and 32% of the ﬁltered data (hence be-

tween 1% and 6% of the overall data)

Outline. We introduce an overview of the prob-

lem of leak detection in Section 2.1, alongside an ar-

chitecture of our framework in Section 2.2. We fur-

ther detail the different modules: We describe the Path

Model in Section 3, the Snippet models in Section 4,

and the Similarity model in Section 5. We present an

evaluation of our approach, focusing on the false pos-

itive rate induced by the machine learning models, in

Section 6. We discuss the related work in Section 7.

We ﬁnally address potential privacy concerns in Sec-

tion 8.

2 OVERVIEW

2.1 Problem Statement

A leak is a piece of information in a source code, pub-

lished on open-source platforms such as GitHub, dis-

closing personal and sensitive data. Data leaks can be

caused by any type of developer, such as independent

developers or important corporations. For instance,

a password published on GitHub by an Uber’s em-

ployee led to the disclosure of personal information

of 57 millions customers

Several types of data leaks exist: API Keys (e.g.,

AWS credentials), email passwords, database creden-

tials, etc. Although detection techniques exist, cur-

rent approaches do not achieve a satisfying precision

rate, leading to a high false positive rate, i.e., non-

negligible part of data is wrongly classiﬁed as leak. A

high false positive rate implies an important workload

for reviewers who manually check the accuracy of the

classiﬁcation.

In this paper, we present an automated leak detec-

tor for open-source platforms with low false positive

rate, powered by machine learning.

We identify three main problems we intend

to tackle. To begin with, we notice that open-

source projects often provide the documentation of

their code, together with tutorials, tests, and ex-

ample ﬁles. These situations are easily recogniz-

able by the actual path name (e.g., src/Example.py,

connectionTutorial.java, etc.). An important

amount of passwords or database credentials are lo-

cated in these type of ﬁles and are never used in pro-

duction, increasing the false positive rate.

Moreover, current solutions such as GitGuardian,

Trufﬂehog, S3Scanner, GitHub Token Scanning or

others in (Sinha et al., 2015) consist of regular expres-

sion classiﬁers and exclusively focus on API Keys,

ignoring passwords as a category of leak. Indeed,

the detection of API Keys creates a negligible amount

of false positive data (due to the particular patterns).

Thus, it is easier to handle them with simple regu-

lar expression classiﬁers. Passwords, on the other

hand, are difﬁcult to identify with classic methods,

even though they account for the majority of leaks,

leading to a high false positive rate. Current solutions

offer little to no automated false positive ﬁltering (ex-

cept with simple heuristics) because they discard the

most important source of false positive data in their

analysis.

Additionally, the detection of leaks with low false

positive rate is usually performed using supervised

machine learning techniques which by deﬁnition in-

cur the need for labelled training data. The collec-

tion of leak data in this context remains a challenge

for several reasons: (i) on a theoretical point of view,

passwords/credentials are privacy sensitive data, (ii)

on a practical point of view the training dataset needs

to satisfy general properties such as balance or diver-

sity, and current machine learning approaches cannot

guarantee these properties while maintaining a rea-

sonable manual workload to sanitize, anonymize and

label data.

https://tinyurl.com/yd3c37lc

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

146

Regex Scanner

Path Model

Snippet Models

Extractor

Classifier

discoveries

repository

Path false positives

Similarity Model

output after review

Code snippet false positives

output before review

Figure 1: Architecture of our approach.

2.2 Our Approach

In order to detect leaks with high precision and low

false positive rate, we begin with the use of a regu-

lar expression scanner similar to classical approaches.

We further propose to make the distinction between

two sources of false positives: Path false positives

(e.g., data located in documentation or example ﬁles)

and Code snippet false positives (e.g., dummy creden-

tials or initialization variables). These two sources

of false positives can be tackled by two separate ma-

chine learning models: the Path model and the Snip-

pet model. Consequently, our solution regroups the

following components.

Regex Scanner. Given an open-source repository, the

Regex Scanner searches through the source code his-

tory to detect any credential, API Key or plaintext

password, and is considered as the default component

in classic approaches. The Regex Scanner analyzes

each source code modiﬁcation by a developer over

time, retrieving the link between these modiﬁcations

and a set of regular expressions. The output of the

Regex Scanner over a repository R is a set of m dis-

coveries D = {d

, ..., d

}, each discovery containing

a path f

and a code snippet s

Path Model. The Path model analyzes each path f

to reduce Path false positives, and outputs a list of

ﬁltered discoveries. We propose to make use of the

Linear Continous Bag-of-Words model to represent

and link words to the actual context. Thanks to this

model, we already reduce false positives by 69%.

The Snippet models ﬁlter false positives related

to code snippets. A code snippet is more complex to

analyze than a ﬁle path (more diversity, more irreg-

ular patterns, etc.), and may contain non-negligible

amount of irrelevant data for leak classiﬁcation (func-

tion names, type names, method names, symbols,

etc.). Compared to the Path model, an additional pre-

processing step is needed before the actual leak de-

tection. Therefore, the Snippet models consist of two

main components:

Extractor. The Extractor identiﬁes relevant informa-

tion in the snippets, i.e., the variable name and the

value assigned. As mentioned before, it is difﬁcult

to collect relevant data to train the Extractor. Thus,

we implement data augmentation techniques through

reinforcement learning.

Classiﬁer. The Classiﬁer takes the relevant informa-

tion extracted as inputs to classify a code snippet as

a leak or as a false positive. At this step, we con-

sider again a LCBOW model, leading to a reduction

of 13% of the discoveries with the combination of the

Extractor and the Classiﬁer.

As a ﬁnal step, once automated components out-

put the leaks they have detected, a human reviewer

manually checks the accuracy of the classiﬁcation

by ﬂagging (i.e., re-classifying manually) a leak as a

false positive.

Similarity Model. The Similarity model can assist

the human reviewer by ﬂagging similar discoveries as

false positives to reduce her workload.

Figure 1 gives an overview of the architecture of

the proposed framework. In the following sections,

we describe the design choices for each of these com-

ponents while illustrating their use with three example

scenarios.

Scenario 1. Consider the code snippet String

password = "Ub4!l", located in the ﬁle

src/Example.py. The Regex Scanner identi-

ﬁes the key word password, so that the discovery is

classiﬁed as a leak. Then, the Path model analyzes

the ﬁle path, and discards the leak as a Path false

positive (due to the word Example).

Scenario 2. Consider the code snippet String

password = "Ub4!l", located in the ﬁle

src/run.py. The Regex Scanner still identiﬁes

the key word password, while the Path model does

not discard the leak due to its Path. The Extractor

outputs the combination (password, Ub4!l), and the

Classiﬁer classiﬁes this code snippet as a leak.

Scenario 3. Consider the code snippet String

password = "INSERT_CREDENTIAL_HERE",

located in the ﬁle src/run.py. The Extrac-

tor outputs the combination (password, IN-

SERT_CREDENTIAL_HERE), and the Classiﬁer

classiﬁes the code snippet as a false positive.

Optimizing Leak Detection in Open-source Platforms with Machine Learning Techniques

147

3 PATH MODEL

The goal of the Path model is to reduce the Path false

positives. This model analyzes where a leak is identi-

ﬁed in an open-source repository (i.e., its ﬁle path),

and gives a ﬁrst classiﬁcation on whether the leak

is relevant or not. The Path model relies on a basic

machine learning technique called Linear Continous

Bag-of-Words (LCBOW).

3.1 LCBOW Model

In the ﬁeld of Natural Language Processing, there

exist many possible choices for a text represen-

tation method among which the word embedding

one where words are mapped to vectors like in

word2vec (Mikolov et al., 2013b) or Bag-of-Words

(BoW) (Ma et al., 2019) representations. In this work,

we consider the use of the Linear Continuous Bag-

of-Words (LCBOW) model (Mikolov et al., 2013a;

Joulin et al., 2016), especially for its efﬁciency. We

brieﬂy explain how the LCBOW model is built.

Let’s denote a list of words as a document corpus

of size N. A sentence in the document corpus is com-

posed of N-gram features {w

, w

, ..., w

}. We obtain

the feature representations via a weight matrix U to

obtain x

= U · w

. Then, we deﬁne y as the linear

Bag-of-Words of the document, by averaging all the

feature representations x

y =

∑

i=1

y is the input of a hidden layer associated to a

weight matrix V, such that output z = V · y. We

can compute the probability that a word vector be-

longs to the j

class as p

= σ(z

), with σ(z

)

be-

ing the softmax function. Finally, the weight matri-

ces U and V are computed by minimizing the nega-

tive log-likelihood of the probability distribution, us-

ing stochastic gradient descent, namely:

−

∑

k=1

· log



σ(V ·U ·w

)



In the remaining of the paper, we will use the nota-

tion LCBOW (w) to describe the vector representation

of the word w.

3.2 Data Pre-processing

The Regex Scanner outputs a list of discoveries, each

discovery containing a path f

(used as an input for

thePath model) and a code snippet s

(used as an input

σ(z

) =

∑

k=1

for the Snippet models). Some pre-processing phase

is needed for both these data: First, we remove non-

alphanumerical characters, before applying stemming

and lemmatization, which are natural language pro-

cessing techniques (Sun et al., 2014). We split the

input data in words to obtain f

preprocc

= { f

, ..., f

In order to respect common coding conventions while

standardizing the input data, we apply the Java coding

convention to each word in f

(the choice of coding

convention is irrelevant as long as it is standardized

for all inputs).

Example. If we consider Scenario 1, with

f = src/Example.py and s = String

password = "Ub4!l", the pre-processing

phase outputs f

preproc

= {src, Example, py} and

preproc

= {String, password,Ub4!l}

3.3 Training Phase

The workload to gather sufﬁcient training data and to

review labeled items can be handled by a human re-

viewer. Since the path name is not a sensitive piece

of information, the data sanitization aspect can be re-

duced to a minimum. We collected 100k ﬁle names

from 1000 GitHub repositories (analyzed in our eval-

uation in Section 6), which we labeled using regular

expressions and manual checks. We applied the data

pre-processing techniques and we train a LCBOW

model, achieving 99% of accuracy on this dataset.

4 SNIPPET MODELS

In this section, we detail the design choices for the

Snippet models: the Extractor and the Classiﬁer. To

fully understand our approach, we propose to intro-

duce several concepts, aimed to be used as building

blocks for these models.

4.1 Building Blocks

4.1.1 Code Stylometry

Each developer has her own coding habits, depend-

ing on many factors such as the coding language or

the occurrences of given key words. We introduce a

concept called code stylometry, aiming to encapsulate

into a vector the main characteristics of these coding

habits.

Example. Consider a Python developer, focused on

software development. This developer will probably

use key words such as password or pass_word to do

password assignments (e.g., password = "Ub4!l").

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

148

A different developer, focused on database manage-

ment, might prefer keywords such as root or db (like

db.root = "Ub4!l". These design choices will re-

sult in two different code stylometry vectors.

Supposing that we have extracts of code belong-

ing to a developer (denoted E ), we compute her code

stylometry based of these extracts

4.1.2 Data augmentation

As previously mentioned in Section 2.1, obtaining a

dataset for leaks on GitHub is complicated. Indeed,

since we are dealing with sensitive data, we have to

follow and comply with privacy guidelines, e.g., per-

forming data sanitization. The collected data also

needs to be labelled, which may require a signiﬁ-

cant manual workload. In addition, the diversity of

leaks in open-source repositories usually follows the

Pareto rule, meaning that 80% of the data leaks are

originating from the same few programming patterns

(for instance, password="1234" is extremely com-

mon). Therefore, collecting a diverse dataset in order

to train a machine learning model (to have good gen-

eralization properties and avoid overﬁtting (Shorten

and Khoshgoftaar, 2019)) would be difﬁcult to reach

from a practical point of view. For these reasons, we

propose to use data augmentation techniques in order

to enhance the size and the diversity of the dataset

with no extra cost in labelling or sanitization.

Data augmentation is a set of techniques to en-

hance the diversity of a dataset without new data. It

is particularly used in image processing (Shorten and

Khoshgoftaar, 2019), by applying ﬁlters to images in

order to produce new training samples. The main ben-

eﬁt is to expand a dataset (ﬁxing class imbalance or

adding diversity in the training samples) with limited

pre-processing cost. Data augmentation can also pre-

vent overﬁtting (i.e, when a machine learning model

is not able to generalize from the training data).

Example. Consider two leaks password="Ub4!" and

mypass="1234". If we switch the variable names to

obtain password="1234" and mypass="Ub4!", we

have in fact created two new leaks. In general, given

a pattern key="value", any pair of (key, value) can

be chosen to obtain a new leak. Every time another

variable name is collected, data augmented leaks can

be obtained by the re-arrangement of already exist-

ing data. More speciﬁcally, when a new program-

ming pattern is collected for password assignment

(e.g., DataBase.key="value") additional leaks can

be obtained, creating diversity from limited dataset.

The complete list of the features we consider for code

stylometry can be found in the Appendix

Data: D , π, style

re f

Result: Training Data for π T

while condition is True do

style ← choose_actions(π, D );

reward

sim

← similarity(style,style

re f

);

update_choices(reward

sim

);

end

← choose_actions(π, D )

Algorithm 1: Q-learning algorithm.

Data: Collected data D , patterns Π, extracts

Result: model

style

re f

← stylometry(E );

for π in Π do

← QLearning(π, D , E , style

re f

);

tot

← T

∪ T

tot

end

model ← train

LCBOW

tot

);

Algorithm 2: Extractor model algorithm.

In the context of this work, we have an important

number of alternatives to enhance our dataset, such

as replacing variable names by synonyms, modify-

ing function names (e.g., from set_password() to

os.setPass()), or replacing ’[]’ with ’()’. Since

there is no clear algorithm to choose which actions

(or combination of actions) will output the best suited

dataset for the training phase, we consider the Q-

learning algorithm (Watkins and Dayan, 1992).

4.1.3 Q-learning

Q-learning algorithm is a reinforcement learning al-

gorithm, where an agent learns, through interactions

with its environment, actions to take to maximize a

reward. A classic example is a game of chess: the Q-

learning algorithm will compute the list of moves the

player needs to perform to win the game (to check-

mate her opponent).

Similarly, in the data augmentation process, some

actions can be applied to the collected data, such as

modifying variable names (like in Example A), select-

ing different functions names, or considering object-

oriented programming patterns. Since different com-

binations of actions lead to different datasets, it will

also lead to different code stylometry vectors. The

goal for data augmentation is to converge to a particu-

lar code stylometry of the transformed dataset, called

reference stylometry.

We deﬁne three primitives to build the Q-learning

algorithm, that we show in Algorithm 1.

• style ← choose_actions(π,D): The agent can

choose a combination of actions she intends to

Optimizing Leak Detection in Open-source Platforms with Machine Learning Techniques

149

perform on data D (collected data from an em-

pirical study) for a given pattern π. These actions

produce a new dataset, from which we can com-

pute the resulting stylometry style. In this paper,

we consider a list of 28 programming patterns

• similarity(style, style

re f

): The similarity function

computes the cosine distance between the current

stylometry and the reference stylometry (com-

puted through extracts E ). The output corre-

sponds to the reward (which we want to maxi-

mize)

• update_choices(reward

sim

): Based on the re-

ward, the Q-learning algorithm will update the

available choices of actions. This update is ruled

by the Bellman equation (Bellman, 1957).

After several iterations of the algorithm, the Q-

learning will apply the optimal choices of combina-

tions of actions, to compute the training dataset for

a given programming pattern T

. The stopping con-

dition can be time-based (e.g., maximum number of

iterations) or a threshold reward value.

4.2 Extractor

The main objective for the Extractor is to remove un-

necessary elements in a code snippet, taking as inputs

a list of discoveries (corresponding to the output of

the Path model), and it outputs, for each code snip-

pet, a tuple containing a variable name and a variable

value. If no tuple can be found in a code snippet, then

it is automatically discarded (because no variable as-

signment has been found).

The training data for the Extractor is obtained

through the augmentation of collected data D from

GitHub, like variable names, function names, etc.,

used for variable assignments. Data augmentation is

performed before the training phase. Simultaneously,

the Extractor has access to a collection of code ex-

tracts E ; these extracts are not discoveries, but simply

randomly chosen pieces of code, from which we can

compute a reference code stylometry. Hence, an Ex-

tractor model can be trained for every developer (be-

cause each of them has a different code stylometry)

or for a group of developers (considering their global

code stylometry).

The training phase for the Extractor is shown in

Algorithm 2. For each collected programming pattern

π, we apply the Q-learning algorithm, while consid-

ering the stylometry of the developer (style

re f

) as the

reference stylometry. We obtain the training data T

tot

see Appendix. The list of actions can also be found in

the Appendix.

Table 1: FP by models (in millions of discoveries).

Repository type Discoveries File path FP Code snippet FP Total FP

public 13.6 9.35 (69%) 1.79 (13%) 11.11 (82%)

proprietary 0.259 0.091 (35%) 0.064 (25%) 0.155 (60%)

on which a LCBOW model is trained to obtain

the Extractor.

4.3 Classiﬁer

The Classiﬁer takes as input a list of tuples, each of

them containing a variable name and a variable value

(which corresponds to the output of the Extractor) and

classiﬁes the tuple as a leak or as a Code snippet false

positive. The training data for the Classiﬁer is dif-

ferent from the training data of the Extractor. We

retrieved an open-source list of the most commonly

used passwords

(used by multiple tools when at-

tempting to guess credentials for a given targeted ser-

vice), and collected (through an empirical study) a list

of commonly used variable names (such as root, ad-

min, pass, etc.). The design of the Classiﬁer is simi-

lar to the design of the Path model, with a LCBOW

model. The Classiﬁer achieves on this dataset of

(variable name, variable value) 98% of accuracy.

5 SIMILARITY MODEL

In the manual review phase, a user can classify a po-

tential leak containing a code snippet s

as false pos-

itive. We assume that we have the set of LCBOW

word representations of code snippets of discoveries

{LCBOW (s

), ...LCBOW (s

)}. To reduce the work-

load of a human reviewer, we introduce a Similar-

ity model, taking the code snippet LCBOW (s

) as in-

put and automatically classifying discoveries contain-

ing similar code snippets as false positives, denoted

{LCBOW (s

), ...LCBOW (s

)} with 0 ≤ k

≤ k.

Deﬁnition. Let η be a similarity threshold.

Two code snippets LCBOW representation

LCBOW (s

) and LCBOW (s

) are similar if

cosine(LCBOW (s

), LCBOW (s

)) ≤ η

A similarity threshold η = 1 means that, for a

ﬂagged discovery { f

, s

}, the Similarity model ﬂags

all the duplicates of the code snippets. The impact of

η is analyzed in Section 6.3.

6 EXPERIMENTS

In this section, we present an evaluation of our solu-

tion, divided into three major parts. Firstly (in Sec-

https://github.com/danielmiessler/SecLists/tree/

master/Passwords/Common-Credentials

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

150

Figure 2: Most common ﬁles containing secrets.

tion 6.1), we evaluate the rate of false positive data

on the output of the Regex Scanner (as proposed by

the solutions in the literature). With this goal, we

scan a dataset of 1000 repositories from the pub-

lic GitHub (i.e., github.com), and 300 repositories

from a GitHub-like code versioning platform owned

by a private company. In the remainder of this sec-

tion, we refer to github.com as public github and to

the repositories publicly available on this platform as

public repositories, while we refer to the privately

owned GitHub platform as proprietary github and to

its repositories as proprietary repositories. Next, in

Section 6.2, we manually assess the false positive rate

as well as the false negative rate induced by the ma-

chine learning models, and we show that the false

negative rate is negligible (meaning that no leak on

the output of the Regex Scanner is discarded by the

models). Finally, in Section 6.3 we estimate the im-

pact of the data augmentation algorithm parameters

on the precision of our solution.

The tool that we have developed, and that we have

used for the experimental evaluation of our proposal,

is available open source together with the machine

learning models

6.1 Regex Scanner False Positive Rate

For this experiment, we randomly selected and

scanned 1000 public repositories on GitHub. The list

of regular expressions used by the Regex Scanner can

be found in the Appendix. Over 14 million discov-

eries have been found, with 13.6 million in 579 out

of 1000 public GitHub repositories (58%) and 260k

discoveries in 268 out of 300 proprietary reposito-

ries (89%). Our discoveries cover more than 30 pro-

gramming languages, and represent more than 300

ﬁle types. Figure 2 shows the 10 most common ﬁle

extensions containing leaks in our dataset. The num-

ber of contributors and the sizes of the repositories

have been chosen equally distributed.

We notice that API keys are still widely published

https://github.com/SAP/credential-digger

Table 2: Manual assessment of 2000 discoveries.

classiﬁcation

machine learning models

potential leak non critical data

manual

leak

20%

(true positives)

(false negatives)

non critical data

80%

(false positives)

99%

(true negatives)

in open-source projects, as shown also in (Meli et al.,

2019). Nevertheless, they do not represent the major-

ity of the discoveries. Indeed, in our study, we notice

a more important number of passwords giving access

to local and remote databases, or to e-mail accounts.

We observe that the vast majority of these passwords

is not critical (i.e., false positives), which seriously in-

creases the load of a developer to review each of them

manually. These passwords are mostly undetectable

by traditional scanning tools, but they are still easy

to ﬁnd for someone using a simple search tool in the

commit message (with keywords such as remove cre-

dentials, delete password, etc.). We found many pass-

words that we suppose to be real (even if we cannot

have the certainty of this, since we are not allowed

to test these passwords). This is a very important

concern not only because passwords are still widely

reused (Pearman et al., 2019), but also because two-

factor authentication is still scarcely known (and thus

activated) (Milka, 2018; Center, 2019), and scarcely

supported by services (Bursztein, ).

To summarize, the vast majority of the discover-

ies detected with the Regex Scanner consists of false

positive data. In order to reduce the false positive rate,

as described in section 3, we apply the Path model

and the Snippet models sequentially, and ﬁnally eval-

uate the newly obtained false positive rates. As shown

in Table 1, the Path model classiﬁes almost 70% of

the discoveries as false positives in the public dataset.

This score is halved with the proprietary dataset. To-

gether with the Snippets models, we see that up to

82% of the discoveries are classiﬁed as false positives

without a human intervention.

6.2 Models False Negatives

In order to assess the behavior of our models, we de-

cided to perform a manual review of a limited num-

ber of discoveries (we recall that the Regex Scanner

found 14 millions discoveries in the previous exper-

iment). To do so, we consider a sampling method,

randomly selecting 100 discoveries classiﬁed as po-

tential leaks by the models and 100 discoveries clas-

siﬁed as non critical data by the models, and we man-

ually analyze each of them. We repeat this process

10 times (covering 0.01% of all the discoveries from

the previous experiment). The results are shown in

Table 2.It is visible that 99% of the discoveries clas-

siﬁed as non critical data by the models are real-

Optimizing Leak Detection in Open-source Platforms with Machine Learning Techniques

151

{R1, R2, R3}

Augmentation

𝛱: patterns

rfp: poisoning rate

Train/Test split

𝛱*: patterns

Similarity Model

η: threshold

output after review

output before review

D’

Figure 3: Data augmentation on D to assess the performance of the Extractor with the train/test split technique.

Table 3: Description of the three repositories.

Repository Language Contributors

rhiever/MarkovNetwork

Python 3

bradtraversy/vanillawebprojects

Javascript 8

AGWA/git-crypt

C++ 15

Table 4: Impact of data augmentation with Π

∗

0.80

Situation Precision Recall

Pre-trained Extractor 55.56 100

Extractor with Q-learning 71.66 100

Extractor + Similarity model 74.52 99.71

life true negatives. The remaining percentage (cor-

responding to false negatives) corresponds to edge

cases, where developers inserted (seemingly) real cre-

dentials in dummy ﬁles. Thus, in the scope of our

study, we can state that the unclassiﬁed leak rate is

negligible. Given the discoveries classiﬁed as poten-

tial leaks, 80% of them are non critical (i.e., false pos-

itives non detected by the models), and 20% of them

are actual leaks (i.e., true positives). If we project the

results of this manual assessment to the complete list

of discoveries, we can assume that (i) our models do

not create false negatives and (ii) they provide an efﬁ-

cient reduction of the false positive data on the output

of the Regex Scanner.

6.3 Models False Positives

In the previous section, we notice that it is difﬁcult

to assess the false positive rate of the Snippet Models

(especially the Extractor) with precise metrics since

we do not have a ground truth for the majority of

the leaks detected in open-source repositories. In the

previous section, we had to consider other evaluation

techniques (e.g., sampling) to evaluate the false pos-

itive rate in real-life conditions, or to manually label

the discoveries, which represents an important work-

load. Furthermore, due to the limited size of labeled

data that we manage to collect, we cannot apply the

train/test split (Bronshtein, 2017) technique in order

to evaluate our models on them. The train/test split

technique is a well-known process to assess the va-

lidity of machine learning models, splitting the data

into two distinct subsets: training data (on which we

will ﬁt our model) and testing data (on which we will

evaluate our model). As mentioned before, the size of

the collected labeled data D is too small to accurately

evaluate the Extractor using the train/split technique.

Nevertheless, Section 4.2 shows that we can ap-

Figure 4: Normalized FP rate by pattern for the pre-trained

model and the Extractor trained with Q-learning.

ply data augmentation techniques to expand the size

of our training dataset, as long as we have a ref-

erence stylometry. Hence, the goal of this section

is to evaluate the false positive rate induced by the

Extractor itself (independently from the false posi-

tives induced by the Regex Scanner) on several open-

source repositories, with a train/test split approach

commonly used in supervised learning on an data

augmented dataset. To achieve this goal, we consider

three different repositories {R

, R

}, each of them

containing source code written in different program-

ming languages by different developers (and different

code stylometries) as shown in Table 3. The main

idea is to use the stylometries of these repositories to

obtain an augmented dataset where the train/test split

technique is possible, and to see the impact of the aug-

mentation process on accuracy metrics such as preci-

sion or recall.

6.3.1 Train/Test Split

We propose an experiment to evaluate the false pos-

itive rate on the Snippet Models with respect to R ∈

, R

}, as illustrated in Figure 3.

• To begin with, we obtain an augmented dataset D

from the collected data D, the patterns Π, and the

extracts E of the repository R. We can select the

leak percentage in D

with parameter r

f p

0.5 corresponds to a balanced dataset).

• Next, we split D

into a training and a testing

dataset. We also perform the split on the pat-

terns to obtain Π

∗

⊂ Π: this ensures that the pat-

terns used to perform the training (Π

∗

) are differ-

ent from the patterns used to do data augmentation

(Π).

• Finally, after the training phase, we compute met-

rics such as precision, recall, and f

score on the

testing dataset. A manual reviewer manually ﬂags

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

152

the false positives, and she is assisted by the simi-

larity model (with threshold η). We consider that

the manual reviewer ﬂags 0.1% of the discoveries.

There are mainly three hyper-parameters that have

an impact on the precision of the Extractor: r

f p

(the

percentage of leaks over the size of D

), how we

choose the subset of patterns Π

∗

used to train the

Extractor in the train/test phase, and the similarity

threshold η from the similarity model. In the follow-

ing subsections, we show the effects of these three

hyper-parameters on the accuracy of our solution. We

propose to ﬁrst study the impact of the Q-learning al-

gorithm on the precision of the Extractor, and show

that this technique signiﬁcantly increases the preci-

sion (thus decreasing the false positive rate). We fur-

ther evaluate the impact of the three hyper-parameters

on the precision, recall and false positive rate of our

approach.

6.3.2 Pre-trained Model

To begin with, we study the impact of the data aug-

mentation process. On the one hand, we have an Ex-

tractor model, pre-trained on the data we collected

without any data augmentation process (called pre-

trained Extractor). On the other hand, we have an Ex-

tractor model, trained with the Q-learning algorithm

for data augmentation where Π

∗

= Π

∗

0.8

(correspond-

ing to a set of patterns, randomly chosen including

80% of the patterns in Π). In Table 4, we see the

impact of the Q-learning algorithm, with a high pre-

cision score as opposed to the pre-trained model (it

increases from 55.56% to 71.66%).

A recall close to 100% means that we detect al-

most all the leaks. However, when the user ﬂags a

discovery as false positive, the similarity model (with

threshold parameter η) may classify an actual leak as

non relevant (i.e., it may cause a false negative). If we

select η = 1, we reach a recall of 100% but without

any signiﬁcant improvement of the precision score.

To ﬁx the recall drop, a possible remediation is to in-

form the user on what discoveries have been classiﬁed

as non relevant by the similarity model, so that she can

check whether or not an actual leak has been wrongly

classiﬁed (it will improve the recall score, but will in-

crease the manual workload also).

We also compare the precision score per pattern.

Each pattern has a complexity value associated with

its index (i.e., the pattern with index 1 is the simplest,

and the pattern with index 28 is the most complex).

As shown in Figure 4, we can observe a linear rela-

tionship between the pattern complexity and the false

positive rate when we use the pre-trained Extractor

(which seems natural for a global model, since more

complex patterns are harder to detect, leading to more

false positives). With the Extractor trained with the Q-

learning algorithm, the false positive rate is indepen-

dent from the complexity of the pattern (which means

that no particular pattern will lead a higher false pos-

itive rate).

6.3.3 Extractor with Q-learning

In this section, we solely consider the Extractor

trained with the Q-learning algorithm (excluding the

pre-trained model), by presenting the impact of r

f p

and Π

∗

on the false positive rate.

Impact of r

f p

: First, we analyze the impact r

f p

the false positive rate in three different situations, i.e.,

with r

f p

= 0.5 (balanced situation between leaks and

false positive), r

f p

= 0.2, and r

f p

= 0.05 (unbalanced

situation where leaks are scarce), while ﬁxing the pa-

rameter Π

∗

. We present the results in Table 5a. We

observe that:

• in a balanced situation, we achieve a false positive

rate of 5.97%, considerably reducing the part of

false positive data in the discoveries;

• in unbalanced situations, the results show that we

manage an acceptable rate of false positives, be-

low 12%.

Impact of Π

∗

: Next, we analyze the impact of the

choice of Π

∗

on the false positive rate in several sit-

uations, while ﬁxing the poisoning rate r

f p

= 0.5.

As mentioned before, each pattern has a complexity

value. Thus, we can deﬁne the complexity of a set of

patterns Π as the average complexity of these patterns

(therefore, in our experiments, the complexity of our

set of 28 patterns Π, is equal to 14.5). Let Π

∗

0.5

be a

set of patterns representing 50% of the set of patterns

in Π, with an equivalent pattern complexity. Table 5b

presents the results.

For Π

∗

= Π

∗

0.5

, we obtain a false positive rate of

18.35% in this setting. Compared to Π

∗

0.8

, a 30% de-

crease in the number of patterns leads to a 15% in-

crease of the false positive, proving that our approach

is able to generalize unseen patterns while preserving

low false positive rate. We also consider Π

∗

0.25

cor-

responding to 25% of the patterns with an equivalent

overall pattern complexity.

Furthermore, we decide to study set of patterns

without conserving the overall pattern complexity,

splitting Π

∗

into two sets Π

∗

= Π

∗

simple

∪ Π

∗

complex

corresponding respectively to the ﬁrst 14 patterns and

to the last 14 patterns. The results of the experiment

with Π

∗

simple

, Π

∗

complex

, Π

∗

simple

and Π

∗

0.25

are also pre-

sented in Table 5b.

Although Π

∗

0.5

, Π

∗

simple

and Π

∗

complex

contain the

same number of programming patterns, the pattern

Optimizing Leak Detection in Open-source Platforms with Machine Learning Techniques

153

Table 5: Poisoning experiments. Results in bold in (a) correspond to experiments with identical parameters in (b).

∗

0.80

∗

0.5

Situation

Precision Recall F1 Precision Recall F1

Before review 89.33 100 94.36 71.66 100 84.89

After review 89.69 99.96 94.55 74.52 99.71 85.30

(a) Impact on the manual review on the metrics

Situation

∗

= Π

∗

0.80

f p

= 0.5 r

f p

= 0.20 r

f p

= 0.05

FP Rate 5.97 12.09 11.03

Situation

f p

= 0.5

∗

complex

∗

0.5

∗

simple

∗

0.25

FP Rate 9.36 18.35 31.99 27.86

(b) Impact of Π

and r

f p

on the FP rate

complexity distribution greatly impacts the false pos-

itive rate. We reach an acceptable false positive rate

with only 25% of the patterns, but more equally dis-

tributed in complexity. It is worth noting that the high-

est score is reached with the Π

complex

pattern set, with

results close to the full pattern experiment. Indeed, as

shown in Figure 4, the false positive rate per pattern is

higher, on average, for complex patterns (i.e., with in-

dex above 14). Therefore, targeting only this class of

patterns leads to a decrease of the global false positive

rate.

With respect to Π

∗

and r

f p

, we estimate the false

positive rate induced by the Extractor between 6%

and 32%. In Section 6.1, we showed that more than

80% of the false positive data (induced by the Regex

Scanner) has already been discarded. Overall, we

showed that the false positive rate of the whole so-

lution (including the Regex Scanner and the machine

learning models) represents between 1% and 6% of

the output.

7 RELATED WORK

7.1 Research Work

An important amount of work targets GitHub open-

source projects, from vulnerability detection(Russell

et al., 2018) to sentiment analysis (Guzman et al.,

2014). Empirical studies also provide a more global

overview of the data on GitHub (Kalliamvakou et al.,

2014) and how to facilitate its access (Gousios et al.,

2014).

With the advent of machine learning techniques

in the researchers’ toolkits, approaches for source

code representation have been developed, propos-

ing a language-agnostic representation of source

code (Alon et al., 2018; Gelman et al., 2018). Leak

detection can be also considered as a branch of data

mining or code search tasks. Works on evaluating the

state of the semantic code search (Husain et al., 2019),

as well as works on deep learning applications for

code search (Cambronero et al., 2019), emphasize the

need for developing machine learning techniques for

source code analysis. However, these previous works

have different purposes from ours, especially regard-

ing the criticality of the datasets, and they consider

token-based representations (so language dependent)

as opposed to our purely semantic approach.

Leak detection is connected to malware detec-

tion (Dahl et al., 2013; Pendlebury et al., 2019) ad-

dressing similar issues to solve privacy concerns in

realistic settings, where the testing samples are not

representative of real world distributions. Contrary

to malware classiﬁcation, we do not have a reference

dataset to benchmark language speciﬁc approaches.

Code transformations based on stylometry have

been tackled by other works (Long et al., 2017;

Quiring et al., 2019). In particular, in (Quiring

et al., 2019), the authors, given a list of code ex-

tracts {e

, ..e

} developed by a list of developers

, ...D

} and an authorship attribution classiﬁer,

transform each e

to fool the classiﬁer concerning the

authorship of e

. To do so, they use a Monte-Carlo

Tree Search algorithm to compute the most optimal

code transformations to perform the authorship attri-

bution attack. In our work, we leverage the ideas

developed in (Quiring et al., 2019) to perform our

own code transformation to do data augmentation.

We choose Temporal Difference (TD) learning over

Monte-Carlo, due to its incremental aspect. Indeed,

in the description of the Q-learning algorithm, there is

a stopping condition in order to obtain the augmented

data, whereas Monte-Carlo algorithms have to be run

completely. We suppose that in our case the condi-

tions for the convergence of TD algorithms are satis-

ﬁed (Van Hasselt et al., 2018).

Two different studies have considered the state

of data leakage in GitHub repositories. (Sinha et al.,

2015) focuses on API Keys detection but the scope

of their study is limited to Java ﬁles, and the remedi-

ation techniques are mainly composed of heuristics.

In a more recent work (Meli et al., 2019), Meli et al.

propose a study on the leak of API Keys, focusing on

possible correlations between multiple features in a

GitHub project to ﬁnd root causes. Nevertheless, this

work is limited to API Keys: it is explicitly stated that

their analysis does not apply to passwords. Moreover,

the focus of their study was on the characteristics of

true secrets, with indications on contributors or per-

sistence of secrets. Our focus dwells instead, on the

false positive data, since it represents the vast major-

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

154

Category Tool Scanning process User experience Adoption

Regex

Entropy check

Heuristics

Path FP detection

Password detection

Machine learning

Free

User interface

Open-source

Repository management

Scan of private repositories

Authentication not required

Community

Scalability

Regular updates

Known algorithms TrufﬂeHog G# - - - - - - G#

Git-secrets - - - - - - G# G#

Gitrob - - - - - - G# G#

(Meli et al., 2019) G# - - - - - - - - - - -

Commercial offers GitGuardian - - - - G# - - G#

Nighfall AI - - - - - - - -

Our approach - -

= provides property; G# = partially provides property; - = does not provide property;

Figure 5: Comparison of available tools.

ity of discoveries of any open-source project. Finally,

they provide an extensive study of GitHub API Keys

leaks by scanning an important number of reposito-

ries, close to 700,000. In our work, we chose not

to conduct our GitHub leak status study with such a

high number of repositories, because it would have

led to a tremendous number of false positive discov-

eries, which would not have been possible to process.

7.2 Comparison with Other Tools

Since the problem of leak detection in public open-

source projects is not new, open-source tools such as

GitHub Token Scanning

, GitLeaks

or S3Scanner

have been developed to tackle it alongside com-

mercial platforms, namely GitGuardian and Gamma.

However, to the best of our knowledge, there is no

open-source tool which scans GitHub repositories and

applies machine learning to decrease the false positive

rate. Therefore, since the existing tools do not work

in the same paradigm as our approach (not consid-

ering passwords, for instance), we do not provide a

comparison of metrics to avoid any bias. Still, we can

compare our approach with several tools we selected.

TrufﬂeHog

is a very popular (5k stars on

GitHub, at the time of writing) and open-source scan-

ning tool. The user has to provide her own set of reg-

ular expressions to the tool in order to detect possible

leaks. This tool does not use machine learning, and

it is mostly targeted to detect API Keys. Its main ad-

vantage is surely its simplicity for developers. Simi-

lar tools have emerged with the same characteristics,

such as Gitrob

and git-secrets

https://preview.tinyurl.com/ycnllvfd

https://github.com/zricethezav/gitleaks

https://github.com/sa7mon/S3Scanner

https://github.com/dxa4481/trufﬂeHog

https://github.com/michenriksen/gitrob

https://github.com/awslabs/git-secrets

GitGuardian

is a tool provided by the name-

sake company founded in 2016 and specialized in de-

tection of leaks in open-source resources. Alongside

their commercial offer, they provide free services to

scan one’s own GitHub repositories. They claim their

tool is machine learning powered and that they can

identify more than 200 API Keys patterns, but they do

not mention passwords.

TrufﬂeHog and its variants aim to be a strong

baseline for scanning tools. For example, in (Meli

et al., 2019) authors offer improvements to its core

algorithm. Various heuristics can be implemented

to improve the accuracy of the tool, such as entropy

check: if a string has high entropy, which means it

consists of seemingly random characters, the proba-

bility that this string is an API Key is high. We per-

form several manual tests on the GitGuardian plat-

form on various API Keys patterns and on plaintext

passwords in order to understand the possibilities and

the limitations of such a tool. According to our tests,

the platform is not able to detect plaintext passwords,

and it only detects a reduced sample of API Keys,

excluding big API Keys providers such as Facebook

and Paypal. We only tested the free version of Git-

Guardian, so it might be possible that the full capa-

bilities of the platform are only enabled in the com-

mercial offer. Another commercial tool called Night-

fall AI

(formerly known as Watchtower) offers the

same services, but no free version is available to test

the platform.

We compared several tools, on different criteria,

and show our results in Figure 5. For each scan-

ning tool, we compare what techniques are used, and

if there is any false positive reduction. The open-

source tools do not perform false positive reduction

(since most of them do not detect passwords), favor-

ing the usage of heuristics which need less compu-

https://www.gitguardian.com/

https://www.nightfall.ai/

Optimizing Leak Detection in Open-source Platforms with Machine Learning Techniques

155

tational power. However, most of the heuristics are

not adapted to all use cases, so the developer has to

manually conﬁgure the tool without efﬁciency guar-

antees. In our approach, we choose to adapt the scan-

ning process to each developer, thus the ﬁne-tuning

is performed by the Leak Generator rather than the

user herself. The continuous training parameter is

the ability for the tool to re-train the machine learn-

ing models when the user ﬂags a discovery, so to im-

prove future classiﬁcations. Open-source solutions

are more focused on single use cases, offering lim-

ited interactions with the developers. Our approach,

similar to the GitGuardian platform, is to improve the

accuracy while reviewing, decreasing the monitoring

time. The user experience is also a key point in order

to be used efﬁciently. The price could represent an

important barrier for small companies willing to pro-

tect themselves, encouraging bad development habits.

Commercial products provide a user interface, mak-

ing the tool more accessible to developers, and even

to non-technical people. Since the origin of a leak

does not depend on the level of expertise of the devel-

opers (Meli et al., 2019), tools with a user interface

could be easily used also by beginners to protect their

code.

8 PRIVACY CONCERNS

DISCLOSURE

In this paper, we deal with critical data, which could

harm users’ privacy in case they were used for mali-

cious purposes. Thus, we need to discuss privacy is-

sues in the scope of our research. First, with regard to

the experiment shown in Section 6.1, public reposito-

ries represent open-source data found in public web-

sites (in particular, github.com), while the access to

the proprietary platform has been granted by the com-

pany that owns all the rights on it. In both cases, no

intrusion or hacking techniques were used to obtain

data. We ensure that data collected are only accessi-

ble to our working team, for analysis purposes only,

and that sensitive information have not been used to

train predictive models. The training of the models,

together with the evaluation of our approach shown in

Section 6.3, has been achieved using sanitized data.

Furthermore, we did not attempt to use any actual

leaks we discovered to verify their authenticity, and

we tried, when possible, to notify the developer re-

sponsible for publishing credentials. Finally, all the

real data we collected have been deleted after the ex-

perimental evaluation of our approach.

9 CONCLUSION

We proposed an approach to detect data leaks in open-

source projects with a low false positive rate. Our

solution improves classic regular expression scanning

methods by leveraging machine models, ﬁltering an

important number of false positives. Through our se-

ries of experiments, we show that our approach out-

performs classic scanning methods, produces a negli-

gible amount of undetected leaks and results in a false

positive rate of at most 6% of the output data.

ACKNOWLEDGMENTS

We would like to thank Sabrina Kall for her help dur-

ing the writing of this paper. We also would like to

thank the Institue for artiﬁcial intelligence 3IA and

the Councel of Industrial Resarch for Artiﬁcial Intel-

ligence ICAIR for their support.

REFERENCES

Alon, U., Zilberstein, M., Levy, O., and Yahav, E. (2018).

code2vec: Learning distributed representations of

code.

Bellman, R. (1957). Dynamic Programming.

Bronshtein, A. (2017). Train/test split and cross validation

in python. Understanding Machine Learning.

Bursztein, E. The bleak picture of two-factor authentication

adoption in the wild. https://tinyurl.com/yctk4aja.

Cambronero, J., Li, H., Kim, S., Sen, K., and Chandra, S.

(2019). When deep learning met code search.

Center, P. R. (2019). Americans and digital knowledge.

https://tinyurl.com/y8ftudoh.

Dahl, G. E., Stokes, J. W., Deng, L., and Yu, D. (2013).

Large-scale malware classiﬁcation using random pro-

jections and neural networks. In ICASSP.

Gelman, B., Hoyle, B., Moore, J., Saxe, J., and Slater,

D. (2018). A language-agnostic model for semantic

source code labeling. In MASES.

Gousios, G., Vasilescu, B., Serebrenik, A., and Zaidman, A.

(2014). Lean ghtorrent: Github data on demand. In

MSR, pages 384–387.

Guzman, E., Azócar, D., and Li, Y. (2014). Sentiment

analysis of commit comments in github: an empirical

study. In MSR, pages 352–355.

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., and

Brockschmidt, M. (2019). Codesearchnet challenge:

Evaluating the state of semantic code search.

Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T.

(2016). Bag of tricks for efﬁcient text classiﬁcation.

Kalliamvakou, E., Gousios, G., Blincoe, K., Singer, L., Ger-

man, D. M., and Damian, D. (2014). The promises and

perils of mining github. In MSR.

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

156

Long, F., Amidon, P., and Rinard, M. (2017). Automatic

inference of code transforms for patch generation. In

FSE, pages 727–739.

Ma, S., Sun, X., Wang, Y., and Lin, J. (2019). Bag-of-Words

as target for neural machine translation.

Meli, M., McNiece, M. R., and Reaves, B. (2019). How

bad can it git? characterizing secret leakage in public

github repositories. In NDSS.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a).

Efﬁcient estimation of word representations in vector

space.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013b). Distributed representations of words

and phrases and their compositionality. In NIPS.

Milka, G. (2018). Anatomy of account takeover. In Pro-

ceedings of Enigma.

MITRE (2019). 2019 cwe top 25 most dangerous software

errors. https://tinyurl.com/y73xa6qk.

Pearman, S., Zhang, S. A., Bauer, L., Christin, N., and Cra-

nor, L. F. (2019). Why people (don’t) use password

managers effectively. In USENIX SOUPS.

Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., and

Cavallaro, L. (2019). TESSERACT: Eliminating ex-

perimental bias in malware classiﬁcation across space

and time. In USENIX Security Symposium, pages 729–

746.

Quiring, E., Maier, A., and Rieck, K. (2019). Misleading

authorship attribution of source code using adversarial

learning.

Russell, R., Kim, L., Hamilton, L., Lazovich, T., Harer,

J., Ozdemir, O., Ellingwood, P., and McConley, M.

(2018). Automated vulnerability detection in source

code using deep representation learning. In ICMLA.

Shorten, C. and Khoshgoftaar, T. M. (2019). A survey on

image data augmentation for deep learning. Journal

of Big Data, 6:60.

Sinha, V. S., Saha, D., Dhoolia, P., Padhye, R., and Mani,

S. (2015). Detecting and mitigating secret-key leaks

in source code repositories. In MSR, pages 396–400.

Sun, X., Liu, X., Hu, J., and Zhu, J. (2014). Empirical

studies on the nlp techniques for source code data pre-

processing. In EAST, pages 32–39.

Van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat,

N., and Modayil, J. (2018). Deep reinforcement learn-

ing and the deadly triad.

Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine

learning, 8:279–292.

APPENDIX

A list of 29 regular expression used in the Regex

Scanner is presented in Table 7. We collected 15 API

Keys patterns, 3 RSA Key patterns and 1 access to-

ken pattern from (Meli et al., 2019). In addition to

these, we also used patterns from TrufﬂeHog. We

augmented this dataset with 2 ssh-related patterns,

alongside 8 passwords (or keywords) patterns. We

did not optimize our regular expressions, since we im-

plemented the scanner with Hyperscan, i.e., a regular

expression library offering integrated optimization.

In Table 6 and Table 8, we present respectively the

list of possible transformations on source code and the

list of programming patterns used for the data aug-

mentation process. We group the actions by class

of actions: identity action (no modiﬁcation on the

source code), actions expanding (or reducing) the in-

put length, actions changing the hypothetical type of

an input, and actions impacting the pattern complex-

ity.

Table 6: Actions which could be applied to a source code

extract.

Actions

identity

longer_key

longer_ f unction

longer_method

longer_ob ject

smaller_key

smaller_ f unction

smaller_method

smaller_ob ject

change_type

more_complex_pattern

simpler_pattern

We present the list of features considered to com-

pute the stylometry of an extract in Figure 6.

Features

Word occurrences in the code snippet

List of keywords in the code snippet

Number of total symbols

Average length in characters

Standard Deviation length in characters

Number of spaces

Ratio between number of spaces and number of characters

Occurrences of speciﬁc symbols (parentheses, brackets, etc.)

Figure 6: Features used to compute the stylometry vector.

Optimizing Leak Detection in Open-source Platforms with Machine Learning Techniques

157

Table 7: Regular expression patterns.

Type Pattern Source

—–BEGIN RSA PRIVATE KEY—–

RSA Private Key [\r\n]+(?:\w+:.+)*[\s]*(?:[0-9a-zA-Z+=]

{64,76}[\r\n]+)+[0-9a-zA-Z+=]+[\r\n]+ Meli et. al

—–END RSA PRIVATE KEY—-

—–BEGIN EC PRIVATE KEY—–

RSA EC Key [\r\n]+(?:\w+:.+)*[\s]*(?:[0-9a-zA-Z+=]

{64,76}[\r\n]+)+[0-9a-zA-Z+=]+[\r\n]+ Meli et. al

—–END EC PRIVATE KEY—-

—–BEGIN PGP PRIVATE KEY BLOCK—–

RSA PGP Key [\r\n]+(?:\w+:.+)*[\s]*(?:[0-9a-zA-Z+=]

{64,76}[\r\n]+)+[0-9a-zA-Z+=]+[\r\n]+ Meli et. al

—–END PGP PRIVATE KEY BLOCK—-

Access token ((?:\? | \& | \" | \’)(?:access_token)(?:\" | \’)?\s*(?:= | :)) Meli et. al

Token EAACEdEose0cBA[0-9A-Za-z]+ Meli et. al

Token AIza[0-9A-Za-z\-_]{35} Meli et. al

Token [0-9]+-[0-9A-Za-z_]{32}\.apps\.googleusercontent\.com Meli et. al

Token sk_live_[0-9a-z]{32} Meli et. al

Token sk_live_[0-9a-zA-Z]{24} Meli et. al

Token rk_live_[0-9a-zA-Z]{24} Meli et. al

Token sq0atp-[0-9A-Za-z\-_]{22} Meli et. al

Token sq0csp-[0-9A-Za-z\-_]{43} Meli et. al

Token access_token\$production\$[0-9a-z]{16}\$[0-9a-f]{32} Meli et. al

Token SK[0-9a-fA-F]{32} Meli et. al

Token key-[0-9a-zA-Z]{32} Meli et. al

Token AKIA[0-9A-Z]{16} Meli et. al

Token (xox[p|b|o|a]-[0-9]{12}-[0-9]{12}-[0-9]{12}-[a-z0-9]{32}) TrufﬂeHog

Token https://hooks.slack.com/services/T[a-zA-Z0-9_]{8} TrufﬂeHog

/B[a-zA-Z0-9_]{8}/[a-zA-Z0-9_]{24}

Key word sshpass Our contribution

Key word sshpass -p.*[’|\"] Our contribution

\s*((?:=|:| − >|< − |=>|<=|==|<<))

Password (password|new_pasword|username Our contribution

\s*(?:=|:| − >|< − |=>|<=|==|<<))

\s*(?:=|:| − >|< − |=>|<=|==|<<))

Password (access_token|access_token_secret|consumer_key |consumer_secret Our contribution

\s*(?:=|:| − >|< − |=>|<=|==|<<))

Password (FACEBOOK_APP_ID|ANDROID_GOOGLE_CLIENT_ID) Our contribution

\s*(?:=|:| − >|< − |=>|<=|==|<<))

Password (authTokenToken|oauthToken|CODECOV_TOKEN Our contribution

\s*(?:=|:| − >|< − |=>|<=|==|<<))

Password (IOS_GOOGLE_CLIENT_ID Our contribution

\s*(?:=|:|->|<-|=>|<=|==|«))

Password (sk_live|rk_live Our contribution

\s*(?:=|:| − >|< − |=>|<=|==|<<))

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

158

Table 8: Programming patterns used for the data augmentation process.

Id Pattern

1 key = "value"

2 key[’value’]

3 key « object.method("value")

4 key.method(’value’)

5 Object.key = ’value@gmail.com’

6 key = type_1 function Password(’value’)

7 public type_1 type_2 int key = ’value’

8 key => method(’value’)

9 type_1 key = ’value’

10 Object[’key’] = ’value’

11 method.key : "value"

12 object: {email: user.email, key: ’value’}

13 key = setter(’value’)

14 key = os.env(’value’)

15 Object.method :key => ’value’"

16 key = Object.function(’value’)

17 User.function(email: ’name@gmail.com’, key: ’value’)

18 User.when(key.method_1()).method_2(’value’)

19 key.function().method_1(’value’)

20 type_1 key = Object.function_1(’value’)

21 method(’key’=>’value’)

22 public type_1 key { method_1 { method_2 ’value’ } }

23 private type_1 function_1 (type_1 key, type_2 password=’value’)

24 protected type_1 key = method(’value’)

25 type_1 key = method_1() credentials: ’value’.function_1()

26 type_1 key = function_1(method_1(type_2 credentials = ’value’))

27 Object_1.method_1(type_1 Object_2.key = Object_1.method_2(’value’) )

28 type_1 Object_1 = Object_2.method(type_2 key_1=’value_1, type_3 key_2=’value_2’)

Optimizing Leak Detection in Open-source Platforms with Machine Learning Techniques

159