Track Down Identity Leaks using Threat Intelligence

Timo Malderle

, Sven Knauer

1,2

, Martin Lang

, Matthias Wbbeling

1,2

and Michael Meier

1,2

University of Bonn, Germany

Fraunhofer FKIE, Bonn, Germany

Independent Researcher, Leverkusen, Germany

Keywords:

Identity Leakage, Treat Intelligence, Automated Leak Detection, Identity Protection.

Abstract:

Leakage of identity data is a precursor of identity theft in the Internet. Prevention measures are neither es-

tablished to counteract identity theft nor is there any effective way to inform affected subjects after identity

leakage has been discovered. To build an identity theft early warning system, it is crucial to ﬁnd evidence

of identity leakage that happened in the past. News sites in the Internet regularly report about organizations

suffering from data leakage. Those leaked data mostly contains member, customer or employee databases

including private information. This paper presents a framework that automatically crawls and classiﬁes news

articles with respect to identity data leakage. The framework is designed to monitor an arbitrary set of websites

and to extract corresponding articles. The articles found are provided to analysts and security researchers with

extracted information about the covered leaks. This lowers the amount of work that is necessary to stay up

to date regarding leaks of identity data. The developed framework is a proof of concept and a foundation for

further projects aiming to proactively warn affected users.

1 INTRODUCTION

Millions of user credentials, such as passwords or

credit card information are breached from Internet

services. Regularly, the leaked sensitive information

of individuals is abused by criminals for identity theft.

After the criminal act, many credentials are still valid

and leaks are sold to other criminals or published pub-

licly accessible. Huge identity leaks result from at-

tacks on services like social media platforms or on-

line shops. Criminal hackers usually copy the user

database partially or completely after they success-

fully broke into the target system. More and more

web services implement secure techniques to store

customer credentials. If implemented correctly, these

techniques render resulting leaks more or less unus-

able. Unfortunately, many web services do not secure

their customer information using state-of-the-art se-

curity means.

The criticality of each identity leak depends on

multiple factors. One factor is the kind of informa-

tion included in the leaked data. While some leaks

only reveal the email address and a hashed password,

others contain the full name, the postal address, the

birth date and some payment information alongside

the clear text password. Affected individuals can be

harmed in multiple ways. The obvious harm is ﬁnan-

cial loss due to fraud based on identity theft. Addi-

tionally, there can be social or psychological harm. At

least, if a leak solely contains the email address and

a hashed password, this implicitly discloses the usage

of the certain web service by the affected individual.

Leaks are traded, distributed or sold on different

kind of underground platforms. Many leaks circu-

late through these platforms over years before an af-

fected individual recognizes the own involvement. To

protect individuals against possible consequences re-

sulting from identity leakage, there are several web

services aiming to inform users about their identity

leak status. Internationally, Have i been pwned is the

most common service in this area (Hunt, 2017). How-

ever, there are some other services sharing the same

intent: HPI-Identity-Leak-Checker (Hasso-Plattner-

Institut f

ur Digital Engineering gGmbH, 2017), Uni-

Bonn Leak-Checker (University of Bonn, 2019). One

major problem in operating such a service is the ex-

tensive necessity of computational and storage re-

sources. To get access to relevant information, ana-

lysts manually search underground platforms in the

Internet to identify newly available identity leaks.

This is a very time consuming task, that needs to be

done steadily to ensure the recency of the own leak

Malderle, T., Knauer, S., Lang, M., Wbbeling, M. and Meier, M.

Track Down Identity Leaks using Threat Intelligence.

DOI: 10.5220/0008963000960106

In Proceedings of the 6th International Conference on Information Systems Security and Privacy (ICISSP 2020), pages 96-106

ISBN: 978-989-758-399-5; ISSN: 2184-4356

database. It seems to be impossible to fully auto-

mate this task, as most underground platforms pro-

tect their infrastructure against automated process-

ing. Common hurdles are security mechanism like

CAPTCHAS or different kind of social interaction.

Besides underground platforms, new leaks are re-

ported by different sources, whereby the article con-

tents differ signiﬁcantly. Analyst need to subscribe to

virtually every possible news service or blog to stay

up-to-date. Nevertheless, relevant information regu-

larly gets lost in the huge amount of available data. To

decrease the time for scanning the platforms, it would

be helpful for the analyst to have more speciﬁc infor-

mation. Knowledge about a recently breached web

service or company and a possible name of a leak

might help to ﬁnd the leak in the Internet. This al-

lows to search systematically for this leak instead of

crawling the whole Internet without a speciﬁc target.

Therefore, an automatic ﬁlter is needed to ﬁlter out

the relevant information only.

The work presented in this paper tries to close this

gap. The main contributions are the steady monitor-

ing of suitable sources and the rapid extraction of rele-

vant information from the articles crawled from these

sources to support security analysts. The required ﬁl-

ter is realized with machine learning techniques. This

ﬁlter is integrated in a system that assists security ana-

lysts monitoring relevant news services and being no-

tiﬁed when a new leak occur. The developed frame-

work aims to permanently monitor relevant sources,

identifying news dealing with identity leaks by lever-

aging natural language processing and a trained sup-

port vector machine for the selection process. To fur-

ther decrease the amount of data served to the secu-

rity researcher a novel mechanism for determining the

criticality of a leak is introduced, structuring the se-

lected data in a usable web interface.

This paper is structured as following: Section

2 presents related work out of the categories iden-

tity leakage and information retrieval. Section 3 de-

scribes how the news articles are gathered. Their clas-

siﬁcation with a SVM into the groups leak related ar-

ticle and unrelated article is topic of section 4. Subse-

quently a comprehensive analyses will be performed

on the leak related articles to extract needed features

(section 5). Section 6 shows the entire concept of

the evolved system and in section 7 all results will

be summarized.

2 RELATED WORK

In computer science identity leakage, information re-

trieval and document classiﬁcation are well known re-

search ﬁelds. For a better understanding, the different

research areas and their interaction for the designed

system are presented in this section. The following

section will focus on identity leakage and the section

afterwards will outline the research ﬁeld of informa-

tion retrieval and document classiﬁcation.

2.1 Identity Leakage

Identity leakage is a research ﬁeld which investigates

issues around users and their digital personal data.

There are few subareas exploring identity leakage

from different perspectives. Mainly, it can be divided

into the technical and the human part. The technical

part evaluates the leakage itself, detection of identity

leakage and the protection against it (Thomas et al.,

2017; Thomas et al., 2019; Gruss et al., 2018). The

human part is about the humanly factors that can also

be a reason for identity leakage like using weak or

reused passwords (Mayer and Volkamer, 2018; Wash

et al., 2016; Pearman et al., 2017; Stobert and Biddle,

2014; Stobert and Biddle, 2018). From the viewpoint

of security researchers, the preventive and reactive

countermeasures are examined which are supposed to

secure the affected user and the compromised infras-

tructure.

A reactive countermeasure is described by DeBla-

sio et alii. They modelled a detection service for

external web services (DeBlasio et al., 2017). This

service registers honey pot accounts at the observing

web services by using fake email accounts of a self-

operating email server. The email mailbox has the

same login credentials as used for the web service.

As soon as someone authenticates himself with the

correct password at such a mailbox the service knows

that the login credentials are stolen from the observed

web service (DeBlasio et al., 2017). The second view-

point is the perspective from the human computer in-

teraction. There are multiple studies on how users

choose their passwords. Around 33 % of users reuse

their passwords (Han et al., 2016; Subrayan et al.,

2017) and if the password is not reused it is common

to add a preﬁx or sufﬁx, which leads to multiple sim-

ilar passwords for one user (Han et al., 2016). Ad-

ditionally, it is examined how to warn affected users

about their involvement in an identity leak (Malderle

et al., 2019).

2.2 Information Retrieval & Document

Classiﬁcation

While document classiﬁcation is the umbrella term

for operations classifying documents into differ-

ent classes algorithmically, information retrieval de-

Track Down Identity Leaks using Threat Intelligence

scribes the extraction of speciﬁc information from a

big and or unordered source. Document classiﬁca-

tion is a very broad term, summing up whole re-

search topics like image recognition or natural lan-

guage analysis, including the classiﬁcation of textual

data, which is relevant to our research goal. When

categorizing documents there are different ways to as-

sign a category. It could be based on the documents’

subject, metadata or other attributes. Hereby, the se-

lection of suitable attributes is crucial for a sophisti-

cated categorization as well as for further tasks like

grouping. Automated information retrieval systems

are used to reduce what has been called information

overload. The most visible information retrieval sys-

tems are web search engines like Google and projects

like Thesaurus (Dictionary.com, LLC, 2019; Aitchi-

son and Dextre Clarke, 2004).

However, there are also more focused ﬁelds of in-

formation retrieval. Even in the scope of threat intel-

ligence systems there are research works. Nunes et

al. developed a threat intelligence system which gath-

ers information about new malware and cyber-attacks

by analyzing hacker forums and special marketplaces

(Nunes et al., 2016). With a machine learning model

their system gathers around 300 cyber threat warnings

per week. Benjamin et al. additionally track and anal-

yse the conversations in hacker IRC channels to detect

potential future threats (Benjamin et al., 2015). Very

similar to this research, Grisham et al. are using neu-

ral networks to identify information about new mobile

malware in hacker forums (Grisham et al., 2017). All

these works download content from different hacker

forums. The problem is that the hacker forums are

secured against automatic crawling by using different

techniques like CAPTCHAS, invitational logins and

so on. Wiliams et al. design a new approach to bypass

these anti-crawling countermeasures by developing a

specialized forum crawler (Williams et al., 2018). It is

also examined how textual information from common

threat intelligence sources can be analyzed for further

automatic processing. Husari et al. evolved an ontol-

ogy which can describe the context and the actions of

cyber attacks (Husari et al., 2017). In addition they

developed a new approach for text mining by using

natural language processing and techniques from in-

formation retrieval to generate information which can

be represented in their ontology. Furthermore, their

tool allows to export this information to standard data

formats for threat intelligence like STIX 2.1 (Husari

et al., 2017).

3 ARTICLE GATHERING

For gathering, suitable sources need to be identi-

ﬁed. While there is a great amount of news produc-

ing sources like analog and digital media, the scope

needed to be narrowed for effective sourcing. To limit

the amount of possible sources to those being eas-

ily accessible by our framework, the following con-

straints were made:

Media. To enable automated collection and evalua-

tion of the articles, the sources were limited to

digital media.

Language. Regarding the current preference of En-

glish in the ﬁeld of natural language processing as

well as security research and computer science in

general we focused on this language, based on the

assumption that a relevant article about a new leak

is going to be be published in English, at least as

translation.

Accessibility. The source must be freely accessible

and not hindered by limitations like pay walls,

captcha-queries or censorship. While this form of

technical hurdles could be overcome, it bears ad-

ditional cost, not beneﬁting this proof of concept.

Signiﬁcance. This quality can be acquired by hav-

ing a signiﬁcant scope or having specialized in IT

security, ensuring a good coverage or high speci-

ﬁcity.

Following these constraints, the chosen list con-

tains sector speciﬁc websites like security week, the

hacker news or threat post as well as popular inter-

national sites like the guardian. The complete list

consists of: Comodo

, GBHackers

, HackRead

Help Net Security

, Infosecurity-Magazine

, Secu-

rity Gladiators

, Security Week

, Techworm

, The

Hacker News

, Threat Post

, The Guardian

, In-

formation Week

, Naked Security

, Trendmicro

https://blog.comodo.com

https://gbhackers.com/category/data-breach/

https://www.hackread.com/hacking-news

https://www.helpnetsecurity.com/view/news/

https://www.infosecurity-magazine.com/news/

https://securitygladiators.com/internet-security-news/

https://www.securityweek.com

https://www.techworm.net

https://thehackernews.com

https://threatpost.com/blog/

https://www.theguardian.com/international/

https://www.informationweek.com/

https://nakedsecurity.sophos.com/

https://www.trendmicro.com/vinfo/us/security/news/

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

Cyberdefense Magazine

For each of these sources a crawler was designed.

The crawlers are adjusted for each source, based on

the framework scrapy (Scrapinghub, 2019). A man-

ager module is managing the spider swarm and crawl-

ing every source at a conﬁgurable frequency, favor-

ably multiple times a day, downloading the new pub-

lished articles. After downloading, the articles get to-

kenized, tagged with source and timestamp and are

stored in a database. The biggest challenge besides

selecting useful sources was to write crawlers that

would populate the database with homogeneous and

sanitized data, following hyperlinks where it is help-

ful and skipping them everywhere else. The rules

modelling the crawlers behavior were crafted for ev-

ery source after a manual analysis of the sources’

structure.

The crawling of these news outlets via the afore-

mentioned spiders yielded a total of 52382 articles,

which were saved in a mongoDB. The amount of ar-

ticles from different sources varied greatly, ranging

from 1646 articles from cyberdefense.com to 15163

articles from securityweek.com. The oldest crawled

articles dated back to May 2007, resulting in an aver-

age of 12 articles per day for a time span of 12 years.

4 ARTICLE CLASSIFICATION

A linear support vector machine (SVM) is a linear

machine learning model, which is widely used in text

categorization. It has several advantages, such as a

good generalization in data with wide feature sets

(Joachims, 1998). This characteristic pays off in the

analysis of textual data composed of thousands of dif-

ferent words. Therefore a linear SVM is well-suited

for the classiﬁcation of news articles and leak news

articles.

The classiﬁcation process had two main chal-

lenges: On one side ﬁnding a way to distinguish arti-

cles that deal with leaks semantically from those who

don’t. On the other side decide, whether an article

about leaks is about the form of digital identity leak-

age or about unrelated forms of leakage like oil leak-

age, or broken water pipes, to just name a few.

The articles in question obviously consist of a title

and the actual content. A skimming through different

news sources lead to the impression that, for a human

reader the classiﬁcation as leak/ non leak is possible

for most articles by only looking at the title. Since

the length of the typical article title is only a frac-

tion of the length of the typical article content, the

http://www.cyberdefensemagazine.com/category/

news/

title-only analysis consumes fewer computational re-

sources than the analysis of the entire article. Because

of that it seemed reasonable to test if a classiﬁcation

of articles based only on the title can lead to good

results. In addition, the classiﬁcation of the articles

based on the whole article is tested.

4.1 Building a Training Set

In order to build a machine learning classiﬁcation al-

gorithm, it is unavoidable to manually classify a suf-

ﬁcient number of articles in order to form a training

and a test set. To facilitate this task, a GUI was built

to easily ﬂag an article as leak or non-leak by a single

click. A combined ﬁltering of the articles regarding

date, source and keywords allows to group similar ar-

ticles and manually classify them reasonably fast. Af-

ter initially classifying a few hundred articles in the

described way, a ﬁrst version of the optimizer, which

is described later, was tested on non-classiﬁed arti-

cles. The articles which were predicted as leak re-

lated were then again manually checked to broaden

the training set. Since the overall share of leak related

articles is relatively small, this was a feasible task.

The repeated application of both approaches lead to a

total of 15217 classiﬁed articles forming the test set

of which 1996 were categorized as leak related. Due

to the described approach, the percentage of leak ar-

ticles in the training set is considerably higher than in

the complete data set. Overall, the leak related articles

made up 3.8 % of all articles collected.

4.2 Text Pre-processing

Before being able to be classiﬁed by the SVM, the

titles, and in the second stage the text bodies, of the

manually classiﬁed articles need to be pre-processed.

This process consists of a tokenization and pos-

tagging of the text with the help of the Natural Lan-

guage Toolkit (NLTK). After this step, the tokenized

and tagged representation of the articles were used as

feature sets for the consecutive steps. This represen-

tation is characterized by the reduction of raw data

by deleting stop-words, conjunctions and alike and

attaching categories like ”nouns”, ”numeric data”,

”verbs”, etc. to the features.

4.3 Learning

Afterwards, the data is vectorized by SKLearns

CountVectorizer, split in a training and a test set and

ﬁnally processed by the SVM via the Python library

scikit-learn (scikit-learn, 2019). To ﬁnd the best pa-

rameters for the optimizer a grid search was per-

Track Down Identity Leaks using Threat Intelligence

formed. The best found parameters lead to the follow-

ing results, looking at the title only approach: Classi-

ﬁcation of the articles as leak related was done with

a precision of 0.80 and a recall of 0.77, classiﬁcation

as non-leak related with a precision of 0.96 and a re-

call of 0.97. These numbers, although not completely

terrible, do warrant a further investment of time and

resources into the analysis of the complete content of

the articles to improve the recognition of articles deal-

ing with leaks.

Concerning the full text body approach the lexical

analysis and classiﬁcation pipeline is improved, con-

sisting of a tf-idf step and a classiﬁcation step with

a trained SVM model. To improve the performance

of the already existing pipeline, it gets retrained with

more data points. For further information on the tech-

nical details consult the previous work on automated

leak classiﬁcation. This larger feature set showed sig-

niﬁcant improvements over the approach using the

titles only. While consuming more computational

power in the training and the service stage, the preci-

sion of retrieving leak related articles came in at 0.876

and the recall showed rates above 0.95. The ﬁnal re-

call value depends on how sharp the term ”leak re-

lated” is deﬁned and how borderline cases are tagged.

5 FEATURE EXTRACTION OF

LEAK NEWS ARTICLES

When dealing with articles about leaks it is neces-

sary to extract vital information about the data breach:

How many users were affected? Which kind of iden-

tity data was stolen? When was the breach?

5.1 Extract Breached Service Detection

For extracting the breached web service from a leak

news article it is necessary to specify how that infor-

mation can be presented in an article. The obvious

part of this speciﬁcation is the position in the arti-

cle. Consisting of the three aspects title, text and tags,

the collected articles referenced the desired informa-

tion partly in their title, partly in their text, sometimes

both, and sometimes as addition in the tags section.

Regarding that there are no cases which had the in-

formation in the tags but in neither of the other ﬁelds,

the location for searching the breached service was

narrowed to the text and title ﬁelds. On these ﬁelds,

which were tokenized and saved by the crawler, a part

of speech tagging is performed. This process includes

the selection of all tokens representing nouns and re-

ducing the selection by comparing the tokens to the

Alexa 1 Million (Amazon, 2019) lists, containing the

million biggest web services. This results in an entity

recognition process which produces a list of possible

breached sources mentioned by the article. To extract

the desired service a bunch of ﬁltering and weight-

ing steps are taken. At ﬁrst repeatedly named enti-

ties are weighted higher, representing their frequency

in the article. Second, known correlating names, for

example typical email providers like yahoo, gmail,

gmx, or often mentioned security researchers like Troy

Hunt (and his site have i been pwned (Hunt, 2017))

are weighted lower. These entities represent a ma-

jor amount of the false positives returned by the en-

tity recognition process. It was still a design goal not

to ﬁlter them out strictly, because even if they regu-

larly aren’t the source of an identity leak, they could

be. In the end the breached service recognition yields

a (sorted) list representing the most probable sources

for the leak in question.

5.2 Extract Affected Identities

Another critical information for evaluating the poten-

tial harm of an identity leak as well as estimating

the best way for a victim warning, is to know how

many identities are affected. This information is - if

present at all - included in the text (possibly also the

title) of the article in numeric or written-out/spelled-

out form. Leveraging the advantages of the previ-

ously done speech tagging, the tokens, representing

these number associated features, are extracted. Ap-

pended to a list the numbers extracted from the title

are weighted three times as heavy as numbers origi-

nating from the text. In the idea that another number,

maybe referencing an event the article author wanted

to show similarities to, would need to be repeated

more than the usual reference does (1 or 2 times) to

overwrite the impression from the title. At the end the

most frequent (and therefore most weighing) numbers

are shown to the analyst.

5.3 Extract Leaked Data Type

Furthermore it is important to know what type of

identity data has been leaked. While a leakage of

any data type is undesirable, especially when taking

into consideration recent incidents like the ’politician

doxing’ in Germany in early 2019 (Dobuch, 2019),

critical information like ﬁnancial data, clear text pass-

words or breaches at sensitive services like adult sites

can cause and have caused more harm than others

(Thomson, 2015). To ﬁlter for these information the

natural language processing extracts all words known

to be afﬁliated with these groups and stores them as

metadata in the database. These extracted lists are

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

100

used in the threat assessment step to determine the

potential harm.

5.4 Threat Assessment

Mandatory tasks for providing meaningful threat as-

sessment and warning services are to determine the

actuality and relevance of the leak. Therefore it is

desirable to qualify an article as either referring to a

current, yet unknown breach or an old, already known

one. Known in this context can mean known by this

frameworks database - called internal matching.

5.4.1 Internal Matching - Clustering

Internal matching describes the comparison of new

leak articles with ones stored in the database. Be-

cause of the in-homogeneous nature of journalistic

work this comparison can not take place at document

level by for example comparing their hash sums. In

general, two articles vastly differing in style, written

by different people from various news sites can very

well be about the same leak, whereas two very similar

articles could deal with different ones. To overcome

this challenge the internal matching is realized by cre-

ating a network of leak clusters. They are developed

by grouping up the articles with similar breached ser-

vices and ﬁltered by time. For getting useful clusters

the process has been redone with several, shrinking

time windows, observing the resulting clustering of

known leak articles. This showed a typical time span

of less than a week after the disclosure of a leak, dur-

ing which the reporting takes place. An additional

effect of using time windows as a feature is the sepa-

ration of survey articles which happen to deal with the

same services and affected user counts as the actual

leak articles, but are regularly published with a tem-

poral distance and often contain more detected ser-

vices than matching to a single leak.

5.4.2 Topicality

Another important information about an article han-

dling a leak is how old the leak is. This can reference

the time when the source service was breached, the

point when the leaked data was published, or the point

when the media began reporting. These points can fall

close together but do not have to. Typically data gets

stolen and is sold indeﬁnitely later, a timespan which

can be as long as several years. Concerning media

coverage, it regularly starts hours to a few days after

the leak was published on sites like pastebin.com.

Because of the uncertainty associated with infor-

mation about the breach taking place and the ﬁrst oc-

currence on paste sites, we are focusing on the times-

tamps related to the articles dealing with the respec-

tive leaks.

Dealing with this process, it was evident that

the report duration is signiﬁcantly shorter than the

span between breach and becoming common cur-

rency. While within the ﬁrst 48 hours about 80 % of

the articles concerning the same leak are published,

it was observed that 99.4 % of all articles about the

same leak are published within 6 days, staying just

under a week of reporting time. This observation in-

ﬂuenced the design of the clustering process.

5.4.3 Threat Level Rating

To ﬁlter the amount of data, the analyst has to recog-

nize regularly, it is important to assess the relevance

of the leak news. In this regard relevance means the

level of potential harm that could be associated with

this new leak.

Because yet no perfect model for this application ex-

ists, a constructive approach was met. The model

was designed with a security researcher as operator

in mind and based on the collectible information, ac-

cessible without privilege or ﬁnancial expense. Fig-

ure 1 shows the process for evaluating the threat level

of a detected leak. Ensuring a reliable metric without

the need for future expansion in scale we decided to

favor a usable approach, scaling between 1 smallest

potential threat, and 9 biggest potential threat.

When an article was classiﬁed as a leak and the

extraction of the supposed service and affected user

count has been done, the leak article is integrated in

the cluster model to decide whether the reported inci-

dent is already known or not. At this point additional

external matching services can be applied. As ﬁgure

1 shows, the Threat Level is set to one if the article

is known and set to ﬁve otherwise. This step was de-

signed with a frequent usage of this tool in mind. This

ensures that an incident already recognized by the ad-

min will not be seen as an imminent threat thus not

reported by the notiﬁcation system. Additionally, this

ensures that an already known leak, irrespective of the

following evaluation steps like affected user count or

stolen data-type, is always valued below 5, keeping

the web-interface uncluttered. This allows a selec-

tive view on more relevant data for the operator in the

web interface. To get to a more meaningful metric the

threat model weighs in several aspects, represented by

the questions in 1. If answered positively the threat

level is incremented.

Service Reputation. Is the breached Service in the

Alexa Top 1k list, representing a big, possibly

world wide acting business?

Outstanding Affected User Count. Are more users

Track Down Identity Leaks using Threat Intelligence

101

Figure 1: Threat Level Rating Process.

affected than in the average leak?

Valuable Information Leaked. Is more data leaked

than email addresses and hashed passwords? For

example clear text passwords, credit card data, or

vast amounts of private data like addresses, tele-

phone numbers.

Popular by Newspapers. If the article is handling a

new, jet unknown leak: Did more than three news-

paper articles came up within 48 hours? Then

there seems to be a special interest in this particu-

lar leak, so the analyst should be informed.

The model waives to differentiate between differ-

ently answered threat level questions resulting in the

same score, providing an easy to understand metric

for the operator, keeping all additional information in

its linked data.

6 CONCEPTUALIZATION OF

THE ENTIRE SYSTEM

This section sketches the concept of the leak-news de-

tection framework. Concentrating on the implementa-

tion, the framework consists of three stages forming a

processing pipeline. Stage one handles the download

and categorizing of articles, using scrapy crawlers

(Scrapinghub, 2019) and a trained linear SVM. If

stage one categorizes an article as dealing with iden-

tity leaks, it starts the parsing process. The Parser, di-

vided in several modules, composes stage two. Once

stage two gets used the program ﬂow always triggers

stage three, consisting of the threat level assessment

and according reactions from displaying the infor-

mation in the web-interface to immediately write an

email to its supervisor. This process, with additional

steps like the evaluation of potential articles sources

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

102

Figure 2: Framework Pipeline.

or the retrain option for the SVM is shown in ﬁgure

2. Figure 3 shows the GUI which is used to facilitate

the manual classiﬁcation of articles. Articles can be

ﬁltered by date, source or keywords. The background

color indicates whether an article was manually clas-

siﬁed either as leak-related (green), or non-leak re-

lated (red). A true or false ﬂag in the ’Pred’ column

signiﬁes that an article was part of the test set and pre-

dicted to be leak related or not.

7 EVALUATION

To evaluate the usefulness of the framework, the clas-

siﬁcation process, the extraction of leak information

and the threat assessment are looked at separately.

7.1 Information Extraction

Because of the information retrieving nature of this

process the metrics precision and recall are also suit-

able. For the extraction process recall describes

the amount of cases in which the correct affected

user count and breached service could be detected

compared to all cases. Precision describes how

many false/unnecessary/unwanted information was

retrieved as well. The return of the extraction pro-

cess yields in a much lower volume than a complete

leak article offers. This results in a lower weighting of

the precision metric. Supposedly a security researcher

would be able and interested to get a list with three to

four services, from which one is the service which

got breached, linking to a single article explaining the

case. In contrast he would probably unsubscribe a

leak news detection service which would send four

articles, with only one handling news. Trusting on

the ﬁltering skills of the security researcher we regard

recall the main Q-degree for this module.

7.1.1 Breached Service Extraction

Looking into the extraction of the service or company

that had suffered a data breach, there are two recall

metrics we can look into. On one hand the articles

containing the name of the service in their text could

be successfully parsed with a recall of 0.95. The re-

sulting margin of 0.05 are companies with names not

recognized by the pipeline. The main reasons for this

are a too unusual name, not identiﬁed by the NLP step

in the ﬁrst place, or company names that are too close

to normal English words, also skipping the NLP pro-

cess. Additionally, services to small to appear on the

Alexa 1 million (Amazon, 2019) list are excluded as

Track Down Identity Leaks using Threat Intelligence

103

Figure 3: GUI.

well.

On the other hand when just looking on the extracted

services for all classiﬁed articles we ﬁnd a recall of

0.89 because not all articles reporting about a hack or

credentials appearing on marketplaces name any af-

fected company, resulting in a lowering of the over-

all recall. In regard of the fact that this decrease is

not rooted in the framework itself but in the under-

lying articles, the recall of 0.95 should be looked at

to assess the performance of the extraction feature. It

should be noted that the process to determine this met-

ric did not make a difference whether the extracted

entity was a service or a company as long as a se-

curity analyst could use the information as well. As

example, many news articles reported about hacks of

Steam and the PlayStation Network, whilst some oth-

ers reported about leaks at Valve and Sony, both infor-

mation is correct, and viewed equal.

Looking at the accuracy of the weighing process,

it is the best metric for this case. The likelihood of

having the breached service in one of the ﬁrst three

position of the potential service array is about 70 %.

7.1.2 Affected User Count

The affected user count can be parsed in 95 % of the

cases, resulting in a high recall of 0.95. The remain-

ing cases do not give affected user counts at all, or

are survey articles resuming about too many high user

counts for the framework to keep track off. The accu-

racy is determined as for the breached service detec-

tion above, closely bound to the position in the re-

turned feature list. Having the correct affected user

count in one of the ﬁrst three positions of the poten-

tial user count array, the likelihood is about 85 %. Be-

cause there is no ﬁxed position in the text stating the

correct count, vs. additional affected user counts, re-

ferring to different leaks, information like these could

not be ignored easily in the natural language process-

ing, but is considered future work, leaving room for

further improvement.

7.1.3 Information Type Detection

As we are interested in the possible impact factor of

the leaked data we cared about whether the data type

is member of a group of especially dangerous mate-

rial, containing ﬁnancial data like credit card num-

bers, clear text passwords, or socially sensible infor-

mation like breaches at adult websites for example.

This process is implemented as a list matching and

retrieves the pre-selected terms reliably. The infor-

mation about the leaked data, disregarding emails as

a type, was given in less than 20 % of the cases. In

the scope of this project it was not possible to assess

whether this was due to less sensible data being com-

promised or to just a few newspapers reporting.

7.2 Threat Assessment

Assessing the performance of the computed threat

level is quite difﬁcult as it contains a subjective com-

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

104

ponent, as its performance would be measured in the

usability for the security researcher using our frame-

work. The best metric would be a usability study with

for example a derived SUS questionnaires and a group

of security researchers as participants. Sadly this was

not feasible in the scope of this project, so we concen-

trated on the constructive approach met in designing

the assessment module, regarding a ﬁnal evaluation

also as future work. By design the framework con-

centrates on a few crucial features to determine the

harmfulness of a leak. Most users should agree that

a leak is more dangerous if more people than average

are affected. Also it is common sense that a leak-

age of clear text passwords or credit card informa-

tion is more critical than leaked password hashes. The

consideration whether some data types in our critical

category are more harmful than the other ones is not

part of this project, but could be researched in future

projects. It was the priority of the found framework

to clearly identify a leak that is deﬁnitely harmful and

give a warning via the web-interface and optionally

warn or inform its operator by email. Being assem-

bled in a pipeline, each stage relies on the work of

its predecessors. If no affected user count could be

detected, it could not be compared to the average af-

fected user count. Regarding that we implemented a

medium threat level area in which each user has to

decide on its own whether it is worth to look further

into it or not, conﬁgurable by an email at threat-level

threshold variable. This short coming is not rooted in

the process itself, but determined by the limited infor-

mation provided by the news articles in the ﬁrst place.

8 CONCLUSION

identities of individuals. Attacks on service providers

regularly bring these identities into the unauthorized

hands of criminals. The information is then used for

identity theft or it is aggregated into larger collections

and sold accordingly. This paper contributes with a

new approach to identify digital identity leakage as

foundation for a larger framework to proactively in-

form affected individuals. Therefore, a wide variety

of newspapers and blogs are crawled on a regular ba-

sis. After the identiﬁcation of relevant articles, the

substantial information is extracted and forwarded to

a security analyst. It is shown that relevant articles can

be reliably classiﬁed by in depth text analysis. Fur-

thermore, it is possible to extract required informa-

tion to identify the source of the leak and the amount

of identities affected. Running as a service, it steadily

monitors the Internet for related information with low

operational cost. Even though the described software

cannot substitute security analysts, it decisively sup-

ports them in tracking down identity leaks, be they

small or huge collections of stolen identities.

REFERENCES

Aitchison, J. and Dextre Clarke, S. (2004). The thesaurus: a

historical viewpoint, with a look to the future. In Cat-

aloging & Classiﬁcation Quarterly, volume 37, pages

5–21.

Amazon (2019). Alexa top million. https://aws.amazon.

com/de/alexa-top-sites/. Last accessed 2019/12/17.

Benjamin, V., Li, W., Holt, T., and Chen, H. (2015). Explor-

ing threats and vulnerabilities in hacker web: Forums,

irc and carding shops. In 2015 IEEE International

Conference on Intelligence and Security Informatics

(ISI), pages 85–90.

DeBlasio, J., Savage, S., Voelker, G. M., and Snoeren,

A. C. (2017). Tripwire: Inferring internet site com-

promise. In Proceedings of the 2017 Internet Mea-

surement Conference, IMC ’17, pages 341–354, New

York, NY, USA. ACM.

Dictionary.com, LLC (2019). Thesaurus.com - synonyms

and antonyms of words. https://www.thesaurus.com.

Last accessed 2019/12/17.

Dobuch, G. (2019). 20-year-old german hacker con-

fesses in doxxing case. https://www.handelsblatt.com/

23841212.html. Last accessed 2019/12/17.

Grisham, J., Samtani, S., Patton, M., and Chen, H. (2017).

Identifying mobile malware and key threat actors in

online hacker forums for proactive cyber threat intel-

ligence. In 2017 IEEE International Conference on

Intelligence and Security Informatics (ISI), pages 13–

18.

Gruss, D., Schwarz, M., W

ubbeling, M., Guggi, S.,

Malderle, T., More, S., and Lipp, M. (2018). Use-

after-FreeMail: Generalizing the use-after-free prob-

lem and applying it to email services. ASIACCS 2018

- Proceedings of the 2018 ACM Asia Conference on

Computer and Communications Security, pages 297–

311.

Han, W., Li, Z., Ni, M., Gu, G., and Xu, W. (2016). Shadow

Attacks based on Password Reuses: A Quantitative

Empirical View. IEEE Transactions on Dependable

and Secure Computing, X(X):1–1.

Hasso-Plattner-Institut f

ur Digital Engineering gGmbH

(2017). HPI Leak Checker. https://sec.hpi.de/

leak-checker. Last accessed 2019/12/17.

Hunt, T. (2017). have i been pwned? https://

haveibeenpwned.com. Last accessed 2019/12/17.

Husari, G., Al-Shaer, E., Ahmed, M., Chu, B., and Niu, X.

(2017). Ttpdrill: Automatic and accurate extraction

of threat actions from unstructured text of cti sources.

In Proceedings of the 33rd Annual Computer Security

Applications Conference, ACSAC 2017, pages 103–

115, New York, NY, USA. ACM.

Track Down Identity Leaks using Threat Intelligence

105

Joachims, T. (1998). Text categorization with support vec-

tor machines: Learning with many relevant features.

In ECML.

Malderle, T., W

ubbeling, M., Knauer, S., and Meier, M.

(2019). Warning of affected users about an identity

leak. In Madureira, A. M., Abraham, A., Gandhi, N.,

Silva, C., and Antunes, M., editors, Proceedings of

the Tenth International Conference on Soft Computing

and Pattern Recognition (SoCPaR 2018), pages 278–

287. Springer International Publishing.

Mayer, P. and Volkamer, M. (2018). Addressing misconcep-

tions about password security effectively. ACM Inter-

national Conference Proceeding Series.

Nunes, E., Diab, A., Gunn, A., Marin, E., Mishra, V.,

Paliath, V., Robertson, J., Shakarian, J., Thart, A., and

Shakarian, P. (2016). Darknet and deepnet mining for

proactive cybersecurity threat intelligence. In 2016

IEEE Conference on Intelligence and Security Infor-

matics (ISI), pages 7–12.

Pearman, S., Thomas, J., Naeini, P. E., Habib, H., Bauer, L.,

Christin, N., Cranor, L. F., Egelmany, S., and Forgetz,

A. (2017). Let’s Go in for a closer look: Observing

passwords in their natural habitat. Proceedings of the

ACM Conference on Computer and Communications

Security, pages 295–310.

scikit-learn (2019). scikit-learn - Machine Learning in

Python. https://scikit-learn.org/stable/. Last accessed

2019/12/17.

Scrapinghub (2019). Scrapy. https://scrapy.org/. Last ac-

cessed 2019/12/17.

Stobert, E. and Biddle, R. (2014). The Password Life Cy-

cle: User Behaviour in Managing Passwords. In 10th

Symposium On Usable Privacy and Security (SOUPS

2014), volume 10, Menlo Park, CA. USENIX Associ-

ation.

Stobert, E. and Biddle, R. (2018). The password life cycle.

ACM Transactions on Privacy and Security, 21(3).

Subrayan, S., Mugilan, S., Sivanesan, B., and Kalaivani,

S. (2017). Multi-factor Authentication Scheme for

Shadow Attacks in Social Network. 2017 Inter-

national Conference on Technical Advancements in

Computers and Communications (ICTACC), pages

36–40.

Thomas, K., Li, F., Zand, A., Barrett, J., Ranieri, J., In-

vernizzi, L., Markov, Y., Comanescu, O., Eranti, V.,

Moscicki, A., Margolis, D., Paxson, V., and Bursztein,

E. (2017). Data Breaches, phishing, or malware? un-

derstanding the risks of stolen credentials. Proceed-

ings of the ACM Conference on Computer and Com-

munications Security, pages 1421–1434.

Thomas, K., Pullman, J., Yeo, K., Raghunathan, A., Kel-

ley, P. G., Invernizzi, L., Benko, B., Pietraszek, T., Pa-

tel, S., Boneh, D., and Bursztein, E. (2019). Protect-

ing accounts from credential stufﬁng with password

breach alerting. 28th {USENIX} Security Symposium,

{USENIX} Security 2019, Santa Clara, CA, USA, Au-

gust 14-16, 2019., pages 1556–1571.

Thomson, I. (2015). More deaths linked to ashley madison

hack as scammers move in. Last accessed 2019/12/17.

University of Bonn (2019). Uni bonn identity leak

checker. https://leakchecker.uni-bonn.de. Last ac-

cessed 2019/12/17.

Wash, R., Rader, E., Berman, R., and Wellmer, Z. (2016).

Understanding Password Choices: How Frequently

Entered Passwords Are Re-used across Websites.

Twelfth Symposium on Usable Privacy and Security,

(Soups):175.

Williams, R., Samtani, S., Patton, M., and Chen, H. (2018).

Incremental hacker forum exploit collection and clas-

siﬁcation for proactive cyber threat intelligence: An

exploratory study. In 2018 IEEE International Con-

ference on Intelligence and Security Informatics (ISI),

pages 94–99.

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

106