Systematically Searching for Identity-Related Information in the Internet

with OSINT Tools

Marcus Walkow and Daniela P

ohn

Universit

at der Bundeswehr M

unchen, Neubiberg, Germany

Keywords:

OSINT, Open Source Intelligence, Taxonomy, Identity, Attack.

Abstract:

The increase of Internet services has not only created several digital identities but also more information

available about the persons behind them. The data can be collected and used for attacks on digital identities

as well as on identity management systems, which manage digital identities. In order to identify possible

attack vectors and take countermeasures at an early stage, it is important for individuals and organizations to

systematically search for and analyze the data. This paper proposes a classiﬁcation of data and open-source

intelligence (OSINT) tools related to identities. This classiﬁcation helps to systematically search for data.

In the next step, the data can be analyzed and countermeasures can be taken. Last but not least, an OSINT

framework approach applying this classiﬁcation for searching and analyzing data is presented and discussed.

1 INTRODUCTION

The software company LastPass examined the pass-

word behavior of individuals (LastPass, 2021). Ac-

cording to them, 92 percent know that it is risky to

use passwords more than once. Nevertheless, 65 per-

cent always or mostly still use the same password

or variations. While ﬁnancial accounts primarily re-

ceive stronger passwords (68 percent), work-related

accounts and medical records do not (32 resp. 31 per-

cent). For only 8 percent of the participants, a strong

password should not be tied to personal information.

According to (Zhang et al., 2010), it is possible to pre-

dict changes to the password. Consequently, search-

ing for personal information on the Internet may lead

to a valid new password. This is even more serious

as attacks are increasing, leading to further creden-

tials and personal data being compromised (Verizon,

2022). In organizations, not only one but several digi-

tal identities are managed in the identity management

system. Typically, users have further accounts, such

as web services, where information or credentials can

be leaked. Hence, one compromised account in the

organization can result in a wider attack.

Open-source intelligence (OSINT) can tackle the

problem of the personal factor in passwords and fall-

back mechanisms. The more knowledge is found

about the individual user, the greater the probability

https://orcid.org/0000-0002-6373-3637

that the authentication factor can be derived from it.

Hence, the results of a systematic search can warn the

user before an incident happens. In order to system-

atically search for data, a classiﬁcation is required.

In addition, a modular open-source framework helps

to apply this classiﬁcation. The contribution of the

paper is two-fold: 1) a classiﬁcation of data related

to identities and identity management systems and 2)

an open-source OSINT framework approach based on

the classiﬁcation. This can be utilized to identify pos-

sible problematic information.

The paper is structured as follows: Section 2 pro-

vides an overview of the related work. Section 3 in-

troduces and structures OSINT search. The classiﬁca-

tion is applied by an OSINT framework approach in

Section 4. The approach is then discussed based on a

real-world example in Section 5. Section 6 concludes

the paper and provides an outlook on future work.

2 RELATED WORK

Several authors describe OSINT in general. For

example, (Pastor-Galindo et al., 2020) provide an

overview of OSINT with the basic workﬂows (collec-

tion, analysis, knowledge extraction). Additionally,

the authors categorize analysis (lexical, semantic,

geospatial, social media) and information (personal,

organizational, network). (Martins and Medeiros,

2022) propose a taxonomy for threat intelligence

402

Walkow, M. and Pöhn, D.

Systematically Searching for Identity-Related Information in the Internet with OSINT Tools.

DOI: 10.5220/0011644200003405

In Proceedings of the 9th International Conference on Information Systems Security and Privacy (ICISSP 2023), pages 402-409

ISBN: 978-989-758-624-8; ISSN: 2184-4356

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

sharing, which is of limited use for our purpose.

The Malware Information Sharing Platform (MISP

Project, 2022) uses, among others, the categories of

blog posts, reports, presentations, news, forums, mail-

ing lists, repositories, and other sources. (Azevedo

et al., 2019) further detail the steps of clustering and

correlating data, while (Hickey and Arcuri, 2020) ex-

plain OSINT including web applications, passwords,

and emails. Like many other approaches, the authors

focus on generic threat intelligence.

Only a few approaches target OSINT for identi-

ties. (Butler et al., 2016) present REAPER, a tool for

automated mass credential harvesting. Related to that,

(Fang et al., 2019; Peng et al., 2019; Bermudez Vil-

lalva et al., 2018) describe the effects of a password

leak. The site Have I been Pwned (Hunt, 2022) ap-

plies a similar approach to warn of password leaks.

Social media platforms have changed the way people

communicate with each other. At the same time, they

are an interesting source for further actions. (Kanta

et al., 2020; Kanta et al., 2022) propose a concept to

generate individual password lists based on data gath-

ered by OSINT search. (V. et al., 2020) go a step fur-

ther by searching the Internet for information about

names, mobile numbers, and email addresses. The

authors apply different people’s search engines focus-

ing on social media. Similarly, (Sharad Sonawane

et al., 2022) performs account matching, extracting

user metadata to generate a single report. (Akhgar

et al., 2017) uses geographic, statistical, and other

public sources, while (Gibson, 2016) speaks of un-

structured and structured data as well as the type of

procurement and the origin of the data.

Especially on GitHub, different OSINT tool lists

can be found with (Cyber Detective, 2022) being the

most comprehensive. The author lists, e. g., the cate-

gories maps, geolocation, and transport; social media;

text/sound/video analysis; image search and identiﬁ-

cation; cryptocurrencies; messengers; search engines;

datasets; passwords; emails; nicknames; phone num-

bers; contact and leak search. OSINT Framework

by (Nordine, 2022) visualizes different OSINT tools

by grouping them into 32 categories, e. g., username;

email address; images/videos/docs; social networks;

instant messaging; people search engines; dating;

phone numbers; public records; forums/blogs/IRC;

archives; digital currency. Similarly, (Bielska et al.,

2020) lists 7,600 tools and services. Two well-known

OSINT tools with open-source and commercial vari-

ants are SpiderFoot and Maltego. Although both of-

fer modules related to identities, their main target are

domains and networks. Maltego Community Edition

(CE) only has one person-related machine, searching

for email addresses, whereas SpiderFoot Open Source

offers more search options. In addition, full function-

ality is only available in the paid versions. This shows

that further work is required to better protect individ-

ual users and organizations with relatively low costs.

3 OSINT SEARCH

In order to search for compromised identities and fur-

ther information, which could lead to that state, rele-

vant data ﬁrst needs to be explored. In the next step,

vulnerabilities can be ﬁxed and data be removed to re-

duce the number of successful fraud attempts. Hence,

the goals are the stages of identity research and cy-

ber reconnaissance. We classify possible sources in

a systematic way. First, we detail identity search in

Section 3.1. As identity management systems require

additional inspections, these are explained in Sec-

tion 3.2. Next, we describe all-in-one search tools.

Last but not least, we show helpers, which aid in

the search and visualization (see Section 3.4). These

sources and helpers can be applied for an extensive

OSINT search, using all the different information.

3.1 Identity Search

The data requested during registration (e. g., user-

names, email addresses, phone numbers, and personal

information) can be leaked. Other data, such as re-

lationship status and hobbies, can be used for so-

cial engineering and are, therefore, particularly inter-

esting for security issues. Even though multi-factor

authentication is increasingly applied, it can be cir-

cumvented. Therefore, it is important to reduce pub-

lished data, described next, which can be found in

the following sources (Cyber Detective, 2022; MISP

Project, 2022; Hickey and Arcuri, 2020):

Social Media: Social media intelligence

(SOCMINT) is a sub-branch of OSINT and

refers to the information collected from social

media websites. The data available can be open to

the public or private (cannot be accessed without

proper permissions). The content comprises

posts/comments, replies, multimedia, social

interaction, and metadata.

Search Engines: Main search engines used by users

can be repurposed for OSINT. In addition, meta

and specialty search engines are available.

Public Media: News from newspapers, radio sta-

tions, etc. are published online. News digest and

discovery tools try to combine speciﬁc news.

Public Records: Reliable and legitimate source of

Systematically Searching for Identity-Related Information in the Internet with OSINT Tools

403

information, e. g., registration of a person or ﬁ-

nancial data of a company.

Repositories: Codes, snippets, documentation, and

other information is published at public reposito-

ries, such as GitHub.

Archives: Website history and capture sites take

snapshots of websites that will remain online even

if the original page changes or disappears.

Leak Pages: Pastebin and alternative Pastebin-type

sites contain leaks. These leaks are then checked

by speciﬁc leak pages.

Dark and Deep Web: Another source for leaks is

dark and deep web pages, either information in

forums or speciﬁc web services.

Further Internet Pages: This includes forums,

blogs, academic resources such as publications,

cryptocurrencies, and all other Internet pages.

3.1.1 Registration Data

Email. Email addresses are often used as a sub-

stitute for self-chosen usernames or phone num-

bers. They immediately offer the advantage of

an address for the conﬁrmation link. Differ-

ent tools search for email addresses or check

whether an email address exists (Hickey and

Arcuri, 2020; V. et al., 2020; Cyber Detec-

tive, 2022). This includes Snov.io, Hunter.io,

Skrapp.io, Prospect.io, breachchecker.com,

spycloud.com, and haveibeensold.app. In ad-

dition, haveibeenpwned.com searches by email ad-

dress for leaks.

Username. Especially at the beginning of Web 2.0,

self-chosen user names were common for logging

in (V. et al., 2020; Cyber Detective, 2022). Often,

users choose a name, or a variation thereof, with

which they have a personal connection. Hence, they

often reuse it in the same or in a modiﬁed form for

other registrations. There are two types of online

tools: 1) check whether a proﬁle page exists on var-

ious social networks, such as whatsmyname.app and

2) create possible usernames based on entered names,

for example, namecombiner.com.

Password. As users tend to reuse passwords,

haveibeenpwned.com lists leaks based on email ad-

dress. The leaks though can be found at paste sites,

dark and deep web (Hickey and Arcuri, 2020; But-

ler et al., 2016; Fang et al., 2019; Peng et al., 2019;

Bermudez Villalva et al., 2018; Hunt, 2022). In ad-

dition, default passwords and password crackers are

available online.

Phone Number. Eliminating the ownership factor

simply by knowing the phone number requires ad-

ditional technologies. However, there are hardly

any online services that link a mobile phone number

with a name or email address. A classic method is

the telephone book, which mainly publishes landline

numbers. The latter can be used, for example, via

an SMS for multi-factor authentication if no mobile

phone number is available. For practical attacks, land-

line number cloning is more complicated than mobile

number cloning. Some online services provide meta-

data, such as the provider for a speciﬁed phone num-

ber (V. et al., 2020; Cyber Detective, 2022). There is

the option of querying cell phone numbers that have

been found via Google Dork. The phone number can

also be read from social media using suitable crawlers

or online services. A possible tool for Instagram, for

example, is istaunch.com.

Address. Personal information such as postal (ship-

ping) addresses can often be found in identity man-

agement systems of organizations (Cyber Detective,

2022). This information can be collected online after

a leak. In addition, telephone books and public ad-

ministrations provide further sources. For example,

addresses are included in criminal and trafﬁc regis-

ters and property searches in the US. Hitta.se is a

Swedish search engine that offers telephone directo-

ries, addresses, and maps. Last but not least, search

engines collect information.

3.1.2 Further Data

Texts and Relationships. Information about social

relationships (personal and organizational (Pastor-

Galindo et al., 2020)) can provide a plausible back-

ground story for social engineering attacks or an-

swers to security questions. Depending on the coun-

try, different networks dominate the market (e. g.,

Russia VKontakte). In consequence, several tools

are specialized (Cyber Detective, 2022). As an ex-

ample, instahunt.co looks for usernames in In-

stagram, while crawlers such as Osintgram auto-

mate the quest. In contrast, meta-search engines

explore different social networks, search engines,

archives, and other websites in their forwarded search

queries.yasni.de, for example, focuses on German-

speaking countries, spokeo.com addresses the US,

and social-searcher.com can be used internation-

ally. In order to provide additional background infor-

mation, further searches, such as about cryptocurren-

cies (e. g., with blockcypher.com) can be applied.

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

404

Image, Video, and Sound. Users upload several

pictures and other material to social media and spe-

ciﬁc pages (Cyber Detective, 2022). Faces, ob-

jects, and logos be recognized in a photo using

Google Vision (Google, 2022b) or Microsoft’s face

recognition API (Microsoft, 2022a). Tools such as

reversearch.com can reverse search or analyze the

image, e. g., with Sherloq. Huntel.io and further

tools analyze the geolocation if published by ex-

changeable image ﬁle format (EXIF) data. Political

information, maps, etc. help to locate the material.

3.2 Technical Search

Cyber reconnaissance is a technical investigation aim-

ing to provide attackers with as much information

about the target as possible. This includes which

(identity) software is being used. On the other hand,

publicly available data can be browsed. Hence, the

following sources can be utilized (Cyber Detective,

2022; MISP Project, 2022; Hickey and Arcuri, 2020):

Social Media: SOCMINT refers to the information

(text, multimedia, interaction, metadata) collected

from social media websites.

Search Engines: Main, meta, and specialty search

engines search for selected topics.

Public Media: Online news from original sources.

Public Records: Reliable and legitimate source of

information, e. g., ﬁnancial data of a company.

Repositories: Data published at public repositories,

such as GitHub.

Archives: Snapshots of sites taken by archives.

Leak Pages: Information about leaks.

Dark/Deep Web: Information below the surface.

Further Internet Pages: All other Internet pages.

Organisation Website: Organisations provide infor-

mation online via their organization websites,

such as email addresses, roles, and persons.

Network: The organization’s network and servers

offer information about insecure systems, used

operating system and software versions, and in-

ternet protocol (IP) addresses.

3.2.1 Unsecured Data

Unintentionally leaked information, such as applica-

tion programming interface (API) keys, credentials,

and internal information, can be used during the at-

tack lifecycle. With self-built scrappers, crawlers

or uniform resource locator (URL) fuzzers such as

(Wilkening et al., 2022), and Google Dorks, publicly

accessible folders, ﬁles, and data are displayed.

3.2.2 Network Data

Network Scanner. In order for the identity man-

agement systems to be screened using OSINT, the

associated hardware must be found on the Internet.

Shodan.io systemically asks for relevant ports and

publishes the results in a queryable format (Daskevics

and Nikiforova, 2021). Censys.io provides a similar

service. Network scanners such as the Network Map-

per (NMAP) can be used to ﬁnd out more about the

IT infrastructure.

Application Testing Software. Tools such as Burp

Suite examine the website of the identity manage-

ment system. The Burp Suite extension security as-

sertion markup language (SAML) Raider focuses on

the federated identity management protocol SAML.

The Open Authorization (OAuth) scanner extension

detects misconﬁgurations in the protocol implemen-

tations of OpenID Connect and OAuth.

Security Scanner. If the identity management

system software is known, databases such as

exploit-db.com display known vulnerabilities and

exploits (Cyber Detective, 2022). So-called security

scanners such as the Open Vulnerability Assessment

Scanner (OpenVAS) thoroughly test the server behind

it for possible vulnerabilities. pentest-tools.com

provides a collection of such security scanners.

3.3 All-in-One Search Tools

All-in-one search tools reuse the tools listed above

and combine the results across group boundaries (Cy-

ber Detective, 2022). Recon-ng, SpiderFoot, TiDOS,

The Harvester, and Maltego are comprehensive rep-

resentatives. In the case of Maltego and SpiderFoot,

the range of tools differs depending on the version.

APIs for paid services such as Social Links CE can

be integrated into Maltego CE and SpiderFoot HX. In

the full version, external services such as pipl.com

or People Data Labs are purchasable. Although these

tools provide all-purpose searches, such as social me-

dia, search engines, dark web, and leak pages, their

main focus is on organization networks.

3.4 Helpers

Due to the huge amount of data that can be found on

the Internet, advanced techniques are needed to ana-

lyze the data and make a pre-selection.

Systematically Searching for Identity-Related Information in the Internet with OSINT Tools

405

3.4.1 Machine Learning

Machine learning is suitable for this task. Thereby,

the collected photos can be evaluated by various so-

cial media and provide new insights into identities

that were not yet obvious through research. Valu-

able information is also found in (short) messages

and other texts on the Internet, where machine learn-

ing algorithms help to extract keywords and analyze

the context. Microsoft’s Text Analytics (Microsoft,

2022b), IBM’s Watson API (IBM Developer, 2022),

and Google’s Natural Language API (Google, 2022a)

provide such analysis services.

3.4.2 Natural Language Processing

The aim of NLP (Noubours et al., 2013) is to pro-

cess natural language and thereby be able to grasp the

meaning of texts and language. Just like people, a

computer should have eyes and ears to pick up speech

and analyze it with the brain, convert it into code or

text, and then process it. In NLP, the problem is best

addressed through deep learning models, where sufﬁ-

cient learning material is available due to large data

collections. For the purpose of the paper, named-

entity recognition (NER) (Yang and Lee, 2012; Al-

Moslmi et al., 2020), sentiment analyzes (Notz et al.,

2019), and text generation (Lee et al., 2022) are of

particular interest. A current text generator is a gen-

erative pre-trained transformer (GPT)-3 by OpenAI.

4 EXAMPLE AND CASE STUDY

OF AN OSINT FRAMEWORK

This section describes our open-source OSINT frame-

work to search for identity-related information (see

Section 3.1). In addition, the technical search de-

tailed in Section 3.2 can be used if the target is an

organization. The framework exerts the workﬂow de-

scribed by (Pastor-Galindo et al., 2020): Data collec-

tion (see Section 4.2), data analysis (see Section 4.3),

and knowledge extraction (see Section 4.4).

4.1 Overview

Our OSINT framework has a graphical user interface

(GUI) for interaction with the user. Thereby, the user

can select different modules for their search. The

modules are implemented or attached in the backend,

which interacts with the storage (using a pre-deﬁned

folder structure) and database. The search results are

then displayed in graphs. In order to realize the inter-

actions, the framework Dash was chosen.

Figure 1 provides a brief overview of the GUI. In

the top line, new values (e. g., names, identities, email

addresses) are added. This can be combined with

modules, which come next. In the big frame below,

the results are displayed. In the example within the

ﬁgure, a node with an image was selected and passed

to a module. This evolved into new nodes with further

information. In the next section, Figure 2 details the

overview on an example. Thereby, the workﬂow can

be iterated.

Figure 1: OSINT framework.

4.2 Data Collection

For the data collection, we use our own and external

tools based on Google Dork, scrappers, and crawlers

for various sources including social media sites. For

example, if a name is entered, possible email ad-

dresses are generated and then checked for existence.

Next, those valid email addresses are then used to

search for social media accounts with Sherlock (Sher-

lock Project Team, 2022). Based on the results, dif-

ferent crawlers download texts, images, and videos as

well as further data. The raw data is either written di-

rectly to the project database or stored in the respec-

tive folders. In the next step, images and texts are

included to be analyzed with suitable tools. Thereby,

the phase data collection especially focuses on user-

names, texts, and relationships, as well as media. If

addresses and phone numbers are part of the social

media proﬁles, then these are also collected.

4.3 Data Analysis

The data analysis again uses external and imple-

mented tools. For example, images are analyzed

for geospatial data in EXIF format. Using an API,

Google Vision is supposed to recognize texts or faces

in images. Images can be further analyzed for loca-

tion information, such as buildings. The text analy-

sis utilizes an API to Microsoft Text Analysis and the

NER extraction tool for the German language. A list

of found tokens is returned via the API. In order to

receive full words, the words associated with the to-

ken are searched in the original text. Results from

the analysis are transferred to the database for knowl-

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

406

edge extraction. This phase focuses on texts and me-

dia, although the selected words are used as input to

generate possible passwords. An attacker could apply

these passwords for brute-force attacks. In a defen-

sive scenario, the results may help to rise awareness

and improve current passwords.

4.4 Knowledge Extraction

For the extraction of knowledge from the images and

texts, the APIs of machine learning algorithms pro-

vided by Microsoft and Google are applied. Thereby,

emotions among others can be discovered. In order to

receive age and sex/gender, two convolutional archi-

tectures for fast feature embedding (CAFFE) models

with an OpenCV library are used. Here, the pictures

are transformed into binary large objects (BLOBs)

and transferred to the deep neural network (DNN)

of the CAFFE model. All liable results serve as in-

put for, e. g., GPT-NeoX to generate text messages,

which could be used by attackers. In addition, possi-

ble passwords are created by a custom wordlist gener-

ator. This shows, that knowledge extraction requires

the described helpers in Section 3.4.

5 DISCUSSION

We discuss our OSINT framework based on an exem-

plary search, the intended usage, and a brief compar-

ison with other OSINT tools.

5.1 Applying the OSINT Framework

To explain how the OSINT framework works, an ex-

emplary search was conducted on German Chancellor

Olaf Scholz. This was limited to gender age detection

(GAD) of images from his Instagram page and NER

of his tweets. After deﬁning the target person of the

search, the new node ”olafscholz” as the username for

Instagram was entered. This is possible as a self-built

Instagram crawler was added via the corresponding

menu. The framework was informed that the crawler

needs information in string type as input, saves data

as result, and inserts the ﬁle names as nodes. With

this self-built crawler, all images were downloaded.

In Figure 2, 19 found images are displayed. De-

pending on the results, this overview can get too

crowded. In the future, this framework will pro-

vide better placement and formatting of the nodes

and edges; this may include a reduction of results by

grouping.

5.2 Data Analysis

For the data analysis, a photo of the results

(olafscholz10.jpg) was selected via its represen-

tative node and examined with the ML application

GAD. Unlike the crawler, GAD requires images

as input and returns information. The results of

olafscholz10.jpg are inserted as two new nodes

on the graph. For Twitter, the tool vicinitas.io is

used. The tweets are stored as text ﬁles in the ”Files”

folder and a node is added to the graph.

By selecting the node and the German NER ML

analysis tool, after pressing the ’crawl text ﬁle for

possible passwords’ button, all tweets are examined

for named entities. These appear below the graph and

are stored in a text ﬁle. In later versions, the results

will be sorted by frequency and by sentiment analysis

according to emotional signiﬁcance. The text ﬁle can

be used by programs such as Hydra to reduce the time

for a brute-force attack. The assumption behind this is

that users choose their passwords with a personal ref-

erence to remember them better. However, it should

be noted that further cleaning of the words must take

place to remove the # and characters that are typical

for Twitter. Furthermore, the German NER also rec-

ognizes words as named entities that are none. In the

future, information from nodes will be added to the

password list, in order to include, for example, Olaf

Scholz’s wife Britta Ernst as well as other informa-

tion found about her.

The automated analysis of photos that supports re-

search has been demonstrated with GAD. This ca-

pability becomes more helpful when extended with

other services such as Google Vision. In the next step,

we plan to test the framework on more private indi-

viduals as they are typically not aware of the external

effects on their posts.

5.3 Comparison and Limitations

The basic functioning of collecting and gaining in-

formation about identities has been explained. On the

technical side, however, there are still some limits and

obstacles in comparison to the established tools. The

search can only search at locations, where modules

with APIs are already written. Further APIs still have

to be integrated. In case of APIs are not possible, the

search becomes cumbersome. This is also one limi-

tation of existing tools. In comparison, Maltego CE

found four unrelated email addresses, whereas Spi-

derFoot Open Source said the person exists. The latter

result did not change when including a Google API.

As programs return different data types, a

workaround for Python-based programs was created.

Systematically Searching for Identity-Related Information in the Internet with OSINT Tools

407

Figure 2: Research on Olaf Scholz.

From the subprocess.run() Python method used,

each output of the executed program is recorded as

a string. For outputs in list format, an interpreter

was written that turns the string back into a list. Fur-

ther interpreters for other Python typical data formats

are planned. However, here lies a possible weakness

that affects user-friendliness. If developers were to

use proprietary data types for the output, users would

have to write their interpreter or information extrac-

tion process. A ﬁrst idea would be to create a menu, as

envisaged for the integration of APIs, in which users

insert a tool output and mark which information is

relevant. The framework should then recognize this,

save it, and build an interpreter.

6 CONCLUSION AND OUTLOOK

OSINT unearths information just waiting to be dis-

covered - either by an individual/organization or an

attacker. In order to master the ﬂood of information,

classiﬁcation is necessary for a systematic search.

This paper provides a systematic classiﬁcation for

identity and technical data, which is based on a litera-

ture review and available tools. In addition, all-in-one

search tools and helpers are described. The classiﬁ-

cation is applied by the open-source OSINT frame-

work approach, which covers the phases of data col-

lection, data analysis, and knowledge extraction. The

OSINT framework approach is then discussed based

on a targeted search on Olaf Scholz. It shows that

open-source tools are possible, though require addi-

tional work to produce similar or better results than

established tools with a focus on networks.

In order to provide a comprehensive tool for iden-

tity protection, further sources will be added in future

work. In addition, we plan a user study on the us-

ability and success rate, comparing the results with

other open-source and commercial tools. At the same

time, countermeasures to hide one’s information will

be outlined. This OSINT framework will then be ex-

tended for organizational purposes.

REFERENCES

Akhgar, B., Bayerl, P. S., and Sampson, F. (2017). Open

source intelligence investigation: From strategy to im-

plementation. Springer.

Al-Moslmi, T., Gallofr

e Oca

na, M., L. Opdahl, A., and

Veres, C. (2020). Named Entity Extraction for Knowl-

edge Graphs: A Literature Overview. IEEE Access,

8:32862–32881.

Azevedo, R., Medeiros, I., and Bessani, A. (2019). PURE:

Generating Quality Threat Intelligence by Clustering

and Correlating OSINT. In 2019 18th IEEE Inter-

national Conference On Trust, Security And Privacy

In Computing And Communications/13th IEEE Inter-

national Conference On Big Data Science And Engi-

neering (TrustCom/BigDataSE), pages 483–490.

Bermudez Villalva, D. A., Onaolapo, J., Stringhini, G., and

Musolesi, M. (2018). Under and over the surface: a

comparison of the use of leaked account credentials in

the Dark and Surface Web. Crime Science, 7(1):17.

Bielska, A., Kurz, N. R., Baumgartner, Y., and

Benetis, V. (2020). Open Source Intel-

ligence Tools and Resources Handbook

2020. https://i-intelligence.eu/uploads/public-

documents/OSINT

Handbook 2020.pdf. Accessed

10-10-2022.

Butler, B., Wardman, B., and Pratt, N. (2016). REAPER: an

automated, scalable solution for mass credential har-

vesting and OSINT. In 2016 IEEE APWG Symposium

on Electronic Crime Research (eCrime), pages 1–10.

Cyber Detective (2022). OSINT tools collection. https://

github.com/cipher387/osint stuff tool collection. Ac-

cessed 10-10-2022.

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

408

Daskevics, A. and Nikiforova, A. (2021). ShoBeVODSDT:

Shodan and Binary Edge based vulnerable open data

sources detection tool or what Internet of Things

Search Engines know about you. In Proceedings of

the 2nd IEEE International Conference on Intelligent

Data Science Technologies and Applications (IDSTA),

pages 38–45.

Fang, Y., Guo, Y., Huang, C., and Liu, L. (2019). Ana-

lyzing and Identifying Data Breaches in Underground

Forums. IEEE Access, 7:48770–48777.

Gibson, H. (2016). Acquisition and Preparation of Data for

OSINT Investigations. In Akhgar, B., Bayerl, P. S.,

and Sampson, F., editors, Open Source Intelligence In-

vestigation: From Strategy to Implementation, pages

69–93. Springer International Publishing, Cham.

Google (2022a). Natural Language API. https://cloud.

google.com/natural-language. Accessed 10-10-2022.

Google (2022b). Vision AI. https://cloud.google.com/

vision. Accessed 10-10-2022.

Hickey, M. and Arcuri, J. (2020). Open Source Intelligence

Gathering, pages 55–86. Wiley Data and Cybersecu-

rity.

Hunt, T. (2022). Have I Been Pwned: Check if your

email has been compromised in a data breach. https:

//haveibeenpwned.com. Accessed 10-10-2022.

IBM Developer (2022). Watson APIs - Resources and

Tools. Accessed 10-10-2022.

Kanta, A., Coisel, I., and Scanlon, M. (2020). Smarter Pass-

word Guessing Techniques Leveraging Contextual In-

formation and OSINT. In 2020 International IEEE

Conference on Cyber Security and Protection of Dig-

ital Services (Cyber Security), pages 1–2.

Kanta, A., Coisel, I., and Scanlon, M. (2022). A Novel

Dictionary Generation Methodology for Contextual-

Based Password Cracking. IEEE Access, 10:59178–

59188.

LastPass (2021). The 2021 Password Security Re-

port. https://www.lastpass.com/de/resources/ebook/

psychology-of-passwords-2021. Accessed 10-10-

2022.

Lee, M., Liang, P., and Yang, Q. (2022). CoAuthor: De-

signing a Human-AI Collaborative Writing Dataset

for Exploring Language Model Capabilities. In Pro-

ceedings of the ACM Conference on Human Factors

in Computing Systems (CHI).

Martins, C. and Medeiros, I. (2022). Generating Quality

Threat Intelligence Leveraging OSINT and a Cyber

Threat Uniﬁed Taxonomy. ACM Trans. Priv. Secur.,

25(3).

Microsoft (2022a). Face API. https://azure.microsoft.com/

en-us/products/cognitive-services/face/. Accessed

10-10-2022.

Microsoft (2022b). Text analytics. https://azure.microsoft.

com/en-us/products/cognitive-services/text-analytics.

Accessed 10-10-2022.

MISP Project (2022). MISP taxonomies and classiﬁca-

tion as machine tags. https://www.misp-project.org/

taxonomies.html#

osint. Accessed 10-10-2022.

Nordine, J. (2022). OSINT Framework. https://

osintframework.com. Accessed 10-10-2022.

Notz, M., Grambau, J., and Hitzges, A. (2019). Evaluation

of Sentiment Databases: A Comparison of Sentiment

Databases through Social Listening Statements and

Azure Machine Learning Studio. In Proceedings of

the 3rd ACM International Conference on E-Business

and Internet (ICEBI), page 8–12.

Noubours, S., Pritzkau, A., and Schade, U. (2013). Nlp as

an essential ingredient of effective osint frameworks.

In Proceedings of the IEEE Military Communications

and Information Systems Conference (MILCIS), pages

1–7.

Pastor-Galindo, J., Nespoli, P., G

omez M

armol, F., and

Mart

ınez P

erez, G. (2020). The Not Yet Exploited

Goldmine of OSINT: Opportunities, Open Challenges

and Future Trends. IEEE Access, 8:10282–10304.

Peng, P., Xu, C., Quinn, L., Hu, H., Viswanath, B., and

Wang, G. (2019). What Happens After You Leak

Your Password: Understanding Credential Sharing on

Phishing Sites. In Proceedings of the 2019 ACM Asia

Conference on Computer and Communications Secu-

rity (Asia CCS), page 181–192.

Sharad Sonawane, H., Deshmukh, S., Joy, V., and Hadsul,

D. (2022). Torsion: Web Reconnaissance using Open

Source Intelligence. In Proceedings of the 2nd IEEE

International Conference on Intelligent Technologies

(CONIT), pages 1–4.

Sherlock Project Team (2022). Sherlock Project. https://

sherlock-project.github.io. Accessed 10-10-2022.

V., A. A., A. K., B., R., M., Subbaraj, K., and Kumar Mo-

han, A. (2020). PeopleXploit : A hybrid tool to collect

public data. In Proceedings of the 4th IEEE Interna-

tional Conference on Computer, Communication and

Signal Processing (ICCCSP), pages 1–6.

Verizon (2022). Data Breach Investigations Report 2022.

https://www.verizon.com/business/resources/reports/

2022/dbir/2022-data-breach-investigations-report-

dbir.pdf. Accessed 10-10-2022.

Wilkening, F., Stiemert, L., Schopp, M., P

ohn, D., and

Hommel, W. (2022). Investigating Leaked Sensi-

tive Information in Version Control Systems with the

Kraulhorizon Framework. In Sicherheit in vernetzten

Systemen: 29. DFN-Konferenz, pages C1–C21. Books

on Demand.

Yang, H.-C. and Lee, C.-H. (2012). Mining open source

text documents for intelligence gathering. In Pro-

ceedings of the International IEEE Symposium on In-

formation Technologies in Medicine and Education

(ITiME), volume 2, pages 969–973.

Zhang, Y., Monrose, F., and Reiter, M. K. (2010). The Secu-

rity of Modern Password Expiration: An Algorithmic

Framework and Empirical Analysis. In Proceedings

of the 17th ACM Conference on Computer and Com-

munications Security (CCS), pages 176–186.

Systematically Searching for Identity-Related Information in the Internet with OSINT Tools

409