Automatic Detection of Cyber Security Events from Turkish Twitter

Stream and Newspaper Data

Ozg

ur Ural

1 a

and Cengiz Acart

urk

1,2 b

Informatics Institute, Cyber Security Graduate Program, Middle East Technical University, Ankara, Turkey

Informatics Institute, Cognitive Science Graduate Program, Middle East Technical University, Ankara, Turkey

Keywords:

Cyber Security, Event Detection, Turkish, Twitter, H

urriyet Newspaper.

Abstract:

Cybersecurity experts scan the internet and face security events that inﬂuence user and institutions. An in-

formation security analyst regularly examines sources to stay up to date on security events in the domain of

expertise. This may lead to a heavy workload for the information analysts if they do not have proper tools

for security event investigation. For example, an information analyst may want to stay aware of cybersecurity

events, such as a DDoS (Distributed Denial of Service) attack on a government agency website. The earlier

they detect and understand the threats, the longer the time remaining to alleviate the obstacle and to investi-

gate the event. Therefore, information security analysts need to establish and keep situational awareness active

about the security events and their likely effects. However, due to the large volume of information ﬂow, it may

be difﬁcult for security analysts and researchers to detect and analyze security events timely. It is important

to detect security events timely. This study aims at developing tools that are able to provide timely reports of

security incidents. A recent challenge is that the internet community use different languages to share infor-

mation. For instance, information about security events in Turkey is mostly shared on the internet in Turkish.

The present study investigates automatic detection of security incidents in Turkish by processing data from

Twitter and news media. It proposes an automatic prototype, Turkish-speciﬁc software system that can detect

cybersecurity events in real time.

1 INTRODUCTION

1.1 Motivation and Objectives

Security awareness tools help security analysts to pro-

tect an institution’s sensitive and mission-critical data

from being stolen, damaged, or compromised by at-

tackers. The duration between the disclosure of a new

vulnerability and the moment when the security ana-

lyst becomes aware of it is crucial for taking appro-

priate countermeasures in a timely manner.

Twitter is a major source of up to date information.

Twitter has 330 million monthly active users world-

wide (Phan et al., 2020). Turkey is the ﬁfth country

in the list of leading countries with nearly 9 million

active users, as of January 2019 (Okay et al., 2020).

Twitter users can tweet in any languages they select.

Although there are no statistics about the use of Turk-

ish by Twitter users, it is very likely that most of the

https://orcid.org/0000-0003-1329-4303

https://orcid.org/0000-0002-5443-6868

Turkish Twitter users share their tweets in their native

language.

A review of the literature and recent state of tech-

nology reveal that most of the research conducted on

security event detection has been developed for ana-

lyzing text in English or other popular languages such

as Portuguese language (Duarte et al., 2018) using

Big Data (Seth et al., 2017). As of our knowledge, re-

search is lacking on real-time security event detection

in Turkish language streams. Given the signiﬁcant

share of the use of the Turkish language on the Inter-

net, it is necessary to develop security event detection

tools that process Turkish data. Internet usage pene-

tration in Turkey is %72 with 59.36 million internet

users, and active social media penetration in Turkey

is %63 with 52 million people (Alan, 2020). With

emerging internet adoption in Turkey, there are much

timely information shared in Turkish. Recent event

detection systems which developed for English texts

are not useful for Turkish texts mining. Therefore, in

order to use Turkish texts at detection of cybersecurity

events, we should develop Turkish language-speciﬁc

Ural, Ö. and Acartürk, C.

Automatic Detection of Cyber Security Events from Turkish Twitter Stream and Newspaper Data.

DOI: 10.5220/0010201600660076

In Proceedings of the 7th International Conference on Information Systems Security and Privacy (ICISSP 2021), pages 66-76

ISBN: 978-989-758-491-6

methods and algorithms.

Social media is not the only option to extract infor-

mation as such. A security analyst has a wide range

of sources available such as the specialized press,

blogs, forums, news agencies, newspapers, and so on

to gather cyber threat information. Although, their

initial source of information for detecting such secu-

rity events is usually social networks.An alternative

way to extract information about security events is

newspapers. After the emergence of a trending event,

users increasingly share posts about it on social me-

dia. For instance, a DDoS attack to a service or a

website is usually recognized and reported by social

media users ﬁrst, and they share the information on

online platforms, by posting tweets such as “X web-

site is unreachable”.

An autonomous system which can use various

data sources for security event detection has the po-

tential to be beneﬁcial for a security analyst. We de-

signed and developed a software system capable of

detecting and monitoring cybersecurity-related events

over the Twitter Stream in Turkish. Int its recent ver-

sion, it can process several millions of documents per

day and detect security events. To gain more accu-

rate results, we added the H

urriyet Turkish newspaper

stream to Twitter, for analyzing and detecting security

events. The software solution’s infrastructure sup-

ports adding new data resources, thus providing ﬂex-

ibility. For example, it is possible to expand the sys-

tem by adding LinkedIn, Facebook website streams to

gain more complete and accurate results.

We designed the system as a framework to make it

useable for further research. Turkish datasets are used

in various research areas like text classiﬁcation, au-

thor detection, automatic question answering. How-

ever, ﬁnding datasets in Turkish is difﬁcult since there

are limited accessible datasets online. By means of

this research software framework, researchers will

be able to access security event datasets in Turkish.

Moreover, they will be able to select and modify their

queries by changing keyword vectors, thus changing

the content of information to be extracted from online

sources. We validated the proposed approach using

several detected events already shared in Turkish-in

online platforms. By means of automatic event detec-

tion systems, a security analyst establishes situation

awareness in cyberspace and take countermeasures

against new threats. For example, a security analyst

who is working for a Turkish institution may use local

websites APIs like Eksisozluk API e-Devlet API or

libraries/frameworks developed for focused Turkish

people. If these API’s, libraries or frameworks have

vulnerabilities, and someone discovers them, they are

probably discussed and announced within social me-

dia like Twitter in Turkish. It is likely that Turkish

newspapers publish it as breaking news too. To detect

such events automatically, the software system must

listen to Turkish data sources and process the text in

Turkish. Our research aims at meeting these require-

ments by proposing a software system and framework

for security event detection.

1.2 Routine Tasks of an Information

Security Analyst

Information security analyst’s the primary respon-

sibility is to take countermeasures for protecting

organizational-level, mission-critical and sensitive in-

formation, as well as being prepared for cyber-

attacks(Sohime et al., 2020). To be prepared for a

cyber-attack, they use various tools and systems. One

of their responsibilities is to analyze data and to rec-

ommend changes to managers. However, security an-

alysts are not authorized to implement changes. Their

main job is to keep cyber-attacks out.

In practice, a security analyst spends approxi-

mately one hour per a working day to get caught

up on the latest security news through bulletins, fo-

rums, news, social networks and so on to identify

new threats. They further spend two to three hours

by repeated investigation of potential security inci-

dents using online resources. They spend the rest of

their daily time with manually copying and pasting

information from disparate and siloed tools to cor-

relate data. They generally face with ten to twenty

challenges daily such as monitoring security access,

analyzing security breaches to identify the root cause,

verifying the security of third-party vendors and col-

laborating with them to meet security requirements

and so on. (Sohime et al., 2020) Their investiga-

tion time gives cyber attackers advantages if it is long

enough, and it is challenging for a security analyst to

keep up with threats. A manual investigation of se-

curity events is not sustainable without automation.

To make it sustainable, automated Natural Language

Processing analysis tools and text mining methods

need to be used.

1.3 Relevant Work

The identiﬁcation of victims affected by cyber-attacks

is a major subdomain of research in cybersecurity.

One of the research ﬁeld focuses on cybersecurity

events detection using English text in Twitter. For

example “Automatic Detection of Cyber Security Re-

lated Accounts on Online Social Networks: Twitter as

an Example”. In that paper (Aslan et al., 2018), they

use machine learning techniques; they investigated to

Automatic Detection of Cyber Security Events from Turkish Twitter Stream and Newspaper Data

ﬁnd a method of whether social media accounts re-

lated to cybersecurity. To prepare their dataset to use

in their research, they develop a crawler with Twitter

API using Python programming language. Another

notable paper in this domain is ”Processing tweets for

cybersecurity threat awareness”(Alves et al., 2021).

They tested a quantitative evaluation considering all

tweets from 80 accounts over 8 months (more than

195,000 tweets), it shows that their approach timely

and successfully ﬁnds most of the security-related

tweets related to an example IT infrastructure (rate

positive rate greater than 90 %), incorrectly selects a

small number of tweets as relevant (false positive rate

less than 10 %).

Another subdomain of research is event forecast-

ing. The researches try to estimate the DDoS at-

tacks that have not yet taken place by processing Twit-

ter data. They tried to obtain this information us-

ing six popular supervised classiﬁcation models. To

illustrate, one of the models which they used is the

“negative term count.”. Neg-Term-count is the base-

line sentiment-based model. They count the negative

words from tweets each day, forecasting an attack if

the number of negative words is more signiﬁcant than

a threshold, which is the average number of negative

words on training data.

Another subdomain of research is Drive-by

Download Attack Prediction. Cyber attackers may

use the URL abbreviation method to show malicious

websites as if a harmless website and share them on

twitter as an abbreviated URL. Twitter users may be-

lieve in this deception and click on such website ab-

breviations, and these links can harm the users. “Pre-

diction of Drive-by Download Attacks on Twitter” is

an example which researches this ﬁeld. (Javed et al.,

2019) They have explored what we can do to prevent

such malicious websites from being clicked like a safe

website due to this kind of abbreviation. They try var-

ious methods such as detecting malicious software in-

fection from the increase in the use of CPU or RAM

with using Honeypot.

Another subdomain of research is cyberattack de-

tection using social media. A sample study on this

ﬁeld is “SONAR: Automatic Detection of Cyber Se-

curity Events Over the Twitter Stream”. They devel-

oped a self-learning framework called Sonar. (Pe-

tersen, ) Sonar can automatically capture events re-

lated to cybersecurity by processing twitter data. De-

velopers give the system some keywords to follow.

The system can ﬁnd other keywords to followed re-

lated to cybersecurity with the help of previously

given keywords. They have also beneﬁted from big

data technologies. For the architectural design of

our system, we use this research in our present re-

search. Another example is “Crowdsourcing Cyber-

security: Cyber Attack Detection using Social Me-

dia”. (Khandpur et al., 2017) It is another study on

detecting cybersecurity attacks by processing Twitter

data.

2 SYSTEM ARCHITECTURE,

DESIGN AND METHODOLOGY

In this chapter, we explain the software system’s ar-

chitecture and design and methodology. Firstly, we

explain the general approach. Then we present data

collection using Standard Twitter API, Twitter Pre-

mium API, Hurriyet API, and Selenium. After that

we mention how we can preprocess and process the

data. Then we present how we detect a cybersecurity

event with using anomaly detection which is one of

the machine learning techniques.

2.1 The Approach

Figure 1 presents a general overview of the archi-

tecture and design. First, we need real-time stream-

ing data to process. In order to establish a Twitter

stream connection, the software uses statically de-

ﬁned the conﬁguration ﬁle values. To gather the

data in real-time, we use Standard Twitter API. We

create cybersecurity-related Turkish keyword vector

with using Term Frequency - Inverse Term Frequency

analysis of past security incidents. We use this key-

word vector to gather useful Twitter stream and Hur-

riyet Newspaper stream for our research. We use

the language ﬁlter feature of the Twitter API in or-

der to fetch only the Turkish Tweets. Hurriyet is a

Turkish newspaper, therefore we did not need a lan-

guage ﬁlter for it. To establish the Hurriyet News-

paper stream connection, the software also uses the

conﬁguration ﬁle. The architecture of the software

system is implemented considering new data sources

may be wanted to add. Before writing the fetched data

to the database, both fetched data of Hurriyet News-

paper and Twitter are formatted to a suitable form for

writing database.

After the normalization step, we move forward to

Named Entity Recognition step of our pipeline. In

this state, we use the predeﬁned string vector, which

currently includes institution names, government or-

ganization name, and country names. These strings

represent the potential victims of security events. Af-

ter that step, the software counts the number of men-

tions of the potential victims with searching the pre-

deﬁned string vector elements in the normalized texts

which are stored in the database. We add daily

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

Figure 1: The General Overview of The System.

threshold values which is calculated dynamically. If

the number of mentions is more than the thresh-

olds value, we share this detected cybersecurity event

within the user interface. The software repeatedly

checks the database and analyzes new texts for detect-

ing new cybersecurity events. If one of the possible

victim’s numbers of mentions in the cybersecurity-

related database texts exceeds the threshold limit per

day, the software system adds them to the table too.

We show these detected events in a dynamically cre-

ated HTML ﬁle. Security Analysts can see the de-

tected security events from browser.

2.2 Selection of Cybersecurity Related

Keywords Vector and Data

Collection

To create an optimum version of cybersecurity-related

keyword vector, we used term frequency-inverse doc-

ument frequency(TF-IDF) technique, keyword-based

analysis, the statistical technique, and A/B testing.

Even if the Tweets or the news are in Turkish lan-

guage, there are widespread English cybersecurity

terms used in Turkish texts. Therefore we create

the vector using both English and Turkish keywords.

It is a numerical statistic that intends to reﬂect im-

portance a keyword or phrase is for within a doc-

ument or a Web page in a corpus or in a collec-

tion. (Rajaraman et al., 2014) In order to identify our

cybersecurity-related keywords vector, we used the

Term frequency–Inverse document frequency tech-

nique. Firstly, we ﬁnd one of the past important cyber-

security events related to Turkey from history. We se-

lect “nic.tr DDOS attack” as the cybersecurity event.

Then we create three different training databases re-

lated to this attack with using Twitter Premium API.

We select “nic.tr” as keyword and ﬁlter only the Turk-

ish Tweets. With using tf-idf technique, we identify

the most important words in these databases. Then

we select the cybersecurity-related ones from the re-

sults of the tf-idf technique, and then add them to the

cybersecurity-related keyword vector. The ﬁrst query

includes tweets containing nic.tr keyword at the dates

between 10.12.2014 and 13.12.2015. These dates are

the one year period of time before the nic.tr attack.

Then we created another training database. We se-

lect the Tweets only at the day of the nic.tr attack

on 14 December 2015. We analyze the tweets in the

database with TF-IDF frequency analysis and do A/B

test to select words from them to our cybersecurity-

related keyword vector. Lastly, we created another

training database. It includes the Tweets between 14

December 2015 and 28 December 2015. Within two

weeks period of time, nearly 1000 Tweet had been

tweeted related with ”nic.tr”.

We analyzed their results and create

cybersecurity-related keyword lists for each one

of them. Then we used these keywords lists for A/B

testing.The A/B test is a randomized experiment with

two variants, A and B. It includes the application of

the statistical hypothesis test or “two-sample hypoth-

esis test” used in the ﬁeld of statistics. The A/B test

is a method of comparing two versions of the same

variable and determining which of the two variants

is more effective. (Fabritius, 2017) We compare the

results of the A/B test and update the elements in the

keyword vector according to their success rate. For

A/B test we used the number of false-positive cyber-

security event detection and number of cybersecurity

event detection. If a keyword signiﬁcantly increases

the number of false-positive detection, we do not add

it to our cybersecurity-related keyword vector. On

the other hand, if a keyword does not affect so much

the false positive detection but increases the number

Automatic Detection of Cyber Security Events from Turkish Twitter Stream and Newspaper Data

of detection we add it to our cybersecurity-related

keyword vector list.

In order to collect data, we use Twitter and

urriyet newspaper. Both H

urriyet API and Twitter

API need seed keywords to query them. In order to

collect Turkish stream data, we need Turkish cyberse-

curity terms. However, we cannot ﬁnd a Turkish cy-

bersecurity terms dictionary. Therefore, we research

the Turkish cybersecurity terms and gather them as a

list to use them in the query. Then we implemented

a Python code to parse the Twitter website with using

Selenium automation tool and chrome browser web

driver and, created our desired training datasets. The

selenium solution for fetching Twitter data is not a

known method and it is ﬁrstly implemented by us.

Twitter is an online social networking service,

which was created in October 2006 by Jack Dorsey,

Even Williams, and Biz Stone. People use Twitter for

various purposes. (Huberman et al., 2009) First of all,

One of its usage examples is as a social messaging

service. Users can interact with the other users, com-

municate with their friends and family, and share de-

tails of their lives. Secondly, users can use it as a mi-

croblogging service for sharing details of a person’s

life. Thirdly, users can use Twitter as a marketing

tool for public relations. Many celebrities and politi-

cians use Twitter for interacting with their audience.

Lastly, Twitter is an information platform on which

users can get news via broadcasting agents’ or jour-

nalists’ accounts fast and efﬁciently. Moreover, there

are Twitter bots created by developers for a precise

function like Bitcoin ticker bot will tweet every hour

the price of Bitcoin in Turkish Lira. According to the

ﬁrst quantitative study on Twitter “What is Twitter,

a Social Network or a News Media?” which is pub-

lished in 2010 (Kwak et al., 2010), Twitter is more an

information-sharing network than a social network.

They found that result while working on Twitter fol-

lower graph. They decided that because of the low

rate of reciprocated ties. People tend to use Twitter as

a news feed by following multiple online news media,

but other Twitter users will only follow “real” users.

Twitter users can post a short message called tweet,

which is limited to 280 characters, or retweet another

user tweet. Photos, videos, or URLs can be added to

the tweets. Users can follow other accounts and cre-

ates their networks. They can mention each other or

reply to each other within their tweets. To identify

what the tweet is about, users use word preceded by

a hash sign (#). Twitter uses these hashtags to deﬁne

trending topics, both locally and globally. Users use

the trending topic lists to identify favorite subjects at

that time on Twitter. In default settings, all Twitter ac-

counts are public. Users can interact with each other

like replying other user’s tweets, sending a private di-

rect message, and so on. The Twitter API is a set of

URLs. The URLs cant take parameters and let users

access Twitter features like ﬁnding tweets which con-

tain a set of speciﬁc words and so on. Twitter pro-

vides several APIs to get tweets. Twitter’s Standart

API allows users to get tweets which includes spe-

ciﬁc parameters. Moreover, the resulting stream can

be ﬁltered according to Tweet languages, geolocation

and so on.

Our second data source, H

urriyet Newspaper.

urriyet is one of the major Turkish newspapers,

founded in 1948. As of January 2018, it had the high-

est circulation of any newspaper in Turkey at around

319,000. We can make 12,000 request per day in

urriyet Newspaper API. Therefore, the keyword list

is essential to get relevant data in the result streams.

urriyet API is an interface which enables the usage

of H

urriyet data programmatically in web, mobile, or

desktop applications. Developers can access H

urriyet

newspaper data via standard HTTP requests. The re-

sultant set of results is in JSON format.

2.3 Data Processing

Before writing the streaming data to our database, we

need to format the collected texts. Firstly, we should

select the needed keys from JSON streams of Twitter

API and H

urriyet API. For example, H

urriyet API re-

quests return related news in a JSON which has “Title

of the News” key. The key can be useful for repre-

senting the detected event. On the other hand, there

are unrelated or unuseful data in the JSON too, so we

ﬁlter them and do not write in our database. We ﬁlter

the Twitter API stream’s JSON keys too and select the

useful and relevant keys too. In our database, we have

a ‘Status’ column. When we ﬁrst write the texts to our

database, we set the text’s status with ‘0’. ‘0’ means

that the text is not processed yet, and it is raw data.

We sent the raw data to ITU NLP API to normalize

it. After the normalization step, we update the text

with normalized text and update the Status column of

the row which has the text with “1”. After the row

is processed to detect cybersecurity events, the Status

column is set with “2”. “2” means that the data pro-

cessed before and there is nothing to do with that row

of the table.

In the present research, we used a few Natural

Language Processing techniques and Istanbul Tech-

nical University’s Natural Language Processing API

(Eryi

git, 2014) for normalization of the texts. In or-

der to develop automated systems, Natural Language

Processing is one of the actively used concepts in text

mining. It uses Natural Language Processing to de-

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

liver the system in the information extraction phase

as an input. (Tan et al., 1999)

Turkish Natural Language Processing Tools and

APIs developed by the Natural Language Processing

group at Istanbul Technical University are available

at “tools.nlp.itu.edu.tr” website. To be able to use the

API, we used access token and an account for the to-

ken upon permission. The platform operates as a Soft-

ware as a Service and provides the researchers and the

students the state of the art NLP tools in many layers:

preprocessing, morphology, syntax, and entity recog-

nition. It is a web API, developers can access it with

an HTTP request and can use GET or post method.

Text mining consists of a broad variety of meth-

ods and technologies. (Gaikwad et al., 2014) In this

research, we used Keyword-based technologies and

statistics technologies. Keyword-based technologies

use the input based on a selection of keywords in text

that are ﬁltered as a series of character strings, not

words nor concepts. (Wu et al., 2006) Statistics tech-

nologies leverage a training set of documents used as

a model to manage and categorize text. In this re-

search, we used keyword-based analysis and statis-

tical techniques. We use two keyword vectors for

keyword-based analysis. One of the keyword vectors

stores possible victims who are tracked by our soft-

ware solution. The other keyword vector stores the

possible useful cybersecurity-related Turkish terms

such as “hacklendi” and “eris¸ilemiyor”. We analyze

the results by comparing the past frequency statistics

and current results as described in the Approach sec-

tion. The text required for text mining for cybersecu-

rity event detection purposes is gathered from online

platforms.

From the previous steps of the software system,

we get the possible cybersecurity-related texts from

different sources. Then preprocess and process them

and store them in our database. In order to detect the

events and ﬁnd the possible victim of those events,

we prepared a named entity vector. This vector in-

cludes possible victims which we want to track. Cur-

rently, this list includes institution names, govern-

ment organization names, and country names. The

vector can be updated from changing the conﬁgura-

tion ﬁle to change tracked entities. Then with us-

ing term frequency-inverse document frequency (TF

- IDF) technique, keyword-based analysis, the sta-

tistical technique, and A/B testing; we analyze past

cybersecurity events and create cybersecurity-related

keywords vector.

As we explained in Approach section, we analyze

real-time Turkish text data to detect cybersecurity

events. In order to do this, we send requests to Twit-

ter and Hurriyet newspaper with our cybersecurity-

related keywords vector and we add Turkish language

ﬁlter to our request. The possible victim vector of the

solution periodically checked in the database in terms

of the number of occurrences. If the number of occur-

rences of a victim shows anomaly1 according to its

historical values, our solution detects them as a po-

tential cybersecurity event and shows that events in

the user interface portal.

3 IMPLEMENTATION

3.1 Multi-process Architecture

We use multi-processed system architecture in the im-

plementation of the project. There are four processes

as described in the subchapters below. These are

Twitter API Stream to Database, Hurriyet API Stream

to Database, ITU NLP API Normalization and Secu-

rity Events Web Portal Processes. Twitter API Stream

to Database Process continually gathers Twitter API

stream. Then preprocess the data and write them to

the database.Hurriyet API Stream to Database Pro-

cess continually gathers Hurriyet API stream. Then

preprocess the gathered data and write them to the

database. ITU NLP API Normalization Process con-

tinually checks the database. If the process can ﬁnd

columns with status 0, then sent the columns to ITU

NLP API servers to normalize them. After the nor-

malization, the process writes back the texts to the

database and update their status row with “1”. Secu-

rity Events Web Portal Process continually checks the

database to ﬁnd columns with status row set with “1”.

If it can ﬁnd, it processes them to add the HTML page

which security analysts can monitor the events from

that page.

3.2 Microservice Architecture

Microservices are small, and independent services fo-

cus on doing a task at a time and ability to work to-

gether. Because the project has the potential to grow,

we design it with following microservice architecture.

With this design, our software became resilient. Fail-

ure in one service does not impact the other services

of our project. For example, assume that ITU NLP

API service stops to work for a while and does not re-

spond to our project’s requests. Due to the microser-

vice architecture of our software, the other services

can continue to work even if our software has mono-

lithic or bulky service errors in one service. Hurriyet

API can still gather the streaming data, preprocess

them, and write them to the database; Twitter API

can still gather the streaming data, preprocess them,

Automatic Detection of Cyber Security Events from Turkish Twitter Stream and Newspaper Data

and write them to the database and so on. More-

over, it has scalability. For example, if our database

technology becomes insufﬁcient for our software, we

can change the database technology with a more suit-

able one. Furthermore, our software has less depen-

dency and easy to modify its code and test them. Our

software can understand by other developers since

the processes represent the small piece of function-

ality. It is vital because our software solution will

be an open-source project and will be used by other

developers and researchers. Lastly, this architecture

method gives us the freedom to choose technology.

We can choose the best-suited technology for each of

functionalities.

3.3 User Interface of the System

It is a simple dynamically generated HTML page

which will be used by security analysts as a portal

page of the system. A process continuously checks

the database per minute to detect new data and use

them to show the new cybersecurity events in this user

interface.

4 RESULTS

In this chapter, we discuss the results of the cyber-

security events which are discovered by our software

solution. We focus on what our software system suc-

ceeded and what it did not achieve. We share success-

ful cybersecurity event detection samples and share

the not successful cybersecurity event detection sam-

ples. As described in the previous subsection, it is

a dynamically created HTML page. We divide the

events by their dates. As cybersecurity event informa-

tion, we represent an entity, a representative news title

or tweet and a count which shows how many times the

entity is seen in the data on the same day.

4.1 Historical Cybersecurity Event

Detection Test with an Independent

Dataset: Nic.tr DDOS Attack

To reach the best version of our software solution, we

train our software with training data. In order to do

that, we select an important cybersecurity event test

that can our solution detect that cybersecurity event.

Turkish Internet hit with massive DDoS attack started

on 14.12.2015 and continues about two weeks long.

Turkey’s ofﬁcial domain name servers (Nic.tr) have

been under a Distributed Denial of Service (DDoS) at-

tack. We created 3 separate databases using existing

keywords. 2310 tweets were found when we pulled

the tweets during the 1-year period before the attack.

Then we analyzed these data, our solution can suc-

cessfully ﬁnd the cybersecurity events that took place

for a year.

28 tweets were found when we pulled the tweets at

the start day of the nic.tr DDOS attack. Results of this

day data were important for us because we wanted to

see that our solution could detect the event just after

the attack happened. Then we analyzed these data,

our solution can successfully detect the nic.tr attack

as you can see in Figure 2.

Figure 2: Nic.tr Attack Start Day Detected Security Events

Samples.

The nic.tr attack lasted for about two weeks.

Therefore, we analyze that two weeks period

(14.12.2015 – 28.12.2015) and we expected to detect

the nic.tr attack. About 400 tweets were found when

we pulled the tweets for the given period. After run-

ning our software solution with that database, the re-

sults were satisfactory. Our solution successfully de-

tected the nic.tr attack as you can see in Figure 3.

As we explained before, we used one of the past

cybersecurity incidents. We used Term Frequency -

Inverse Document Frequency (TF-IDF) analysis of

the news and tweets just before the cybersecurity

event (premise) and immediately after the event. For

immediately after phase, we used two different time

intervals for testing. First one is the attack day, and

the second one is the two weeks period after the at-

tack. We used the attack day for sensitivity test. Our

solution is accepted as successful in terms of sensi-

tivity if it can detect the cybersecurity event at the at-

tack day. We used two weeks of period after attack

for certainty. Our solution is accepted as successful

in terms of certainty if there is not so many (more

than %30) false-positive cybersecurity event detection

within two weeks period after a cybersecurity event.

According to these success criteria, we train our soft-

ware solution with the datasets and cybersecurity-

related keyword lists. Then update our keyword lists

according to the results. With using these lists, we

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

Figure 3: Detected Security Events Samples between 14

and 28 December 2015.

tested the method and its accuracy in independent data

set which is nic.tr DDOS attack dataset in the present

section. As can be seen in Figure 2 and Figure 3,

our software solution can successfully detect the nic.tr

DDOS attack in terms of sensitivity and certainty and

passed our test.

4.2 Successful Cybersecurity Event

Detection Samples

In the following subsections, we share successful cy-

bersecurity event detection samples and brieﬂy try to

explain how a security analyst can use this informa-

tion.

4.2.1 WhatsApp Spyware Attack

As can be seen in the Figure 4, our software system

can detect this event on 5 May 2019. However, there

are two different entities about the same event.

Assume that a security analyst wants to track se-

curity events related to countries. When the security

analyst sees the “WhatsApp Spyware Attack” event in

Figure 4: WhatsApp Spyware Attack Detection.

the user interface page with a country name entity, he

should check the news or tweets to control whether it

is a positive or false positive event detection. If it is a

positive and useful cybersecurity event detection, the

security analyst takes the required actions. There are

two entities as “meksika” which is the Turkish syn-

onym of Mexico, and “israil” which is the Turkish

synonym of Israel. When we control the related news

and tweets, we can see that an Israel ﬁrm named NSO

Group performs the cyber-attack. Therefore “israil”

is passing six times in the detected news and tweets.

A Mexican journalist is affected by the cyber-attack.

That is why we capture the “meksika” entity. The

security analyst can notice such attack with follow-

ing our software solutions user interface and can learn

what the new WhatsApp cyberattack is, how one can

protect from such attacks and so on from the related

news and tweets.

4.2.2 Vulnerabilities in Remote Patient Tracking

System Applications

STM is a Turkish software company which does re-

searches about cybersecurity domain. They ﬁnd a vul-

nerability about Remote Patient Tracking System Ap-

plications and share this information from Twitter and

with using newspapers. Our software could detect the

security incident which is happened on 26.04.2019

about ”STM Warns about Remote Patient Tracking

System Applications” successfully. If our software

solution were to have used English texts as a data

source, we could not detect such a cybersecurity event

published in Turkish. Because of our software solu-

tion can analyze Turkish texts, we can detect such a

cybersecurity event. This is an excellent example to

show what our solution can do while the other solu-

tions in the literature cannot do.

4.3 Unsuccessful Cybersecurity Event

Detection Samples

Sometimes our software solution can detect false-

positive events, or even it is a cybersecurity event, the

detection may not be a useful event for security an-

alysts. The following subsections examine such sce-

narios.

Automatic Detection of Cyber Security Events from Turkish Twitter Stream and Newspaper Data

4.3.1 Sample False Positive Cybersecurity Event

Detection

A sample not useful cybersecurity event de-

tection detected by our software is like ”

Omer

bey inanamıyorum, gerc¸ekten bunları siz mi

oyl

uyorsunuz yoksa hesabınız mı hacklendi?”. Even

the tweet has “hacklendi” word, which is one of our

keywords from our keyword vector; the event is not

a real cybersecurity event. Analyzing such tweets to

realize that it is not a real security event is hard for an

automated system.

4.3.2 Sample Not Useful Cybersecurity Event

Detection

Sometimes, even the detected event is a cybersecu-

rity event; it may be a personal status primarily if it

is published on Twitter. Security analysts should read

the detected event from the user interface and decide

that it is useful or not for her/him. Even if the de-

tected event is not a personal cybersecurity event, the

detected event may not be useful for security events.

For example, an event may occur months ago, but a

Twitter user or a Twitter bot may share the event in a

Tweet as if it occurred newly. The time frame is con-

ﬁgurable in our software system. Security analysts

should conﬁgure the software detection timeframe ac-

cording to their needs. For example, if a security an-

alyst works for a big cybersecurity technology com-

pany and he/she wants to know more detected security

events, he/she can set the timeframe longer. How-

ever, if another security analyst wants to know only

the latest security events, he/she should set smaller

timeframe in our software solution.

4.4 Evaluation of the Results

When we run our software with too much the

cybersecurity-related seed keywords vector, our soft-

ware system might receive more tweets than it can

handle. Only about %20 of Twitter users are post-

ing informative messages (Kr

al and Rajtmajer, 2017).

Moreover, the false-positive cybersecurity event de-

tection may signiﬁcantly increase. It decreases the

certainty of our software solution. On the other hand,

if we run our software with too few cybersecurity-

related seed keywords, our software system might not

detect some cybersecurity events as fast as we expect

from our software. It decreases the sensitivity. We

expect that we can detect an attack on the day of the

attack.

Although we can verify with other sources that the

detected events are indeed occurring, or occurred, be-

ing sure that we have missed any events is very difﬁ-

cult. During our tests, we realized that we could miss

small events. However, our solution does not miss

any serious attack as far as we know. Sometimes our

solution detects an already detected event as if it is a

new cybersecurity event. Because our software uses

one day as a period for its frequency calculation. For

each day, all calculations start from zero again.

For a limited time, we run our software for test-

ing purposes. At a sample test run of our software

solution, our database of the software includes 437

entries. 186 of them is Twitter Tweets, and 251 of

them is from H

urriyet Newspaper. After analyzing

the entries in our database, our software solution can

detect 29 cybersecurity events. 22 of them are pos-

itive detection, and 7 of them are false positive de-

tection. Our software solution’s success rate is ap-

proximately %76.15 These statistics show that this

methodology works in the detection of cybersecurity

events from Turkish texts with an acceptable success

rate in term of certainty and sensitivity. Cybersecu-

rity analysts can use our software with preparing our

cybersecurity-related keyword vector and named en-

tity vector and selecting a suitable time frame. More-

over, they can modify the keyword vector or named

entity vector as they wish. If we add new data sources

in the future, our software can work with bigger

datasets and this leads to more accurate detection and

it may increase the success rate percent of our soft-

ware solution in terms of certainty and sensitivity.

5 CONCLUSION AND FUTURE

WORK

5.1 Conclusion

In the last few decades, automation has been increas-

ingly used in various ﬁeld of people’s life due to its

beneﬁts like cost reduction, productivity, availability,

reliability, and performance. Cybersecurity is one of

the ﬁelds which automation is often used. However,

every automation software system has unique require-

ments to achieve its purposes. It leads to lots of re-

search areas and unique automation systems. Auto-

matic event detection is one of these research ﬁelds.

Social media is one of the fastest ways to detect cy-

bersecurity events because people and bots share such

events in there. Newspapers are also shared such

cybersecurity events and processing the newspaper

data is relatively more straightforward because false-

positive cybersecurity events are rarely shared in the

newspaper websites.

In this research, we investigated automatic event

detection of cybersecurity events from Turkish Twit-

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

ter Stream and Turkish newspaper data. We work on

real-time data to achieve that our research can be used

by security analysts. Existing publications about real-

time cybersecurity event detection system generally

use English texts to analyze and detect the events.

We cannot ﬁnd any research which use Turkish data

sources to detect cybersecurity events. Using Turk-

ish data sources for cybersecurity event detection is a

new topic for literature. We believe that this research

contributes to the literature by ﬁlling an uninvesti-

gated ﬁeld. We proposed an automated software sys-

tem which works using different data sources, named

entities, text mining methods, and ”state of art” soft-

ware techniques. Then we analyze the results of our

software system. Even if our software system detects

few false-positive cybersecurity events, it was often

able to detect a useful cybersecurity event. For ex-

ample, our software system can detect cybersecurity

events such as WhatsApp Spyware, MuddyWater At-

tack, the Remote Patient Tracking System Applica-

tions vulnerability, Pirate Matryoshka Virus, Zombie

Cookies threat. We concluded that event detection

with using Turkish texts is applicable, and security

analysts can use such a system like our software sys-

tem as a helper tool.

5.2 Limitations and Future Work

Currently, our software system works on a local com-

puter. When we move the software to a server(i.e.

AWS), our software can work 7x24, which will be

useful for detection success. If our software can

work with bigger data, it will detect more events

with more accurate event detection. To increase the

streaming data, we are planning to add new Turk-

ish data sources from other websites like Eksisozluk,

Linkedin, Facebook, and so on. This improvement

will make our datasets an excellent resource for fu-

ture work. After these improvements, our datasets

can be useful not only for us but also the other re-

searchers work on cybersecurity, cognitive science

or computer science ﬁeld. We shared our software

solution as an open source project via Github un-

der Apache-2.0 license and it can be reachable from

”https://github.com/ozzgural/MSThesis” link. We are

also planning to share our future works on there and

according to users feedback, we are planning to reﬁne

our software tool. The developed scenario may be ap-

plied to the other languages with necessary modiﬁca-

tions and this work is also in our future plans. More-

over, we do not handle the named entity recognition

ambiguities yet. We are planning to handle them in

the future.

REFERENCES

Alan, G. A. E. (2020). The importance of marketing public

relations for “new” consumers. New Communication

Approaches in the Digitalized World, page 157.

Alves, F., Bettini, A., Ferreira, P. M., and Bessani, A.

(2021). Processing tweets for cybersecurity threat

awareness. Information Systems, 95:101586.

Aslan, c. B., Sa

glam, R. B., and Li, S. (2018). Automatic

detection of cyber security related accounts on online

social networks: Twitter as an example. In Proceed-

ings of the 9th International Conference on Social Me-

dia and Society, SMSociety ’18, page 236–240, New

York, NY, USA. Association for Computing Machin-

ery.

Duarte, F., Pereira, O., and Aguiar, R. (2018). Discovery of

newsworthy events in twitter. pages 244–252.

Eryi

git, G. (2014). ITU Turkish NLP web service. In

Proceedings of the Demonstrations at the 14th Con-

ference of the European Chapter of the Association

for Computational Linguistics (EACL), Gothenburg,

Sweden. Association for Computational Linguistics.

Fabritius, M. (2017). How to motivate colouring app users.

Gaikwad, S. V., Chaugule, A., and Patil, P. (2014). Text

mining methods and techniques. International Jour-

nal of Computer Applications, 85(17).

Huberman, B., Romero, D., and Wu, F. (2009). Social net-

works that matter: Twitter under the microscope. First

Monday, 14.

Javed, A., Burnap, P., and Rana, O. (2019). Prediction

of drive-by download attacks on twitter. Information

Processing & Management, 56(3):1133 – 1145.

Khandpur, R. P., Ji, T., Jan, S., Wang, G., Lu, C.-T., and

Ramakrishnan, N. (2017). Crowdsourcing cybersecu-

rity: Cyber attack detection using social media. In

Proceedings of the 2017 ACM on Conference on In-

formation and Knowledge Management, CIKM ’17,

page 1049–1057, New York, NY, USA. Association

for Computing Machinery.

al, P. and Rajtmajer, V. (2017). Real-time data harvesting

method for czech twitter. pages 259–265.

Kwak, H., Lee, C., Park, H., and Moon, S. (2010). What is

twitter, a social network or a news media? In Proceed-

ings of the 19th International Conference on World

Wide Web, WWW ’10, page 591–600, New York, NY,

USA. Association for Computing Machinery.

Okay, A., Gole, P. A., and Okay, A. (2020). Turkish and

slovenian health ministries’ use of twitter: a compar-

ative analysis. Corporate Communications: An Inter-

national Journal.

Petersen, J. K. Handbook of surveillance technologies.

CRC Press,, Boca Raton, Fla., 3rd edition.

Phan, H. T., Tran, V. C., Nguyen, N. T., and Hwang,

D. (2020). Improving the performance of sentiment

analysis of tweets containing fuzzy sentiment using

the feature ensemble model. IEEE Access, 8:14630–

14641.

Rajaraman, A., Leskovec, J., and Ullman, J. (2014). Mining

of Massive Datasets.

Automatic Detection of Cyber Security Events from Turkish Twitter Stream and Newspaper Data

Seth, A., Nayak, S., Mothe, J., and Jadhay, S. (2017). News

dissemination on twitter and conventional news chan-

nels. pages 43–52.

Sohime, F. H., Ramli, R., Rahim, F. A., and Bakar, A. A.

(2020). Exploration study of skillsets needed in cyber

security ﬁeld. In 2020 8th International Conference

on Information Technology and Multimedia (ICIMU),

pages 68–72.

Tan, A.-H. et al. (1999). Text mining: The state of the art

and the challenges. In Proceedings of the pakdd 1999

workshop on knowledge disocovery from advanced

databases, volume 8, pages 65–70. Citeseer.

Wu, S.-T., Li, Y., and Xu, Y. (2006). Deploying approaches

for pattern reﬁnement in text mining. In Sixth Interna-

tional Conference on Data Mining (ICDM’06), pages

1157–1161. IEEE.

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy