RESEARCH ON HETEREGENEOUS DATA

FOR RECOGNIZING THREAT

Deris Stiawan, Abdul Hanan Abdullah and Mohd Yazid Idris

Faculty of Computer Science & Information System, Universiti Teknologi Malaysia, Johor Bahru, Malaysia

Keywords: Heterogeneous data, Intrusion prevention and prediction, Data mining.

Abstract: The information increasingly large of volume dataset and multidimensional data has grown rapidly in recent

years. Inter-related and update information from security communities or vendor network security has

present of content vulnerability and patching bug from new attack (pattern) methods. It given a collection of

datasets, we were asked to examine a sample of such data and look for pattern which may exist between

certain pattern methods over time. There are several challenges, including handling dynamic data, sparse

data, incomplete data, uncertain data, and semistructured/unstructured data. In this paper, we are addressing

these challenges and using data mining approach to collecting scattered information in routine update

regularly from provider or security community.

1 INTRODUCTION

Currently, much of the information is now in textual

form, this information can be correlate and

appropriate for solving problem on a particular

problem. This could be data from the web, library

data, logging, and past information that are stored as

archives, these data can form a pattern of specific

information

In this case, the information increasingly large of

volume dataset and multidimensional data has

grown rapidly in recent years. In another scenario,

(Martin, 2001) describes benefits of CVE

compatibility, integrating vulnerability services and

tools to provide more complete security provide and

alert advisory services, (Tsang et al., 2009) using

blacklisting a user and notifying the user of blacklist

status, and (Zhou et al., 2010) collecting URL

filtering systems for provide a simple and effective

way to protect web security.

However, it is possible for propose collecting

scattered information in routine update regularly

from provider or security community. This data can

be useful information to be associated with other.

The data set includes signature identification, rules,

policy, pattern, method attack, URL blacklist, update

patch, log system, list variant of virus and regular

expression, all this will be collected and labeled to

identify attack patterns and can predict it that would

occur. Furthermore, if the future is similar to the

past, we may have an opportunity to make

predictions and readiness/ prevention.

The main contributions this paper is the

enhancement of a learning phase and is part of the

research have being done, which aim to increasingly

accuracy alarm in detection and prevention system.

The remaining of the paper is structured as follows:

In Section 2 we present and briefly discuss

background and related work. Section 3 proposes

analysis problem. Section 4, discusses our approach.

Section 5 summarized our conclusions and present

additional issues on which research can be continued.

2 BACKGROUD & RELATED

WORK

Data Mining (DM) is an integration of multiple

technologies, these include database management,

data warehouse (DW), statistics, Machine Learning

(ML), decision support, visualization, and parallel

computing. In this approach for finding decision

function, classification function and regression

function, it is adequate to use DM approach with

supervised learning. DM is the process of posing

queries and extracting information previously

unknown from large quantities of data.

In some case, the data sources have to be

integrated into DW, DM helps the users to extract

meaningful information from the numerous and

222

Stiawan D., Hanan Abdullah A. and Yazid Idris M..

RESEARCH ON HETEREGENEOUS DATA FOR RECOGNIZING THREAT.

DOI: 10.5220/0003596502220225

In Proceedings of the 6th International Conference on Software and Database Technologies (ICSOFT-2011), pages 222-225

ISBN: 978-989-8425-76-8

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

heterogeneous data sources. The data libraries could

have different semantics and syntax, it will be

difficult to extract useful information. Sophisticated

DM tools are needed for this purpose.

In other hands, (Adeva et al., 2007), they

introduces an intrusion detection software component

based on text mining techniques, using text

categorization. This approach is capable to learning

the characteristic of both normal and malicious user

behavior from the log entries generated by the web

application server Text mining refers to the discovery

of non-trivial, previously unknown and potentially

useful knowledge from a collection of text.

Currently, Text Mining (TM) has become an

inevitable part in information retrieval, around 80%

of the information stored in computer consist of text

and digital files. According to work by (Lopes et al.

2007), they framework for visual text mining to

support exploration of both general structure and

relevant topics within a textual document collection,

in this effort, they have answer and examine sets of

documents to achieve understanding of their structure

and to locate relevant information. This is reinforced

by subsequent research by (Zhang et al., 2008), they

argued text classification, namely text categorization,

is defined as assigning predefined categories to text

documents, where documents can be news stories,

technical reports, web pages, and categories are most

often subjects or topics, but may also be based on

style (genres), pertinence, etc.

3 ANALYSIS PROBLEM

While several work have been proposed, there are

several challenges for solving these problems,

including handling dynamic data, sparse data,

incomplete data, uncertain data, and

semistructured/unstructured data. We have

addressing these challenges based on some effort

problems from previously work;

1) The problems are not fully defined in advance.

Grammars will have to be modified to take

account of new data. This is not easy: the

addition of just one new example can

completely alter a grammar and render

worthless all the work that has been expended

in building it, declared by (Witten et al., 1999).

2) There also some effort and problem from

(Singhal, 2007) and (Junqi & Zhengbing, 2008)

to introduce the concepts hybrid approach

effectively with detecting normal usages and

malicious activities using heterogeneous data.

Furthermore, what makes this solution different

from others?

3) How to collecting and integrating information

from different structure, data format, label,

meta data and variable of data. These data set

bulk in information and growing from

community or security services?

4) How we can convert and integrating this data

into information, and subsequently into

knowledge.

5) How to extract the relationships, and then

correlate data source to run on the new

environment if the data sources could be based

on complex structure and many relationships?

6) Is it true to integrate data for the process of the

standardization data definitions and data

structures by using a common conceptual

schema across a collection of data sources?

With respect work by (Singhal, 2007) present four

data source with multiple audit streams from diverse

cyber sensor: (i) raw network traffic, (ii) netflow

data, (iii) system call, and (iv) output alert from IDS.

Unfortunately, we assume this method can not

effective with new challenge of intrusion threat.

However, with respect we improve and expand this

opinion to our approach, in this approach we use

sixteen event parameters from heterogeneous data

input. We present sixteen interrelated of information

in database for knowledge process. Accordingly,

obtaining general pattern with variation diversity

structure, label, and variable of data to potentially

useful knowledge is another part of this research.

In this study, DM is used to perform data

collection using history, patterns, and relationships to

classification and estimation of attack in stream

network. This is due to hybrid system receive data

from many different sources and it is expected that a

hybrid system has the potential to detect

sophisticated attacks that involve multiple networks

with the information from multiple sources. As a

mentioned above, Learning technique from DM can

be solution for research objective (i) prediction of

attack pattern, (ii) identification from anomaly

habitual activity, (iii) estimation normal activity

based on habitual activity, (iv) classification

attack/suspicious packet, (v) mapping habitual-

activity, and (vi) early prevention security violation.

We use DW to collecting scattered information in

routine update regularly from provider or security

community, we illustration in Figure 1. From our

observation, these data can be useful information to

be associated with other. The information,

increasingly large of volume dataset and

multidimensional data has grown rapidly in recent

RESEARCH ON HETEREGENEOUS DATA FOR RECOGNIZING THREAT

223

years. The data set includes signature identification,

rules, policy, pattern, method attack, URL blacklist,

update patch, log system, list variant of virus and

regular expression, all this will be collected and

labeled to identify attack patterns and can predict it

that would occur. These data set bulk in information

and growing from community or security services.

Therefore, there is a critical need of data analysis

system that can automatically analyze the data to

classification it and predict pattern attack future

trends. This information is scattered in internet and in

the form of text. Unfortunately, text is complex

characteristic that has defeated many representation

attempts with very rich semantics, However, here is

the strength characteristic of the text.

Figure 1: Interrelated web from provider and security

community.

TM is a multidisciplinary field that includes many

tasks such as text analysis, clustering, categorisation,

and summarisation. According to some work (Al

Fawareh et al., 2008) and (Sanchez et al., 2008),

they have clearly described for addressing issues of

ambiguity in natural language texts, and have

presented a technique for resolving ambiguity

problem in extracting an entity from texts. This text

data mining approach has proved to be very useful in

many applications.

4 OUR APPROACH

Text mining and DM are inherently hard problem in

term of computational complexity. An interesting

and summary some previously work using text

mining help solve problem in security attack. From

(Abe & Tsumoto, 2009), uses method of detecting

trends of technical term on importance indices using

three sub processes: (i) technical term extraction in a

corpus, (ii) importance indices calculation, (iii) trend

detection.

We may have an opportunity to make prediction

future threat from past experiences, these scenario

called text categorization, making a prediction

requires more that a lookup of past experience.

Furthermore, for prediction, a pattern must be found

in past experience that will hold in the future,

leading to accurate result on new, unseen examples.

As the basis of this approach are (Sanchez et al.,

2008) and (Romero & Ventura, 2007), text mining is

concerned with obtaining new, non-trivial, and

potentially useful knowledge for text repositories

stored in computers, and almost all text mining

approaches existing in the literature, that have been

shown to be very useful in practice, are based on

induction.

In the context of TM, information retrieval is one

the main problems; the more general approach, a

complete document will have many words and it is

unlikely that it will completely match a stored

document. Instead of an exact match, we try to find

the closets matches to the stored documents. The

proposed system has the following handling steps;

1) Take an unstructured document and

automatically fill in the value of a spreadsheet.

For example: information attack pattern from

CVE in XML format data. Meanwhile, from

security community (http://www.us-cert.gov/

cas/techalerts/) have information infiltrating a

botnet via Internet Relay Chat (IRC).

Wherefore, when the information is

unstructured, such as that found in a collection

of documents, then a separate process is needed

to extract data from an unstructured format.

2) Create pseudonymized for describe and declare

a log of event parameters

3) The partitioning document is divided by time,

not randomly. We assume this mechanism can

closely simulate the prediction of future events

before inside to system.

4) Document Standardization, once the documents

are collected, There are several variations with

different formats available, depending on when

the document was generated, some of them

using the ASCII format, CSV or format as

images.

We identified the problem in collecting

information from different structure, label, and

variable of data, shown in Figure 2. Here data can

refer to heterogeneous data, is a set bulk in

information and growing from provider, community

or security services. Therefore, there is a critical

need of data analysis system that can automatically

analyze the data to classification it and predict

ICSOFT 2011 - 6th International Conference on Software and Data Technologies

224

pattern attack future trends.

Figure 2: Correspond with parameters.

5 CONCLUSIONS

Data Warehouse (DW) is an important supporting

technology for Data Mining (DM), DW is an

essentially an integration of various data source for

decision support and analysis. There are several

advantages to make the TM as a solution emphasis;

(1) The engine of TM currently includes

functionalities for text categorisation, language

identification, text/ document summarisation, text

clustering, and similarity analysis, (2) TM can do

detecting trend, important indices calculation with

information extraction methods, (3) TM allows

identify request functionality from different structure

text in one paragraphs, it is depending form of text,

(4) TM can find the best decision rules.

This approach still needs further exploration in

future research mainly query correlation each

parameters, using data mining approach is one

primary our focus. In the future research can also

include more factors to implement our approach in

real environment and benchmarking with other IPS

software solution to tested effectiveness on

accuracy, attack containing, measurement

vulnerabilities, and risk/nearness True Positive and

False Positive value.

ACKNOWLEDGEMENTS

This research is supported by The Ministry of Higher

Education Malaysia and collaboration with Research

Management Center (RMC) Universiti Teknologi

Malaysia.

REFERENCES

Abe, H. & Tsumoto, S., 2009. Detection of Trends of

Technical Phrases in Text Mining. IEEE International

Conference on Granular Computing, pp. 7-12.

Adeva, J. G., Manuel, J. & Atxa, P., 2007. Intrusion

detection in web applications using text mining.

Engineering Applications of Artificial Intelligence, 20,

pp. 555-566.

Al Fawareh, H. M. et al., 2008. Ambiguity in Text

Mining. Proceedings of the International Conference

on Computer and Communication Engineering 2008,

pp. 1172-1176.

Junqi, W. & Zhengbing, H., 2008. Study of Intrusion

Detection Systems ( IDSs ) in Network Security.

IEEE. Wireless Communications, Networking and

Mobile Computing. WICOM 08, pp. 1-4.

Lopes, A. A. et al., 2007. Visual text mining using

association rules. Computers & Graphics, 31, pp. 316-

326.

Martin, R. A., 2001. Managing Vulnerabilities in

Networked Systems. Computer, 34(11), pp. 32-38.

Romero, C. & Ventura, S., 2007. Educational data mining:

A survey from 1995 to 2005. Expert Systems with

Applications, 33, pp. 135-146.

Sanchez, D. et al., 2008. Text Knowledge Mining: An

Alternative to Text Data Mining. Knowledge Creation

Diffusion Utilization.

Singhal, A., 2007. Data Warehousing and Data Mining

Techiques for Cyber Security 31st ed., Advance in

Information Security Springer.

Tsang, P. P. et al., 2009. Nymble: Blocking Misbehaving

Users in Anonymizing Networks. IEEE Transaction

Dependable and secure computing, pp. 1-15.

Witten, I. H. et al., 1999. Text mining: a new frontier for

lossless compression. Proceedings DCC 1999 Data

Compression Conference, pp. 198-207.

Zhang, W., Yoshida, T. & Tang, X., 2008. Knowledge-

Based Systems Text classification based on multi-

word with support vector machine. Knowledge-Based

Systems, 21(8), pp. 879-886.

Zhou, Z., Song, T. & Jia, Y., 2010. A High-Performance

URL Lookup Engine for URL Filtering Systems.

IEEE ICC 2010, pp. 1-5.

RESEARCH ON HETEREGENEOUS DATA FOR RECOGNIZING THREAT

225