Integrated Analytics for Application Management using Stream

Clustering and Semantics

M. Omair Shafiq

School of Information Technology, Carleton University, Ottawa, ON, Canada

Keywords: Semantics, Streaming Clustering, Integrated Analytics, Application Execution and Management.

Abstract: Large-scale software applications produce enormous amount of execution data in the form of logs which

makes it challenging for managing execution of such applications. There have been several semantically

enhanced analytical solutions proposed for enhanced monitoring and management of software applications.

In this paper, author proposes a customized semantic model for representing application execution, and a

scalable stream clustering based processing solution. The stream clustering based approach acts as key to

combine all the other analytical solutions using the proposed customized semantic model for logs. The

proposed approach works in an integrated manner that clusters log data that is produced, as a result of

events occurring during execution, at a large-scale and in a continuous streaming manner for managing

execution of software applications. The proposed solution utilizes semantics for better expressiveness of log

events, other related data and analytical approaches, through stream clustering based integrated approach, to

process logs that helps in enhancing the process of monitoring and management of software applications.

This paper presents the customized semantic logging model for scalable stream clustering, algorithm design

and discussion on scalable stream clustering based solution and its integration with other analytical

solutions. The paper also presents experimentation, evaluation and demonstrates applicability of the

proposed solution.

1 INTRODUCTION

Building analytical solutions is challenging but

making different analytical solutions work together

is even more challenging. Several analytical

solutions are proposed that focus on processing and

analyzing data in a particular manner. Different

analytical solutions may have different strengths and

speciality in analyzing data and could be beneficial

in different aspects. Some analytical solutions are

better in discovering different hidden correlations

among different features in data. Other analytical

solutions are better in categorizing data based on

different features. With large and complex systems

to be analyzed, multiple analytical solutions are

often built to analyze data in such system from

different aspects. This brings another challenge in

making all the analytical solutions work together in

a meaningful and integrated manner.

For example, in an earlier work of the author, a

hybrid solution of semantically formalized logging

with advanced analytical solutions for enhanced

monitoring and management of software

applications (Shafiq, 2014b) was proposed. The

proposed solution was built using semantic models

to be able to formally describe components as well

as events descriptions in execution logs of software

applications. Analytical solutions were then built to

effectively process such semantically formalized

logs. In this way, information available with higher

level of explicitness and expressiveness was better

utilized. Data described formally and with higher

level of expressivity makes it easier for the

analytical solutions to process and analyze such data

to be able to have monitoring and management of

execution of software applications in an enhanced

and effective manner.

There are several possible analytical solutions

that can be integrated together in a meaningful way

to perform deep and extensive analysis in a

collective manner. However, in order to perform

integration of different analytical solutions, inputs

and outputs of different analytical solutions have to

be matched syntactically and semantically. In this

paper, author shows how previously proposed

semantically enhanced analytical techniques can be

280

Shaﬁq, M.

Integrated Analytics for Application Management using Stream Clustering and Semantics.

DOI: 10.5220/0006334802800287

In Proceedings of the 19th International Conference on Enterprise Information Systems (ICEIS 2017) - Volume 1, pages 280-287

ISBN: 978-989-758-247-9

integrated together in a meaningful and effective

manner.

In (Shafiq, 2015) an Association Rule Mining

based approach was proposed. It is based on

Semantic extension of FP-Growth algorithm for

effective ranking and adaptation of Web Services.

The approach was hybrid, i.e., partially using

semantic annotations to Web Services combined

with semantically adapted FP-Growth for

Association Rule Mining allows the pre-processing

of requests for searching Web Services. It helps in

improving Web Service selection experience from

performance as well as precision perspectives. This

approach takes a set of log events as an input, and

outputs a set of association rules.

In (Shafiq, 2014b), a hybrid approach for

enhanced and automated monitoring and

management of applications was built by using

Semantics with Bayesian Classification. Semantics

were used to formalize and structure logs from

application execution which are then utilized by

Bayesian Classification to classify different types of

possible issues, with classification extended from

(Friedman, 1997). It helped in reducing the size of

problem space for system and application

administrators to focus on the problematic part of

application rather than the whole application, at the

time errors of faults occur. This approach takes a

given set of log events as input, uses its Bayesian

classification based learning mechanism to deduce

the system state of the system as output.

In (Shafiq, 2014a), a social network based

solution with Semantic Logs to handle missing

values and incomplete data during execution of

applications. The proposed solution is based on

semantically formalized logging (Shafiq, 2014b) for

recording execution of applications and later-on

using it to deduce possibly new or hidden

information by analysing such logs. Key elements in

logs were identified and correlations were modelled

into a social network analysis hexagon. It was

further shown that how such correlations between

different key elements of semantic logs can be used

to deduce new and non-obvious correlations

between other elements of semantic logs and then

utilize this information in monitoring and

management of applications. This approach takes a

set of log events and uses the proposed social

network analysis based solution to deduce any

hidden or missing correlations.

The proposed solution in this paper aims to show

how semantic logs can plan an important role in

integrating all the three techniques together in a

meaningful manner.

The integrated analytics solution is also required to

handle incoming events from logs as a stream. Such

incoming events can be large in number, large in

velocity and may also have different variety. This

makes the events from logs to be of the scale of big

data. Therefore, our proposed solution also includes

a stream clustering based overall integration

approach for different analytical approaches. There

could be several other ways to perform integration

of all of the components together. However, in order

to keep the proposed integrated analytics solution

open and generic, stream clustering has been chosen

as the best candidate for following reasons. First, it

allows processing of incoming event logs in the

form of a stream. Second, it performs categorization

of incoming log events into different categories (i.e.,

clusters) which can be then used by other analytical

approach to perform further analysis. Third, it is an

unsupervised leaning approach and does not require

prior knowledge of data for clustering and hence that

makes it a good candidate for acting as a broker-

style interface to process incoming data and

categorize it into different clusters and make it

available for further processing by other analytical

approaches.

The rest of the paper is organized as follows.

Section 2 presents related work in the area of stream

clustering and monitoring and management of

software applications. Section 3 presents proposed

solution of stream clustering on semantic logs for

integrated analytics. Section 4 presents experiments

and discusses evaluation of results as well as

compares it with that of existing solutions. Section 5

presents conclusions followed by references.

2 RELATED WORK

A number of related works have been studied and

analyzed that are carried out in the areas of

clustering of logs for different types of software

systems or software code management repositories

or monitoring and management of software

execution. These research works range from

monitoring and analysis of stand-alone applications

to large-scale applications with multiple

components, middleware-based solutions and

service based systems. Brief discussions and

analyses on some of the interested and related

approaches is described as follows.

In (Vaarandi, 2003), clustering of log events is

proposed based on different features of events in

logs. Different clustering algorithms (Hand, 2001)

and (Berkhin, 2002) have been used to cluster log

Integrated Analytics for Application Management using Stream Clustering and Semantics

281

events into different categories. Authors categorize

different lines in log files as different objects and

then use clustering algorithms to cluster different

lines into different clusters. After the clusters of

event types are been identified, different analysis

techniques are further used for detecting temporal

associations between event types. A clustering tool

called SLCT (Simple Logfile Clustering Tool) has

been built based on these analyses techniques.

However, limitation of this approach is that authors

do not make any attempt to structure or formalize

data in logs. The solution build be authors mostly

relies on unstructured and almost not expressive

data.

In (Makanju, 2008), authors use logs from a

network management software and perform

clustering in order to have a better and meaningful

view for system and network administrators.

Authors believe that clustering that allow system and

network administrators to view faulty parts of log

data easily rather than being overwhelmed with a

large amount of log data and then having to

manually find out faults. Large amounts of log data

with a lot of different and irrelevant information

may make process of monitoring difficult and may

also cause unnecessary delays as well as

inefficiencies. This work is also based on the Simple

Log file Clustering Tool (SLCT) (Vaarandi, 2003)

tool and a visualization tool has been further

developed that can be used to view log files based

on the clusters produced by the SLCT tool. Authors

claim that results their solution further help in easing

the summarization of a large amounts of data

contained in the log files from network devices. The

approach further helps in expediting analyses of

events to detect any possible errors, faults or

exceptions in networks. Drawbacks of this approach

are the same as in previous approach, i.e., it is also

based on using unstructured and almost not

expressive data. This limits the approach in

detection of different possible events (i.e., faults).

In (Beeferman, 2000), clustering is applied on log

of queries for a search engine. Clustering is used to

mine a collection of different and multiple user

transactions over the search engine to discover

clusters of similar queries as well as similar URLs.

Identifying different queries from logs and then

using clustering for different queries from the log,

the authors claim that it enhances the process of web

search. Clustering of different queries into different

clusters in a meaningful manner helps in computing

results faster for new queries that are similar to the

queries that have already been recorded and

categorized in clustering. This approach helps in

enhancing the process of search but it is however

limited to unstructured and raw log data (which is

also sometimes referred to as click-through data).

That limits the approach for detection and

correlation of different events in terms of efficiency,

accuracy and effectiveness.

In addition to the above-mentioned solutions,

there are several other approaches that attempt to

model data using semantics for the purposes of

automating the process of Web Service discovery,

composition and execution. Ontology Web

Language for Services (OWL-S) (Paolucci, 2003),

extended from DAML (Fensel, 2002), is considered

as pioneer approach for semantically modelling web

service description. It is based on OWL ontologies

to describe different aspects of a web service to be

known as Semantic Web Service (SWS) (SWSF,

2005). Web Service Modeling Framework (WSMF)

(Fensel, 2002) is another similar and well-known

approach proposed as a comprehensive framework

to model different aspects of service consumers and

service providers, known as Semantic Web Services

(Roman, 2006). This approach is based on the

principles of maximizing de-coupling between

service consumers and service providers by

providing mediation (Mocan, 2006), (Cimpian,

2005). The WSMF is realized by modelling

ontology WSMO (Roman, 2006), description

language WSML (de Bruijn, 2005), and execution

environment WSMX (Recuerda, 2005), (Moran,

2004). Semantic Web Services Framework (SWSF)

is another approach, having conceptual model as

Semantic Web Service Ontology (SWSO) and

language Semantic Web Service Language (SWSL).

SWSO is based on three ontologies, i.e. service

profile, model and grounding. It enables formal

service descriptions and reasoning (Sirin, 2007) on

Web Services. WSDL-S (Akkiraju, 2005) proposes a

mechanism to enhance existing Web Services

Description languages with semantics, in particular

focusing on the services’ functional descriptions. All

these approaches attempted to formally describe

Web Services descriptions or other relevant aspects,

but none of these approaches attempt to formally

represent or describe execution logs during

execution of Web Services.

There are also several tools that attempt to

process logs regardless of structures of such logs.

Some of the tools are Adiscon LogAnalyzer

(Adiscon, 2011), WebLog Expert (WebLog, 2016),

GitHub Log-analyzer (Github, 2014), Retrospective

Log Viewer Software (Retrospective, 2016) and

XpoLog Log Analysis Platform (XpoLog, 2016).

These tools were found to be applicable for currently

ICEIS 2017 - 19th International Conference on Enterprise Information Systems

282

available logging solutions. However, these tools

were not found to be able to employ one or more

analytical solutions to perform analysis in collective

as well as meaningful manner.

To summarize the related work, most of the

clustering solutions that have been reviewed so far

either attempt to cluster logs that are not formalized

and structures, or approaches like Semantic Web

Service focused only on formalizing descriptions of

web services and user requests. Such approaches do

not specify issues related to processing of logs and

especially having more than one analytical solutions

analysing data in a collective and meaningful

manner. This paper proposes to use semantic logs

and stream clustering to allow different analytical

techniques to analyse events in logs in integrated as

well as meaningful manner.

3 PROPOSED SOLUTION

This section presents the proposed solution. The

proposed solution is two-fold. First, employs stream

clustering for processing of log events. Stream

clustering was chosen because events are executed

in applications in a stream like manner where logs

are produced as event execution progresses in

applications. However, employing stream clustering

based solution was not straightforward. In case of

large-scale applications, logs being produced are

also large in scale. That means, incoming log events,

especially from large-scale applications, can be large

in number (volume), large in speed with which the

log events are generated (velocity) and may also

have different variety of log events. This fulfils the

definition of big data. Therefore, the proposed

solution should be able to handle log events, not

only in streaming manner, but also in large-scale.

For this purpose, BIRCH based stream clustering

solution has been proposed.

3.1 BIRCH based Stream Clustering

for Log Events

Logs are produced as events that occur while an

application is being executed. The events are

produced in a continuous and streaming manner.

Therefore, it is important to be able to process such

logs in a streaming manner. BIRCH (Zhang, 1997)

based approach has been utilized to cluster log

events, streaming during execution of an application,

into different clusters. Events are categorized into

different clusters using stream clustering. The

categorization could be based on a particular

category, status, component, functional, non-

functional properties or any other application specific

features. Clustering of logs based on data stream of

events from logs is carried out by BIRCH approach

as described in Table 1. BIRCH uses clustering

feature (CF) which is based on number of data points

(N), linear sum (LS) and squared sum (SS).

Therefore, CF = {N, LS, SS}.

Table 1: Stream Clustering Algorithm for Log Events.

Inputs:

1. A set of n Log Events from

Semantic Logs (LE1, LE2, LE3, …

LEn).

2. An integer k for number of

clusters to be formed.

Algorithm:

1. For n Log Events LE1 to LEn,

compute clustering feature CF.

2. Build CF-Tree with a branching

factor B and Threshold T using

(Zhang, 1997).

3. Perform initial clustering using

hierarchical clustering as in

(Zhang, 1997).

4. Perform cluster refining by doing

additional pass-overs over the

data points and re-assigning

points to closest centroids

Output:

1. K clusters with each cluster

containing a set of Log Events

belonging to that cluster as {C1,

C2, C3 … Ci … Ck}.

2. Ci = {LE1, LE2 … LEx}

3.2 Clustering of Log Events from

Semantic Logs

This section presents extended semantic logging

model that is customized specifically for clustering

of log events. The proposed extension of the

semantic logging model based on the previous works

of the author (Shafiq, 2014b) is shown in table 2.

The semantic model encapsulates important and

relevant information like global clustering solution,

intermediate refined clustering solutions, centroids

for different clusters and so on. Rest of the semantic

logging model contains elements like different types

of annotations including semantic and syntactic or

simple annotations.

Integrated Analytics for Application Management using Stream Clustering and Semantics

283

Table 2: Extended Semantic Logging Model for Stream

Clustering.

Class GlobalClustering

hasCluster type Cluster

multiplicity = multi-valued

Class RefinementClustering

hasCluster type Cluster

multiplicity = multi-valued

Class Cluster

hasLogEvent type LogEvent

multiplicity = multi-valued

hasCentroid type LogEvent

Class SimpleAnnotation

attribute(s) as defined in (Shafiq, 2016).

Class SemanticAnnotation

attribute(s) as defined in (Shafiq, 2016).

Class Application