Fast Analysis and Prediction in Large Scale Virtual Machines Resource

Utilisation

Abdullahi Abubakar

1,2

, Sakil Barbhuiya

, Peter Kilpatrick

, Ngo Anh Vien

and Dimitrios S. Nikolopoulos

Department of Computer Science, Waziri Umaru Federal Polytechnic, Birnin Kebbi, Nigeria

School of Electronics, Electrical Engineering and Computer Science, Queen’s University, Belfast, U.K.

John W. Hancock Professor of Engineering, Computer Science, Virginia Tech, U.S.A.

Keywords:

Virtual Machine, Cloud/Data Centre, Prediction, Scalability, Performance, Partitioning, Parallelism, Big Data.

Abstract:

Most Cloud providers running Virtual Machines (VMs) have a constant goal of preventing downtime, increas-

ing performance and power management among others. The most effective way to achieve these goals is

to be proactive by predicting the behaviours of the VMs. Analysing VMs is important, as it can help cloud

providers gain insights to understand the needs of their customers, predict their demands, and optimise the use

of resources. To manage the resources in the cloud efﬁciently, and to ensure the performance of cloud ser-

vices, it is crucial to predict the behaviour of VMs accurately. This will also help the cloud provider improve

VM placement, scheduling, consolidation, power management, etc. In this paper, we propose a framework

for fast analysis and prediction in large scale VM CPU utilisation. We use a novel approach both in terms

of the algorithms employed for prediction and in terms of the tools used to run these algorithms with a large

dataset to deliver a solid VM CPU utilisation predictor. We processed over two million VMs from Microsoft

Azure VM traces and ﬁlter out the VMs with complete one month of data which amount to 28,858VMs. The

ﬁltered VMs were subsequently used for prediction. Our Statistical analysis reveals that 94% of these VMs

are predictable. Furthermore, we investigate the patterns and behaviours of those VMs and realised that most

VMs have one or several spikes of which the majority are not seasonal. For all the 28,858VMs analysed and

forecasted, we accurately predicted 17,523 (61%) VMs based on their CPU. We use Apache Spark for parallel

and distributed processing to achieve fast processing. In terms of fast processing (execution time), on average,

each VM is analysed and predicted within three seconds.

1 INTRODUCTION

Organisations across various ﬁelds are producing and

storing vast amounts of data. These large volumes

of data are generated at an unprecedented rate from

heterogeneous sources such as sensors, servers, so-

cial media, IoT, mobile devices, etc. (Assunc¸

ao et al.,

2015). These heterogeneities gave birth to the term

“big data”. According to Rajaraman (Rajaraman,

2016), about 8 Zettabytes of digital data were gen-

erated in 2015. Kune et al (Kune et al., 2016), esti-

mated that by 2020, the volume of data will reach 40

Zettabytes. This implies that the volume of data being

generated is doubling every two years.

As cloud/data-centres continue to grow in scale

and complexity, effective monitoring and managing

of cloud services becomes critical. It is essential for

cloud providers to efﬁciently manage their services to

facilitate the extraction of reliable insight and to opti-

mise cost (Oussous et al., 2018). The two major chal-

lenges of handling large volumes of data (big data)

are management and processing. Managing the com-

plexity of big data (velocity, volume and variety) and

processing it in a distributed environment with a mix

of applications remain open challenges (Khan et al.,

2014). Analysing Virtual Machines (VMs) is impor-

tant, as this can help cloud service providers such as

Amazon, Google, IBM and Microsoft gain insights to

understand the needs of their customers, predict their

demands, and optimise the use of resources. To man-

age the resources in the cloud efﬁciently, and to en-

sure the performance of cloud services, it is crucial to

predict the behaviour of VMs accurately. Based on

the above motivation we set two major objectives:

• fast data analysis

• accurate prediction

Abubakar, A., Barbhuiya, S., Kilpatrick, P., Vien, N. and Nikolopoulos, D.

Fast Analysis and Prediction in Large Scale Virtual Machines Resource Utilisation.

DOI: 10.5220/0009408701150126

In Proceedings of the 10th International Conference on Cloud Computing and Services Science (CLOSER 2020), pages 115-126

ISBN: 978-989-758-424-4

115

Most Cloud providers running VMs have a con-

stant goal of preventing downtime, increase perfor-

mance and power management among others. There-

fore, one of the most effective way to achieve these

goals is to be proactive by predicting the behaviours

of the VMs. A good predictor can help improve VM

placement, scheduling, consolidation, power manage-

ment, e.t.c. Furthermore, to ensure the quality of ser-

vices is maintained, a good predictor of VM resource

utilisation can serve as a powerful tool for anomaly

detection. In this paper, a prediction model for fore-

casting CPU utilization is proposed.

The novelty of this work lies in using an adaptive

predictive model which is both accurate and efﬁcient

for prediction. Additionally, our model is designed

to efﬁciently handle realistic large scale datasets such

as Azure VMs traces (Cortez et al., 2017). We use

a novel approach both in terms of the algorithm em-

ployed for prediction and the tools used to manage

our large scale dataset. We use Apache Spark for par-

allel and distributed processing to achieve fast pro-

cessing, and ARIMA, a statistical model for scalable

time-series analysis for prediction.

In terms of contribution, we address fast, scal-

able prediction of VM behaviour. In addition, we

communicate some useful information extracted (dis-

covered) from the datasets. These pieces of infor-

mation are useful for other researchers who want to

carry out further investigations, such as Anomaly de-

tection among others. We also highlight methods and

tools employed to achieve scalability, fast data analyt-

ics, and accurate prediction. Additionally, a sanitised

dataset for 28,858 VMs has been made available for

other researchers (see the appendix).

We processed traces of over two million VMs

from Microsoft Azure VM and ﬁltered out the VMs

for which there is a complete one month of data which

amount to 28,858 VMs. We refer to these as long-

running VMs. The ﬁltered VMs were subsequently

used for prediction. To the best of our knowledge, we

are the ﬁrst to analyse as many as 28,858 long-running

VMs of Azure VMs traces and accurately predict over

17K VMs. Work done by Comden et al (Comden

et al., 2019), is the closest work to us in terms of num-

ber of VMs analysed from the same dataset. How-

ever, they only analysed 1003 Azure VMs. In their

work, they proposed an automatic algorithm selection

scheme, which can help cloud operators to pick the

right algorithm to manage cloud computing resources.

They use four prediction methods namely; Random

Forest, Naive, Seasonal Exponential Smoothing and

SARMA. They empirically study the prediction errors

from Azure VM traces and propose a simple predic-

tion error model. The model creates a simple online

meta-algorithm that chooses the best algorithm (Com-

den et al., 2019).

The rest of this paper is organised as follows. We

discussed the related work in Section 2. Section 3

presents our methodology and approach. Section 4

describes the experimental processes pursued. Sec-

tion 5 presents results, demonstrating the capability

of our method. Finally, we provide the conclusion

and possible future work in Section 6.

2 RELATED WORK

Many researchers such as (Wang et al., 2016; Com-

den et al., 2019; Calheiros et al., 2015) are working in

the area of cloud computing to predict the behaviour

of VMs. The two most popular VM prediction ap-

proaches are homeostatic prediction (Lingyun Yang

et al., 2003) and history-based prediction. In homeo-

static prediction, the incoming data point at the next

time instance is predicted by adding or subtracting a

value such as the mean of previous data point from the

current data point. Whereas, history-based methods

analyze previous data points instances and extract pat-

terns to forecast the future (Kumar and Singh, 2018).

To use historical information, VM CPU utilisation

can be presented in the form of time series. A time se-

ries data is a set of data points measured at successive

points in time spaced at uniform time interval (Hynd-

man and Athanasopoulos, 2018). Studies (Kumar and

Singh, 2018) show that data centre workloads tend to

present behaviour that can be effectively captured by

time series-based models.

Classical models have been widely used for time

series prediction. These models, including Autore-

gression (AR) (Gersch and Brotherton, 1980), Holt-

Winter (Yan-ming Yang et al., 2017), Exponentially

Weighted Moving Average (EWMA) (Fehlmann and

Kranich, 2014), Moving Average (MA), and Autore-

gressive Integrated Moving Average (ARIMA) (Ah-

mar et al., 2018) among others, can be used for VM

prediction.

Li (Li, 2005), proposed an Autoregression (AR)

based method to predict the webserver workload. The

author used a linear combination of past values of the

variable to forecast the value for upcoming time in-

stances. The model was strictly linear in nature, there-

fore, does not take into account non-linear behaviour

that might occur in a cloud setting. Wang (Wang

et al., 2016) used a combination of ARIMA and BP

neural network to predict CPU utilisation of a sin-

gle Virtual Machine. Furthermore, (Calheiros et al.,

2015), develop a workload prediction module using

the ARIMA model. The predicted load is used to

CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science

116

dynamically provision VMs in an elastic cloud envi-

ronment for serving the predicted requests taking into

consideration the quality of service (QoS) parameters,

such as response time and rejection rate. The major

drawback of their work is that they manually tweak

the parameters of the ARIMA(p,d,q) which is time-

wasting and could not ﬁt for analysing multiple VMs

concurrently. Therefore, to increase the applicabil-

ity of our proposed work, we adopted ARIMA-based

prediction. We developed a method to grid search

ARIMA hyperparameters to determine the optimal

values for our model. We automate the process of

training and evaluating ARIMA models on different

combinations of model hyperparameters. We specify

a grid of (p, d, q) ARIMA parameters to iterate on the

training data which avoids manual parameter tweak-

ing.

We are analysing a very large scale dataset with

the goal of achieving fast processing and accurate pre-

diction. Therefore, it is crucial when choosing the

appropriate framework for big data analytics to con-

sider several critical aspects, such as data size, com-

puting capacity, scalability, fault tolerance and frame-

work functionality (Singh and Reddy, 2015). Based

on the above critical aspects, we chose Apache spark

as our data processing framework. This is because, all

of the studies reviewed here (Alkatheri et al., 2019;

Merrouchi et al., 2018; Marcu et al., 2016; Singh

and Reddy, 2015) indicate that, Spark is the best

in terms of measuring processing time, task perfor-

mance, low latency, good fault-tolerance and scalable

computing capability, compared to Hadoop, Storm

and Flink. Work done by Comden et al (Comden

et al., 2019), is the closest work to us in terms of num-

ber of VMs analysed from the same dataset. How-

ever, they only analysed 1003 Azure VMs, whereas

we analysed 28,858 VM.

3 METHODOLOGY

This section describes the approach (experimental

workﬂow) implemented to achieve the set objectives.

In this section, we present our framework which

aimed at predicting VMs resource utilisation as de-

picted in Figure 1. The experimental workﬂow is sep-

arated into four phases, namely; Data source, Data

management, Modelling and Result analysis. These

phases are brieﬂy discussed below;

• Data Source: A Microsoft Azure VM dataset is

analysed. The dataset has 125 ﬁles which when

concatenated contain over two million VMs with

117GB ﬁle size. The VMs ran during 30 days

interval between 16th November, 2016 and 16th

February, 2017(Cortez et al., 2017). The informa-

tion includes; identiﬁcation numbers for each VM

and resource utilisations (minimum, average, and

maximum CPU utilisation) reported at the time in-

terval of every 5 minutes.

• Data Management: Performing analytics on

large volumes of data requires efﬁcient meth-

ods to store, ﬁlter, transform, and retrieve the

data (Assunc¸

ao et al., 2015). Therefore it is very

important when choosing the appropriate frame-

work for big data analytics to consider several

critical aspects, such as data size, computing ca-

pacity, scalability, fault tolerance and framework

functionality (Singh and Reddy, 2015). Based on

these critical aspects, we chose Apache spark as

our data processing framework. The dataset anal-

ysed is divided into 125 ﬁles which when concate-

nated amount to 117GB. In other to have a com-

plete one month data, this large volume of data

demands fast and accurate preprocessing tasks, in-

cluding integration, ﬁltering and transformation.

These tasks were easily and accurately handled

with Spark. The result of the preprocessed data

is then used for prediction, evaluation, visualisa-

tion and interpretation.

• Modelling: Studies in (Assunc¸

ao et al., 2015),

show that analytics solutions can be classiﬁed as

descriptive and/or predictive. Descriptive analyt-

ics is concerned with modelling past behaviour by

using historical data to identify patterns and cre-

ate management reports, whereas, Predictive ana-

lytics attempts to forecast the future by analysing

current and historical data (Assunc¸

ao et al.,

2015). We adopted both solutions while analyz-

ing and predicting Azure VMs. We ﬁrst imple-

mented descriptive analytics which we referred

to as data inspection, by using historical data to

identify patterns on the prepared data (the re-

sult of the preprocessed data). The main rea-

sons for implementing this phase is to help us se-

lect an appropriate predictive model. We carried

our data inspection by implementing autocorrela-

tion analysis and decomposition. Autocorrelation

is a characteristic of data which shows the de-

gree of similarity between the values of the same

variables over successive time intervals. It sum-

marises the strength of a relationship between an

observation in a time series with observation(s)

at prior time steps (Andrews, 1991; Brownlee,

2017). Whereas, Decomposition is a mechanism

to split a time series into several components, each

representing an underlying pattern category (Hyn-

dman and Athanasopoulos, 2018). Decomposi-

tion provides a structured way of thinking about

Fast Analysis and Prediction in Large Scale Virtual Machines Resource Utilisation

117

Figure 1: Model Architecture (Analytic workﬂow).

how to best capture each of these components

in a given model (Brownlee, 2017) and in terms

of modelling complexity (Shmueli and Lichten-

dahl Jr, 2016).

After descriptive analytics, we proceeded to pre-

dictive analytics. Accurate prediction of VMs is

vital for cloud resources allocation. Therefore it

is important to choose a good predictive model.

Based on the outcomes of descriptive analytics

which comprises autocorrelation analysis and de-

composition, we implemented ARIMA, a classi-

cal time series prediction algorithm to forecast the

behaviour of each VM.

• Result Analysis & Visualisation: In this phase,

the experimental results are analysed and inter-

preted in accordance with the guidance and rec-

ommendations of (Assunc¸

ao et al., 2015). They

suggest that the initial result generated by a ﬁrst

model is not enough to draw a ﬁnal conclusion.

There is a need to re-evaluate results and might

sometimes lead to possible modiﬁcation to gener-

ate new models or adjust existing ones. In our

experiment, the re-evaluation was done by grid

search ARIMA hyperparameters to determine the

optimal values for our model.

4 EXPERIMENT

This section describes the details of the experiments

conducted. These include tool selection, experimental

steps, dataset description, prediction algorithm selec-

tion and implementation.

4.1 Experimental Tool

Data analytics is a complex process that demands ex-

pertise in data understanding, data cleaning, proper

method selection, and analysing and interpreting the

results. Tools are fundamental to help perform

these tasks. Therefore, it is important to under-

stand the requirements in order to choose appropriate

tools (Assunc¸

ao et al., 2015).

4.1.1 Data Analytic Tool

Apache Spark is one of the most widely used open-

source big data processing engines (Armbrust et al.,

2015). It is a hybrid processing framework that pro-

vides high-speed batch as well as real-time process-

ing on large scale datasets. It has wide support,

integrated libraries and ﬂexible integrations (Matei,

2012). Apache Spark is claimed to be 100 times

faster than Hadoop (Shvachko et al., 2010) due to in-

memory computation (Vernik et al., 2018).

Spark’s generality has several important beneﬁts.

First, we can easily develop an application because

spark uses a uniﬁed API. Secondly, we can develop

a parallel and distributed application to achieve faster

processing speed of about 100x faster in memory and

10x faster on disk (Shvachko et al., 2010).

Going by the above beneﬁts, we propose a scal-

able data analytic framework and predictive model

based on Apache Spark. Additionally, we chose

Spark to achieve the efﬁciency of quickly writing

code that runs in a distributed system. Furthermore,

Spark handles the parallelisation and organisation of

the data processing tasks neatly. In our experiment,

we use Spark version 2.3.1 with Scala version 2.11.8.

CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science

118

4.1.2 Prediction Tool

Classic time series models such as Autoregressive In-

tegrated Moving Average (ARIMA) (Ahmar et al.,

2018; Calheiros et al., 2015), Holt-Winter (Yan-ming

Yang et al., 2017) and Exponentially Weighted Mov-

ing Average (EWMA) (Fehlmann and Kranich, 2014)

can be used for VM prediction. In our experiment,

prediction models are developed based on ARIMA.

ARIMA is one of the very powerful prediction

models for forecasting a time series which can be

stationarised by transformations such as differencing

and lagging (Hyndman and Athanasopoulos, 2018).

Differencing is useful, in that it helps stabilise the

mean of a time series by removing changes in the

level of a time series (Hyndman and Athanasopoulos,

2018) (Brockwell and Davis, 2016). ARIMA is an ex-

tension of the ARMA model which is a combination

of Auto-Regressive(AR) and Moving Average (MA)

models. The AR and MA models represent time se-

ries that are generated by passing the input through

a linear ﬁlter which produces the output Y(t) at any

time t. The major difference between the two is that,

with AR models time series are generated using the

previous output values Y(t - τ). Whereas, MA mod-

els use only the input values X(t -τ) to generate the

time series (Yu et al., 2016), where τ= 0,1,2,3....n.

ARIMA applies the ARMA model not immediately

to the given time series, but after its preliminary dif-

ferencing, which is the time series obtained by com-

puting the differences between consecutive values of

the original time series (Cheboli et al., 2010). The

model is generally speciﬁed as (p, d, q), where p is

the number of autoregressive order, d is the differenc-

ing order, and q is the moving average (Brockwell and

Davis, 2016). The prediction equation is given by;

= α

∑

i=1

∗ L

∗Y

∑

i=1

∗ L

∗ ε

+ ε

(1)

Here α

is constant, φ is autoregressive coefﬁcients, θ

is moving average and ε

is white noise at time t, and

L is the lag operator which when applied to Y returns

the prior value.

The choice of ARIMA was attributable to its scal-

ability in related work (Yu et al., 2016; Wang et al.,

2016; Ahmar et al., 2018; Calheiros et al., 2015;

Schmidt et al., 2018). Moreover, it is speciﬁcally built

and works well for time series data (Hyndman and

Athanasopoulos, 2018).

4.1.3 Environment

The experiments are set up and run on a High Per-

formance Computing (HPC) cluster called “Kelvin2”.

The operating system is Centos7 (64 bit). The com-

pute nodes are HP Apollo 200 servers, with high per-

formance 40Gbps Inﬁniband fabric (32Gbps effective

bandwidth per link). Our experiment is limited to a

cluster of four nodes.

Table 1 summarizes the cluster resources used in our

experiment.

Table 1: Cluster Resources.

Node Name

Memory Capacity

(TB)

No of Core

Speed per Core

(GHz)

Smp01 1.0 40 2.60

Smp02 1.0 40 2.60

Smp03 1.5 24 2.60

Smp04 1.5 24 2.60

Total 5.0 128

4.2 Experiment Dataset

This study was conducted on a realistic dataset of Mi-

crosoft Azure Virtual Machines (Cortez et al., 2017).

The dataset contains over two million VMs that ran

during a 30 day interval between 16th November,

2016 and 16th February, 2017 (Cortez et al., 2017).

The information includes: identiﬁcation numbers for

each VM and resource utilisations (minimum, aver-

age, and maximum CPU utilisation) reported at the

time interval of every 5 minutes. The total number of

VMs in the dataset is 2,013,767.

4.3 Experimental Steps

4.3.1 Data Preprocessing

Average CPU utilisation from Microsoft Azure VMs

is used in our experiment. The dataset has 125 ﬁles

which when concatenated amount to 117GB. In order

to have a complete one month data, this large volume

of data demands preprocessing tasks including inte-

gration, ﬁltering and transformation.

We loaded and concatenated these ﬁles in Apache

Spark as depicted in Figure 1. We ﬁrst created a

DataFrame with three columns namely; VM ID, Time

and Average CPU. Then we ﬁltered and grouped each

record by VM ID. Finally, we converted our data to

the array of dense vector for model consumption. Fig-

ure 2 depicts how our transformed data look. It is

important to note that from over two million VMs

analysed, not all the 2,013,767 VMs ran continually

throughout the entire 30 day period. We realized that

only 28,858 VMs has a complete one month data.

Therefore, predictions were carried out on the VMs

with complete one month of data.

Fast Analysis and Prediction in Large Scale Virtual Machines Resource Utilisation

119

Figure 2: Preprocessed result indicating both original and transformed data.

4.3.2 Data Inspection

As discussed in section 3, and depicted in Figure 1,

the prediction process started by modelling past be-

haviour of the ﬁltered long-running VMs to identify

patterns. The main reason for data inspection is to

help us ﬁnd out whether our VMs are predictable or

not. Furthermore, it also helps us select an appro-

priate prediction model, since our goal is to develop

a model that makes accurate predictions. We carry

our data inspection by implementing autocorrelation

analysis and decomposition.

We started with the autocorrelation analysis. We

applied the Durbin-Watson (DW) statistical test on

our ﬁltered VMs. The Durbin-Watson statistic is a

test for autocorrelation in the residuals from a statis-

tical regression analysis. The test always produces a

test statistic that ranges between 0.0 to 4.0. A Value

of 2.0 (the middle of the range) suggests no autocor-

relation detected. Conversely, values closer to 0.0 in-

dicates positive autocorrelation, while values above

2.0 indicate negative autocorrelation. In our experi-

ment, the DW test suggests that about 94 percent of

the VMs have autocorrelation detected, and hence are

predictable.

Time series data can exhibit a variety of pat-

terns, and it is often helpful to split and decompose

a time series into several components, each repre-

senting an underlying pattern category (Hyndman and

Athanasopoulos, 2018). This is because decomposi-

tion provides a structured way of thinking about how

to best capture each of these components in a given

model (Brownlee, 2017) and in terms of modelling

complexity (Shmueli and Lichtendahl Jr, 2016). We

decompose our time series into four components.

• Observed: The original observed data in series.

• Trend: The long term movement in a time series.

• Seasonal: The repeating short-term cycle in the

series.

• Residuals: Time series after the trend and sea-

sonal components are removed.

To get more insight and have more understanding of

our data, we implemented decomposition. Figure 3 is

a decomposition result of a randomly selected VM. In

Figure 3(a), we used ﬁrst-day data to check the prop-

erties of the observed data. The seasonality informa-

tion extracted from the series indicates a pattern is re-

peating itself every hour. The residuals are also inter-

esting, showing periods of high variability towards the

middle of the series. We extend our data to one week

to check for daily properties as shown in Figure 3(b).

Here we can see that the pattern is also repeating it-

self every day. The trend decreases toward the second

day but gradually increases on the third day and in-

crement is maintained throughout the week. Finally,

Figure 3(c) represents the decomposition of a com-

plete/one month of data (8640 data points) to check

for trend, seasonality, and residue. We can see that

the trend from the series indicating gradual upward

increment in the series. The seasonality information

extracted is also interesting, showing repeating pat-

tern every week.

Furthermore, we carried out a visual inspection

and realised that most series contain one or several

signiﬁcant spikes as shown in ﬁgure 4. As most spikes

are not seasonal, it could interest other researchers to

investigate further and ﬁnd out whether those spikes

could potentially be anomalies.

4.3.3 Prediction Method

Having realised and conﬁrmed that our VMs are pre-

dictable based on the DW test and decomposition (see

Figure 3), we proceeded to predictive analytics.

The ﬁltered VMs with complete (one month) data,

which amount to 28,858 are used for our experiment.

Each of the 28,858 VM is predicted. Each VM pre-

dicted has 8,640 data points. We used ARIMA on

Apache Spark to develop our model for prediction,

which was implemented in the Scala programming

language. We modelled the relationship between

CPU utilisation (dependent variable) and timestamp

(independent variable). As recommended by (Brown-

lee, 2017), we split the data into a training set and a

test set. The training set was used for model prepara-

tion and the test set was used to evaluate it. We respect

the temporal ordering in which values were observed

when splitting our data, since the time dimension of

CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science

120

(a) Decomposing ﬁrst day data (b) Decomposing ﬁrst week data (c) Decomposing one month data

Figure 3: Decomposition result of a sample VM (checking for hourly, daily and weekly behaviours).

observations does not allow random split into groups.

The dataset is split in two: 90% for the training and

10% for the testing.

After the train-test split, the process of ﬁt-

ting the ARIMA model based on the Box-Jenkins

method (Box et al., 2015) is done on the training data

set. According to the Box-Jenkins method (Box et al.,

2015), the time series must be transformed into a sta-

tionary time series, that is, for each (X

, X

t+τ

), the

mean and variance of the process must be constant

and independent of t, where τ is the time difference

(lag) between the data points. This transformation is

achieved by differencing the original time series.

We developed a method to grid search ARIMA

hyperparameters to determine the optimal values for

our model. We automate the process of training and

evaluating ARIMA models on different combinations

of model hyperparameters. We specify a grid of (p, d,

q) ARIMA parameters to iterate on the training data

which avoids manual parameter tweaking. A model is

created for every parameter combination and its per-

formance evaluated based on the scale-dependent er-

ror as suggested by (Hyndman and Athanasopoulos,

2018), in our case, Mean Square Error (MSE). Then

we selected the best model based on the computed

residuals. The best model is the one with the least

MSE/residuals, which is then applied to the data to

generate predictions (forecast).

After generating predictions, all predicted data

points are compared with the expected values on our

test dataset with a forecast error score calculated.The

test dataset is not used during the training, so can be

considered as new (in this case, actually real-life data

from Microsoft Azure VMs). It is important to note

that forecast errors are different from residuals. Fore-

cast errors are calculated on the testing data while

residuals are calculated on the training data (Hynd-

man and Athanasopoulos, 2018). We again use MSE

to measure the forecast accuracy. This is because it

is more sensitive than other measures and it penalises

large errors. MSE is calculated as the average of the

squared forecast error values as shown in equation (2).

The forecast error function is deﬁned as;

MSE =

∑

t=1

( ˆy

− y

)

(2)

Where n is the total number of observations, ˆy

and

are the predicted and expected data point at time t,

respectively.

5 RESULTS & DISCUSSION

In this section, we present experimental results

demonstrating the capability of our method both in

terms of accurate prediction and in terms of faster pro-

cessing as seen in ﬁgure 5 and ﬁgure 8 respectively.

We also describe the technique employed in our ex-

periment to achieve faster analysis followed by pre-

sentation of the result obtained.

5.1 Results Analysis

The performance of prediction is evaluated by com-

puting the prediction error. On a scale of 0 to 100

(Unit of CPU utilisation), we classiﬁed prediction

performance for each VM in three categories;

• Low Mean Square Error (LMSE): VM with error

score (MSE) ranging between 0.0 to 0.9.

• Medium mean Squaure Error(MMSE): VM with

error score (MSE) ranging between 1.0 to 5.0.

• High mean square Error(HMSE): VM with error

score (MSE) above 5.0.

The VMs that fall in the ﬁrst category are con-

sidered as VMs with an accurate prediction. This

is because an MSE of zero, or a very small number

near zero indicates accurate prediction (Hyndman and

Athanasopoulos, 2018; Aggarwal, 2018; Brownlee,

2017).

Fast Analysis and Prediction in Large Scale Virtual Machines Resource Utilisation

121

Figure 4: Time series CPU utilization of four randomly selected VMs.

(a) MSE for VMs over different forecast period (b) Percentage of MSE for VMs over different forecast period

Figure 5: Forecast result.

Finally, we forecast the future of our time series

after automating the (p,d,q) parameters and obtaining

the best ﬁt. We varied the forecast period and com-

pute the forecast error for every VM for each period.

The eight different forecast periods implemented are

30 minutes, 1 hour, 2 hours, 5 hours, 12 hours, 1 day,

2 days & 3 days. We started with short-term forecast

and forecasted 30 minutes. With 30 minutes forecast,

our model displays good prediction performance by

accurately predicting the behaviour of more than sev-

enteen thousand VMs as shown in Figure 5(a). With

this short-term forecast, for all the 28,858 VMs anal-

ysed, we accurately predicted 17,523 (61%) VMs. We

gradually increase the length of the forecast period

through a number of hours, to days as shown in Fig-

ure 5. We notice that, the shorter the forecast period,

the better the prediction and vice versa as expected.

Figure 6 is an MSE plot for ten randomly selected

CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science

122

Figure 6: MSE for ten randomly selected VMs on different

forecast periods.

VMs. The ﬁgure clearly shows that for every VM, the

shorter the forecast periods, the lower the MSE. This

is because we are analysing a realistic dataset and in

the real world, data comes from a complex environ-

ment, and it is evolving over time. Therefore, predic-

tions become less accurate over time (Zambon et al.,

2018).

However, with a ﬁve-hour forecast, which is ideal

for cloud providers, more than thirteen thousand VMs

were accurately predicted (about 49% of the total

VMs analysed). Experiment by (Zheng et al., 2013)

reveals that in cloud/data centres, a ﬁve-hour fore-

cast is optimal for cloud providers to take decisions

such as VM migration. the Authors design a mi-

gration progress management system called “Pacer”.

Pacer manages migration by controlling the migration

time of each migration process and coordinating them

to ﬁnish at the desired time. With bandwidth speed

of 32MBps, Pacer estimated approximately 2 hours

(precisely 6500 seconds) for migrating a VM of size

160GB (Zheng et al., 2013).

Figure 7: Processing our VMs on each partition in parallel.

5.2 Fast Processing and Scalability

In this section, we discuss how we achieved fast pro-

cessing on our framework. Furthermore, we also

show that our system is a scalable parallel and highly

efﬁcient system. As discussed in section 4.1.1, our

dataset is bulky and cannot be processed by means

of conventional techniques. We need to use prevail-

ing technology to perform massive-scale and com-

plex computing. Parallel computing models that pro-

vide the parallel and distributed data processing for

cloud providers, can accelerate the processing of large

amounts of data.

5.2.1 Fast Processing

Spark uses a dispatcher which is called “driver node”

to manage jobs assigned to different distributed work-

ers. Each job consists of several independent tasks

which can be executed in parallel. If the execution

fails, the driver node relaunches the task. Moreover,

Spark has a speculation feature which can identify if

a task is too slow and in this case, the task will be

stopped and executed again (Vernik et al., 2018).

Data Partitioning in Spark helps achieve more par-

allelism (Zaharia et al., 2015). Although Spark ﬁts

in our problem domain, it is important to note that

to achieve faster processing and scalability, there is a

need to induce expertise on data partitioning with par-

allel and distributed processing. In our experiment,

we split our data (VMs) into partitions and execute

computations on each partition in parallel as shown

in Figure 7. The partitions are represented by yellow,

brown, green and blue colours. The blue, black and

green arrows indicate the ﬁrst, second and third set

VMs to be processed for each partition. It is important

to note that, the processing is done in parallel. Insufﬁ-

cient partitions might lead to improper resource utili-

sation due to less concurrency. In Spark cluster mode,

with insufﬁcient partitions, data might be skewed on

a single partition and a worker node might be doing

more than other worker nodes. Conversely, too many

partitions may result in excessive overhead in manag-

ing many small tasks. The task scheduling may take

more time than the actual execution time (Gounaris

et al., 2017; Karau et al., 2015). Therefore there is a

need for proper partitioning to keep our Spark com-

putations running efﬁciently. Reasonable partitions

can lead to utilisation of available cores in the clus-

ter and avoid excessive overhead in managing small

tasks (Petridis et al., 2016).

To achieve efﬁciency and fast processing, Spark

documentation recommends that each partition

should hold a maximum of 128Mb of data per parti-

tion (Karau et al., 2015). Based on the recommenda-

Fast Analysis and Prediction in Large Scale Virtual Machines Resource Utilisation

123

(a) Scalability check by varying the number of VMs and

keeping the number of cores constant

(b) Scalability check by varying the number of Cores and

keeping the number of VMs constant

Figure 8: Scalability Check.

tion of (Karau et al., 2015), we keep our Spark com-

putations running efﬁciently by choosing the number

of partitions based on the following equation.

θ =

(3)

Here θ represents the number of partitions, γ is the

total input dataset size, and λ is the partition size ( λ

has a constant value of 128Mb).

We used Dynamic conﬁguration mecha-

nism (Gounaris et al., 2017) to partition our

data. The number of partitions used for preprocessing

was 915 partitions, this was as a result implementing

equation (3) on our original data, whose size was

117GB. After ﬁltering the long-running VMs, we

dynamically re-partition our data to 140 partitions to

run our predictive model on the ﬁltered long-running

VMs. With data partitioning mechanism, we achieved

efﬁcient use of resources and faster processing. In

our experiment, we estimated approximately three

seconds for predicting each VM (from preprocessing

to prediction and evaluation).

It is also important to note that, we also imple-

mented sequential processing on a sample of 312

VMs to test the performance of our parallel process-

ing. It took 206.46 minutes to run (analysed and fore-

cast) 312 VMs when the data wasn’t partitioned. Con-

versely, with partitioning and parallelism mechanism,

with 100 cores, it only took us 11.51 minutes . That is

about 17.8x speed improvement, 93.28% faster than

sequential processing and 1660% increase in perfor-

mance.

5.2.2 Scalability

Scalability is a measure of a parallel system’s capac-

ity to increase speedup in proportion to the number

of processors (Kumar et al., 1994). In our experi-

ment, we check for scalability in two ways. At ﬁrst,

we keep the number of cores constant (100 cores) and

vary the problem size by gradually increasing the data

size. For each data size chosen, we compute the exe-

cution time to determine how long it takes to ﬁnish the

job. Keeping the number of processors constant and

increasing the problem size leads to a positive linear

relationship (scalability) as seen in Figure 8(a). In the

second method, we keep the problem size constant

(sample of 312 VMs) and vary the number of cores

as shown in Figure 8(b). For every chosen number of

cores, we run the experiment and compute the execu-

tion time to determine how long it takes to ﬁnish the

job. Our experimental result show the execution time

decreases in proportion to the increase in cores. This

shows that our system is a scalable parallel.

6 CONCLUSION

In this paper, we proposed a framework for fast anal-

ysis and prediction in large scale VM CPU utilisa-

tion. Our model is designed to quickly handle a real-

istic large scale dataset such as Microsoft Azure VMs

traces. We processed over two million VMs from Mi-

crosoft Azure VM traces and ﬁltered out the VMs

with complete one month of data which amount to

28,858 VMs. The ﬁltered VMs were subsequently

used for prediction. For fast processing, we imple-

CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science

124

mented our framework using Apache Spark. We par-

titioned our data and run our models in parallel to

achieved high scalability. Our framework provides an

efﬁcient data processing method for large scale Vir-

tual Machines in a cloud Settings. We use ARIMA, a

statistical model for time series to predicts our VMs.

With short-term prediction, we accurately predicted

61% of the total 28,858 VMs analysed. In term of

execution time, on average, each VM is analysed and

predicted in three seconds.

To the best of our knowledge, we are the ﬁrst

to analyse as many as 28,858 long-running VMs of

Azure VMs traces and accurately predict over 17K

VMs.

We observed that most VMs have one or several

spikes of which majority of those spikes are not sea-

sonal. Future work requires further investigation on

thos spikes and to ﬁnd out whether they are potential

anomalies.

ACKNOWLEDGEMENTS

This project was funded by Petroleum Tech-

nology Development Fund (PTDF) of Nigeria

(PTDF/ED/PHD/AA/1133/17).

We thank Michael Davis and Mohsen Koohi

Esfahani for comments that greatly improved the

manuscript, and we thank all the anonymous review-

ers for their insights.

REFERENCES

Aggarwal, C. C. (2018). A survey of stream clustering algo-

rithms. In Data Clustering, pages 231–258. Chapman

and Hall/CRC.

Ahmar, A. S., Guritno, S., Rahman, A., Minggi, I., Tiro,

M. A., Aidid, M. K., Annas, S., Sutiksno, D. U., Ah-

mar, D. S., Ahmar, K. H., et al. (2018). Modeling

data containing outliers using arima additive outlier

(arima-ao). In Journal of Physics: Conference Series,

volume 954, page 012010. IOP Publishing.

Alkatheri, S., Abbas, S., and Siddiqui, M. (2019). A com-

parative study of big data frameworks. International

Journal of Computer Science and Information Secu-

rity,, page 8.

Andrews, D. W. (1991). Heteroskedasticity and autocorre-

lation consistent covariance matrix estimation. Econo-

metrica: Journal of the Econometric Society, pages

817–858.

Armbrust, M., Das, T., Davidson, A., Ghodsi, A., Or, A.,

Rosen, J., Stoica, I., Wendell, P., Xin, R., and Zaharia,

M. (2015). Scaling spark in the real world: perfor-

mance and usability. Proceedings of the VLDB En-

dowment, 8:1840–1843.

Assunc¸

ao, M. D., Calheiros, R. N., Bianchi, S., Netto,

M. A., and Buyya, R. (2015). Big data computing

and clouds: Trends and future directions. Journal of

Parallel and Distributed Computing, 79-80:3 – 15.

Box, G. E., Jenkins, G. M., Reinsel, G. C., and Ljung, G. M.

(2015). Time series analysis: forecasting and control.

John Wiley & Sons.

Brockwell, P. J. and Davis, R. A. (2016). Introduction to

time series and forecasting. springer.

Brownlee, J. (2017). Introduction to time series forecasting

with python: how to prepare data and develop models

to predict the future. Machine Learning Mastery.

Calheiros, R. N., Masoumi, E., Ranjan, R., and Buyya, R.

(2015). Workload prediction using arima model and

its impact on cloud applications’ qos. IEEE Transac-

tions on Cloud Computing, 3:449–458.

Cheboli, D., Chandola, V., and Kumar, V. (2010). Anomaly

detection for time series: A survey. Technical re-

port, Technical Report in progress, University of Min-

nesota, Department of . . . .

Comden, J., Yao, S., Chen, N., Xing, H., and Liu, Z. (2019).

Online optimization in cloud resource provisioning:

Predictions, regrets, and algorithms. Proceedings of

the ACM on Measurement and Analysis of Computing

Systems, 3(1):16.

Cortez, E., Bonde, A., Muzio, A., Russinovich, M., Fon-

toura, M., and Bianchini, R. (2017). Resource cen-

tral: Understanding and predicting workloads for im-

proved resource management in large cloud platforms.

In Proceedings of the 26th Symposium on Operating

Systems Principles, pages 153–167. ACM.

Fehlmann, T. and Kranich, E. (2014). Exponentially

weighted moving average (ewma) prediction in the

software development process. In 2014 Joint Confer-

ence of the International Workshop on Software Mea-

surement and the International Conference on Soft-

ware Process and Product Measurement, pages 263–

270.

Gersch, W. and Brotherton, T. (1980). Ar model prediction

of time series with trends and seasonalities: A contrast

with box-jenkins modeling. In 1980 19th IEEE Con-

ference on Decision and Control including the Sympo-

sium on Adaptive Processes, pages 988–990.

Gounaris, A., Kougka, G., Tous, R., Montes, C. T., and Tor-

res, J. (2017). Dynamic conﬁguration of partitioning

in spark applications. IEEE Transactions on Parallel

and Distributed Systems, 28:1891–1904.

Hyndman, R. J. and Athanasopoulos, G. (2018). Forecast-

ing: principles and practice. OTexts.

Karau, H., Konwinski, A., Wendell, P., and Zaharia, M.

(2015). Learning Spark: Lightning-Fast Big Data

Analysis. ” O’Reilly Media, Inc.”.

Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z., Ali, M.,

Kamaleldin, W., Alam, M., Shiraz, M., and Gani, A.

(2014). Big data: survey, technologies, opportunities,

and challenges. The Scientiﬁc World Journal, 2014.

Kumar, J. and Singh, A. K. (2018). Workload prediction

in cloud using artiﬁcial neural network and adaptive

differential evolution. Future Generation Computer

Systems, 81:41 – 52.

Fast Analysis and Prediction in Large Scale Virtual Machines Resource Utilisation

125

Kumar, V., Grama, A., Anshul, G., and Karypis, G. (1994).

Introduction to parallel computing: Design and anal-

ysis of algorithms. Benjamin/Cummings Publishing

Company, Redwood City, CA, 18:82–109.

Kune, R., Konugurthi, P. K., Agarwal, A., Chillarige, R. R.,

and Buyya, R. (2016). The anatomy of big data com-

puting. Software: Practice and Experience, 46(1):79–

105.

Li, T.-H. (2005). A hierarchical framework for modeling

and forecasting web server workload. Journal of the

American Statistical Association, 100:748–763.

Lingyun Yang, Foster, I., and Schopf, J. M. (2003). Home-

ostatic and tendency-based cpu load predictions. In

Proceedings International Parallel and Distributed

Processing Symposium, pages 9 pp.–.

Marcu, O.-C., Costan, A., Antoniu, G., and P

erez-

Hern

andez, M. S. (2016). Spark versus ﬂink: Un-

derstanding performance in big data analytics frame-

works. In 2016 IEEE International Conference

on Cluster Computing (CLUSTER), pages 433–442.

IEEE.

Matei, Z. (2012). Discretized streams: A fault-tolerant

model for scalable stream processing. no. ucb/eecs-

2012–259. California University Barkeley, Depart-

ment of Electrical Engineering and Computer Sci-

ence.

Merrouchi, M., Skittou, M., and Gadi, T. (2018). Popu-

lar platforms for big data analytics: A survey. In 2018

International Conference on Electronics, Control, Op-

timization and Computer Science (ICECOCS), pages

1–6.

Oussous, A., Benjelloun, F.-Z., Lahcen, A. A., and Belfkih,

S. (2018). Big data technologies: A survey. Journal

of King Saud University - Computer and Information

Sciences, 30(4):431 – 448.

Petridis, P., Gounaris, A., and Torres, J. (2016). Spark pa-

rameter tuning via trial-and-error. In INNS Conference

on Big Data, pages 226–237. Springer.

Rajaraman, V. (2016). Big data analytics. Resonance,

21(8):695–716.

Schmidt, F., Suri-Payer, F., Gulenko, A., Wallschl

ager,

M., Acker, A., and Kao, O. (2018). Unsupervised

anomaly event detection for vnf service monitoring

using multivariate online arima. In 2018 IEEE Inter-

national Conference on Cloud Computing Technology

and Science (CloudCom), pages 278–283.

Shmueli, G. and Lichtendahl Jr, K. C. (2016). Practical

Time Series Forecasting with R: A Hands-On Guide.

Axelrod Schnall Publishers.

Shvachko, K., Kuang, H., Radia, S., Chansler, R., et al.

(2010). The hadoop distributed ﬁle system. In MSST,

volume 10, pages 1–10.

Singh, D. and Reddy, C. K. (2015). A survey on platforms

for big data analytics. Journal of big data, 2:8.

Vernik, G., Factor, M., Kolodner, E. K., Ofer, E., Michiardi,

P., and Pace, F. (2018). Stocator: Providing high per-

formance and fault tolerance for Apache Spark over

object storage. In CCGRID 2018, 18th IEEE/ACM

International Symposium on Cluster, Cloud and Grid

Computing, May 1-4, 2018, Washington DC, USA.

Wang, J., Yan, Y., and Guo, J. (2016). Research on the

prediction model of cpu utilization based on arima-

bp neural network. In MATEC Web of Conferences,

volume 65, page 03009. EDP Sciences.

Yan-ming Yang, Hui Yu, and Zhi Sun (2017). Aircraft fail-

ure rate forecasting method based on holt-winters sea-

sonal model. In 2017 IEEE 2nd International Con-

ference on Cloud Computing and Big Data Analysis

(ICCCBDA), pages 520–524.

Yu, Q., Jibin, L., and Jiang, L. (2016). An improved arima-

based trafﬁc anomaly detection algorithm for wireless

sensor networks. International Journal of Distributed

Sensor Networks, 12:9653230.

Zaharia, M., Wendell, P., Konwinski, A., and Karau, H.

(2015). Learning spark. O’Reilly Media.

Zambon, D., Alippi, C., and Livi, L. (2018). Concept

drift and anomaly detection in graph streams. IEEE

transactions on neural networks and learning systems,

29:5592–5605.

Zheng, J., Ng, T. S. E., Sripanidkulchai, K., and Liu, Z.

(2013). Pacer: A progress management system for live

virtual machine migration in cloud computing. IEEE

Transactions on Network and Service Management,

10:369–382.

APPENDIX

The sanitised dataset for the ﬁltered Long-

running VMs can be downloaded from github

repository via https://github.com/abafo22/

Filtered-long-running-Azure-VM-traces

CLOSER 2020 - 10th International Conference on Cloud Computing and Services Science

126