Multivariate Short Term Load Forecasting Strategy: Application to

Anomalous Days of ISO New England Data

Innocent Sizo Duma

and Bhekisipho Twala

Department of Electrical Power Engineering, Faculty of Engineering and the Built Environment,

Durban University of Technology, P.O. Box 1334, Durban, 4000, South Africa

Faculty of Engineering and the Built Environment, Durban University of Technology,

P.O. Box 1334, Durban, 4000, South Africa

Keywords:

Short Term Load Forecasting, Multivariate Denoising using Wavelet and Principal Component Analysis,

Bayesian Optimization Algorithm, Long Short-Term Memory Neural Networks, Feedforward Neural

Networks, Random Forest.

Abstract:

In this paper, we consider short-term electricity load forecasting which is for making forecasting within 1 hour

to 7 days or a month ahead usually used for the day-to-day operations of the utility industry, such as schedul-

ing the generation and transmission of electric energy. This is a three step process: (1) Data preprocessing

which include feature extraction, (2) Modeling and (3) Model Evaluation. Electrical load time series are non

stationary and notoriously very noisy because of variety of factors that affect the electrical markets. As a data

preprocessing step to remove the white noise on the multivariate predictor variables (which include historical

load, weather, and holidays) we perform a multivariate denoising using wavelets and principal component

analysis (MWPCA). In the modeling step we propose three multivariate Bayesian Optimization (BO) based

Random Forest (RF), Feedforward Neural Networks (FFNN) and Long Short-term Memory (LSTM) neural

network for day ahead hourly load forecast of the anomalous days system load of the ISO New England grid.

For model evaluation we used three evaluation metrics, the Mean Absolute Percent Error (MAPE), Mean Ab-

solute Error (MAE), and Root Mean Square Error (RMSE). All the trained models achieved a superior results

on the chosen model evaluation metrics most notably achieving a MAPE of less than 1% on the data under

study. And the FFNN model outperformed both the RF and LSTM models.

1 INTRODUCTION

Load forecasting is a central and integral process in

the planning and operation of electric utilities. It in-

volves the accurate prediction of both the magnitudes

and geographical locations of electric load over the

different periods (usually hours) of the planning hori-

zon. The basic quantity of interest in load forecast-

ing is typically the hourly total system load. How-

ever , load forecasting is also concerned with the pre-

diction of hourly, daily, weekly and monthly values

of the system load, peak system load and the system

energy. Load forecasting can be classiﬁed in terms

of the planning horizon’s duration: up to 1 hour for

very short-term load forecasting (VSTLF), 24 hours

to one week for short-term load forecasting (STLF),

more than one week to few months for medium-term

load forecasting (MTLF), and 1±10 years for long-

term load forecasting (LTLF).

Accurate load forecasting holds a great saving po-

tential for electric utility corporations, these savings

are realised when load forecasting is used to control

operations and decisions such as dispatch, unit com-

mitment, fuel allocation and off-line network analy-

sis. The accuracy of load forecasts has a signiﬁcant

effect on power system operations, as economy of op-

erations and control of power systems may be quite

sensitive to forecasting errors. It is observed that both

positive and negative forecasting errors resulted in in-

creased operating costs. Hobbs (Hobbs et al., 1999)

quantiﬁed the dollar value of improved Short Term

Load Forecasting for a typical utility and observed

that a 1% reduction in the average forecast error for

a 10, 000MW utility can save up to $1.6 million an-

nually and that was 22 years ago.

Previous work has been carried out on Short term

load forecasting for Anomalous Days of ISO New

England Grid Data by Raza (Raza et al., 2020) who

proposed an ensemble forecast framework with a sys-

Duma, I. and Twala, B.

Multivariate Short Term Load Forecasting Strategy: Application to Anomalous Days of ISO New England Data.

DOI: 10.5220/0010652600003063

In Proceedings of the 13th International Joint Conference on Computational Intelligence (IJCCI 2021), pages 293-301

ISBN: 978-989-758-534-0; ISSN: 2184-3236

293

tematic combination of three predictors, namely El-

man Neural Network (ELM), Feedforward Neural

Network (FFNN) and Radial Basis Function (RBF)

Neural Network. They trained these predictor models

using Global Particle Swarm Optimization (GPSO)

to improve their training capability in the ensemble

framework. The outputs of individual predictors were

combined using trim aggregation technique by re-

moving forecating anomalies. Their predictor vari-

ables which include weather, seasonality and histori-

cal load were subjected to univariate wavelet denois-

ing to remove ﬂuctuations and spikes. Their proposed

model showed a signiﬁcant improvement in predic-

tion accuracy compared to autoregressive integrated

moving average (ARIMA) and back-propagationneu-

ral networks (BPNN).

As Raza (Raza et al., 2020) put it succinctly: “

For load forecasting of a normal day, a day with a

predictable load proﬁle, the training data have enough

correlated training samples to train the model. How-

ever, an anomalous day load forecasting has a much

smaller number of patterns for effective training of

the model. Therefore, the anomalous day forecast-

ing model is more complex and difﬁcult to design

for higher forecast accuracy. Generally, the predic-

tion accuracy of models for anomalous days is lower

due to multiple factors such as uncertainty in de-

mand, meteorological variables, unpredictable socio-

logical events and intermittency of renewable energy

resources etc.” It is these challenges that encourages

us to put foward our methodology.

Nti (Nti et al., 2020) undertook a systematic and

critical review of about seventy-seven (77) relevant

previous works reported in academic journals over

nine years 2010 − 2020 in electricity load forecast-

ing. Speciﬁcally, attention was given to the follow-

ing themes: (i) The forecasting algorithm used and

their ﬁtting ability in this ﬁeld, (ii) the theories and

factors affecting electricity consumption and the ori-

gin of research work, (iii) the relevant accuracy and

error metrics applied in electricity load forecasting,

and (iv) the forecasting period. Their results revealed

that 90% out of the top nine models used in electric-

ity load forecasting where artiﬁcial intelligence based,

with Artiﬁcial Neural Networks (ANN) representing

28%. They also observed that ANN models were pri-

marily used for short-term electricity load forecasting

where electrical energy consumptionare complicated.

Son (Son and Kim, 2020) proposed a LSTM

model that can accurately forecast monthly residen-

tial electricity demand and compared its performance

to four benchmark models: Support Vector Regres-

sion, Artiﬁcial Neural Network (ANN), Autoregres-

sive Integrated Moving Average (ARIMA) and Mul-

tiple Linear Regression. The LSTM model showed

a superior performance for MAPE by achieving the

lowest MAPE value, of less or equal to 1%.

A comparative analysis of ﬁve commonly used

short-term load forecasting techniques, i.e. Auto-

Regressive Integrated Moving Average (ARIMA),

Multiple Linear Regression (MLR), Recursive Parti-

tioning Regression Trees with Bootstrap Aggregating

(RPART+BAGGING), Conditional Inference Trees

with Bootstrap Aggregating (CTREE+BAGGING),

and Random Forest (RF) was performed by Kapoor

(Kapoor and Sharma, 2018). On comparison of

MAPE of all techniques, they concluded that the error

associated with RF was least and this approach pro-

duced more accurate results. A comparative study by

Kandananond (Kandananond, 2011), in which three

methodologies, ARIMA, ANN and Multiple Linear

Regression (MLR) were deployed to load forecast-

ing in Thailand. The results showed that the ANN

model reduced the MAPE to 0.996%, while those of

ARIMA and MLR were 2.80981% and 3.2604527%

respectively.

Rana (Rana and Koprinska, 2016) presented an

approach for very short-term load forecasting called

Advanced Wavelet Neural Networks (AWNN) which

used a shift invariant advanced wavelet packet trans-

form for load decomposition, Mutual Information for

feature selection and a multi-layer perceptron Neu-

ral Network trained with Levenberg-Marquardt algo-

rithm for prediction. They evaluated the performance

of their AWNN model using two different datasets for

two years: Australian 5-min data and Spanish 60-min

data. They choose the different geographic location

and time resolution to better access the robustness of

their AWNN model. Using the two evaluation met-

rics MAE and MAPE their AWNN models outper-

formed a number of methods used for comparison:

three ARIMA methods, three Holt-Winters exponen-

tial smoothing methods, an industry model, several

na¨ıve baselines, and also single non-wavelet Neural

Networks, Linear Regression, and Model Tree Rule.

Wang (Wang et al., 2014) proposed a novel ap-

proach for short-term load forecasting by applying

univariate wavelet denoising in a combined model

that is a hybrid of the seasonal autorgressive in-

tegrated moving average (SARIMA) model and a

back propagation neural networks (BPNN). Electric-

ity load data from New South Wales, Australia was

used to evaluate the performance of their proposed

approach. They compare their combined model with

the SARIMA, and BPNN and the results showed

that their proposed model can effectively improve the

forecasting accuracy.

NCTA 2021 - 13th International Conference on Neural Computation Theory and Applications

294

Munem (Munem et al., 2020) proposed a multi-

variate Bayesian Optimization based Long short-term

memory (LSTM) neural network to forecast the res-

idential electric power load for the upcoming hour.

Their model surpasses convolutional neural network

(CNN), artiﬁcial neural network (ANN) and support

vector machine (SVM) using MAE, RMSE and MSE

(Mean Squared Error) as performance evaluation met-

rics. They also observed that though their model per-

formed conspicuously it can be improved by adding

feature selection technique with the model.

Overview. The main objective of this study is to

achieve a MAPE of less than 1% on day ahead hourly

load forecast of the Anomalous Days of ISO New

England Grid Data. To this period no monogram in

short term load forecasting has combined Multivari-

ate Denoising Using Wavelet and Principal Compo-

nent Analysis with Bayesian Optimization for model

hypeparameter tuning and this work seeks to ﬁll in

that gap.

Section 2 brieﬂy discusses Multivariate Denois-

ing Using Wavelet and Principal Component Analy-

sis, Bayesian Optimization for Hyperparameter Tun-

ing, the three models used in this study Random For-

est (RF), Feedforward Neural Networks (FFNN), and

the Long Short Term Memory (LSTM) Neural Net-

work.

Section 3 outlines the methodology followed in

this study.

Section 4 outlines the experiments and ﬁndings

of this study with regards to the objective speciﬁed

above.

Section 5 concludes the paper and outlines future

directions.

2 BACKGROUND

In this section, we brieﬂy describe Multivariate De-

noising Using Wavelet and Principal Component

Analysis, Bayesian Optimization for model hypepa-

rameter tuning, Random Forest, Feedforward Neu-

ral Network (FFNN) and Long Short-Term Memory

(LSTM) Neural Network.

2.1 Multivariate Wavelet Denoising

using Principal Component Analysis

The Multivariate Denoising Using Wavelet and Prin-

cipal Component Analysis Scheme of Aminghafari

(Aminghafari et al., 2006) is performed by the MAT-

LAB R2020b function WMULDEN and outlined in

Algorithm 1: Multivariate Denoising Using Wavelet and

Principal Component Analysis (WPCA).

• Parameters: J ∈ N.

1: Apply the level J wavelet decomposition of each

column of X, where X is an n× p predictor ma-

trix. This step produces J + 1 matrices D

, . . . , D

containing the detail coefﬁcients at level 1 to J of

the p signals and the approximation coefﬁcients

of the p signals. Matrices D

, 1 ≤ j ≤ J, and

are, respectively, of size

× p and

× p.

2: State

the estimator of Σ

, the noise covari-

ance matrix, equivalent to the Minimum Co-

variance Determinant estimator (MCD) as

MCD(D

) and then compute the Singular Value

Decomposition (SVD) of

providing an or-

thogonal matrix V such that

= VΛV

where

Λ = diagonal(λ

, 1 ≤ i ≤ p). Apply to each detail

after change of basis (namely D

V, 1 ≤ j ≤ J),

the p univariate thresholding strategies using the

threshold t

2λ

log(n) for the ith column of

3: Apply the PCA of the matrix A

and then choose

the convenient number p

J+1

of principal compo-

nents.

4: Rebuild the denoised matrix

X, from the simpli-

ﬁed detail and approximation matrices, by chang-

ing of basis using V

and inverting the wavelet

transform.

Algorithm 1. The wavelet decomposes the time se-

ries producing approximation coefﬁcients which de-

scribe the overall shape of the signal by capturing low

frequency information and detail coefﬁcients that de-

scribe ﬁner local changes by capturing high frequency

information. Among detail coefﬁcients, low intensity

ones correspond to the noisy part of the signal. Princi-

pal component analysis is then performed on the ap-

proximation coefﬁents to keep only the most impor-

tant features of the signal.

2.2 Bayesian Optimization Algorithm

Machine learning modeling involves training a

model M to minimize some predeﬁned loss func-

tion L(X

(val)

;M) on a given validation data set

(val)

. Most common loss functions are the root

mean squared error (RMSE) and mean squared error

(MSE). The model M is constructed by a learning al-

gorithm A using a training data set X

(tr)

. The learning

algorithm A may itself be parameterized by a set of

hyperparameters λ, for example M = A(X

(tr)

;λ). The

goal of hyperparameter search is to ﬁnd a set of hy-

perparameters λ

∗

that yield an optimal model M

∗

(to

Multivariate Short Term Load Forecasting Strategy: Application to Anomalous Days of ISO New England Data

295

be used make forecast on an out of sample data set

(test)

) which minimizes L(X

(val)

;M). Formally this

is described as follows:

∗

= argmin

L(X

(val)

;A(X

(tr)

;λ)) (1)

= argmin

F(λ;A, X

(tr)

, X

(val)

, L) (2)

The objective function F takes a tuple of hyperparam-

eters λ and returns the loss. The data sets X

(tr)

and

(val)

are given and the learning algorithm A and loss

function L are chosen (Claesen and Moor, 2015).

Bayesian optimization ﬁnd λ

∗

by estimating F as a

form of Gaussian Process. Bayesian optimization in-

corporates prior belief about F and updates the prior

with samples drawn from F to get a posterior that bet-

ter approximates F. Bayesian optimization also uses

an acquisition function that directs sampling to areas

where an improvement over the current best obser-

vation is likely. Let’s deﬁne λ

as the ith sample,

and F(λ

) as the observation of the objective func-

tion at λ

. As we accumulate observations D

1:t

{(λ

, F(λ

)), i = 1, 2, . . . , t}, the prior distribution is

combined with the likelihood function P(D

1:t

|F) to

obtain the posterior distribution:

P(F|D

1:t

) ∝ P(D

1:t

|F)P(F) (3)

The prior distribution P(F) represents our belief

about the space of possible objective functions and

the posterior distribution captures our updated beliefs

about the unknown objective function.

Algorithm 2: Bayesian Optimization Algorithm.

• Select: T, u

1: for t = 1, 2, . . . , T do

2: Find λ

by optimizing the acquisition function

u over function F:

= argmax

u(x|D

1:t−1

)

3: Sample the objective function:

= F(λ

)

4: Augment the data:

1:t

= {D

1:t−1

, (λ

, y

)}

and update the posterior of function F.

5: end for

There are two main parts of Algorithm 2: Maxi-

mizing the acquisition function (step 2) and updating

the posterior distribution (steps 3 and 4). The MAT-

LAB R2020b function BAYESOPT is used to maxi-

mize the acquisition function and as for updating the

posterior distribution BAYESOPT uses another MAT-

LAB R2020b function FITRGP which ﬁt a Gaussian

process model to the data.

2.3 Random Forest (RF)

Random forests are indeed a generalization of bag-

ging. Instead of considering all of the predictors

at each split of the tree, only a random sample of

“NumPTS”

predictors can be chosen each time. The

main advantage of random forests with respect to bag-

ging can be noticed in the case of correlated predic-

tors, predictions from the bagged trees will be highly

correlated so that bagging will not reduce the vari-

ance so much, whereas random forests overcome this

problem by forcing each split to consider only a sub-

set of the predictors. In the case of random forest, the

efﬁciency of the method depends on a suitable selec-

tion of the number of trees n and the number of pre-

dictors NumPTS tested at each split. The out-of-bag

(OOB) error can be used for searching a suitable n as

well as a suitable NumPTS. As with bagging, random

forests will not overﬁt if we increase n, so the goal is

to choose a value that is sufﬁciently large.

2.4 Feedforward Neural Network

(FFNN)

The FFNN architecture used in this study consists of

three layers known as the input layer, hidden layer and

output layer. The activation function used in the hid-

den and output layers is hyperbolic tangent sigmoid

transfer function (tansig) and linear transfer function

(purelin) respectively. During the training process,

the input data (in the input layer) will be trained and

weighted by the neurons in the hidden layer. Then,

the estimated output from the training process will

be compared with the desired target (in the output

layer). The comparison is further evaluated based on

the mean squared error (MSE). The training process

is repeated by adjusting the weights and bias inside

the neurons until the estimated output and the desired

target is matched with minimum MSE. The main ad-

vantage of FFNN is the ability to implicitly learn and

model the complex relationships between inputs and

outputs.

NumPTS = Number of Predictors to Sample

NCTA 2021 - 13th International Conference on Neural Computation Theory and Applications

296

2.5 Long Short-Term Memory (LSTM)

Neural Network

LSTM is a speciﬁc recurrent neural network (RNN)

architecture that was designed to model temporal se-

quences and their long-range dependencies more ac-

curately than conventional RNNs. The LSTM con-

tains special units called memory blocks, in the recur-

rent hidden layer. The memory blocks contain mem-

ory cells with self-connections storing the temporal

state of the network in addition to special multiplica-

tive units called gates to control the ﬂow of informa-

tion (Sak et al., 2014). The LSTM memory cells are

arranged sequentially to create a stable memory se-

quence which eliminates the vanishing gradient prob-

lem. LSTM have several hyperparameters that needs

to be ﬁne tuned, such as the number of hidden units

in the LSTM layer, so as to learn the data well. In ad-

dition an Optimizer that updates the weights and bi-

ases of the LSTM Neural Network has hyperparam-

eters such initial learning rate that needs tuning as

well. According to Greff (Greffet al., 2017) the learn-

ing rate and network size are the most crucial tunable

LSTM hyperparameters. For a detail discussion of

LSTM Neural Network for load forecasting refer to

Munem (Munem et al., 2020) and He (He et al., 2019)

3 METHODOLOGY

In this section we describe the Model evaluation met-

rics used in this study, Data, RF, FFNN, and LSTM

training:

3.1 Model Evaluation Metrics used in

This Study

RMSE =

∑

j=1



− ˆy



(4)

MAPE = 100∗

∑

j=1



− ˆy



(5)

MAE =

∑

j=1



− ˆy



(6)

• RMSE: Root Mean Square Error

• MAPE: Mean Absolute Percent Error

• MAE: Mean Absolute Error

• N: Total number of values

• y

: Actual observed value to compare the forecast

with

• ˆy

: Forecast value, that is, the model output

3.2 Data

Real recorded hourly data of ISO New England grid

from 2004 to 2009 is used in this study. The raw data

ﬁles can be obtained directly from ISO New England

(www.iso-ne.com). The power generating and distri-

bution system of the entire New England Area is man-

aged and operated by the ISO New England. New

England is a region comprising six states of USA:

Connecticut, Maine, Rhode Island, Vermont, Mas-

sachusetts and New Hampshire. A better description

and analysis of the same data set used in this study

is given in Raza (Raza et al., 2020). The predictor

variable are:

• Dry bulb temperature

• Dew point temperature

• Hour of day

• Day of the week

• Holiday/weekend indicator (0 or 1)

• Previous 24-hr average load

• 24-hr lagged load

• 168-hr (previous week) lagged load

Denoising of the Predictor Variables. The denois-

ing procedure is oulined in Algorithm 1:

• Wavelet Decomposition Parameters: The ﬁrst step

in Wavelet Denoising Scheme is the selection of

the proper wavelet type, here we lected Symlets 4

and the next step is to select the level of decompo-

sition, which is the number of times that the orig-

inal signal is decomposed by the wavelet trans-

form. Aminghafari (Aminghafari et al., 2006)

proposed a decomposition maximum level of 8

and in this study we selected 5 levels. Both the

selected wavelet and level are the default values

in the MATLAB R2020b function WMULDEN.

The choice being that with more decomposition

we may remove more noise, however it is more

likely that we may also remove valuable time se-

ries ﬂuctuations.

• Denoising Parameters: The next step is to select a

thresholding method. The intuition is that small

wavelet coefﬁcients are combined with noise,

while large wavelet coefﬁcients contain more use-

ful signal than noise. Hence to remove those com-

ponents with small coefﬁcients and reduce the

Multivariate Short Term Load Forecasting Strategy: Application to Anomalous Days of ISO New England Data

297

inﬂuence of components with large coefﬁcients

we apply soft ﬁxed form thresholding (default in

WMULDEN). Fixed form thresholding is given

2×log(length(X

′

))

where X

′

is n×1

• Principal Components Parameters: The Kaiser’s

rule (Aminghafari et al., 2006) deﬁne the way to

select principal components for approximation at

the chosen level in step 1 of Algorithm 1 in the

wavelet domain and for ﬁnal PCA after wavelet

reconstruction by keeping the components associ-

ated with eigenvalues greater than the mean of all

eigenvalues.

3.3 RF Training

The data set was divided as follows:

• Training - From 1 January 2004 to 31 December

2008

• Testing - From 1 January 2009 to 31 December

2009

It is noted that the performance of random for-

est is closely related to the strength of each tree and

the inter-tree correlation. As NumPTS goes up, the

strength of each tree can be improved, but the inter-

tree correlation increases and the random forest er-

ror rate goes up. As NumPTS goes down, both inter-

tree correlation and the strength of individual trees go

down. So, the optimal value of NumPTS must be cho-

sen carefully to make the trees as uncorrelated as pos-

sible. In addition, since the ﬁnal regression result of

algorithm is produced by all decision trees, the num-

ber of decision trees in the random forest will also

affect the accuracy of the model.

Usually, the performance of single tree in the forest

is low, so if the number of decision trees is too small,

the whole random forest model will have bad perfor-

mance. Hence, the hyperparameters NumPTS and n

must be chosen carefully. Next, hyperparameters of

random forest model are tuned by Bayesian optimiza-

tion. We choose the number of decision trees in the

random forest n and the size of the predictor variables

subset NumPTS as hyperparameters. The ranges of

value for hyperparameters are n in the range [1, 300]

and NumPTS

in the range [1, 7], respectively. The

model OOB error is chosen as the objective function.

3.4 FFNN Training

The data set was divided as follows:

Number of predictors is 8

• Training - From 1 January 2004 to 31 December

2006

• Validation - From 1 January 2007 to 31 December

2008

• Testing - From 1 January 2009 to 31 December

2009

The effectiveness of FFNN is inﬂuenced by al-

gorithm and FFNN parameters such as momen-

tum constant, learning rate and the number of neu-

rons in hidden layers. In this study we utilize

the TRAINBR (Bayesian regularization backpropa-

gation) MATLAB R2020b function to train the net-

work. Bayesian regularization takes place within

the Levenberg-Marquardt (LM) algorithm. The LM

algorithm offers a tradeoff between the beneﬁts of

the Gauss-Newton method and the steepest descent

method. The LM algorithm updates the network

weights using the Hessian matrix approximation:

k+1

= w

−



+ µI



−1

e (7)

where J is the Jacobian matrix (ﬁrst-order deriva-

tives of the errors), I is the identity unit matrix, e is the

vector of the network errors, and µ the Marquardt ad-

justment parameter. The Marquardt adjustment pa-

rameter controls the algorithm. If µ = 0 the algorithm

behaves as in the Gauss-Newton’s method. For high

values of µ the algorithm uses the steepest descent

method. We tune the hyperparameters number of neu-

rons in the hidden layer as well as the Marquardt ad-

justment parameter using Bayesian Optimization to

minimize the Mean Squared Error (MSE) of the val-

idation set as the objective function. The ranges of

value for hyperparameters are number of neurons in

the range [1, 50] and Marquardt adjustment parame-

ter in the range



1×10

−3

, 1 ×10



, respectively.

3.5 LSTM Training

The data set was divided as follows:

• Training - From 1 January 2004 to 31 December

2006

• Validation - From 1 January 2007 to 31 December

2008

• Testing - From 1 January 2009 to 31 December

2009

Deep learning models are typically trained by a

stochastic gradient descent optimizer. There are many

variations of stochastic gradient descent; in this study,

we utilize the Adam optimizer. The Adam parameter

(weights and biases) update is calculated on the fol-

lowing equation:

NCTA 2021 - 13th International Conference on Neural Computation Theory and Applications

298

n+1

= θ

−

αm

√

+ ε

(8)

where

= β

n−1

(1−β

)∇E(θ

) (9)

and

= β

n−1

+ (1−β

)[∇E(θ

)]

(10)

where

• n is the following steps of iterative process of

training

• α is the learning rate

• θ the vector of trained parameters

• E(θ) is the loss function

• β

is the Gradient Decay Factor

• β

is the Squared Gradient Decay Factor

The learning rate tells the optimizer how far to

move the weights in the direction opposite of the gra-

dient for a mini-batch. If the learning rate is low, then

training is more reliable, but optimization will take a

lot of time because steps towards the minimum of the

loss function are tiny. If the learning rate is high, then

training may not converge or even diverge. Weight

changes can be so big that the optimizer overshoots

the minimum and makes the loss worse.

There are multiple ways to select a good start-

ing point for the learning rate and in this study we

let Bayesian Optimization select the best initial learn

rate. In this study, we utilize one LSTM layer. And

in addition to the initial learn rate, we also optimize

the number of hidden units in the LSTM layer, us-

ing Bayesian Optimization. The number of hidden

units corresponds to the amount of information re-

membered between time steps (the hidden state). The

hidden state can contain information from all pre-

vious time steps, regardless of the sequence length.

If the number of hidden units is too large, then the

layer might overﬁt to the training data, to overcome

this problem of overﬁtting we include a dropoutlayer

after the LSTM layer. The range of values for the

two hyperparameters are learning rate in the range



1×10

−3

, 1



and number of hidden units in the range

[1, 200], respectively.

4 EXPERIMENTS RESULTS AND

ANALYSIS

4.1 Models Hyperparameters Values

Selected by Bayesian Optimization

Algorithm

Table 1: RF Model.

Hyperparameters Selected Hyperparameters Values

n (Tree Size) 200

NumPTS (No. of points to sample) 1

Table 2: FFNN Model.

Hyperparameters Selected Hyperparameters Values

number of neurons in the hidden layer 25

Marquardt adjustment parameter 3753.8

Table 3: LSTM Model.

Hyperparameters Selected Hyperparameters Values

No. of hidden units in the LSTM layer 186

Initial learn rate 0.0046407

4.2 Anomalous Days Load Prediction

The selected anomalous days are Christmas Day, (Fri-

day, December 25, 2009), Memorial Day (Monday,

May 25, 2009), Easter Day (Sunday, April 12, 2009),

and Labor Day (Monday, September 7, 2009).

Figure 1: Day ahead hourly load forecast of Easter Day.

Table 4: MAPE, MAE and RMSE for Easter Day 2009.

Out of Sample

Model MAPE (%) MAE (MWh) RMSE (MWh)

LSTM 0.65 74.66 97.07

FFNN 6.24×10

−7

7.37×10

−5

8.68×10

−5

RF 0.045 5.20 7.03

The x-axis of the graph represents the hours of the

anomalous day and the y-axis represents the load de-

mand in megawatt. From the graphs and outlined in

the tables of Model evaluation metrics, the Feedfor-

ward Neural Network (FFNN) is a better ﬁt to the data

followed by the Random Forest (RF) and last is the

Multivariate Short Term Load Forecasting Strategy: Application to Anomalous Days of ISO New England Data

299

Figure 2: Day ahead hourly load forecast of Memorial Day

2009.

Table 5: MAPE, MAE and RMSE for Memorial Day 2009.

Out of Sample

Model MAPE (%) MAE (MWh) RMSE (MWh)

LSTM 0.89 101.27 117.19

FFNN 6.00 ×10

−7

6.52×10

−5

9.23×10

−5

RF 0.042 4.62 7.48

Figure 3: Day ahead hourly load forecast of Labour Day

2009.

Table 6: MAPE, MAE and RMSE for Labour Day 2009.

Out of Sample

Model MAPE (%) MAE (MWh) RMSE (MWh)

LSTM 0.48 56.42 72.78

FFNN 4.42 ×10

−7

5.51×10

−5

6.10×10

−5

RF 0.032 3.86 4.94

Figure 4: Day ahead hourly load forecast of Christmas Day

2009.

Table 7: MAPE, MAE and RMSE for Christmas Day 2009.

Out of Sample

Model MAPE (%) MAE (MWh) RMSE (MWh)

LSTM 0.52 85.27 95.61

FFNN 3.73×10

−7

6.17×10

−5

7.04×10

−5

RF 0.014 2.33 2.68

Long Short Term Memory (LSTM) Neural Network.

The MAPE, MAE and RMSE from Table 4 to Table

7 show that the FFNN has achieved the lowest errors

and hence the highest accuracy in predicting the load

proﬁle of the selected anomalous days. The predicted

load demand for the FFNN model is very close to the

actual load. The performance of the FFNN model

implies that with proper data preprocessing and ex-

cellent hyperparameter optimization shows that the

FFNN is capable of approximating any measurable

function to any desired degree of accuracy. The prob-

lem of overﬁtting was avoided during training by the

use of MATLAB R2020b function TRAINBR which

uses Bayesian regularization and enhances general-

ization as well. As the second best performing model

the Random Forest is however the one most prefered

the reason being their ﬂexibility which is suitable for

linear and non-linear relationships, they take into ac-

count interactions among predictors at different lev-

els, no assumption on the data are needed, they pro-

vide very accurate predictions as evident in this study

and their interpretability. As the least performing

model, the LSTM model howevercan be improved by

adding several LSTM layers, optimizing hyperparam-

eters such as Gradient Decay Factor and the Squared

Gradient Decay Factor of the Adam Optimizer

5 CONCLUSIONS

The paper presented three multivariate Bayesian Op-

timization (BO) based Random Forest (RF), Feed-

forward Neural Networks (FFNN) and Long Short-

term Memory (LSTM) neural network for day ahead

hourly load forecast of the anomalous days system

load of the ISO New England grid. The predictor vari-

ables were subjected to Multivariate Denoising us-

ing Wavelet and Principal Component Analysis. The

methodology followed shows that the Feedforward

Neural Networks achieved superior results. Also, the

main objective of this study was attained which was

to achieve a MAPE of less than 1% on the day ahead

hourly load forecast of the Anomalous Days of ISO

New England Grid Data. The major challenge going

forward is to apply our methodology to different grid

data sets.

NCTA 2021 - 13th International Conference on Neural Computation Theory and Applications

300

ACKNOWLEDGEMENTS

I would like to acknowledge the Faculty of Engineer-

ing and the Built Environment at the Durban Univer-

sity of Technology for their ﬁnancial support.

REFERENCES

Aminghafari, M., Cheze, N., and Poggi, J.-M. (2006). Mul-

tivariate de-noising using wavelets and principal com-

ponent analysis. In Computational Statistics and Data

Analysis, 50, pp. 2381–2398. ELSEVIER.

Claesen, M. and Moor, B. D. (2015). Hyperparame-

ter search in machine learning. In arXiv preprint

arXiv:1502.02127. CORNELL UNIVERSITY.

Greff, K., Shrivastava, R. K., Koutnik, J., Steunebrink,

B. R., and Schmidhuber, J. (2017). Lstm: A search

space odyssey. In IEEE Transactions on Neural Net-

works and Learning Systems, vol. 28, issue 10, pp.

2222–2232. IEEEXplore.

He, F., Zhou, J., Feng, Z., Liu, G., and Yang, Y. (2019).

A hybrid short-term load forecasting model based on

variational mode decomposition and long short-term

memory networks considering relevant factors with

bayesian optimization algorithm. In Applied Energy,

237, pp. 103–116. SCIENCEDIRECT.

Hobbs, B. F., Jitprapaikulsarn, S., Konda, S., Chankong, V.,

Loparo, K. A., and Maratukulam, D. J. (1999). Analy-

sis of the value for unit commitment of improved load

forecasts. In IEEE Transactions on Power Systems,

vol. 14, no. 4, pp. 1342–1348. IEEEXplore.

Kandananond, K. (2011). Forecating electricity demand in

thailand with an artiﬁcial neural network approach. In

Energies, 4(8), 1246–1257. MDPI.

Kapoor, A. and Sharma, A. (2018). A comparison of short-

term load forecasting techniques. In 2018 IEEE In-

novative Smart Grid Technologies - Asia (ISGT Asia),

Singapore, pp. 1189–1194. UIASSIST.ORG.

Munem, M., Bashar, T. M. R., Roni, M. H., Shahriar,

M., Shawkat, T. B., and Rahaman, H. (2020). Elec-

tric power load forecasting based on multivariate long

short-term memory neural network using bayesian op-

timization. In 2020 IEEE Electric Power and Energy

Conference (EPEC), Edmonton, AB, Canada, pp. 1–6.

IEEEXplore.

Nti, I. K., Teimeh, M., Nyarko, O., and Adekoya, A. F.

(2020). Electricity load forecasting: A systematic re-

view. In Journal of Electrical Systems and Informa-

tion Technology 7, 13. Springer.

Rana, M. and Koprinska, I. (2016). Forecasting electricity

load with advanced wavelet neural networks. In Neu-

rocomputing, 182, pp. 118–132. ELSEVIER.

Raza, M. Q., Mithulananthan, N., Li, J., and Lee, K. Y.

(2020). Multivariate ensemble forecast framework for

demand prediction of anomalous days. In IEEE Trans-

actions on Sustainable Energy, vol. 11, no. 1, pp. 27–

36. IEEEXplore.

Sak, H., Senior, A., and Beaufays, F. (2014). Long short-

term memory recurrent neural network architectures

for large scale acoustic modeling. In In Proceed-

ings of the Annual Conference of International Speech

Communication Association (INTERSPEECH). RE-

SEARCHGOOGLE.

Son, H. and Kim, C. (2020). A deep learning approach

to forecasting monthly demand for residential-sector

electricity. In Sustainability, 12, 3103. MDPI.

Wang, J., Wang, J., Li, Y., Zhu, S., and Zhao, J. (2014).

Techniques of applying wavelet de-noising into a

combined model for short-term load forecasting. In

Electrical Power and Energy Systems, 62, pp. 816–

824. ELSEVIER.

Multivariate Short Term Load Forecasting Strategy: Application to Anomalous Days of ISO New England Data

301