An Automated Clustering Process for Helping Practitioners to Identify
Similar EV Charging Patterns across Multiple Temporal Granularities
Ren
´
e Richard
1 a
, Hung Cao
1 b
and Monica Wachowicz
1,2 c
1
People in Motion Lab, University of New Brunswick, Canada
2
RMIT, Australia
Keywords:
Agglomerative Hierarchical Clustering, EV Adoption, Charging Infrastructure Usage Patterns, Clustering
Process, Cluster Validity Indices.
Abstract:
Electric vehicles (EVs) are part of the solution towards cleaner transport and cities. Clustering EV charging
events has been useful for ensuring service consistency and increasing EV adoption. However, clustering
presents challenges for practitioners when first selecting the appropriate hyperparameter combination for an
algorithm and later when assessing the quality of clustering results. Ground truth information is usually not
available for practitioners to validate the discovered patterns. As a result, it is harder to judge the effectiveness
of different modelling decisions since there is no objective way to compare them. In this work, we propose
a clustering process that allows for the creation of relative rankings of similar clustering results. The overall
goal is to support practitioners by allowing them to compare a cluster of interest against other similar clusters
over multiple temporal granularities. The efficacy of this analytical process is demonstrated with a case study
using real-world Electric Vehicle (EV) charging event data from charging station operators in Atlantic Canada.
1 INTRODUCTION
Globally, national and local government commit-
ments to electrify the transport sector will have a posi-
tive impact on smart cities. The vision for smart cities
fosters advanced and modern urbanization, which re-
sults in a core infrastructure that enables a good qual-
ity of life for citizens and the sustainable management
of natural resources. Supporting the usage of EVs
contributes to improved air quality, sustainable mo-
bility and therefore contributes to this vision.
The high capital costs of setting up public charg-
ing infrastructure and the usage of public funds to fos-
ter a shift to EVs necessitates informed decision mak-
ing at all stages of the adoption life-cycle. Given early
EV adoption challenges, some charging stations can
be under-utilized, others will serve a disproportionate
amount users. Clustering stations together based on
utilization patterns is a useful planning tool for opera-
tors. Additionally, as vehicle electrification grows, so
does the demand for electricity and the possible strain
on power grids. Utilities and other power generators
need to prepare for increased demand. Accurate load
a
https://orcid.org/0000-0002-1342-6225
b
https://orcid.org/0000-0002-0788-4377
c
https://orcid.org/0000-0002-4659-0101
forecasting is one tool which can help operators en-
sure service consistency.
Clustering is an unsupervised learning method
which assists practitioners in discovering hidden pat-
terns from a data set. It has been utilized by prac-
titioners in the energy domain to group similar con-
sumers, predict future demand, and increase EV
adoption. Statistical models, built with data from
charging stations having similar charging patterns
will reportedly have superior accuracy (Straka and
Buzna, 2019). Therefore, energy load forecasting
methods might preform better when applied to ho-
mogeneous clusters of stations as opposed to all sta-
tions. The patterns in energy usage behavior are core
to improving services provided by utility companies,
which are responsible for managing peaks and imbal-
ances in charging infrastructure usage patterns (Igle-
sias and Kastner, 2013).
Although clustering is widely used in many
knowledge domains, it remains arduous for prac-
titioners to select the proper clustering algorithm
with hyperparameter combination and later assess the
quality of clustering results. The subjectivity found in
the required expert knowledge that is needed for de-
termining the level of “success” achieved during clus-
tering, is likely to be one of the main reasons why
Richard, R., Cao, H. and Wachowicz, M.
An Automated Clustering Process for Helping Practitioners to Identify Similar EV Charging Patterns across Multiple Temporal Granularities.
DOI: 10.5220/0010485000670077
In Proceedings of the 10th International Conference on Smart Cities and Green ICT Systems (SMARTGREENS 2021), pages 67-77
ISBN: 978-989-758-512-8
Copyright
c
2021 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
67
existing AutoML frameworks tend to focus on super-
vised learning tasks that require labeled data as in-
put (Oliveira, 2019). One of the challenges is that the
identification of the most similar clusters can be sub-
jective and it usually requires multiple approaches to
automate this process (Poulakis, 2020). The difficulty
in clustering is finding the results that aligns with a
practitioner’s needs because in many complex data
sets, there are several plausible clusters, and practi-
tioners may have different priorities and preferences.
An unsupervised clustering algorithm has no way to
intrinsically infer which clusters embody desired pri-
orities and preferences (Bae et al., 2020).
Additionally, in data with a temporal component
such as EV charging events for example, assessing the
structure consistency of discovered clusters over dif-
ferent temporal granularities, is often a lengthy man-
ual undertaking. Metrics such as inter-cluster separa-
tion, inter-cluster homogeneity, density, and uniform
cluster sizes can be computed to determine structure
consistency. However, the question of how to select
a particular clustering result that is more meaningful
than another based on user priorities and preferences,
still depends on the practitioner’s capacity of distin-
guishing similar clusters. Towards this challenge, this
research work explores whether given the prospect of
a clustering result of interest, a process of objectively
highlighting and recommending similar clustering re-
sults can be automated in order to support practition-
ers in evaluating how clustering patterns persist over
multiple temporal granularities, allowing practition-
ers to find meaningful clusters according to their pref-
erences and priorities. The overall motivation of this
work is to assist the practitioner in navigating multiple
clustering results for different temporal partitions of
the same data. Providing the practitioner with an ini-
tial ranked list of clustering results and a mechanism
to identify clustering similarities can assist practition-
ers in downstream analytical tasks such as improving
regression or classification model performance.
Therefore, we propose a clustering process which
uses internal cluster validity indices to enable the
identification of similar clustering results across var-
ious temporal slices of data. Of primary concern in
this work is the comparison of clustering results from
a-priori selected temporal granularity (e.g weekly,
monthly and seasonal) and how to support practition-
ers in identifying similar results using a reference re-
sult of interest. A case study using real-world charg-
ing event data from EV station operators in Atlantic
Canada is used to evaluate the proposed clustering
process in identifying similar clusters of charging sta-
tions according to their usage patterns (e.g high vs low
usage).
The scientific contributions of this paper are as fol-
lows.
Our work is unique in proposing a combination of
eight internal cluster validity indices to character-
ize clusters at different granularities (e.g.weekly,
monthly or seasonally). Previous research work
has usually focused on using these indices apart
from each other.
These internal validity indices are then used to
compute a proximity measure (i.e. Euclidean dis-
tance) for helping practitioners to identify similar
clusters. To the best of our knowledge, this clus-
tering procedure has never been used as an objec-
tive measure to reduce the cognitive load of prac-
titioners in understanding clustering results.
The use of real-world data from EV charging sta-
tions advances the understanding of charging be-
havior. To the best of our knowledge, no previous
work has implemented an end-to-end automated
clustering process that facilitates the comparison
of clustering results by practitioners with differ-
ent priorities and preferences.
The rest of the paper is organized as follows. In
Section 2, previous research work is described. Sec-
tion 3 describes the proposed clustering process un-
derpinning our work. Section 4 provides a detailed
description of the real-world EV charging event data
and the end-to-end automated implementation of our
proposed clustering process. In Section 5, we discuss
the results. Finally, Section 6 concludes and indicates
future research work.
2 RELATED WORK
In clustering, various steps must be taken by a prac-
titioner such as the selection of an appropriate algo-
rithm and its hyperparameters, the choice of an ad-
equate proximity measure, and how to validate the
modeling results. Fig. 1 outlines a typical cluster anal-
ysis process.
Additionally, the temporal granularity of an algo-
rithm’s input data can generate different clusters over
time. A common problem in clustering is how to ob-
jectively and quantitatively evaluate the results. Clus-
ter validation is an important task in the clustering
process because it aims to compare clustering results
and solve the question of optimal cluster count. Many
internal validity indices have been proposed to as-
sess the level of “success” that a clustering algorithm
achieves in finding the natural clusters in data without
any class label information (Rend
´
on et al., 2011), (Liu
et al., 2010).
SMARTGREENS 2021 - 10th International Conference on Smart Cities and Green ICT Systems
68
Figure 1: The Main Tasks of a Clustering Process as De-
scribed in (Messina, nd).
The preponderance of studies validating cluster
results have been focused on the computation of indi-
vidual cluster validity indices (CVI), which are usu-
ally selected to determine the relative performance of
clustering results. In (Arbelaitz et al., 2013), Arbe-
laitz et al. perform an extensive comparative study
of 30 CVI that are evaluated by using an experimen-
tal setup which recommends the “best” partitioning
in multiple data sets where ground truth information
exists. The optimal suggested number of partitions is
defined as the one that is the most similar to the cor-
rect one measured by partition similarity measures.
The authors found that noise and cluster overlap had
the greatest impact on CVI performance. Some in-
dices performed well with high dimensionality data
sets and in cases where homogeneity of the cluster
densities disappeared. The conclusion in this work
suggests using several CVI to obtain robust results.
In the energy domain, clustering has played an im-
portant role in revealing new insights in energy us-
age behavior, in particular, the EV charging demand
(Al-Ogaili et al., 2019) . For example, in (Straka
and Buzna, 2019), the authors demonstrated the po-
tential of clustering to understand the usage patterns
related to segments of charging stations by compar-
ing k-means, hierarchical, and DBScan algorithms.
The clustering algorithms have successfully identified
four groups of EV charging stations characterized by
distinct usage patterns.
In contrast, very few attempts have been found in
exploring CVI for evaluating the clustering results.
In (Xydas et al., 2016), the Davies-Bouldin index
is used to determine the best value for the cluster
count parameter using the k-means algorithm. Sun
et al. (Sun et al., 2020) proposed a time series clus-
tering method using a modified Euclidean distance to
group the similar charging tails from ACN-Data col-
lected from smart EV charging stations. In this work,
they evaluated their clustering results with Dynamic
Time Warping distance (DTW) and Euclidean dis-
tance method using the silhouette coefficient.
In summary, the traditional usage of CVI has been
for validation purposes. However, utilizing multiple
CVI together in combination with a proximity mea-
sure such as Euclidean distance has a strong potential
to offer a new pairwise similarity measure that can
enhance the comparison of clustering results by prac-
titioners. Certainly, this is not a common practice in
Data Science as well as in the energy domain.
3 THE PROPOSED CLUSTERING
PROCESS
Our proposed clustering process extends the well-
known process introduced in the previous section.
Fig. 2 provides a conceptual overview of the main
tasks of our proposed clustering process. The num-
bered items in the figure link back to individual
Python scripts described in detail in the implemen-
tation section. At the end of the process, a database
is used to persist all clustering results and a RESTful
Application Programming Interface (API) facilitates
querying these results by different practitioners.
Figure 2: Our Proposed Clustering Process.
3.1 Data Preprocessing and Fusion
The data preprocessing and fusion task uses raw data
from the public EV charging stations. Preprocess-
ing consists of data cleaning and consolidation steps.
Data cleaning, ensures good data quality and pro-
duces a set of cleaned files by eliminating errors, in-
consistencies, duplicated and redundant data rows,
and handling missing data. Data consolidation com-
An Automated Clustering Process for Helping Practitioners to Identify Similar EV Charging Patterns across Multiple Temporal Granularities
69
bines data from various data files into a single data set.
A variety of files from the cleaned data set are used as
the input for this operation. The output of these steps
is a unique file that merges all attributes into one big
table.
Moreover, data fusion consists of combining mul-
tiple data sources followed by a reduction or replace-
ment for the purpose of better inference. In our
proposed clustering process, consolidated station lo-
cation information and charging event data files are
combined to produce more consistent, accurate, and
useful data files.
3.2 Feature Generation and Selection
The aim of the feature generation and selection task is
to enrich pre-processed and fused data files by adding
new attributes to each data row according to a spe-
cific context. This task is defined by a contextual-
ization function that can produce a set of new data
rows using contextualization parameters to add new
attributes to the fused data rows. Transformed data is
then partitioned using multiple temporal granularities
(e.g. e.g.weekly, monthly or seasonally).
3.3 Clustering
The aim of the clustering task is to find the pat-
terns from transformed input data using a hierarchical
agglomerative clustering algorithm. The algorithm
seeks to build a hierarchy of clusters by merging cur-
rent pairs of mutual closest input data points until all
the data points have been used in the computation.
The measure of inter-cluster similarity is updated af-
ter each step using complete Ward linkage. This a pri-
ori selected algorithm is utilized to fit the various tem-
poral granularities of input data, producing multiple
clustering results. Internal clustering validity indices
are recorded during each application of the clustering
algorithm.
3.4 Harvesting and Processing Validity
Indices
Each application of the clustering algorithm gener-
ates a record consisting of the cluster count param-
eter value, the various cluster validity index values
and the input data used to generate the clusters. Pro-
cessing the validity indices involves selecting and nor-
malizing the index values in preparation for Euclidean
distance computations. This task utilizes the combi-
nation of eight cluster validity indices which are de-
scribed as follows:
3.4.1 Silhouette Index
The silhouette width of a data point measures how
similar the data point is to its own cluster compared
to other clusters. For clusters X
j
= ( j = 1,..c), the
silhouette width of the i
th
data point in cluster X
j
is
defined as (Rend
´
on et al., 2011):
S(i) =
(b(i) a(i))
max{a(i),b(i)}
(1)
Where a(i) is the average distance between the i
th
data point and all data points included in X
j
; b(i) is
the minimum average distance between the i
th
data
point and all of the data points clustered in X
k
= (k =
1,..c,k 6= j).
From individual silhouette width calculations, an
aggregated global silhouette index is obtained (Petro-
vic, 2006). The silhouette index values range from
-1 to 1 where a value closer to 1 indicates clusters
are well separated and clearly distinguished. A value
closer to -1 indicates data points are not properly clus-
tered.
3.4.2 Cali
´
nski-Harabasz Index
The Cali
´
nski-Harabasz (CH) index is expressed as
a ratio of between-cluster variance and the overall
within-cluster variance. A recent comparative study
of available clustering indices demonstrated this in-
dex as one of the best cluster validity indices (Arbe-
laitz et al., 2013). Well defined clusters yield high
values of this index. Therefore, the maximum value
of the index is used to select the best partition. For n
data points, k clusters where B and W are the between
within cluster scatter matrices, the index is computed
as (Gurrutxaga et al., 2011):
CH =
traceB/(k 1)
traceW/(n k)
(2)
3.4.3 Davies-Bouldin Index
The Davies-Bouldin (DB) index is defined as fol-
lows (Gurrutxaga et al., 2011):
DB =
1
k
k
i=1
max
j = 1,..,k; j 6= i
(d
i j
) (3)
where
d
i j
=
s
i
/s
j
d(c
i
,c
j
)
(4)
In this formula, k is the number of clusters, s
i
is the
average distance of all data points in cluster i to their
cluster centroid and d(c
i
,c
j
) is the distance between
the centroids of clusters i and j. With this index, a
SMARTGREENS 2021 - 10th International Conference on Smart Cities and Green ICT Systems
70
minimum value denotes the best partitioning of the
data.
3.4.4 Cohesion
Cohesion is measured by the sum of squared distances
from each data point to its respective centroid. Also
referred to as the within sum of squares (WSS), it
measures how closely related data points are in a clus-
ter. The WSS is defined as (L
´
opez et al., 2017) :
W SS =
N
c
i=1
xC
i
d(X,
¯
X
C
i
)
2
(5)
Where C
i
is the cluster N
c
is the number of clus-
ters
¯
X
C
i
is the cluster centroid, and
¯
X is he sample
mean. The goal in clustering is to minimize the value
of WSS.
3.4.5 Separation
Measure how distinct or well separated a cluster is
from other clusters. Calculated as the sum of the
squared deviations between the groups, it is defined
as (L
´
opez et al., 2017):
BSS =
N
c
i=1
|C
i
| d(
¯
X
C
i
,
¯
X)
2
(6)
In this formula, |C
i
| is the size of the cluster N
c
is
the number of clusters
¯
X
C
i
is the cluster centroid, and
¯
X is he sample mean. An optimal clustering will have
a higher value of BSS.
3.4.6 Root Mean Square Standard Deviation
The Root Mean Square Standard Deviation (RMSSD)
measures homogeneity within clusters. A lower
RMSSTD value means a better separation of clusters.
Large values of RMSSTD indicates that clusters are
not homogeneous. The metric is computed as (Ru-
jasiri and Chomtee, 2009):
RMSST D =
v
u
u
t
j=1..k
i=1..p
n
i j
a=1
(x
a
¯x
i j
)
2
j=1..k
i=1..p
(n
i j
1)
(7)
Where k is the number of clusters, p is the num-
ber of independent variables in the data set, ¯x
i j
is the
mean of values in variable j and cluster i, and n
i j
is
the number of data points which are in variable p and
cluster k.
3.4.7 R-squared
The R-square (RS) value captures whether there is a
significant difference among data points in different
clusters and that data points in the same cluster have
high similarity. RS values range from 0 to 1 where a
value closer to 0 indicates there is no difference be-
tween clusters. If the R-squared value is zero, there is
no difference between clusters. On the other hand, if
the value is closer to 1, then the partitioning of clus-
ters is closer to an optimal allotment. The metric is
computed as (Rujasiri and Chomtee, 2009):
RS =
SS
t
SS
w
SS
t
(8)
SS
t
=
p
j=1
n
j
a=1
(x
a
¯x
j
)
2
(9)
SS
w
=
i=1..k
j=1
n
i j
a=1
(x
a
¯x
i j
)
2
(10)
In these equations, SS
t
is the sum of squared dis-
tances among all variables, SS
w
is the sum of square
distances among all data points in the same cluster, k
is the number of clusters, p is the number of indepen-
dent variables in the data set, ¯x
j
is the mean of data
in variable j, ¯x
i j
is the mean of the data in variable j
and cluster i and n
i j
is the number of data which are
in variable p and cluster k.
3.4.8 Xie-Beni Index
The Xie-Beni (XB) index is applicable to fuzzy and
crisp clustering results. It is defined as the quotient
between the mean quadratic error and the minimum
of the minimal squared distances between the points
in the clusters. The index is defined as (Chakrabarty,
2010) :
XB(K) =
K
k=1
n
j=1
(µ
k j
)
m
k x
j
z
k
k
2
n ×
min
iK,1 jK
k x
j
z
k
k
2
(11)
Where the nominator measures cluster compact-
ness and the denominator measures the separation be-
tween different cluster centers. The value of the XB
index should be minimum for the optimum number
of clusters in the data. The parameter m is called the
fuzzifier and is usually set between 1 and 2.
3.5 Similarity Computations
Our work uses a proximity measure in the clustering
task and in the computation of the results similarity
matrix. Selecting a this measure to determine how
An Automated Clustering Process for Helping Practitioners to Identify Similar EV Charging Patterns across Multiple Temporal Granularities
71
similar or dissimilar two data points is an important
step in any clustering process. Proximity measures af-
fect the shape of clusters as some data points may be
close to one another according to one measure and far
from each other according to another. Euclidean dis-
tance is a preferred distance measure by researchers in
the field of clustering and is defined as (Chakrabarty,
2010):
D(x,y) =
v
u
u
t
d
i=1
(x
i
y
i
)
2
(12)
In addition to the clustering task, the similarity
computation task uses Euclidean distance as the prox-
imity measure between clustering results. All index
values (e.g. multidimensional points in Euclidean
space) of each clustering results are used in the dis-
tance computations. The pair-wise similarity compar-
isons (e.g. the similarity matrix) are then persisted in
a database for down-stream results exploration via a
RESTful API.
The similarity matrix is stored in the database us-
ing two tables. The first table summarizes clustering
results with rows consisting of a unique clustering re-
sult ID (result id) and meta-data about running the
algorithm (e.g. input file name, clustering execution
time, all validity index values, etc.). The second ta-
ble, which is linked to the first table, contains rows
consisting of a source result ID (from result id), a tar-
get result ID (to result id) and a Euclidean distance.
Links between result IDs are not duplicated as direc-
tionality is not considered.
4 IMPLEMENTATION
This work makes use of real operational data from
public EV charging stations provided by the New
Brunswick Power Corporation. 9,505 EV charging
events that occurred between the dates of April 2019
and April 2020 at Level-2 (L2) and Level-3 (L3) pub-
lic charging stations were included in the analysis.
Table 1 describes the raw EV charging data set fea-
tures. Our practitioners are managers and planners
of an utility company who are responsible for coordi-
nating various projects including EV charging station
condition assessments, operating and capital budget
forecasting, and maintenance and operation practices
development. Fig. 3 describes the overall end-to-end
implementation of our EV use case.
Custom-written Python code and a scientific
Python stack were leveraged to implement the pro-
posed clustering process. Task elements were ex-
ecuted in sequence from a centralized management
Table 1: Raw Data.
Column Name Description
Connection ID Unique identifier
for a connection
Recharge start time (local) Timestamp denoting
start of charging event
Recharge end time (local) Timestamp denoting
end of charging event
Account name Unused (all null)
Card identifier Unique identifier for
a charging plan
member
Recharge duration
(hours:minutes)
Duration of
charge event
Connector used Connection used
during charge event
Start state of charge (%) State of charge %
at beginning of
charging event
End state of charge (%) State of charge %
after charging
event is complete
End reason Charge event end
reason
Total amount Unused (all null)
Currency Unused (all null)
Total kWh Energy transferred to
vehicle during
charging event
Station Unique identifier
for charging station
script (Richard et al., 2020). The software pro-
grams used in this work were packaged using a
Docker (Boettiger, 2015) container in order to ensure
a reproducible and consistent computational environ-
ment.
Fig. 4 highlights noteworthy aspects of the imple-
mentation. The numbered boxes that represent indi-
vidual parameterized Python scripts. The data flow is
such that the output of one script is the input for the
next script. Input and output file names contain pa-
rameter values that were used when calling the work-
flow’s scripts. The grey elements represent a job’s in-
put file(s). The blue elements represent a job’s output
file(s). The detailed implementation of each script is
described as follows:
Script (1): The one way hash.py script imports
raw event data and casts column elements to ap-
propriate types. Additionally, a one-way hash
function is applied to the Card identifier column.
Script (2): The locations to parquet.py script
imports raw station location data and integrates
multiple input files into one.
Script (3): The fuse location w events.py script
SMARTGREENS 2021 - 10th International Conference on Smart Cities and Green ICT Systems
72
Feature Generation/Selection
Data Pre-processing/Fusion
Cleaned Data
Contextual-
ized Data
Extracted
Data
Hashed
Data
Raw Data
Charging Level 3
Charging Level 2
Charging Level 1
Cable connect to charger
Control &
Communication
Control &
Communication
Control &
Communication
AC
AC
DC
Build-in cable protection
Smart recharge
stations
AC
Data
Cleaning
Data
Cleaning
HashingHashing
Data
Fusion
Data
Fusion
Fused
Data
Feature
Generation
Feature
Generation
Data
Partition
Data
Partition
Feature
Selection
Feature
Selection
Partioned
Data
ClusteringClustering
A-priori selected
Algorithm, Parameter,
Proximity Measure
Harvest &
Process
Valication
Indices
Harvest &
Process
Valication
Indices
Clustering
Similarities
Calculation
Clustering
Similarities
Calculation
Multiple Clustering
Results
Clustering
Result
Similarities
Database
Figure 3: Overview of Our Implemented EV Use Case.
fuses event data with charging station location in-
formation.
Script (4): This work focuses on recharge
report event data in the downstream analysis.
The feat eng rech report.py script creates new
features (contextualized) based on calculations
involving existing data attributes and removes
events with a duration of 5 minutes or less (elimi-
nating 11% of the raw records).
Script (5): The create batch ranges.py script
creates temporal partitions of the data. These
partitions facilitate the cluster analysis based on
charging events occurring during a particular
week, month or season of the year.
Script (6): The generate ev station features.py
prepares the input data for clustering by cal-
culating, for each charging station, station type
and temporal granularity, the proportion of total
charging events and the proportion of total power
used to charge vehicles relative to all stations.
Script (7): The cluster data.py script applies the
agglomerative clustering algorithm to all temporal
slices of the data produced in the previous task.
This is done for a cluster count hyperparameter
that varies from 2 to 7. Other hyperparameter set-
tings are kept constant to simplify the experimen-
tal setup. Internal clustering validity indices are
recorded during each application of the clustering
algorithm (See Table 2 for the list of indices).
Script (8): The scale indices.py script normalizes
the internal clustering validity indices in prepara-
tion for the downstream Euclidean distance com-
putations.
Script (9:) The similarity matrix.py script per-
forms pairwise Euclidean distance computations
for each clustering result. All index values (e.g.
multidimensional points in Euclidean space) of
each clustering results are used in the distance
computations.
Script (10): The load data.py script persists the
similarity matrix data produced in the previous
task in a relational database to enable querying of
clustering results and corresponding similarities
across months, weeks and seasons. The database
query functionality is made available via a REST-
ful API.
After results are generated and persisted (e.g.
Script (10) in Fig. 4 is complete), the practitioner can
navigate these results via a RESTful interface. Fig. 5
illustrates how the practitioner interacts with the re-
sults system. First, the practitioner requests ranked
station clustering results for either L2 or L3 station
types (Step 1). The system then returns a sorted list
of clustering results ordered by silhouette score (Step
2). From this list, the practitioner selects one result
as the reference result for which comparable results
are desired and then request these comparable results
from the system (Step 3). Finally, the system returns
a sorted list of comparable clustering results that is
ordered by Euclidean distance (Step 4). This sorted
list contains result specific artefacts such as scatter
plots, mapped station cluster memberships and sil-
houette plots.
The clustering process implementation and RESTful
API facilitate the comparison of clustering result sim-
ilarities across various temporal granularities. This
process is useful in identifying avenues for further
analysis. One Level 3 station clustering result for the
An Automated Clustering Process for Helping Practitioners to Identify Similar EV Charging Patterns across Multiple Temporal Granularities
73
Figure 4: Data Flow Between Python Scripts of the Clus-
tering Process Implementation.
Figure 5: Results Query Sequence.
week of May 27
th
, 2019 has been selected as a case
study to demonstrate our approach. The case study is
presented in the next section.
5 RESULTS AND DISCUSSION
This section highlights the results of our proposed ap-
proach in identifying similar station clusterings over
multiple weeks with a case study. Table 3 highlights
similar clustering results relative to station clusterings
for a target week starting on May 27
th
, 2019. In all re-
sults, the number of clusters is 2 and the station type
is L3. The table is sorted in ascending order by Eu-
clidean distance relative to the target week. Accord-
ing to the multi-dimensional pairwise distance calcu-
lations obtained using the features described in Ta-
ble 2, the most similar clustering result to the week
starting on May 27
th
, 2019 is the result for the week
starting on February the 17
th
2020. The least similar
clustering result is the result for the week starting on
December 2
nd
, 2019.
Table 2: Clustering Validity Index Data.
Column Name Description
file name File name for clustering
results for station type
and time granularity
n cluster K parameter value used in
applying the clustering
algorithm
silhouette score Silhouette index value for
clustering result
calinski harabasz Cali
´
nski-Harabasz index for
clustering result
davies bouldin Davies-Bouldin index for
clustering result
cohesion Cohesion index for
clustering result
separation Separation index for
clustering result
RMSSTD Root mean square standard
deviation index for
clustering result
RS R-squared index for
clustering result
XB Xie-Beni index for
clustering results
A corresponding visual presentation of the cluster-
ing results found in Table 3 can be seen in Figures 6
through 10. Each figure contains a silhouette and scat-
ter plot describing the clustered data. In the silhouette
plots, an observation with a silhouette width near 1,
means that the data point is well placed in its cluster;
an observation with a silhouette width closer to neg-
ative 1 indicates the likelihood that this observation
might really belong in some other cluster.
Table 3: Clustering Similarities - L3 - May 27
th
, 2019.
WEEK Sil CH DB C S RMS RS XB Dist
27/05/19 0.60 51.37 0.51 1.12 2.40 0.15 0.68 0.09 N/A
17/02/20 0.60 49.35 0.57 0.19 2.44 0.16 0.67 0.10 0.081
02/03/20 0.65 55.51 0.52 1.14 2.63 0.15 0.70 0.07 0.101
29/07/19 0.60 55.82 0.53 0.99 2.30 0.14 0.70 0.11 0.105
02/12/19 0.63 56.55 0.58 1.26 2.97 0.16 0.70 0.09 0.177
SMARTGREENS 2021 - 10th International Conference on Smart Cities and Green ICT Systems
74
Column Name Abbreviations for Table 3
Sil : Silhouette index
CH : Cali
´
nski-Harabasz index
DB : Davies-Bouldin index
C : Cohesion
S : Separation
RMS : Root mean square standard deviation
RS : R-squared
XB : Xie-Beni index
Dist : Euclidean distance between current
and previous row
(a) (b)
Figure 6: L3 Station Clusters - MAY-27-2019.
(a) (b)
Figure 7: L3 Station Clusters - FEB-17-2020.
We can see from Figures 6 and 7, reasonable
structures in the data have been found. Stations are
grouped in terms of relatively higher or lower uti-
lization rates. The average silhouette score is 0.60
in both clustering results. The number of observa-
tions in each cluster for both results are also the same.
Cluster 0 in both situations have more stations with
relatively lower utilization rates. Results for the week
of May 27
th
, 2019 are slightly better when consid-
ering all cluster validation indices. This can also be
observed visually. Data points seem to be closer to-
gether in the scatter plot of Fig. 6b than in Fig. 7b.
The in-between cluster separation in both results are
similar.
The silhouette plot in Fig. 8a suggests a less op-
timal clustering. This plot indicates that some obser-
vations would seemingly belong to clusters other than
the one they are in; these observations have a negative
silhouette width value.
The silhouette plot in Fig. 9a and the average sil-
houette score of 0.60 suggest a reasonable structure in
(a) (b)
Figure 8: L3 Station Clusters - MAR-02-2020.
(a) (b)
Figure 9: L3 Station Clusters - JUL-29-2019.
(a) (b)
Figure 10: L3 Station Clusters - DEC-02-2019.
the data has also been found in this week. The number
of observations in each cluster for both clustering re-
sults are different. Based on the various indices, clus-
tering results for July 29
th
, 2020 are better in some
aspects and inferior in others to results for the week
of May 27
th
, 2019. This result was identified as being
the 3
rd
most similar result for our target week.
The decreasing relative similarity of results is es-
pecially visible when comparing the results for the
week of May 27
th
, 2019 with results having the least
similarity (i.e, results for the week of December 2
nd
,
2019). In Fig. 10a we can see that all cluster 0’s
members have below average silhouette scores and
the clustering of stations is much less similar than the
other clusterings.
Individual index calculations embed implicit
trade-offs on what is prioritized when expressing
inter-cluster separation, inter-cluster homogeneity,
density, and compactness as one numeric value. One
can view the various indices as averages where a cer-
tain precision is lost in the summary. This can lead to
situations where one index will suggest a better clus-
tering relative to another grouping and another index
An Automated Clustering Process for Helping Practitioners to Identify Similar EV Charging Patterns across Multiple Temporal Granularities
75
will inverse this assessment. This is illustrated in Ta-
ble 3 where for example, the silhouette and Cali
´
nski-
Harabasz index values for December 2
th
suggest a
better clustering than on the week starting on May
27
th
. However, the Davies-Bouldin and R-squared in-
dex values inverse this assessment.
Capital investments in public charging infrastruc-
ture involves the use of public funds and necessitates
robust informed decision making. Identifying similar
station utilization patterns over multiple weeks can be
useful planning information for station operators. The
cluster analysis presented in our case study provides
useful insights by identifying similar groupings of EV
charging stations according to their usage patterns in
time.
The results highlighted in the case study provided
in this section demonstrate that given a clustering re-
sult of interest, a process of objectively highlighting
and recommending similar clustering results can in-
deed be automated in order to support the practitioner
in evaluating how structure in data persists over mul-
tiple time slices in a data set with temporal proper-
ties. The relative ranking of similar clustering results
that our approach affords makes it easy to objectively
identify similar station groupings over multiple weeks
based on a reference week. Not highlighted in the
case study, are the clustering results for other a-priori
selected temporal partitions in the data, which are also
available as reference points for exploring monthly or
seasonal clustering similarities. For example silhou-
ette plots representing a reference month (where K=4)
and season (where K=3), see Fig. 11.
(a) (b)
Figure 11: L3 Station Clustering References - August and
Spring.
6 CONCLUSIONS
Although clustering has become a routine analytical
task in many research domains, it remains arduous for
practitioners to select a good algorithm with adequate
hyperparameters and to assess the quality of cluster-
ing and the consistency of identified structures over
various temporal slices of data. The process of clus-
tering data is often an iterative, lengthy, manual and
cognitively demanding task. The subjectivity in deter-
mining the level of “success” that unsupervised learn-
ing approaches are able to achieve and the required
expert knowledge during the modeling phase suggest
that a human-in-the-loop process of supporting the
practitioner during this activity would be beneficial.
Ascertaining whether a particular clustering of data is
meaningful or not requires expertise and effort. Doing
this for multiple results on data that has been sliced by
weekly, monthly or seasonal partitions prior to apply-
ing the clustering algorithm would be very time con-
suming. Manually identifying one meaningful result
of interest and then having an automated mechanism
to select similar results is extremely useful in reduc-
ing the amount of effort required to identify avenues
that merit further analysis and assist in downstream
analytical tasks such as improving regression or clas-
sification model performance.
A case study using real-world charging event data
from EV station operators in Atlantic Canada was
used to validate the approach and identify similar
groupings of charging stations according to their us-
age patterns. Our work demonstrates that given a
clustering result of interest, the process of objectively
highlighting and recommending similar clustering re-
sults can be automated in order to support the prac-
titioner in evaluating how structure in data persists
over multiple time slices and reduce the cognitive load
of identifying multiple meaningful clustering results
from a large number of modeling artifacts.
Presenting the practitioner with an initial ranked
list of clustering results leveraging all index values
simultaneously instead of just using the silhouette
scores (as described in Step 1 of Fig. 5) may improve
the initial results exploration process. Framing the
creation of the initial ranked list of results as a Multi-
ple Criteria Decision Making (MCDM) problem will
be included in future work. Additionally, we will ex-
plore if an expert can label a portion of the model-
ing artifacts as meaningful or not and whether a semi-
supervised or other algorithm can automatically label
the rest of the unseen modeling results from the labels
provided by the practitioner. Finally, other avenues
will explore whether this work can be adapted to im-
plement a novel change point detection approach in
identifying significant changes in station groupings in
temporal slices of the data.
ACKNOWLEDGEMENTS
The authors of this paper like to thank the New
Brunswick Power Corporation for providing access
to station operator users and the EV charging data
SMARTGREENS 2021 - 10th International Conference on Smart Cities and Green ICT Systems
76
referenced in this research. This work was partially
supported by the NSERC/Cisco Industrial Research
Chair, Grant IRCPJ 488403-1.
REFERENCES
Al-Ogaili, A. S., Hashim, T. J. T., Rahmat, N. A., Ra-
masamy, A. K., Marsadek, M. B., Faisal, M., and Han-
nan, M. A. (2019). Review on scheduling, clustering,
and forecasting strategies for controlling electric vehi-
cle charging: challenges and recommendations. Ieee
Access, 7:128353–128371.
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., P
´
eRez, J. M.,
and Perona, I. (2013). An extensive comparative
study of cluster validity indices. Pattern Recognition,
46(1):243–256.
Bae, J., Helldin, T., Riveiro, M., Nowaczyk, S., Bouguelia,
M.-R., and Falkman, G. (2020). Interactive Cluster-
ing: A Comprehensive Review. ACM Computing Sur-
veys, 53(1):1–39.
Boettiger, C. (2015). An introduction to docker for repro-
ducible research. ACM SIGOPS Operating Systems
Review, 49(1):71–79.
Chakrabarty, A. (2010). An investigation of clustering al-
gorithms and soft computing approaches for pattern
recognition. PhD thesis, Assam University.
Gurrutxaga, I., Muguerza, J., Arbelaitz, O., P
´
erez, J. M.,
and Mart
´
ın, J. I. (2011). Towards a standard method-
ology to evaluate internal cluster validity indices. Pat-
tern Recognition Letters, 32(3):505–515.
Iglesias, F. and Kastner, W. (2013). Analysis of similarity
measures in times series clustering for the discovery
of building energy patterns. Energies, 6(2):579–597.
Liu, Y., Li, Z., Xiong, H., Gao, X., and Wu, J. (2010). Un-
derstanding of internal clustering validation measures.
In 2010 IEEE International Conference on Data Min-
ing, pages 911–916. IEEE.
L
´
opez, S. L. S., Redondo, R. P. D., and Vilas, A. F.
(2017). Discovering knowledge from student inter-
actions: clustering vs classification. In Proceedings
of the 5th International Conference on Technological
Ecosystems for Enhancing Multiculturality, pages 1–
8.
Messina, E. (n.d.). Cluster analysis, powerpoint slides. (Ac-
cessed 2020-11-07).
Oliveira, M. (2019). 3 Reasons Why AutoML Won’t Re-
place Data Scientists Yet. (March 2019).
Petrovic, S. (2006). A comparison between the silhouette
index and the davies-bouldin index in labelling ids
clusters. In Proceedings of the 11th Nordic Workshop
of Secure IT Systems, pages 53–64. Citeseer.
Poulakis, G. (2020). Unsupervised automl: a study on au-
tomated machine learning in the context of clustering.
Master’s thesis, Πανεπιστ
´
ηµιo Πειραι
´
ως.
Rend
´
on, E., Abundez, I., Arizmendi, A., and Quiroz, E. M.
(2011). Internal versus external cluster validation in-
dexes. International Journal of computers and com-
munications, 5(1):27–34.
Richard, R., Cao, H., and Wachowicz, M. (2020). Discov-
ering ev recharging patterns through an automated an-
alytical workflow. In 2020 IEEE International Smart
Cities Conference (ISC2), pages 1–8.
Rujasiri, P. and Chomtee, B. (2009). Comparison of cluster-
ing techniques for cluster analysis. Nat. Sci, 43:378–
388.
Straka, M. and Buzna, L. (2019). Clustering algorithms
applied to usage related segments of electric vehicle
charging stations. Transportation Research Procedia,
40:1576–1582.
Sun, C., Li, T., Low, S. H., and Li, V. O. (2020). Clas-
sification of electric vehicle charging time series with
selective clustering. Electric Power Systems Research,
189:106695.
Xydas, E., Marmaras, C., Cipcigan, L. M., Jenkins, N.,
Carroll, S., and Barker, M. (2016). A data-driven
approach for characterising the charging demand of
electric vehicles: A uk case study. Applied energy,
162:763–771.
An Automated Clustering Process for Helping Practitioners to Identify Similar EV Charging Patterns across Multiple Temporal Granularities
77