Clustering of Physical Behaviour Proﬁles using Knowledge-intensive

Similarity Measures

Deepika Verma

, Kerstin Bach

1 a

and Paul Jarle Mork

2 b

Department of Computer Science, Norwegian University of Science and Technology, Trondheim, Norway

Department of Public Health and Nursing, Norwegian University of Science and Technology, Trondheim, Norway

Keywords:

Case-based Reasoning, Knowledge Intensive Similarity Measures, Physical Activity, K-Means.

Abstract:

In this paper, we reuse the Case-Based Reasoning model presented in our last work (Verma et al., 2018) to

create a new knowledge intensive similarity-based clustering method that clusters a case base such that the

intra-cluster similarity is maximized. In some domains such as recommender systems, the most similar case

may not always be the desired one as a user would like to ﬁnd the closest, yet signiﬁcantly different cases. To

increase the variety of returned cases, clustering a case base ﬁrst, before the retrieval is executed increases the

diversity of solutions. In this work we demonstrate a methodology to optimize the cluster coherence as well

to determine the optimal number of clusters for a given case base. Finally, we present an evaluation of our

clustering approach by comparing the results of the quality of clusters obtained using our knowledge intensive

similarity-based clustering approach against that of the state-of-the-art K-Means clustering method.

1 INTRODUCTION

With the unprecedented growth in popularity of wear-

able activity trackers, acquiring reliable and objective

physical behaviour data from users over a long pe-

riod of time has become feasible. Activity trackers

provide objectively measured basic activity statistics

such as daily step count, miles run, heart rate among

others while some selective trackers additionally pro-

vide activity recommendations to help user stay active

throughout the day. While the validity and reliabil-

ity of the activity trackers remains a topic of research

(O’Driscoll et al., 2018), we conduct our research

on the very premise of physical behaviour measured

objectively, as opposed to self-reported (subjective)

and that shall be the point of departure for our work

ahead. Such objectively measured data present the

opportunity to identify groups of people (or clusters)

with similar physical behaviour (Marschollek, 2013;

Howie et al., 2018). Further, this may provide a foun-

dation for gaining new insights into the driving forces

of physical behaviour in a population.

Clustering methods provide a simple yet power-

ful way to reveal underlying structure of the data and

statistically understand the relationship between dif-

https://orcid.org/0000-0002-4256-7676

https://orcid.org/0000-0003-3355-2680

ferent data points. K-Means clustering (MacQueen,

1967) is one of the most commonly employed state-

of-the-art unsupervised machine learning method for

partitioning a given dataset into k clusters. Simple

similarity metrics are used for calculating the similar-

ity of the assigned cluster centroids to any given data

point in the dataset in order to determine the cluster

membership of each data point. The process repeats

until no more changes in the position of centroids are

observed. However, there are certain limitations to K-

Means. It has a tendency to overlook data complexity

(Yang et al., 2016) and moreover, is sensitive to out-

liers (Singh et al., 2011) and therefore can fail to give

meaningful clusters in presence of many outliers in

the dataset.

The challenge for most state-of-the-art clustering

methods is the use of knowledge poor similarity met-

rics or simple distance metrics such as Hamming dis-

tance and Euclidean distance, among others. These

metrics take into consideration only the syntactic dif-

ference between two data points, ignoring the coher-

ence of each attribute or variable of a data point, thus

leading to insufﬁcient estimation of the similarity be-

tween them. In datasets where each variable takes on

a value within a speciﬁc range elicits a requirement

for modelling the local dependency for each variable.

The similarity metric used must allow the existing

knowledge to be brought to use for the assessment

660

Verma, D., Bach, K. and Mork, P.

Clustering of Physical Behaviour Proﬁles using Knowledge-intensive Similarity Measures.

DOI: 10.5220/0008980406600667

In Proceedings of the 12th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2020) - Volume 2, pages 660-667

ISBN: 978-989-758-395-7; ISSN: 2184-433X

of similarity between data points in a dataset. Sim-

ple distance metrics can render the clusters incoherent

in a complex dataset as opposed to cohesive clusters

wherein the data points within a cluster are more simi-

lar to each other than to data points in another cluster.

A solution to this problem can be formulated using

Case-Based Reasoning (CBR) (Aamodt and Plaza,

1994), which employs a more knowledge-driven ap-

proach. Focusing on the semantic similarity between

attributes rather than the syntactic similarity, the col-

lective inﬂuence of each variable’s importance on the

ﬁnal (global) similarity score will improve the clus-

tering quality signiﬁcantly by incorporating the ex-

isting knowledge in the dataset (Adam and Blockeel,

2015) and that CBR offers a more versatile approach

to handle clustering of complex datasets (M

uller and

Bergmann, 2014).

In the sections that follow, we will use both

knowledge-intensive as well as knowledge-poor simi-

larity measures for cluster computation. We now hy-

pothesize in this paper that using knowledge-intensive

similarity measure as the metric for clustering the

cases in a case base would create clusters wherein the

cases within each cluster are semantically more sim-

ilar to each other than to cases in the other clusters.

The main contribution of this paper is a knowledge-

intensive similarity based clustering method that can

be used for any case base to compute clusters with

high intra-cluster similarity. For brevity sake, any

mention of the term similarity from this point on-

wards shall be taken as a reference to the knowledge-

intensive similarity, unless otherwise stated. The

terms have also been used interchangeably.

This paper is organized into sections as follows:

section 2 discusses the related work on similarity-

based clustering, section 3 presents the application

domain and elaborate on how similarity based clus-

tering can be applied to identify clusters of physi-

cal behaviour proﬁles from the objective physical be-

haviour data; section 4 is dedicated towards our sim-

ilarity based clustering algorithm; section 5 describes

the dataset we use to test our algorithm; section 6

presents a set of experiments to evaluate our clus-

tering approach, followed by section 7 discusses and

conclude our work.

2 RELATED WORK

Application of clustering methods has played a ma-

jor role in discovering the underlying patterns in pub-

lic health data sets and understanding the character-

istic differences among clusters. Identifying differ-

ent clusters of similar physical behaviour patterns is

similarly pivotal in understanding the physical activ-

ity characteristics of a population and will facilitate

identiﬁcation of different physical behaviour pheno-

types

. Clustering has been previously applied by

(Marschollek, 2013) on objectively measured physi-

cal behaviour data to identify four activity phenotypes

using regularity, duration and intensity of activities

as the pivotal attributes. Similar to their work, we

aim at applying clustering, albeit knowledge inten-

sive similarity-based, on objectively measured phys-

ical behaviour data to identify phenotypes. Using

a more probabilistic approach, (Howie et al., 2018)

identiﬁed ﬁve activity phenotypes for each gender us-

ing sex-speciﬁc latent class analysis. Although our

approach differs from the one taken in their work, our

long term goals and the target data are quite similar.

Similar to the self-efﬁcacy based activity recom-

mendation approach adopted by (Baretta et al., 2019)

to promote physical activity among adults, we aim

to underpin activity recommendations based on the

activity proﬁle-assessed efﬁcacy using a case-based

approach in order to promote achievement of rec-

ommended physical activity goals

. A case-based

marathon proﬁle recommendation approach has been

presented by Smyth and Cunningham in (Smyth and

Cunningham, 2017) to help marathon runners achieve

their personal best. Using a different approach for

improving the similarity-based retrievals in CBR,

uller and Bergmann, 2014) presents a cluster-

based indexing approach to make retrieval of most

similar cases more efﬁcient. While they use the sim-

ilarity measure to construct a hierarchical cluster-tree

which is used as an index for efﬁcient retrieval, we use

the similarity measure to create the clusters which can

then be used as an index for retrieving relevant cases.

(Lucca et al., 2018) presents a framework for devel-

oping an index on clustered cases for improving query

accuracy in agent simulation systems and making re-

trieval of relevant cases more efﬁcient by organizing a

large case base into smaller sub-case bases. Similarly,

(Cunningham, 2009) introduces using similarity as a

valid measure for selective sampling and generating

solutions for unlabelled cases in clustered case bases.

Furthermore, (Fanoiki et al., 2010) presented a

cluster-based approach which facilitates the identiﬁ-

cation of relevant cases for a given query problem by

considering the similarity relation among the cases

within the case base with respect to their problem

space as well the solution space. Their guiding princi-

ple being that the solutions of the most similar cases

are likely to be similar if their problem descriptions

www.sciencedirect.com/topics/neuroscience/phenotype

https://www.who.int/ncds/prevention/physical-

activity/guidelines-global-recommendations-for-health/en/

Clustering of Physical Behaviour Proﬁles using Knowledge-intensive Similarity Measures

661

are also similar. They formulate the solution by ﬁrst

selecting the cluster with the most similar problem de-

scription and then adapting the solution of the cases

within that cluster. This is similar to what we intend to

achieve for recommending activity goals. The under-

taken approach in (Fanoiki et al., 2010) is somewhat

similar to the selective case sampling approach pre-

sented in (Wiratunga et al., 2003) wherein they show

how unlabelled cases can be labelled with solutions

by clustering similar unlabelled problems within the

case base, which can then be labelled with solutions

by the expert.

3 CLUSTERS OF PHYSICAL

BEHAVIOUR PROFILES

Real-time activity tracking and systematic physical

activity recommendations remind users to help them

stay active throughout the day. This is especially use-

ful for sedentary individuals(Lagersted-Olsen et al.,

2013). Prolonged uninterrupted bouts of sedentary

behaviour are known to be detrimental to health

(Saunders et al., 2012). In addition to the type of

physical activity, the intensity of the moderate to vig-

orous activity performed also has an impact on the

overall health outcomes (Ekelund et al., 2019).

The importance of enough sedentary behaviour

has also been acknowledged (Coenen et al., 2018)

since both high as well as low ends of the activity

spectra are necessary in the right balance in order to

promote good health. However, the existing state-

of-the-art trackers provide approximately the same

recommendations with slight variation to every user.

Recommending activity goals to an individual which

are challenging, yet achievable is more beneﬁcial for

improving their health as opposed to recommending

either unachievable or not challenging enough activ-

ity goals (Baretta et al., 2019). Using an example

from our dataset, we demonstrate how a CBR system

can be used to identify unique clusters of physical be-

haviour proﬁles and how evidence-based experience

of other similar proﬁles can be used to underpin ac-

tivity recommendations for an individual.

Suppose we identify four clusters of physical be-

haviour proﬁles, as shown in ﬁgure 1 (we use a small

subset of the original dataset for clarity in the visu-

alization), in our dataset (see section 5. The aim is

to provide a user a diverse set of adapted most sim-

ilar proﬁles from other clusters as recommendations,

ranked by their similarity (such that lower similarity

indicates more challenging goal). For instance, to rec-

ommend activity goals to case 20, the system can se-

lect one most similar case from each cluster other than

Figure 1: Example: A potential set of similarity-based clus-

ters and how they can be utilised to recommend achievable

activity goals to a user. The x-axis and y-axis show total

sedentary duration and active duration (in minutes), respec-

tively, over a period of six days.

its’ member cluster and return the set of adapted pro-

ﬁles ranked by similarity to offer a diverse set of op-

tions for the user to choose their goal from. The most

similar proﬁle, case 67 appears to be a challenging as

well as an achievable goal for case 20. Therefore, it

might be advisable for case 20 to try and get closer to

the adapted activity proﬁle of case 67 if they wish to

challenge themselves while at the same time achieve

the recommended activity goals. Similarly for case

85, case 12 appears to be a challenging and achievable

goal. Therefore, in this case, it might be advisable for

case 85 to try and get closer to the adapted activity

proﬁle of case 12 in order to become more physically

active.

Large and complex datasets such as the objective

measurements for the HUNT4

(see section 5) study

require pre-processing and organization of the case

base to improve the overall performance of a CBR

system. We address this topic by identifying unique

clusters of different physical behaviour within the

HUNT4 dataset using our similarity-based clustering

method. We direct our attention solely towards under-

standing the behavioural characteristics of a sample

population that contribute to differences in physical

activity and sedentary behaviour which could allow

for designing improved recommendations tailored to

each phenotype for an innovative, yet effective active

lifestyle management intervention. To elicit greater

improvements in the existing infrastructure of activity

recommendations, radical shift in the use and applica-

tion of the existing methodologies may be required.

https://www.ntnu.no/hunt4

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

662

4 KNOWLEDGE INTENSIVE

SIMILARITY BASED

CLUSTERING

Unsupervised machine learning methods provide a

way of inferring underlying patterns or structure in

a given dataset without any reference to known out-

comes and therefore, is a viable option for our prob-

lem. We have a dataset consisting of 9034 physical

behaviour proﬁles and look for clusters that represent

meaningful physical behaviour types. Each cluster

should be semantically coherent. While the state-of-

the-art clustering methods such as K-means do pro-

vide a set of clusters, the proﬁles within each cluster

are not guaranteed to be very semantically similar to

each other since these methods use knowledge-poor

similarity measures or simple distance measures.

As we have shown in our previous work (Verma

et al., 2018), CBR outperforms the k-NN method in

ﬁnding the most similar physical behaviour proﬁles.

We therefore use the similarity score as the measure

for clustering the proﬁles in our dataset. Our approach

for using similarity as the metric for clustering ex-

tends the conventional approach of similarity in CBR

by allowing to model and further utilize the similar-

ity measures which are aligned with domain expert

knowledge. Algorithm 1 introduces the knowledge

intensive similarity-based clustering algorithm used

in our work.

Algorithm 1: Knowledge Intensive Similarity-

based Clustering Algorithm.

Input : case base C, number of clusters n

Output: n clusters

initialization: assign n random cases as

centroids-{c

}

Determine Cluster Membership

for each case k in C do

compute sim(k,c

), ∀j ∈ 1, ..., n

assign k to most similar centroid

end

Update Cluster Centroids

for each c

in {c

} do

compute meanSim

∑

∀k

∈S

sim(k

, c

)

ﬁnd case m in S

such that

sim(m,c

) ≈ meanSim

assign m as the new centroid c

end

Repeat until centroids converge

denotes the set of cases in cluster c

The algorithm initially assigns n cases as centroids

at random and then computes the clusters using the

similarity score of each case to each centroid. As

the similarity-based clustering method operates on the

similarity score between each case and each centroid

to determine its’ cluster membership, it is indepen-

dent of the data type. As a result, one advantage of

this method is that it can be applied to different types

of data sets other than just numerical, for example cat-

egorical or mixed datasets, which otherwise proves

to be challenging when using the conventional clus-

tering methods. Once the similarity measures are in

place, the user is freed from the trouble of taking care

of the data types before applying this knowledge in-

tensive similarity-based clustering method.

5 DATASET

The data set used in this work is the objectively

measured physical activity data collected during the

fourth round of the HUNT

cohort study. The data

collection in HUNT4 spanned over 18 months and

was ﬁnished in February 2019. Each person who vol-

unteered to participate in the objective physical ac-

tivity data collection was ﬁtted with two tri-axial ac-

celerometers, AX3 Axivity

, one on the lower back

and another on the thigh and wore them for a period

of seven consecutive days. Objective measurements

of a total of 35449 participants have been collected

and basic physical activities have been assigned (see

Table 1).

Table 1: Activity Descriptions.

Activity Description

Lying The person is lying down

Sitting When the person’s buttocks is on the

seat of a chair or something similar

Standing Upright, feet supporting the person’s

body weight

Walking Locomotion towards a destination with

one or more strides

Running Locomotion towards a destination, with

at least two steps where both feet leave

the ground during each stride

Cycling The person is riding a bicycle

Before populating the CBR system, we pre-

process the data to obtain the same amount of data

per participant. Therefore we decided to only include

participants who have full six days of measured data.

Furthermore, we remove any record containing zero

minutes for lying, standing, sitting and less than one

minute for walking activity as well as records where

the sum of all activities exceeds 1440 minutes for a

day (which represents the total minutes in a day). Due

https://www.ntnu.no/hunt

https://axivity.com/downloads/ax3

Clustering of Physical Behaviour Proﬁles using Knowledge-intensive Similarity Measures

663

to various reasons (discomfort, sensor failure, loss or

removal of sensor), this reduces our dataset to 31113

participants, out of which we randomly sample 9034

participants while maintaining the overall distribution

of activities for our experimental evaluation. Figure

2 shows the distribution of the six activities in the

dataset.

Figure 2: Summary of test dataset (9034 participants): Dis-

tribution of minutes spent per activity category over a period

of six days.

6 EXPERIMENTAL EVALUATION

We implemented the knowledge intensive similarity-

based clustering algorithm in Java (version 1.8) using

the java implementation of myCBR tool

. The CBR

model for our dataset has been created in the myCBR

workbench (Stahl and Roth-Berghofer, 2008) by im-

porting the data from a csv ﬁle. Similarity modelling

of each activity attribute has been carried out in the

same data-driven manner as we have presented in our

last work (Verma et al., 2018). We then used the CBR

model in our java implementation of the algorithm to

compute any desired number of clusters.

6.1 Coherent Clusters

A new set of centroids in the knowledge intensive

similarity-based clustering algorithm may or may not

give better mean similarity of clusters than the previ-

ous centroids. We can observe in ﬁgure 3, the mean

similarity of clusters varies to a large degree with each

progressive round of clustering, wherein each round

represents a new set of centroids. These variations

occur due to change in cluster membership of the

cases. As the membership of cases in the case base

evolves over several rounds, the movement of cases,

especially the edge cases from one cluster to another

may result in increase in the mean similarity of the

exiting cluster and decrease in that of the joining clus-

https://github.com/ntnu-ai-lab/mycbr-sdk

ter or vice-versa, thereby introducing positive as well

as negative variations in the cluster mean similarity.

These variations make it challenging to determine the

optimal centroids and clusters at any given point in

the algorithm.

Figure 3: Variation in average similarity of ten clusters over

multiple rounds.

Direct optimization of similarity-based clustering

is an NP-hard problem (Yang et al., 2016). To over-

come this challenge, we employ a strategy where the

algorithm looks s steps or rounds to the future to

check if it ﬁnds a set of centroids with a higher mean

similarity than the current set of centroids. It declares

convergence only when it does not ﬁnd any new set

of centroids with a higher mean similarity than the

current maximum mean similarity after s steps. The

objective now is to determine the step size s. It can

be observed in ﬁgure 3 that the mean similarity un-

dergoes considerable amount of variation over mul-

tiple rounds. Therefore, s must be set large enough

to foresee enough number of rounds before declaring

convergence, but small enough to be computationally

inexpensive for large datasets. The hypothesis here is

that the probability of falling into a local maxima is

less if the step size s is large enough to accommodate

the variation observed in the mean similarity of clus-

ters over multiple rounds, wherein each round con-

sists of a new set of centroids.

Figure 4: Relative difference in the mean similarity of clus-

ters with the step size s for number of clusters n in the range

[2,100]: The y-axis of the graph represents the difference

between the maximum mean similarity and mean similarity

achieved at s, displayed by the x-axis, for each n.

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

664

We can observe in the ﬁgure 4, with the increase

in the number of clusters, there is a decrease in the

difference between the mean similarity achieved at

any given s and the maximum mean similarity. This

indicates an inverse relation between step size s and

the number of clusters n. The value of s may differ

depending on the size of the dataset and the number

of clusters chosen, however, for our dataset, s = 50

seems to give a fair trade-off between time complex-

ity and cluster coherency.

6.2 Number of Clusters

Clustering allows you to split a given data set into

clusters according to a similarity metric, but one must

specify the desired number of clusters in advance. De-

termining the optimal number of clusters in unsuper-

vised clustering is a fundamental challenge and can

be a daunting task. One way to determine the op-

timal number of clusters in K-Means is the elbow

method, which involves plotting the sum of squared

errors (SSE) against the number of clusters. As SSE

decreases with the increase in number of clusters, the

optimal number of clusters is observed by noting the

elbow in the graph. In our case however, as we are op-

erating on the mean similarity of clustering which is

expected to increase with the increase in the number

of clusters, we will have a reverse elbow graph.

To determine the optimal number of similarity

clusters we plot the mean similarity of clusters

against the number of clusters. With s = 50, we

computed n clusters in the range [2,100] in order

to learn the optimal number for our dataset. Five

epochs were computed with n randomly chosen cases

as initial centroids, wherein each epoch consists of

reassignment of cases and recomputing the centroids

until the clusters converge. Afterwards, an average

was computed from the mean similarity values of all

the ﬁve epochs. The results are shown in ﬁgure 5,

Figure 5: Similarity within clusters for the knowledge in-

tensive, similarity-based clustering(step size s = 50).

Figure 6: Similarity within clusters for the K-Means clus-

tering method.

where it can be observed that the mean similarity

increases gradually until 20 clusters, followed by a

slow but steady increase. This indicates the maxi-

mum optimal number of clusters for our dataset is 20

or less. We need a more detailed analysis in order

to uniquely identify the different phenotype clusters

from our dataset and aim at achieving this goal using

similarity-based clustering.

6.3 Assessment of Cluster Quality

We now evaluate the quality of the computed clus-

ters within our dataset using our similarity-based clus-

tering approach. We present an evaluation by com-

paring the performance of the proposed similarity-

based clustering method with that of state-of-the-art

K-Means clustering method. The implementation of

K-Means clustering algorithm was done using Scikit

Learn library (Pedregosa et al., 2011) in Python (ver-

sion 3.6.3).

For comparing the results for both methods, we

needed a common metric to base the comparison on.

Since our aim is to have clusters with high degree of

intra-cluster similarity, we decided to take the mean,

minimum and maximum similarity as the metric for

comparing the methods. However, K-Means does

not compute semantic similarity between two given

data points. To overcome this hurdle, we imple-

mented a Rest API function in the myCBR java pack-

age which allows us to compute the similarity of any

two given cases, provided that the attribute values are

within their respective range as deﬁned in the CBR

model. We then used POST calls to calculate the

similarity between each case and its cluster centroid

for each cluster obtained using K-Means implemen-

tation. Five epochs were computed for both K-Means

and similarity-based clustering methods. Each epoch

consisted of reassignment of cases and recomputing

the means until the clusters converge. An average was

then computed of all the ﬁve epochs. The number of

Clustering of Physical Behaviour Proﬁles using Knowledge-intensive Similarity Measures

665

clusters n computed in each epoch were in the range

[2,100].

Figures 5 and 6 show the minimum and maximum

similarity for all the clusters in addition to the mean

similarity for both similarity-based clustering and K-

Means clustering. It can be observed from the results

that the mean similarity and the minimum similarity

for each number of clusters n are higher in similarity-

based clustering, however there is not much differ-

ence in the maximum similarity. To further verify the

difference in the results obtained by our algorithm and

K-Means, we performed a t-test at signiﬁcance level

α = 0.01 and α = 0.05 for the mean similarity val-

ues of the clusters obtained using both the methods.

The result is: t-value = 2.87, p-value= 0.008; which

is signiﬁcant at both α.

Although the measurable difference between re-

sults obtained using K-Means and similarity-based

clustering appears to be small, the t-test results show

that the results obtained are signiﬁcantly different.

Moreover, the difference lies in the quality of the clus-

ters obtained using both the methods. As stated pre-

viously, our objective in this work has been to cre-

ate clusters wherein the cases within each cluster are

more similar to each other than to cases in other clus-

ters. In other words, if we were to query for m similar

cases for a particular case, say Participant 8921, we

would expect the most similar cases to be in the same

cluster as the queried case rather than in some other

cluster(s), except perhaps for the edge cases. We can

examine this by querying the case base in the myCBR

workbench and then verifying the cluster labels of the

m most similar cases in the clusters obtained using

both K-Means and similarity-based clustering meth-

ods. We choose n = 20 and make retrievals using two

randomly chosen cases with m = 6. Figure 7 presents

the results.

Figure 7: Examples showing the quality of clusters for k-

Means vs similarity-based clusters. [Part.: Participant, Sim:

Similarity].

Taking as reference the top most record, which is

the queried case itself, we can now compare and con-

trast the difference in the quality of the clusters ob-

tained using both the methods. In both the examples,

the most similar cases in the similarity-based clusters

are placed in the same cluster. On the other hand,

most of the similar cases are placed in different clus-

ters in the K-Means clusters. The examples presented

in ﬁgure 7 support our hypothesis that the quality of

clusters achieved using our approach is much supe-

rior.

7 DISCUSSION & CONCLUSION

In this paper, we have presented a clustering algo-

rithm which uses knowledge intensive similarity as

the metric for computing clusters in a case base. We

presented an evaluation using the clustering method

in a CBR application built for the HUNT4 physical

behaviour dataset. The method computes clusters and

demonstrates how coherent clusters can be obtained

using an optimization strategy (see section 6.1). The

experimental results shown in ﬁgures 5 and 6 along

with the examples presented in ﬁgure 1 inevitably

demonstrate the coherence as well as the diversity of

the clusters obtained using our similarity-based clus-

tering approach. As stated previously, the conven-

tional clustering methods such as K-Means have cer-

tain limitations which can be overcome using CBR.

K-Means tends to overlook the complexity of the data

and puts emphasis on the attributes which have a dom-

inant presence in the data (such as lying) while ig-

noring the smaller (such as running) but signiﬁcant

attributes. While a small-scale change in the small

attributes may not result in a very large difference in

the similarity score, it can however change the order

of the similar cases. And thus, even though the cases

in each K-Means cluster have a fairly high similarity

to their cluster centroid, they are not necessarily very

similar to each other.

We have demonstrated experimentally the clus-

ters obtained using our similarity-based clustering ap-

proach have higher intra-cluster similarity amongst

the cases as opposed to the clusters obtained using

the state-of-the-art K-Means clustering method. The

difference in the results obtained has been found to

be statistically signiﬁcant. Therefore, it is safe to

conclude that our hypothesis is correct and the pro-

posed similarity-based clustering algorithm provides

better clusters than the K-Means clustering method.

The proposed algorithm is a suitable and viable op-

tion for our application and gives the desired coher-

ent clusters. The proposed similarity-based clustering

method can nevertheless be applied to other datasets

as well, including mixed datasets since the method is

independent of the data types.

In future, we will investigate the physical be-

haviour proﬁles in more detail and use sequential

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

666

physical behaviour data for clustering proﬁles by

adding on information such as the intensity, frequency

and duration of the activity bouts. The guidelines on

physical activity make it evident that there is a ne-

cessity to develop recommendations that address the

links amongst the type, duration, intensity, frequency

and the total amount of physical activity necessary

to be done by an individual in order to prevent non-

communicable diseases and general health issues. We

will extend our work to address this challenge by us-

ing similarity-based clustering to determine more spe-

cialized clusters and attempt to steer towards identify-

ing the physical behaviour phenotypes in our dataset.

REFERENCES

Aamodt, A. and Plaza, E. (1994). Case-based reasoning:

Foundational issues, methodological variations, and

system approaches. Artiﬁcial Intelligence Communi-

cations, 7(1).

Adam, A. and Blockeel, H. (2015). Dealing with overlap-

ping clustering: A constraint-based approach to algo-

rithm selection. CEUR Workshop Proceedings, 1455.

Baretta, D., Sartori, F., Greco, A., D’Addario, M., Melen,

R., and Steca, P. (2019). Improving physical activ-

ity mhealth interventions: Development of a computa-

tional model of self-efﬁcacy theory to deﬁne adaptive

goals for exercise promotion. Advances in Human-

Computer Interaction, 2019.

Coenen, P., Willenberg, L., Parry, S., Shi, J. W., Romero, L.,

Blackwood, D. M., Maher, C. G., Healy, G. N., Dun-

stan, D. W., and Straker, L. M. (2018). Associations

of occupational standing with musculoskeletal symp-

toms: a systematic review with meta-analysis. British

Journal of Sports Medicine, 52(3).

Cunningham, P. (2009). A taxonomy of similarity mecha-

nisms for case-based reasoning. IEEE Trans. Knowl.

Data Eng., 21.

Ekelund, U., Brown, W. J., Steene-Johannessen, J., Fager-

land, M. W., Owen, N., Powell, K. E., Bauman, A. E.,

and Lee, I.-M. (2019). A systematic review and har-

monised meta-analysis of data from 850 060 partici-

pants. British Journal of Sports Medicine, 53(14).

Fanoiki, T. O., Drummond, I., and Sandri, S. A. (2010).

Case-based reasoning retrieval and reuse using case

resemblance hypergraphs. In International Confer-

ence on Fuzzy Systems.

Howie, E. K., Smith, A. L., McVeigh, J. A., and Straker,

L. M. (2018). Accelerometer-derived activity pheno-

types in young adults: a latent class analysis. Interna-

tional Journal of Behavioral Medicine, 25(5).

Lagersted-Olsen, J., Korshøj, M., Skotte, J., Carneiro, I. G.,

Søgaard, K., and Holtermann, A. (2013). Comparison

of objectively measured and self-reported time spent

sitting. International journal of sports medicine, 35 6.

Lucca, M. R. B., Junior, A. G. L., de Freitas, E. P., and Silva,

L. A. L. (2018). A case-based reasoning and clus-

tering framework for the development of intelligent

agents in simulation systems. In FLAIRS, Florida.

MacQueen, J. (1967). Some methods for classiﬁcation and

analysis of multivariate observations. Berkeley, Calif.

University of California Press.

Marschollek, M. (2013). A semi-quantitative method to de-

note generic physical activity phenotypes from long-

term accelerometer data – the atlas index. PLOS ONE,

8(5).

uller, G. and Bergmann, R. (2014). A cluster-based

approach to improve similarity-based retrieval for

process-oriented case-based reasoning. ECAI’14. IOS

Press.

O’Driscoll, R., Turicchi, J., Beaulieu, K., Scott, S., Matu,

J., Deighton, K., Finlayson, G., and Stubbs, J. (2018).

How well do activity monitors estimate energy expen-

diture? a systematic review and meta-analysis of the

validity of current technologies. British journal of

sports medicine.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,

Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,

Cournapeau, D., Brucher, M., Perrot, M., and Duch-

esnay, E. (2011). Scikit-learn: Machine learning in

Python. Journal of Machine Learning Research, 12.

Saunders, T. J., Larouche, R., Colley, R. C., and Tremblay,

M. S. (2012). Acute sedentary behaviour and markers

of cardiometabolic risk: a systematic review of inter-

vention studies. Journal of nutrition and metabolism.

Singh, K., Malik, D., and Sharma, N. (2011). Evolving

limitations in k-means algorithm in data mining and

their removal. International Journal of Computational

Engineering and Management, 12.

Smyth, B. and Cunningham, P. (2017). Running with cases:

A cbr approach to running your best marathon. In

Aha, D. W. and Lieber, J., editors, CBR Research and

Development, Cham. Springer.

Stahl, A. and Roth-Berghofer, T. R. (2008). Rapid proto-

typing of cbr applications with the open source tool

mycbr. In ECCBR ’08. Springer-Verlag.

Verma, D., Bach, K., and Mork, P. J. (2018). Modelling

similarity for comparing physical activity proﬁles - a

data-driven approach. In Cox, M. T., Funk, P., and

Begum, S., editors, CBR Research and Development,

Cham. Springer.

Wiratunga, N., Craw, S., and Massie, S. (2003). Index

driven selective sampling for cbr. In Ashley, K. D.

and Bridge, D. G., editors, CBR Research and Devel-

opment. Springer.

Yang, Z., Cor, J., er, and Oja, E. (2016). Low-rank doubly

stochastic matrix decomposition for cluster analysis.

Journal of Machine Learning Research, 17(187).

Clustering of Physical Behaviour Proﬁles using Knowledge-intensive Similarity Measures

667