The Use of Ensemble Classiﬁcation and Clustering Methods of Machine

Learning in the Study of Internet Addiction of Students

Oksana V. Klochko

1 a

, Vasyl M. Fedorets

2 b

, Vitalii I. Klochko

3 c

and Maryna V. Kormer

4 d

Vinnytsia Mykhailo Kotsiubynskyi State Pedagogical University, 32 Ostrozhskogo Str., Vinnytsia, 21100, Ukraine

Vinnytsia Academy of Continuing Education, 13 Hrushevskoho Str., Vinnytsia, 21050, Ukraine

Vinnytsia National Technical University, 95 Khmelnytsky Highway, Vinnytsia, 21021, Ukraine

State University of Economics and Technology, 5 Stepana Tilhy Str., Kryvyi Rih, 50006, Ukraine

Keywords:

Machine Learning, Clustering Classiﬁcation, Internet Addiction, Detection of Internet Addiction, Internet

Disorders, Internet Addiction of Students, Expectation Maximization, Farthest First, K-Means, AdaBoost,

Bagging, Random Forest, Vote.

Abstract:

One of the relevant current vectors of study in machine learning is the analysis of the application peculiarities

for methods of solving a speciﬁc problem. We will study this issue on the example of methods of solving

the clustering and classiﬁcation problem. Currently, we have a considerable number of machine learning al-

gorithms – e.g. Expectation Maximization, Farthest First, K-Means, Expectation-Maximization, Hierarchical

Clustering, Support vector machines, K-nearest neighbor, Logistic regression, Random Forest etc. – which can

be used for clustering and classiﬁcation. However, not all methods can be used for solving a speciﬁc task. The

article describes the technology of empirical comparison of methods of clustering and classiﬁcation problems

solving using WEKA free software for machine learning. Empirical comparison of data clustering methods

was based on the results of a survey conducted among students majoring in Computer Studies and dedicated

to detecting signs of Internet Addiction (IA) (Internet Addiction is a behavioural disorder that occurs due to

Internet misuse). As a continuation of the study of Internet Addiction of students, a survey of students of other

specialties was conducted. Ensemble methods of machine learning classiﬁcation were used to analyze these

data. Empirical comparison of clustering algorithms (Expectation Maximization, Farthest First and K-Means)

and ensemble classiﬁcation algorithms (AdaBoost, Bagging, Random Forest and Vote) with the application

of the WEKA machine learning system had the following results: it described the peculiarities of application

of these methods in feature clustering and classiﬁcation, the authors developed data instances’ clustering and

classiﬁcation models to detect signs of Internet addiction among students, the study concludes that these meth-

ods may be applicable to development of models detecting respondents with signs of IA related disorders and

risk groups.

1 INTRODUCTION

One of the areas of research, particularly, such as

education, healthcare and life safety, is the empiri-

cal analysis of methods of solving a speciﬁc problem

with the using of the machine learning (Zahorodko

et al., 2021; Zelinska, 2020). Let us study this issue

on the example of methods of solving the clustering

and classiﬁcation problems (Tarasenko et al., 2019).

https://orcid.org/0000-0002-6505-9455

https://orcid.org/0000-0001-9936-3458

https://orcid.org/0000-0002-9415-4451

https://orcid.org/0000-0002-6509-0794

Clustering methods are statistic methods of data

analysis that enable people to group the given selec-

tion of data samples into clusters, classes, taxons de-

pending on the value of their attributes; each of these

groups has certain characteristics. The main idea is

to use several clustering methods in order to carry out

an empirical comparison study and determine which

methods ensure the most optimal data grouping while

solving a speciﬁc problem.

Machine learning classiﬁes clustering problems as

problems for unsupervised learning. Currently, there

is a considerable number of machine learning algo-

rithms that can be used for clustering, for instance,

Expectation Maximization, Farthest First, K-Means,

Klochko, O., Fedorets, V., Klochko, V. and Kormer, M.

The Use of Ensemble Classiﬁcation and Clustering Methods of Machine Learning in the Study of Internet Addiction of Students.

DOI: 10.5220/0010923500003364

In Proceedings of the 1st Symposium on Advances in Educational Technology (AET 2020) - Volume 1, pages 241-260

ISBN: 978-989-758-558-6

241

K-Medians, Hierarchical Clustering etc. But not all

of them are suitable for solving a speciﬁc problem.

Data clustering algorithms differ by the cluster model

type, the algorithm model type, the nesting hierarchy

of clusters, the way of implementation depending on

the data set etc. Because of this, there are also certain

requirements to the data set parameters.

Classiﬁcation tasks are considered to be a differ-

ent group of machine learning tasks – namely, super-

vised learning. However, nowadays, there are devel-

oped classiﬁcation algorithms based on a combina-

tion of supervised learning and unsupervised learning

(e.g., Learning Vector Quantization) the classiﬁcation

algorithm in machine learning is built on the basis of

the preset ﬁnite number of objects divided into groups

and allows to classify an arbitrary object, if it is un-

known which group it belongs to. There is a con-

siderable number of machine learning algorithms that

can be applied to solve classiﬁcation tasks, e.g. Sup-

port vector machines (SVMs), Logistic regression, K-

nearest neighbor, Linear discriminant analysis (LDA),

etc. However, currently, one can get more accurate re-

sults by using ensemble methods.

Popular software products used in machine learn-

ing include TensorFlow, WEKA, MATLAB, MXNet,

Torch, PyTorch, Microsoft Azure Machine Learning

Studio and others.

In this study, we use the WEKA (Waikato Envi-

ronment for Knowledge Analysis) free machine learn-

ing software (Weka, 2021). The free WEKA machine

learning system gives direct access to the library of

implemented algorithms written in Java.

Analysis of contemporary studies and publica-

tions shows that the issue of analysis and selection

of the machine learning method, which would be op-

timal for processing a concrete data set, is popular in

the scientiﬁc circles. A considerable number of these

studies is dedicated to the application of machine

learning methods in the ﬁelds of education, healthcare

and life safety.

In healthcare and education sentiment analysis be-

comes more and more popular. Thus, Pacol and

Palaoag (Pacol and Palaoag, 2020) conducted the sen-

timent analysis of the Textual Feedback of students

regarding the work of the professors using Machine

Learning Techniques. In the sentiment classiﬁcation

study, the Random Forest algorithm proved to be most

effective; it proved to be more effective than base

models of Support vector machines, Naive Bayes, Lo-

gistic regression algorithms and their ensembles (Pa-

col and Palaoag, 2020).

Klochko et al. (Klochko et al., 2020) applied clus-

tering algorithms of machine learning for the anal-

ysis of typical mistakes pupils make, the selection

and adaptation of the content of learning to concrete

groups of pupils that was then supposed to be used for

ﬂipped learning with the use of virtual learning envi-

ronment. The analysis was done through the compar-

ison of clusters, deﬁned by learning results demon-

strated by the pupils, using Canopy, Expectation Max-

imization and Farthest First algorithms.

Souri et al. (Souri et al., 2020) suggested a model

based on the Internet of Things technologies for mon-

itoring the indicators of students’ health in order to

detect biological and behavioral changes. The de-

veloped model, when used together with the Support

Vector Machine (SVM) algorithm, reaches the high-

est accuracy of 99.1% (Souri et al., 2020).

Hussain et al. (Hussain et al., 2019) studied

the application of the machine learning classiﬁcation

methods to ﬁnd ways to ensure independent daily liv-

ing of people who have Alzheimer’s disease. The idea

of the study is to analyze the data registered by differ-

ent equipment in order to determine the changes in a

person’s behavior that are relevant for the daily life

and social interaction. The paper gives a comparison

of the efﬁciency levels of ﬁve machine learning clas-

siﬁcation techniques used for the recognition of a per-

son’s activity (and his/her psychological status). Ex-

perimental ﬁndings show that compared to traditional

methodologies, these approaches give better results in

determining the activity of the person and his/her psy-

chological and behavioral peculiarities.

amer et al. (Kr

amer et al., 2019) studied the

speed and efﬁciency of medical aid provision us-

ing the databases of the Hospital ER. Applying the

Random forest algorithm, the authors developed the

model based on the data about the patient’s pro-

visional diagnosis. The use of the controlled ma-

chine learning method and model training based on

the opinion of a specialized doctor allowed them to

achieve high forecasting accuracy (96%) as well as

the area under the receiver operating curve (>0.99).

Subasi et al. (Subasi et al., 2019) developed a hy-

brid model of detecting epileptic ﬁts using the Ge-

netic Algorithm (GA) and Particle Swarm Optimiza-

tion (PSO) to determine the optimal parameters of ap-

plication of the Support Vector Machine (SVM) algo-

rithm. The hybrid algorithm that they suggested can

demonstrate data set classiﬁcation accuracy of up to

99.38%.

A considerable number of papers appeared, which

are dedicated to diagnosing Internet addiction (IA)

and studying the mechanisms of this disorder among

various social groups. The appearance and use of

the Internet has many beneﬁts. However, at the same

time, disorders related to pathological use of the Inter-

net are becoming a social as well as a psychological

AET 2020 - Symposium on Advances in Educational Technology

242

problem. Currently, we face an important psycholog-

ical, sociocultural and educational issue of detection

and prevention of certain pathologies and steady pre-

morbid conditions (state before the disease) caused by

inadequate Internet use. Cases of IA were ﬁrst men-

tioned in 1995 and attracted considerable attention.

Issues related to this one became the research subject

of Yuryeva and Bolbot (Yuryeva and Bolbot, 2006),

Derhach (Derhach, 2016) and others. Internet Ad-

diction Disorder (IAD) is also called Pathological In-

ternet Use (PIU). The term “Internet Addiction” was

ﬁrst suggested by Ivan K. Goldberg in 1995. He de-

scribes net addiction as a speciﬁc pathology charac-

terized by a wide spectrum of behavioral and impulse

control disorders (lack of control, absence of volun-

tary regulation) (Abbott et al., 1995). In 1996 Gold-

berg made the ﬁrst attempt to determine groups of be-

havioural and psychological signs and symptoms of

IA (Wallis, 1997), namely: tolerance; abstinence syn-

drome; difﬁculties in voluntary regulation of Internet-

behaviour; increase of time and ﬁnancial investments

in things related to Internet or computer use; a shift

of a person’s interests towards Internet-related ac-

tivities; extensive Internet use that leads to malad-

justment. In 1998 Young (Young, 1998a,b) deﬁned

IAD as an impulsive-compulsive disorder, which has

speciﬁc signs or addictions: cyber-sexual addiction,

cyber-relationship addiction, net compulsions, infor-

mation overload and computer addiction. IAD is

not ofﬁcially included into ICD-11 for Mortality and

Morbidity Statistics (Version: 09/2020), however, in

section 6C51 Gaming disorder the “Gaming disor-

der” is described as a “pattern of persistent or recur-

rent gaming behaviour (‘digital gaming’ or ‘video-

gaming’), which may be online (i.e., over the Inter-

net)” (ICD-11 for Mortality and Morbidity Statistics ,

Version: 09/2020).

Even though the problem of IA is becoming more

and more relevant, there are not enough scientiﬁc pa-

pers dedicated to the study of this issue with the help

of machine learning methods. Let us look at some

of them. On the basis of the Support Vector Ma-

chine algorithm, including the C-SVM and v-SVM,

and applying the Student’s t-test to the data set of the

survey conducted among 2,397 Chinese students, Di

et al. (Di et al., 2019) proved the utility of using ma-

chine learning methods for detecting and forecasting

the risk of IA. Hsieh et al. (Hsieh et al., 2019) sug-

gested using the EMBAR protected system of web-

services based on the ensemble classiﬁcation meth-

ods and case-based reasoning to study the IA of the

users and prevent the development of this disorder at

the initial stages. Ji et al. (Ji et al., 2019) are cur-

rently continuing their research, which aims to cre-

ate an IA detector that would work in a real-time

mode. The authors suggest studying this issue using

an adapted system of continuous real-coded variables

(XCSR), which determines the level of Internet ad-

diction (high-risk and low-risk) on the basis of the in-

formation about the Internet users using the Chen In-

ternet addiction scale (CIAS) or respiratory instanta-

neous frequency (IF). Suma et al. (Suma et al., 2021)

studied the possibilities for predicting IA based on a

set of predictor variables using the Random forest al-

gorithm.

Thus, based on the above presented statement of

the problem as well as taking into consideration the

insufﬁcient amount of research on the application of

machine learning methods to IA diagnosing, we de-

termine the aim of our research, which is to determine

the ﬁelds of use and conduct an empirical comparison

of ensemble classiﬁcation and clustering methods of

machine learning in the study of IA disorder of stu-

dents.

2 SELECTION OF METHODS

AND DIAGNOSTICS

The study of IA disorder of pupils had two stages.

The ﬁrst sage was conducted in 2019, its purpose was

to determine the possible ﬁelds of use as well as an

empirical comparison of clustering methods of ma-

chine learning for studying IA disorders of students.

During the second stage, in 2019–2021, the authors

studied possible ﬁelds of use as well as an empir-

ical comparison of ensemble classiﬁcation methods

of machine learning for studying IA disorders of stu-

dents.

At the ﬁrst stage, data regarding the spread and

severity of IA among students majoring in Computer

Sciences were received from an online survey, which

used a questionnaire drafted with the help of Google

Forms. 262 students majoring in Computer Sciences

and coming from different regions of Ukraine partic-

ipated in the experimental study. The data set is pre-

sented in the ARFF format and consists of 8 attributes

(ﬁgure 1) (Klochko and Fedorets, 2019). The data set

contains the ﬁelds described in table 1.

Cluster analysis is one of the tasks of database

mining. Cluster analysis is a set of methods of mul-

tidimensional observations or objects classiﬁcation,

based on deﬁning the concept of distance between the

objects and their subsequent grouping (into clusters,

taxons, classes). The selection of a concrete cluster

analysis method depends on the purpose of classiﬁca-

tion (Klochko, 2019). At the same time, one does not

need a priori information about the statistical popula-

The Use of Ensemble Classiﬁcation and Clustering Methods of Machine Learning in the Study of Internet Addiction of Students

243

Table 1: Data structure on the state of IA among students majoring in Computer Sciences.

Attributes Contents/Questions Type Statistics

age Age of the student Numeric Minimum 16

Maximum 59

Mean 19.756

StdDev 6.806

sex Student’s sex Nominal Female 199

Male 63

3 Can’t imagine my life without the Internet Nominal yes 184

undeﬁned 39

no 39

4 When I cannot use the Internet I feel anxiety, irritation Nominal yes 81

undeﬁned 134

no 47

5 I like “surﬁng” the Net without a clearly deﬁned purpose Nominal yes 121

undeﬁned 112

no 29

6 I can give up from food, sleep, going to classes, if a have a Nominal yes 248

chance to use the Internet for free undeﬁned 7

no 7

7 I prefer meeting new people over the Internet rather than Nominal yes 185

in real life undeﬁned 37

no 40

8 I often feel that I’ve spent not enough time playing Nominal yes 178

computer games over the Internet, I constantly wish undeﬁned 61

to play longer no 23

Figure 1: Data set on the state of IA among students major-

ing in Computer Sciences, presented in the ARFF format.

tion. This approach is based on the following presup-

positions: objects that have a certain number of sim-

ilar (different) features group in one segment (clus-

ter). The level of similarity (difference) between the

objects that belong to one segment (cluster) must be

higher than the level of their similarity with the ob-

jects that belong to other segments (Klochko, 2019).

Let us look at one of cluster analysis algorithms

(Klochko, 2019).

Output matrix:

X =







. . . x







Let us move to the matrix of standardized Z values

with elements:

i j

− ¯x

;

where j = 1, 2, ·· · ,n — index number, i = 1, 2, ··· ,m

– observation number;

¯x

∑

i=1

i j

;

∑

i=1

i j

− x

)

i j

) − (x

)

There are several ways to deﬁne the distance be-

tween two observations z

and z

: weighted Euclidean

AET 2020 - Symposium on Advances in Educational Technology

244

distance, which is determined by the formula

, z

) =

∑

l=1

− z

)

where w

is the “weight” of index; 0 < w

≤ 1; if

= 1 for all l = 1, 2, . . . ,n, then we get the usual

Euclidean distance

, z

) =

∑

l=1

− z

)

Hamming distance:

, z

) =

∑

l=1

− z

in most cases this way of distance measuring gives the

same result as the usual Euclidean distance, but in this

case the inﬂuence of non-systemic large differences

(runouts) decreases.

Chebyshev distance:

BCH

, z

) = max

1≤l≤n

− z

it is best to apply this distance in order to determine

the differences existing between the two objects using

only one dimension.

Mahalanobis distance:

, z

) =

− z

)

−1

− z

where S is covariance matrix; this distance measure-

ment gives good results when applied to a concrete

data group, but it does not work very well, if the co-

variance matrix is calculated for the whole data set.

Distance between peaks:

, z

) =

∑

l=1

− z

+ z

presupposes independence of random variables,

which indicates the distance in the orthogonal space.

It is best to choose from the above described dis-

tance measures after the consideration of the structure

and characteristics of the data sample.

Let us present the received measurements in the

form of distance matrix:

R =







0 ρ

. . . ρ

0 ρ

. . . ρ

. . . 0







As the R matrix is symmetric, i.e. ρ

= ρ

we may conﬁne ourselves to off-diagonal matrix el-

ements. Using the distance matrix, we can imple-

ment the agglomerative hierarchic procedure of clus-

ter analysis. Distances between clusters are deter-

mined as the closest or the farthest ones. In the ﬁrst

case, the distance between the clusters is the one be-

tween the closest elements of these clusters, in the

second case, it is the one between the two farthermost

located. The principle of the work of agglomerative

hierarchic procedures lies in a consequent grouping of

elements, starting from the ones closest to each other

and those that are farther and farther apart. During

the ﬁrst step of the algorithm, every observation z

(i = 1, 2, ..., m) is viewed as a separate cluster. Then,

during every next step of the work of the algorithm,

two closest located clusters are grouped together and

then once again the distance matrix is built, but its

dimension decreases by one. The algorithm stops its

work when all the observations are grouped into clus-

ters.

Let us look at the algorithms we used while clus-

tering the data set regarding the state of IA disorder

among students majoring in Computer Sciences:

EM (Expectation Maximization):

Determines the probability distribution for every

object, which indicated its belongingness to each

cluster. EM methods (Keng, 2016): Maximum Like-

lihood Estimation (MLE) or Maximum a Posteriori

(MAP). Description of the algorithm is shown in ﬁg-

ure 2 (Keng, 2016): at the E-stage (expectation) we

calculate the estimated likelihood; at the M-stage

(Maximization) we calculate the maximum likelihood

estimation, increasing the expected likelihood, calcu-

lated at the E-stage; its value is used for the E-stage at

the next iteration. The algorithm is repeated until its

convergence.

K-Means algorithm:

Aims to partition n observations into k clusters in

such a way that each observation belongs to the clus-

ter with the nearest mean value. The shortest distance

between the observations and the nearest mean value

may be calculated by minimizing the sum of squares

of the distances (Linoff and Berry, 2011) (ﬁgure 3).

Farthest First algorithm:

This is a modiﬁcation of a K-Means algorithm, in

which the initial selection of centroids is 2 and higher.

Centroids are determined following the remoteness

principle, i.e. the point farthest from the rest is se-

lected ﬁrst. The Farthest First algorithm is described

in ﬁgure 4 (Dasgupta and Long, 2005).

During the second stage of the survey on the sit-

uation with IA among students of various specialties

was conducted with the help of Google Forms. 363

The Use of Ensemble Classiﬁcation and Clustering Methods of Machine Learning in the Study of Internet Addiction of Students

245

Figure 2: Description of how the algorithm EM works from 10,000 feet (Keng, 2016).

Figure 3: K-Means algorithm (Linoff and Berry, 2011).

students from different regions of Ukraine took part

in the survey. The data set is presented in the ARFF

format and consists of 9 attributes (ﬁgure 5). The data

set contains the ﬁelds described in table 2.

Cluster analysis is one of the tasks of database

mining. Cluster analysis is a set of methods of mul-

tidimensional observations or objects classiﬁcation,

based on deﬁning the concept of distance between the

objects and their subsequent grouping (into clusters,

taxons, classes). The selection of a concrete cluster

analysis method depends on the purpose of classiﬁca-

tion (Klochko, 2019). At the same time, one does not

need a priori information about the statistical popula-

tion. This approach is based on the following presup-

positions: objects that have a certain number of sim-

ilar (different) features group in one segment (clus-

ter). The level of similarity (difference) between the

objects that belong to one segment (cluster) must be

higher than the level of their similarity with the ob-

jects that belong to other segments (Klochko, 2019).

In order to analyze the IA phenomenon, we di-

vide the respondents into three groups (Signiﬁcant

Risk (SR), Insigniﬁcant Risk (IR), No Risk (NR)).

The division is based on the integrative use of qual-

itative and quantitative characteristics of the IA phe-

nomenon. The SR group is formed on the basis of

detecting and analysis of those IA features, which sig-

nify qualitative changes of the psychological status of

a personality. The selection of such features is car-

ries out on the basis of traditional understanding of

the fact that in-depth psychic changes related to the

formation of addictive behavior concern, primarily,

vital (Balatskiy, 2008) and existential “foundations”

of a personality. Such vital and existential (ontologi-

cal) “foundations” are relatively stable and are subject

to “external” transformation if the inﬂuence is sig-

niﬁcant and long-lasting. The changes concern the

existential dimension (Frankl, 1985), vital resources

(Balatskiy, 2008) and intentions as well as attitudes

and behavioral stereotypes aimed at survival and life

preservation. They are linked with the vital “founda-

tions” of life itself.

The questions which reﬂect the above stated life

“foundations” or vital resources are (table 2): “I can

give up food, sleep, going to classes, if I have a chance

to use the Internet for free” (1

SR) and “When I

cannot use the Internet, I feel anxiety, irritation” (2

SR). The 1

SR seems to be more important as food

and sleep are system organizing and basic vital needs.

The ability to give them up indicates not only the “to-

tal”, in-depth and comprehensive change of the hier-

archy of vital needs, values and senses (Frankl, 1985;

AET 2020 - Symposium on Advances in Educational Technology

246

Figure 4: Farthest-ﬁrst traversal of a data set (Dasgupta and Long, 2005). Take the distance from a point x to a set S to be

d(x, S) = min

y∈S

d(x, y) (Dasgupta and Long, 2005).

Figure 5: Data set on the state of IA of students, presented

in the ARFF format.

Leontyev, 2017), but also the deformation of very vi-

tal “foundation” of a personality. The stated issues of

food and sleep indirectly reﬂect the existential prob-

lems of a person. The is caused by the fact these

issues concern the existential problem of “life and

death” and the “I am” existential phenomenon. That

is why, while IA is being formed, the existential prob-

lem is also being developed, which is temporarily and

compensatory solved with a “potential possibility of

Internet access”.

In the ﬁrst (1

SR) question, the “can give up

classes” part is an important social and personal-

ity oriented aspect. If the answer is positive, that

means that the content embedded in the afore men-

tioned fragment is ignored and desemantized. This

discloses the presence of desemantization and depre-

ciation of the possible socio-economically “settled”

future and a conscious self-limitation in the ﬁeld of

self-actualization in studying and professional activ-

ity. Another relevant point is ignoring communica-

tion, social ties, possibilities for self-improvement

and “construction” of self in the educational dis-

course. The stated needs and aims are partially

changed and substituted by the Internet. At the same

time the “real” reality is replaced, deactualized and

desemantized. The second (2

SR) question reﬂects

the presence of neurotic anxiety, which is a mani-

festation of ”exhaustion” and ”overstrain” of the ner-

vous system as well as of certain changes in the sys-

tem of emotions and volition. Moreover, in this case,

the problem of sense formation and understanding

occurs as well as the corresponding changes in the

value-conceptual ﬁeld. This is what V. Frankl (Frankl,

1985) described and logoneurosis – the loss and / or

absence of sublime and vital senses. The presence

of a neurotic aspect in the form of neurotic anxiety

is particularized through actualization in the 2

question of the irritation phenomenon (“. . . feel . . .

irritation”).

In general, the 2

SR question supplements, par-

ticularizes and “strengthens” the deﬁcit and “narrow-

ing” of vitality, life creativity, nature correspond-

ing existence, healthy life preservation instinct (food,

sleep, communication). The stated vital resources and

life creativity, which are actualized and problematized

in the 1

SR and 2

SR question, act as a com-

plex diagnostic sub-system. It is aimed at detect-

ing systemic, comprehensive stable, in-depth, vital,

personal-psychological problems (i.e. disorders). The

stated problems may develop and together transform

into AI.

The 1

SR and 2

SR questions, which reﬂect

the vitality pf a person and his/her existence, disclose

the qualitative difference of the SR group from IR and

NR groups, which do not have the stated peculiarities.

Thus, with a certain degree of certainty, the SR group

may be represented as well as diagnosed with a rel-

atively limited number of questions (1

SR and 2

SR). The questions, which represent the SR group in-

dicate that the Internet has “penetrated” deep into the

consciousness and into the core of a personality, in

his/her vitality, into human existence (ﬁgure 6).

The Use of Ensemble Classiﬁcation and Clustering Methods of Machine Learning in the Study of Internet Addiction of Students

247

Table 2: Data structure on the state of IA of students.

Attributes Contents/Questions Type Statistics

age Age of the student Numeric Minimum 15

Maximum 59

Mean 20.306

StdDev 7.238

sex Student’s sex Nominal Female 260

Male 103

3 Can’t imagine my life without the Internet Nominal yes 245

undeﬁned 53

no 65

4 When I cannot use the Internet I feel anxiety, irritation Nominal yes 110

undeﬁned 194

no 59

5 I like “surﬁng” the Net without a clearly deﬁned purpose Nominal yes 169

undeﬁned 156

no 38

6 I can give up from food, sleep, going to classes, if a have a Nominal yes 343

chance to use the Internet for free undeﬁned 8

no 12

7 I prefer meeting new people over the Internet rather than Nominal yes 263

in real life undeﬁned 48

no 520

8 I often feel that I’ve spent not enough time playing Nominal yes 243

computer games over the Internet, I constantly wish undeﬁned 86

to play longer no 343

IA IA disorder Nominal nr 113

ir 217

sr 33

Thus, the SR group is qualitatively different from

IR and NR. The SR group represents either a con-

siderable risk of IA development or a transitive (pre-

morbid) state or even the presence of the actual IA

pathology whereas the IR and NR groups indicate a

greater or lesser possibility of its development.

The IR group is diagnosed by questions 3, 5, 7 and

8 of table 2, which reﬂect “weak” signs. The signs re-

ﬂected in these questions are not attributive or essen-

tial. Accordingly, they do not represent the in-depth

personality-psychological, vital and existential aspect

of IA. That is why, these questions, by summing up a

certain number of them (in this case, three), may form

a certain degree of probability for having IA risks.

These questions get a certain level of “consistency”

when there is a certain number of them, in this case,

not less than three.

Thus, the IR group is characterized (diagnosed)

by the presence of three questions out of questions

3 (1

IR), 5 (2

IR), 7 (3

IR) and 8 (4

IR) of

table 2. The 1

IR question discloses contemporary

reality of professional activity and communication, in

which the Internet component id relevant, systemic,

environmental and signiﬁcant. Thus, taking into con-

sideration current systemic Internet-oriented socio-

technological and technological contexts, this ques-

tion does not reveal the totality and explicitness of

personality-psychological changes. It primarily indi-

cates considerably high level or signiﬁcance and even

value of the Internet in the life of a person. The 2

IR question characterizes the peculiarity of the con-

temporary Internet-culture. It reﬂects the fact that

the person has an actualized orientation-searching re-

ﬂex and a corresponding search and cognitive behav-

ior, and not just that possibility of IA development.

While disclosing the peculiarities of modern-day In-

ternet communication, the 3

IR question also par-

tially characterizes the problem of the insufﬁciently

developed communicative competence and commu-

nicative culture of a personality, which, in turn, is ef-

fectively compensated with Internet-communication.

As for the 4

IR, the signiﬁcant question is the one

which indicates an integrative manifestation of the

activity (leading aspect), cognitive, value-conceptual,

creative and communicative dimensions of the psy-

chic reality. The positive answer partially indicates

AET 2020 - Symposium on Advances in Educational Technology

248

Figure 6: “Penetration” of the Internet into the core of a personality, into vitality, into existence, which illustrates the psycho-

logical mechanisms of risk formation in the SR group.

the presence of insufﬁcient risks. At the same time,

in its essence, a person is a creature that plays –

Homo ludens (Huizinga, 2016). What is important,

is the fact that at an early childhood a game is a spe-

ciﬁc integrative and integrating form of activity and

the essence of human existence. The states gaming

essence of a person can manifest itself in course of

solving complicated tasks, studying as well as dur-

ing leisure time. A computer game has a consider-

able mobilizing, emotional, orientation-search poten-

tial. A computer game can become addictive, both

due to the “gaming” peculiarities of human nature and

as a result of the professionally developed games that

take human psychology into account. A computer

game addiction thus indicates not only not as much

the risks of IA development, but rather the presence

of the “gaming essence” of Homo ludens. In addition,

a considerable number of teenagers and grown-up de-

velop computer game addiction as a consequence of

the fact that they did not have a chance to fully real-

ize their gaming potential in early childhood. This

often happens due to intense early learning, which

competes with gaming activity. The stated principle

of competition between different forms of activity is

described in Anokhin (Anokhin, 1968) study on func-

tional systems (Sudakov, 2011). At the same time,

constant interest in playing computer games poses a

certain threat of the development of a computer game

and Internet addictions as it “touches” different psy-

chic spheres.

At the same time, if a person gives at least three

positive answers to the 1

IR, 2

IR, 3

IR and 4

IR questions, this indicates a certain risk of IA devel-

opment. This is caused by the fact that each question

characterizes the inﬂuence of the Internet on a certain

aspect of a personality: the 1

IR – on the need and

value-conceptual aspect, the 2

– on the cognitive as-

pect, including the orientation ability; the 3

– on the

communicative aspect, the 4

– on the activity com-

ponent. Thus, actualization of the Internet as a need,

value-conceptual, cognitive, communicative and ac-

tivity phenomenon that is signiﬁcant for a personal-

ity speaks of its certain “spread” and “rootedness” in

the stated aspects (spheres) as well as about its corre-

sponding signiﬁcance and value. A certain “Internet

locus” is formed in various spheres of a personality.

Thus, as a result of systemic interiorization processes,

the Internet “integrates” into a person’s conscious-

ness, becoming a signiﬁcant phenomenon (ﬁgure 7).

At the same time, the stated “integration” is “super-

ﬁcial”, reversal, unstable, and such that does not lead

to maladaptation or personality-psychological and be-

havioral changes.

The totality of the “spread” or “expansion” of the

Internet on the psychic reality, as a signiﬁcant phe-

nomenon for several spheres of consciousness, cre-

ates certain (but considerably insigniﬁcant) risks of

IA development. At the same time, the more spheres

“contain” the “Internet locus”, the higher the risks are

as this means the increase of the number of opportuni-

ties for forming the summing up and synergy effects,

which lead to qualitative changes.

NR is a group of questions, which characterize

separate features of Internet inﬂuence. The personal-

ity demonstrates local, superﬁcial, reactive reactions.

That is why the people that fall into this group, the

risks are almost absent or are rather insigniﬁcant.

The classiﬁcation ensemble methods of machine

learning were used to study the presence of IA dis-

order in students. The ensemble methods combine

a few algorithms that learn simultaneously and com-

pensate or correct mistakes of one another. Such ap-

proaches as stacking, bagging (bootstrap aggregat-

ing) and boosting are used while developing ensem-

ble methods. Stacking used the approach of meta-

learning in order to best combine a few machine

learning models. On the basis of the basic-level mod-

els, the algorithm is taught. Using these results, the

meta-model learns to better combine the predictions

of basic models. Bagging uses multiple teaching of

an ensemble of classiﬁers on random data sets, which

is conducted simultaneously but independently from

one another. Then, a determined averaging of results

is conducted. The results are averaged on the basis of

The Use of Ensemble Classiﬁcation and Clustering Methods of Machine Learning in the Study of Internet Addiction of Students

249

Figure 7: Inﬂuence of the Internet on various spheres of human psychic.

a determined strategy. Boosting carries out a consec-

utive adaptive algorithm teaching. The next algorithm

learns through focusing on classiﬁcation mistakes of

the ﬁrst algorithm. In this research, the authors used

such ensemble classiﬁcation algorithms as AdaBoost,

Bagging, Random Forest and Vote.

The WEKA machine learning system uses the Ad-

aBoost algorithm.M1 (ﬁgure 8) (Freund and Schapire,

1996; Santos and de Barros, 2020). The Ad-

aBoost meta-algorithm improves the efﬁciency of ba-

sic learning algorithms by building their combination.

It uses adaptive boosting, building every next classi-

ﬁer according to the instance that were badly clas-

siﬁed by previous classiﬁers. Having determined a

weak classiﬁer in the cycle, AdaBoost re-assigns the

weights, and at every iteration the weights of incor-

rectly classiﬁed instances increase. By testing classi-

ﬁers in such a way, the AdaBoost algorithm selects a

classiﬁer that better identiﬁes the instances.

The Bagging (bootstrap aggregation) meta-

algorithm uses compositions of algorithms each of

which learns independently from one another; to

determine the ﬁnal result, the process called voting is

being implemented, as a result of which, the mistakes

of the classiﬁers are compensated (Breiman, 1996;

Tuysuzoglu and Birant, 2020) (ﬁgure 9).

According to Leo Breiman’s deﬁnition, “a Ran-

dom Forest is a classiﬁer consisting of a collection of

tree-structured classiﬁers {h(x, Θ

), k = 1, ... } where

the {Θ

} are independent identically distributed ran-

dom vectors and each tree casts a unit vote for the

most popular class at input x” (ﬁgure 10) (Breiman,

2001; Zhong et al., 2020).

The voting classiﬁer combines different classiﬁers

that learn and are assessed simultaneously (Kittler

et al., 1998; Kuncheva, 2014). The ﬁnal decision

regarding the prediction is taken by a majority vote

following two strategies. In hard voting (majority

voting), the class label is predicted, which is deter-

mined by a majority of votes of every classiﬁer (Kit-

tler et al., 1998; Kuncheva, 2014). In soft voting,

probability vectors for every predicted class (for all

classiﬁers) are summed up and averaged and the class

with the highest value is selected (Kittler et al., 1998;

Kuncheva, 2014).

3 RESULTS AND DISCUSSION

To cluster data using the WEKA platform, we will use

Weka.clusterers.EM, Weka.clusterers.SimpleKMeans

and Weka.clusterers.FarthestFirst algorithms (Weka,

2021).

We check the application of clustering algorithms

that can be assigned to two classes of clustering al-

gorithms, i.e. distribution based (Expectation Max-

imization) and centroid-based (K-Means, Farthest

First). Such selection is motivated by the fact these

algorithms have long been used to cluster different

types of data in many ﬁelds and are considered to be

effective.

Dunn, DB, SD, CDbw and S Dbw were selected

as validity indices for testing (da Silva et al., 2019;

Moshtaghi et al., 2019) (table 3). In the CDbw in-

dex the distance from the point to multitude set in

the course of selecting cluster element can be calcu-

lated in different ways. In this study, we use the sum

of distances of already existing “representatives” of

the cluster to each cluster element to calculate this

distance. The element, on which the maximum was

reached, was selected as the next “representative” of

the cluster.

If the data set has no cluster structure, then such

situation is not determined with the help of validity

metrics. While using K-Means and Farthest First (ta-

ble 2) the numbers of clusters for the two algorithms

that were selected as optimal by the majority of in-

dices, can only nominally be deﬁned as cluster struc-

ture. As the work of Expectation Maximization algo-

rithm is based on determining the probability of eval-

uating maximum similarity, the indices calculated for

this algorithm are more homogenous. The structure,

which is characterized by a small number of clusters

that also have to be compact and separable, is deter-

mined to be the best one. Judging by the results of

evaluation of clustering using the validity indices, we

may consider that k-Means and Farthest First algo-

rithms are most likely to give worse clustering results

than the Expectation Maximization algorithms.

To cluster the data, we select training/testing using

the percentage split option. As a data set for training

(model building) we select 66% of data from the set.

AET 2020 - Symposium on Advances in Educational Technology

250

Figure 8: The algorithm AdaBoost.M1 (Freund and Schapire, 1996).

Figure 9: The algorithm Bagging (Breiman, 1996; Lee et al., 2020).

As a data set for testing we select 34% of data from

the set. In addition, we select number of clusters “3”

in algorithm settings.

We received the following results:

In the course of application of the EM cluster-

ing algorithm, according to the built clustering model

based on the training data set, three clusters were de-

termined, their characteristics are given in table 4.

Cluster 0 (63% of respondents): The average age

of respondents in this cluster is 17. The group consists

predominantly of women. The characteristic feature

of the representatives of this group is that they are un-

able to imagine their life without the Internet. There

are variations in the levels of anxiety and irritation,

if there is no possibility to use the Internet. There

are also varying opinions regarding the aimless use of

the Internet. As for other attributes, disorders related

to IA may be observed in the insigniﬁcant number

of respondents, who belong to this cluster. The be-

havioural model of the representatives of this cluster

demonstrated Internet centration in the psychic real-

ity of a personality, which is accordingly reﬂected in

their activity and behavior, other life interests as well

as signiﬁcance of everyday activities lose their impor-

tance. The stated tendencies are linked to IA.

Cluster 1 (13% of respondents): For the represen-

tatives of this group the average value of the age at-

tribute is 36 and it varies greatly. This is the oldest

age group if compared with other clusters. This group

has the largest share of women. Representatives of

this group, predominantly, cannot imagine their life

without the Internet. Thus, according to the centroid

The Use of Ensemble Classiﬁcation and Clustering Methods of Machine Learning in the Study of Internet Addiction of Students

251

Figure 10: The algorithm Random Forest (Breiman, 2001).

Table 3: Optimal number of clusters, calculated with the help of quality indices.

Index

Algorithms

Expectation Maximization k-Means Farthest First

Dunn 3 6 6

DB 3 6 4

SD 3 3 3

CDbw 3 3 3

S Dbw 3 5 4

Table 4: Model and evaluation on test split by EM algo-

rithm.

Attributes Indications

Clusters

0 1 2

(0,63) (0,13) (0,24)

112.1491 24.7781 44.0727

age

mean 17.4469 36.2459 19.2906

std. dev. 1.5994 10.0785 2.243

sex

female 108.8714 16.0638 5.0648

male 2.2778 7.7143 38.0079

no 22.7034 3.1864 10.1102

undeﬁned 16.0405 6.4026 4.5569

yes 73.4052 15.1891 29.4057

no 54.392 13.8263 27.7817

undeﬁned 23.6012 5.1903 7.2085

yes 34.156 5.7615 9.0825

no 45.3302 19.3167 26.3531

undeﬁned 15.1791 2.1415 5.6794

yes 51.6398 3.32 12.0403

no 106.1573 22.7561 41.0866

undeﬁned 1.0117 1.0098 1.9785

yes 4.9802 1.0122 1.0076

no 81.1224 20.5492 27.3284

undeﬁned 11.5501 2.168 11.282

yes 19.4767 2.061 5.4624

no 89.4444 19.3333 9.2223

undeﬁned 7.2533 1.1937 9.553

yes 15.4514 4.2512 25.2975

values of the attributes, we may diagnose IA related

Internet centration in the psychic reality of a personal-

ity, which is accordingly reﬂected in their activity and

behavior; other life interests as well as signiﬁcance of

everyday activities lose their importance. There are

predominantly no other signs of IA related disorders.

Cluster 2 (24% of respondents): The probabilis-

tic average of the age attribute among the representa-

tives of this group is middle-aged in comparison with

other groups and is 19. Male representatives signiﬁ-

cantly dominate in this group. Regarding the inabil-

ity to imagine their life without the Internet, opinions

differed, however, predominantly respondents believe

they have this addiction. Judging by the values of

attributes 4, 5, 6 and 7, the vast majority of this

group’s representatives declare that they do not have

other signs of IA. However, the feeling of the lack

of time spent playing computer games over the In-

ternet, which was conﬁrmed by the vast majority of

respondents, is a warning signal that may signify the

existence of IA related disorders. Thus, the charac-

teristic feature of this group is that most of its rep-

resentatives have IA related disorders such as: Inter-

net centration in the psychic reality of a personality;

behavioral impulse control disorders related to online

gaming. These people are in the risk group for devel-

oping IA related disorders.

In the course of application of the Farthest First al-

gorithm, according to the built clustering model based

on the training data set, there have also been three

clusters formed; their characteristics are given in ta-

ble 5.

Table 5: Model and evaluation on test split by Farthest First

algorithm.

Attributes

Clusters

0 1 2

age 16.0 22.0 20.0

sex female male male

3 yes undeﬁned yes

4 undeﬁned no yes

5 no yes undeﬁned

6 no no undeﬁned

7 no undeﬁned undeﬁned

8 no undeﬁned no

AET 2020 - Symposium on Advances in Educational Technology

252

Cluster 0: Contains data instances of the youngest

age group, whose age centroid attribute is 16. Ac-

cording to the value of the sex centroid attribute, the

group is made up of mostly female data instances.

The representatives of this group cannot imagine their

life without the Internet, i.e. there is obvious Internet

centration in the psychic reality of a personality. Re-

spondents cannot clearly determine whether they feel

either anxiety or irritation if they do not have the pos-

sibility to use the Internet. Judging by other attributes,

data instances of this cluster do not have IA related

disorders.

Cluster 1: This cluster contains data instances of

an older age group, the age attribute centroid of which

is 22. The value of the sex attribute centroid in this

cluster is male. A characteristic feature of the cluster

is undecidedness regarding the vital need to use the

Internet, prevalence of Internet relations over actual

real interactions, feeling the lack of time spent play-

ing computer games over the Internet (attributes 3, 7,

8 equal undeﬁned). The value of the yes centroid of

attribute 5 shows inclination to use the Internet with-

out a concrete purpose. To give an overall character-

istic, this group has signs of IA, i.e. behavior control

disorders related to Internet use.

Cluster 2: By the value of the age attribute cen-

troid, 20, this cluster contains data instances of the

middle age group if compared with other clusters.

The sex attribute centroid in this cluster is male. The

representatives of this cluster cannot imagine their life

without the Internet and feel anxiety and irritation

when they do not have the possibility to use the In-

ternet. They are characterized by their undecidedness

regarding the vital need to use the Internet; giving up

other life interests and everyday activities for the sake

of free Internet use; prevalence of online relations of

real-life interactions (value of attributes 5, 6, 7 is un-

deﬁned). Thus, the representatives of this cluster have

signs of IA, the priority signiﬁcance of the Internet

and behavior control disorders, related to Internet use.

Compared to other groups, they are in the risk group

for developing IA related disorders.

In the course of application of the K-Means algo-

rithm to the clustering model built on the basis of the

training data set three clusters have also been formed,

their characteristics are presented in table 6.

Cluster 0: Contains data instances of the youngest

age group, whose age attribute centroid is about 18.

According to the sex attribute centroid, mostly female

data instances are present in the groups. The represen-

tatives of this group cannot clearly determine whether

they have a vital need to use the Internet. As for other

indices, respondents state absence of signs of IA re-

lated disorders.

Table 6: Model and evaluation on test split by K-Means

algorithm.

Attributes

Clusters

0 1 2

age 18.4194 21.8605 20.9552

sex female male female

3 undeﬁned yes yes

4 no no no

5 no no no

6 no no no

7 no no no

8 no yes no

Cluster 1: This cluster contains data instances of

the older age group, whose age attribute centroid is

about 22. The value of the sex attribute centroid in

this cluster is male. Characteristic features of data

instances that belong to this cluster include the vi-

tal need to use the Internet, feeling the lack of time

spent playing online computer games as well as the

systemic need to play longer. The overall characteris-

tic of this cluster is the presence of signs of IA, i.e. be-

havior control issues related to Internet use, namely,

gaming Internet addiction. If compared with other

cluster, they belong to the risk group that may develop

IA related disorders.

Cluster 2: By the value of age attribute centroid,

which is about 21 years, compared to other clusters,

this cluster contains data instances of medium age

group. The sex attribute centroid is female. The rep-

resentatives of this cluster cannot imagine their life

without the Internet. Judging by centroids of other

characteristics, respondents of this cluster do not have

Internet-related disorders. Thus, the representatives

of this cluster have only IA signs associated with the

utmost signiﬁcance of the Internet.

The cluster distribution of test data in the course of

application of the three algorithms – the Expectation

Maximization, Farthest First and K-Means – using the

built training models is presented in table 7. Thus,

as it can be seen from the table, the algorithms have

determined three data groups. Clusters were formed,

which included 71:12:7, 67:4:19 and 33:15:42 data

instances respectively. There is a cluster that has

the largest number of data instances; a group, which

has the least data instances (exceptions); a group that

includes several times more data instances than the

smallest group.

Figures 11, 12 and 13 present a graphic repre-

sentation of clusters by age characteristic of data in-

stances, which are built using the training data set

and received in the course of implementation of the

Expectation Maximization, the Farthest First and the

K-Means algorithm respectively. As we can see, the

The Use of Ensemble Classiﬁcation and Clustering Methods of Machine Learning in the Study of Internet Addiction of Students

253

Table 7: Clustered Instances determined using Expectation Maximization, K-Means and Farthest First algorithms.

Attributes

Expectation Maximization Farthest First algorithm K-Means

Instances % Instances % Instances %

0 67 74 71 79 33 37

1 4 4 12 13 15 17

2 19 21 7 8 42 47

formed clusters differ from each other by the age at-

tribute. For instance, Cluster 0, which contains most

data instances, contains instances of respondents of a

younger age, if formed through the application of the

Expectation Maximization algorithm (ﬁgure 11). On

the other hand, the same cluster received through the

implementation of the Farthest First algorithm, con-

tains data instance of various age groups (ﬁgure 12).

Also, a small number of data instances of various

age groups is present in Cluster 2, received in the

course of implementation of the K-Means algorithm

(ﬁgure 13). Cluster 0 and Cluster 2 formed with the

Expectation Maximization algorithm as well as Clus-

ter 1 and Cluster 2 formed with the Farthest First algo-

rithm contain homogeneous age groups, and Cluster 0

and Cluster 1, formed with K-Means algorithm.

Figure 11: Plot of cluster distribution applying the Expec-

tation Maximization algorithm depending on the age group

attribute.

Figure 12: Plot of cluster distribution applying the Farthest

First algorithm depending on the age group attribute.

Figures 14, 15 and 16 present a graphic represen-

Figure 13: Plot of cluster distribution applying the K-Means

algorithm depending on the age group attribute.

tation by sex attribute of clusters formed through the

application of the Expectation Maximization, Farthest

First and K-Means algorithm respectively. The analy-

sis of ﬁgure 14, which visualizes clustering through

application of the Expectation Maximization algo-

rithm, shows that Cluster 0 contains only female data

instances. Clusters 1 and 2 have date instances of both

sexes. Female data instances prevail in Cluster 1 and

male ones in Cluster 2. Unlike Clusters formed by the

Expectation Maximization algorithm, all the clusters

formed by the Farthest First algorithm contain data

instances of both sex groups (ﬁgure 15). Female data

instances signiﬁcantly prevail in Cluster 0. All the

clusters built using the K-Means algorithm, contain

both male and female data instances (ﬁgure 16).

Figure 14: Plot of cluster distribution applying the Expecta-

tion Maximization algorithm depending on the sex attribute.

To classify a data set that contains 363 data sets,

we break it with the help of random choice into a

training set, which contains 70% (254) data sets and a

test set, which contains 30% (109) data sets.

AET 2020 - Symposium on Advances in Educational Technology

254

Figure 15: Plot of cluster distribution applying the Farthest

First algorithm depending on the sex attribute.

Figure 16: Plot of cluster distribution applying the K-Means

algorithm depending on the sex attribute.

To classify using the WEKA machine learning

system, we create classiﬁcation models on the basis

of the training set with the help of AdaBoost, Bag-

ging, Random Forest and Vote algorithms.

The results are shown in tables 8, 9, 10, 11, 12,

13.

According to the results, that are reﬂected in ta-

bles 8-13, the highest percent of correctly classiﬁed

instances both by the results of the training model as

well as by the prediction results are received while

applying the Bagging (classiﬁcation algorithm classi-

ﬁers.trees.REPTree) and Random Forest algorithms,

with 94.4882% (testing – 96.3303%) and 96.8504%

(testing – 99.0826%) respectively. In this case, over-

ﬁtting is not observed as the stated models demon-

strate a higher level of efﬁciency on test data rather

than on training data. In addition, these models

demonstrate the highest Kappa statistic and ROC

Area indexes. At the same time, the best results are

received while using the Random Forest algorithm.

The results received in the course of application of the

Ada Boost (classiﬁers.trees.DecisionStump) model

are somewhat worse by all the criteria, but are still ac-

ceptable. According to the indexes provided in Tables

8 and 11, the worst results are received in the course

of application of the Vote (classiﬁers.rules.ZeroR)

model.

According to the Mean absolute error (MAE),

the data forecast that is closes to the actual results

both in the process pf learning as well as in the pro-

cess of testing was built using the Random Forest

0.0597 (testing – 0.0405) and Bagging 0.0728 (test-

ing – 0.0478) models; the worst result according to

this indicator is received in the course of application

of the Vote 0.3643 (testing – 0.3569) model. The ap-

proximately twice higher MAE value was received in

course of building and testing the Ada Boost model,

which is 0.1309 and 0.1029 respectively.

The Root mean squared error (RMSE) values also

indicate the supremacy of the Random Forest 0.1391

(testing – 0.0964) and Bagging 0.1765 (testing –

0.1362) algorithms. The worst value was received as a

result of building a model based on Vote 0.4263 (test-

ing – 0.4176).

According to the Relative absolute error (RAE)

and Root relative squared error (RRSE) the assess-

ment prioritization of classiﬁcation models is pre-

served with Random Forest and Bagging. It should

be noted that the worst indexes are received as a re-

sult of classiﬁcation using the M model (RAE=100%,

RSE=100%), which characterizes an almost random

prediction.

4 CONCLUSION

In the course of determine the ﬁelds of use and con-

duct an empirical comparison of ensemble classiﬁ-

cation and clustering methods using the WEKA ma-

chine learning system to study the signs of IA related

disorders of students, the following conclusions have

been made:

1. As a result of empirical comparison of Expecta-

tion Maximization, Farthest First and K-Means al-

gorithms using the WEKA machine learning sys-

tem, we developed models of data instances’ clus-

tering to determine the signs of internet addiction

disorders among students majoring in Computer

Sciences.

2. The implementation of the Expectation Maxi-

mization, the K-Means and the Farthest First algo-

rithms each resulted in the formation of 3 clusters.

The results of clustering demonstrate that Internet

centration in the psychic reality of a personality

is a characteristic feature of the respondents that

took part in the survey. This also reﬂects accord-

ingly in their activity and behavior, diminishing

other life interests and the signiﬁcance of every-

day activities. In addition, in the course of im-

plementation of the Expectation Maximization al-

The Use of Ensemble Classiﬁcation and Clustering Methods of Machine Learning in the Study of Internet Addiction of Students

255

Table 8: Evaluation of the results of the work of WEKA ensemble classiﬁcation training models.

Ensemble classiﬁ-

cation algorithm

scheme

Correctly

Classiﬁed

Instances

Incorrectly

Classiﬁed

Instances

Kappa

statis-

tic

Mean

abso-

lute

error

Root

mean

squared

error

Relative

absolute

error

Root

relative

squared

error

weka.classiﬁers.

meta.AdaBoostM1

224

(88.189%)

(11.811%)

0.7583 0.1309 0.2355 35.9359% 55.2562%

weka.classiﬁers.

meta.Bagging

240

(94.4882%)

(5.5118%)

0.8962 0.0728 0.1765 19.9917% 41.3999%

weka.classiﬁers.

trees.RandomForest

246

(96.8504%)

(3.1496%)

0.9411 0.0597 0.1391 16.375% 32.6236%

weka.classiﬁers.

meta.Vote

152

(59.8425%)

102

(40.1575%)

0 0.3643 0.4263 100% 100%

Table 9: Detailed Accuracy by Class of the WEKA ensemble classiﬁcation training models.

Ensemble classiﬁ-

cation algorithm

scheme

Rate

Precision Recall F-

Measure

MCC ROC

Area

PRC Area

Class

weka.classiﬁers.

meta.AdaBoostM1

0.973 0.000 1.000 0.973 0.986 0.981 0.978 0.982 SR

1.000 0.294 0.835 1.000 0.910 0.768 0.958 0.961 NR

0.000 0.000 - 0.000 - - 0.922 0.620 IR

Weighted Average 0.882 0.176 - 0.882 - - 0.960 0.929

weka.classiﬁers.

meta.Bagging

0.973 0.000 1.000 0.973 0.986 0.981 0.975 0.981 SR

0.980 0.108 0.931 0.980 0.955 0.886 0.972 0.965 NR

0.679 0.013 0.864 0.679 0.760 0.741 0.975 0.852 IR

Weighted Average 0.945 0.066 0.944 0.945 0.943 0.898 0.974 0.957

weka.classiﬁers.

trees.Random Forest

0.973 0.000 1.000 0.973 0.986 0.981 0.998 0.996 SR

0.993 0.069 0.956 0.993 0.974 0.935 0.994 0.995 NR

0.821 0.004 0.958 0.821 0.885 0.875 0.995 0.971 IR

Weighted Average 0.969 0.042 0.969 0.969 0.968 0.942 0.995 0.993

weka.classiﬁers.

meta.Vote

0.000 0.000 - 0.000 - - 0.484 0.285 SR

1.000 1.000 0.598 1.000 0.749 - 0.475 0.586 NR

0.000 0.000 - 0.000 - - 0.466 0.103 IR

Weighted Average 0.598 0.598 - 0.598 - - 0.477 0.445

gorithm, a cluster was formed, whose represen-

tatives have behavior control disorders, related to

online gaming. These respondents are in the risk

group for developing IA related disorders.

3. Expectation Maximization, Farthest First and K-

Means algorithms of data clustering differ by their

algorithm model, however, from the point of char-

acteristic features, they produce relatively simi-

lar clusters, thus implementing optimized clus-

tering. At the same time, when a data set was

grouped into three clusters by implementing these

algorithms, the clusters differed by cluster model,

namely, by the number of data instances in each

cluster, their structure and value of attribute cen-

troids.

4. Judging by the evaluation results of clustering va-

lidity using the validity indices, we can state that

most likely the K-Means and Farthest First algo-

rithms show worse clustering results than the Ex-

pectation Maximization algorithm.

5. Respondents are divided into three groups (Signif-

icant Risk (SR), Insigniﬁcant Risk (IR), No Risk

(NR)). Such division gives the possibility of pri-

mary general assessment of risks of IA develop-

ment based on the signiﬁcance of Internet inﬂu-

ence on the psychic of a person. The Signif-

icant Risk (SR) group is determined by asking

questions, which reﬂect the signs of “in-depth”,

maladaptive and, accordingly, a relatively long-

lasting inﬂuence of the Internet on the psychic,

vital resources, vitality, the existential level, the

personality in general in its vital and conceptual

basis. The Insigniﬁcant Risk (IR) is determined

by asking question, which disclose the signs of

“superﬁcial”, local, adaptive even though a rather

signiﬁcant inﬂuence on the psychic. In this group

AET 2020 - Symposium on Advances in Educational Technology

256

Table 10: Table of confusion matrix of WEKA ensemble classiﬁcation testing models.

Actual class

Ensemble classiﬁcation algorithm scheme Area Class SR NR IR

Predicted

class

weka.classiﬁers.meta.AdaBoostM1

SR 72 2 0

NR 0 152 0

IR 0 28 0

weka.classiﬁers.meta.Bagging

SR 72 2 0

NR 0 149 3

IR 0 9 19

weka.classiﬁers.trees.RandomForest

SR 72 2 0

NR 0 151 1

IR 0 5 23

weka.classiﬁers.meta.Vote

0 74 0

NR 0 152 0

IR 0 28 0

Table 11: Evaluation of the results of testing the WEKA ensemble classiﬁcation models.

Ensemble classiﬁ-

cation algorithm

scheme

Correctly

Classiﬁed

Instances

Incorrectly

Classiﬁed

Instances

Kappa

statis-

tic

Mean

abso-

lute

error

Root

mean

squared

error

Relative

absolute

error

Root

relative

squared

error

weka.classiﬁers.

meta.AdaBoostM1

103

(94.4954%)

(5.5046%)

0.8869 0.1029 0.1729 28.8318% 41.399%

weka.classiﬁers.

meta.Bagging

105

(96.3303%)

(3.6697%)

0.9276 0.0478 0.1362 13.3815% 32.6122%

weka.classiﬁers.

trees.RandomForest

108

(99.0826%)

(0.9174%)

0.982 0.0405 0.0964 11.3566% 23.0819%

weka.classiﬁers.

meta.Vote

(59.633%)

(40.367%)

0 0.3569 0.4176 100% 100%

the spheres, inﬂuence by the Internet are the cog-

nitive, activity, value-conceptual, need, commu-

nicative spheres of the psychic. The No Risk (NR)

group indicates the absence of risks for IA devel-

opment. Belongingness to this group is deﬁned

y asking questions, which reﬂect an insigniﬁcant,

local and short-tern inﬂuence of the Internet on

the psychic.

6. The model that gave the results that are the

closest ones to the actual classiﬁcation re-

sults is the model built using the Random

Forest algorithm. According to all the as-

sessments, the classiﬁcation model built us-

ing the Bagging algorithm (classiﬁcation al-

gorithm classiﬁers.trees.REPTree) is close to

it. Somewhat lower classiﬁcation indexes

are received in the course of building a

model using the Ada Boost algorithm (classi-

ﬁers.trees.DecisionStump). These models can be

considered suitable for diagnosing IA disorders

among students. The model built with the help of

the Vote algorithm (classiﬁers.rules.ZeroR) is not

suitable for use. Such a result indicates that the

application of this algorithm requires additional

modiﬁcations.

7. Intellectual analysis of the data set regarding the

situation with IA among students majoring in

Computer Sciences with the application of en-

semble classiﬁcation and clustering methods has

shown that the methods studied above may be

considered suitable for developing models for de-

tecting IA disorders and respondent groups with

the signs of IA related disorders.

8. The results of the research indicate the expedience

of the application of the intellectual data analy-

sis in medical research using the machine learning

systems. The presented methods may serve as the

basis for a strategic development of new vectors

of medical data processing as well as decision-

making in this ﬁeld.

The present-day medicine needs non-standard ap-

proaches to intellectual data analysis, complex appli-

cation of methods, their modiﬁcation, application the

ensemble of methods in order to be able to process

large data sets in digital systems. Our conclusions

may help to determine the signs of IA related disor-

The Use of Ensemble Classiﬁcation and Clustering Methods of Machine Learning in the Study of Internet Addiction of Students

257

Table 12: Detailed Accuracy by Class of testing the WEKA ensemble classiﬁcation models.

Ensemble classiﬁ-

cation algorithm

scheme

Rate

Precision Recall F-

Measure

MCC ROC

Area

PRC Area

Class

weka.classiﬁers.

meta.AdaBoostM1

0.974 0.000 1.000 0.974 0.987 0.980 0.983 0.984 SR

1.000 0.136 0.915 1.000 0.956 0.889 0.976 0.971 NR

0.000 0.000 - 0.000 - - 0.946 0.540 IR

Weighted Average 0.945 0.081 - 0.945 - - 0.977 0.956

weka.classiﬁers.

meta.Bagging

0.974 0.000 1.000 0.974 0.987 0.980 1.000 1.000 SR

0.985 0.068 0.955 0.985 0.970 0.924 0.995 0.997 NR

0.600 0.010 0.750 0.600 0.667 0.657 0.975 0.750 IR

Weighted Average 0.963 0.041 0.962 0.963 0.962 0.932 0.996 0.987

weka.classiﬁers.

trees.Random Forest

1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 SR

1.000 0.023 0.985 1.000 0.992 0.981 0.999 1.000 NR

0.800 0.000 1.000 0.800 0.889 0.890 0.998 0.967 IR

Weighted Average 0.991 0.014 0.991 0.991 0.990 0.984 0.999 0.998

weka.classiﬁers.

meta.Vote

0.000 0.000 - 0.000 - - 0.500 0.358 SR

1.000 1.000 0.596 1.000 0.747 - 0.500 0.596 NR

0.000 0.000 - 0.000 - - 0.500 0.046 IR

Weighted Average 0.596 0.596 - 0.596 - - 0.500 0.486

Table 13: Table of confusion matrix of testing the WEKA ensemble classiﬁcation models.

Actual class

Ensemble classiﬁcation algorithm scheme Area Class SR NR IR

Predicted

class

weka.classiﬁers.meta.AdaBoostM1

SR 38 1 0

NR 0 65 0

IR 0 5 0

weka.classiﬁers.meta.Bagging

SR 38 1 0

NR 0 64 1

IR 0 2 3

weka.classiﬁers.trees.RandomForest

SR 39 0 0

NR 0 65 0

IR 0 1 4

weka.classiﬁers.meta.Vote

0 39 0

NR 0 65 0

IR 0 5 0

ders among students majoring in Computer Sciences,

forecasting the risk of IA and development of services

aimed at IA prevention.

REFERENCES

Abbott, D. A., Cramer, S. L., and Sherrets, S. D. (1995).

Pathological gambling and the family: Practice impli-

cations. Families in Society, 76(4):213–219.

Anokhin, P. (1968). Cybern

etique, neurophysiologie et psy-

chologie. Social Science Information, 7(1):169–197.

Balatskiy, E. V. (2008). Vitalnyye resursy i kontury soz-

naniya (Vital resources and circuits of consciousness).

Vestnik Rossiyskoy akademii nauk, 78(6):531–537.

Breiman, L. (1996). Bagging predictors. Machine learning,

24(2):123–140.

Breiman, L. (2001). Random forests. Machine learning,

45(1):5–32.

da Silva, L. E. B., Melton, N. M., and au2, D. C. W. I.

(2019). Incremental cluster validity indices for hard

partitions: Extensions and comparative study.

Dasgupta, S. and Long, P. M. (2005). Performance guaran-

tees for hierarchical clustering. Journal of Computer

and System Sciences, 70(4):555 – 569.

Derhach, M. (2016). Cyber-addiction of students majoring

in computer science. Science and Education, (7):92–

98.

Di, Z., Gong, X., Shi, J., Ahmed, H. O. A., and Nandi, A. K.

(2019). Internet addiction disorder detection of Chi-

nese college students using several personality ques-

tionnaire data and support vector machine. Addictive

Behaviors Reports, 10:100200.

Frankl, V. E. (1985). Man’s search for meaning. Simon and

Schuster.

AET 2020 - Symposium on Advances in Educational Technology

258

Freund, Y. and Schapire, R. E. (1996). Experiments with

a new boosting algorithm. In Proceedings of the

Thirteenth International Conference on International

Conference on Machine Learning, ICML’96, page

148–156, San Francisco, CA, USA. Morgan Kauf-

mann Publishers Inc.

Hsieh, W.-H., Shih, D.-H., Shih, P.-Y., and Lin, S.-B.

(2019). An ensemble classiﬁer with case-based rea-

soning system for identifying internet addiction. Inter-

national journal of environmental research and public

health, 16(7):1233.

Huizinga, J. (2016). Homo Ludens: A Study of the Play-

Element in Culture. Angelico Press.

Hussain, R. G., Ghazanfar, M. A., Azam, M. A., Naeem, U.,

and Ur Rehman, S. (2019). A performance compari-

son of machine learning classiﬁcation approaches for

robust activity of daily living recognition. Artiﬁcial

Intelligence Review, 52(1):357–379.

ICD-11 for Mortality and Morbidity Statistics (Ver-

sion: 09/2020) (2020). 6C51 Gaming disor-

der. https://icd.who.int/browse11/l-m/en#/http://id.

who.int/icd/entity/1448597234.

Ji, H.-M., Chen, L.-Y., and Hsiao, T.-C. (2019). Real-

time detection of internet addiction using reinforce-

ment learning system. In Proceedings of the Genetic

and Evolutionary Computation Conference Compan-

ion, pages 1280–1288.

Keng, B. (2016). The expectation-maximization algo-

rithm. http://bjlkeng.github.io/posts/the-expectation-

maximization-algorithm.

Kittler, J., Hatef, M., Duin, R. P., and Matas, J. (1998). On

combining classiﬁers. IEEE transactions on pattern

analysis and machine intelligence, 20(3):226–239.

Klochko, O. and Fedorets, V. (2019). An empirical com-

parison of machine learning clustering methods in the

study of Internet addiction among students majoring

in Computer Sciences. CEUR Workshop Proceedings,

2546:58–75.

Klochko, O., Fedorets, V., Tkachenko, S., and Maliar, O.

(2020). The use of digital technologies for ﬂipped

learning implementation. CEUR Workshop Proceed-

ings, 2732:1233–1248.

Klochko, O. V. (2019). Matematychne modeliuvannia

system i protsesiv v osviti/pedahohitsi (Mathemat-

ical modeling of systems and processes in educa-

tion/pedagogy). Druk.

amer, J., Schrey

ogg, J., and Busse, R. (2019). Classiﬁca-

tion of hospital admissions into emergency and elec-

tive care: a machine learning approach. Health Care

Management Science, 22(1):85–105.

Kuncheva, L. I. (2014). Combining pattern classiﬁers:

methods and algorithms. John Wiley & Sons.

Lee, T.-H., Ullah, A., and Wang, R. (2020). Bootstrap ag-

gregating and random forest. In Fuleky, P., editor,

Macroeconomic Forecasting in the Era of Big Data:

Theory and Practice, pages 389–429. Springer Inter-

national Publishing, Cham.

Leontyev, D. A. (2017). Psikhologiya smysla: priroda.

stroyeniye i dinamika smyslovoy realnosti (The psy-

chology of meaning: nature, structure and dynamics

of meaningful reality). Litres.

Linoff, G. S. and Berry, M. J. A. (2011). Data mining tech-

niques: for marketing, sales, and customer relation-

ship management. John Wiley & Sons.

Moshtaghi, M., Bezdek, J. C., Erfani, S. M., Leckie, C.,

and Bailey, J. (2019). Online cluster validity indices

for performance monitoring of streaming data clus-

tering. International Journal of Intelligent Systems,

34(4):541–563.

Pacol, C. A. and Palaoag, T. D. (2020). En-

hancing sentiment analysis of textual feedback in

the student-faculty evaluation using machine learn-

ing techniques. In 3rd International Confer-

ence On Academic Research in Science, Technology

and Engineering. https://www.dpublication.com/wp-

content/uploads/2020/11/3-3003.pdf.

Santos, S. G. T. d. C. and de Barros, R. S. M. (2020).

Online AdaBoost-based methods for multiclass prob-

lems. Artiﬁcial Intelligence Review, 53(2):1293–

1322.

Souri, A., Ghafour, M. Y., Ahmed, A. M., Safara, F.,

Yamini, A., and Hoseyninezhad, M. (2020). A new

machine learning-based healthcare monitoring model

for student’s condition diagnosis in Internet of Things

environment. Soft Computing, 24(22):17111–17121.

Subasi, A., Kevric, J., and Abdullah Canbaz, M. (2019).

Epileptic seizure detection using hybrid machine

learning methods. Neural Computing and Applica-

tions, 31(1):317–325.

Sudakov, K. V. (2011). Funktsionalnyye sistemy (Func-

tional systems). Izdatelstvo RAMN.

Suma, S. N., Nataraja, P., and Sharma, M. K. (2021). Inter-

net addiction predictor: Applying machine learning in

psychology. In Advances in Artiﬁcial Intelligence and

Data Engineering, pages 471–481. Springer.

Tarasenko, A., Yakimov, Y., and Soloviev, V. (2019). Con-

volutional neural networks for image classiﬁcation.

CEUR Workshop Proceedings, 2546:101–114.

Tuysuzoglu, G. and Birant, D. (2020). Enhanced bagging

(ebagging): A novel approach for ensemble learning.

The International Arab Journal of Information Tech-

nology, 17(4):515–528.

Wallis, D. (1997). Just click no: Talk story about Dr. Ivan K.

Goldberg and the internet addiction disorder. The New

Yorker. http://www.newyorker.com/magazine/1997/

01/13/just-click-no.

Weka (2021). Weka 3: Machine Learning Software in Java.

http://old-www.cms.waikato.ac.nz/ ml/weka/.

Young, K. S. (1998a). Caught in the net: How to recognize

the signs of internet addiction–and a winning strategy

for recovery. John Wiley & Sons.

Young, K. S. (1998b). Internet addiction: The emergence of

a new clinical disorder. CyberPsychology & Behav-

ior, 1(3):237–244. https://www.liebertpub.com/doi/

pdf/10.1089/cpb.1998.1.237.

Yuryeva, L. N. and Bolbot, T. Y. (2006). Kompyuter-

naya zavisimost: formirovaniye, diagnos-

tika, korrektsiya i proﬁlaktika (Computer ad-

diction: formation, diagnostics, correction

The Use of Ensemble Classiﬁcation and Clustering Methods of Machine Learning in the Study of Internet Addiction of Students

259

and prevention). Porogi, Dnepropetrovsk.

http://kingmed.info/media/book/3/2673.pdf.

Zahorodko, P. V., Semerikov, S. O., Soloviev, V. N.,

Striuk, A. M., Striuk, M. I., and Shalatska, H. M.

(2021). Comparisons of performance between

quantum-enhanced and classical machine learning al-

gorithms on the IBM quantum experience. Journal of

Physics: Conference Series, 1840(1):012021.

Zelinska, S. (2020). Machine learning: Technologies and

potential application at mining companies. E3S Web

of Conferences, 166:03007.

Zhong, Y., Yang, H., Zhang, Y., and Li, P. (2020).

Online random forests regression with memories.

Knowledge-Based Systems, 201:106058.

AET 2020 - Symposium on Advances in Educational Technology

260