
 
(oracle) is presented with small sets of examples for 
labelling.  The  proposed  algorithm  is  tested  on 
streams of instances, which is suitable for scenarios 
where new instances need to be classified one at a 
time, i.e. an incremental and online learning setting. 
In  this  scenario,  the  goal  is  to  achieve  high 
performance (in terms of accuracy) while utilizing as 
few labelled examples as possible.  
This  paper  is  organized  as  follows.  The  next 
section presents related works. We detail our active 
online  ensemble  method  in  Section  3.  Section  4 
describes  the  experimental  evaluation.  Finally, 
Section 5 concludes the paper. 
2  RELATED WORK 
Classifiers  construct  models  that  describe  the 
relationship  between  the  observed  variables  of  an 
instance  and  the  target  label.  However,  as  stated 
above, in a data stream setting, the labels may often 
be  missing,  incorrect  or  late  arriving.  Further, 
labelling  involves  domain  expertise  and  may  be 
costly to obtain.  
Predictive  models  can  be  generated  using 
classification  methods.  However,  the  produced 
model’s  accuracy  is  highly  related  to  the  labelled 
instances  in  the  training  set.  Incorrectly  classified 
instances can result in inaccurate, or biased models. 
Further  a  data  set  may  be  imbalanced,  where  one 
class dominates  another.  One  suggested solution  is 
to use active learning to guide the learning process 
(Stefanowski  and  Pachocki,  2009,  Muhivumundo 
and Viktor, 2011). This type of learning tends to use 
the most informative instances in the training set.  
Active  learning  studies  how  to  select  the  most 
informative  instances  by  using  multiple  classifiers. 
Generally, informative examples are identified as the 
ones that cause high disagreement among classifiers 
(Stefanowski  and  Pachocki,  2009).  Thus,  the  main 
idea  is  using  the  diversity  of ensemble learning  to 
focus  the  labelling  effort.  This  usually  works  by 
taking some information of the data from the users, 
also  known  as  the  oracles.  In  other  words,  the 
algorithm  is  initiated  with  a  limited  amount  of 
labelled  data.  Subsequently,  it  passes  them  to  the 
learning  algorithm  as a  training  set  to  produce  the 
first  classifier.  In  each  of  the  following  iterations, 
the  algorithm  analyses  the  remaining  unlabelled 
instances  and  presents  the  prediction  to  the  oracle 
(human  expert)  in  order  to  label  them.  These 
labelled examples are added to the training set and 
used  in  the  following  iteration.  This  process  is 
repeated until the user is satisfied or until a specific 
stopping criterion is achieved. 
Past  research  in  active  learning  mainly focused 
on the pool-based scenario. In this scenario, a large 
number of unlabelled instances need to be labelled. 
The main objective is to identify the best subset to 
be  labelled  and  used  as  a  training  set  (Sculley, 
2007a,  Chu  et  al.,  2011).  Hence,  the  basic  idea 
behind  active  learning  stems  from  the  Query-by-
Committee method, which is a very effective active 
learning  approach  that  has  wide  application  for 
labelling  instances.  Initially,  a  pool  of  unlabelled 
data is presented to the oracle, which is then selected 
for  labelling.  A  committee  of  classifiers  is  trained 
and  models  are  generated  based  on  the  current 
training  data.  The  samples  used  for  labelling  are 
based  on the level  of disagreement  in between  the 
individual  classifiers.  In  pool-based  scenarios,  the 
unlabelled data are collected in the candidate pool. 
However,  in a  data  stream setting,  maintaining the 
candidate pool may prove itself to be challenging as 
a large amount of data may arrive at high speed.  
 One of the main challenges in data stream active 
learning is to reflect the underlying data distribution. 
Such  a  problem  may  be  solved  by  using  active 
learning to balance the distribution of the incoming 
data  in  order  to  increase  the  model  accuracy 
(Zliobaite et al.,  2014). The  distribution is adapted 
over  time  by  redistributing  the  labelling  weight  as 
opposed  to  actively  labelling  new  instances. 
Learn++ is another algorithm proposed by (Polikar 
et  al.,  2001)  that  employ  an  incremental  ensemble 
learning methods in order to learn from data streams.  
Also, traditional active learning methods require 
many  passes  over  the  unlabelled  data,  in  order  to 
select the informative one (Sculley, 2007a). This can 
create a storage and computational bottleneck in the 
data  stream  setting  and  big  data.  Thus,  the  active 
learning process needs to be modified for the online 
setting. 
Another  scenario  is  proposed  by  (Zhu  et  al., 
2007) to address the data distribution associated with 
the data stream. Recall that in a data stream there is 
a  dynamic  data  distribution  because  of  the 
continuous arriving of data. In data stream mining, it 
is  unrealistic  to  build  a  single  model  based  on  all 
examples.  To  address  this  problem,  (Zhu  et  al., 
2007)  proposed  an  ensemble  active  learning 
classifier with the goal of minimizing the ensemble 
variance in order to guide the labelling process. One 
of the main objectives of active learning is to decide 
the newly arrived instances labels. According to the 
proposed  framework  in  (Zhu  et  al.,  2007),  the 
KDIR 2016 - 8th International Conference on Knowledge Discovery and Information Retrieval
276