The main proposal of this work is to combine co-
training with incremental learning in order to avoid
this issue. Initial models are generic ones, trained on
a large set of subjects. They are used to predict the
pseudo-label of new data, corresponding to new sub-
jects. After pseudo-labeling these new samples, we
apply incremental learning technics to adapt the mod-
els. Therefore, these models are no more generic, but
personalized to new subjects. We called this process
co-incrementation learning. One of its main advan-
tage is to reduce drastically the computing time while
improving the recognition accuracy on new subjects,
thus reducing the identity bias.
The rest of the article is as follows. Next section
will present the incremental learning field with a par-
ticular focus on random forest (RF)-based algorithms
and detail the co-training process. Section 3 is de-
voted to the data and the feature extraction process.
Section 4 presents in detail the nearest-class mean for-
est (NCMF), how it differs from classical RF and the
way it can learn incrementally. In section 5, we detail
the original co-incrementation algorithm that com-
bines incremental NCMF with co-training. Then, we
present in section 6 results obtained on generic mod-
els (before adaptation) and after co-incrementation on
specific chunks. Finally, we conclude in section 7.
2 RELATED WORKS
Automatic Facial Emotion Recognition (FER) has re-
ceived wide interest in a variety of contexts, espe-
cially for the recognition of action units, basic (or
compound) emotions and affective states. Although
considerable effort has been made, several questions
remain about which cues are important for interpret-
ing facial expressions and how to encode them. Af-
fect recognition systems most often aim to recognize
the appearance of facial actions, or the emotions con-
veyed by those actions (Sariyanidi et al., 2014). The
former are generally based on the Facial Action Cod-
ing System (FACS)(Ekman, 1997). The production
of a facial action unit has a temporal evolution, which
is typically modeled by four temporal segments: neu-
tral, onset, apex, and offset (Ekman, 1997). Among
them, the neutral is the phase with no expression and
no sign of muscle activity; the apex is a plateau where
the maximum intensity usually reaches a stable level.
As seen before, identity bias results in perfor-
mance losses on generic learning models. Strategies
for grouping individuals by common traits such as
gender, weight, or age and personalizing models on
these groups have already shown promising results
in a wide range of areas such as activity recognition
(Chu et al., 2013) (Kollia, 2016) (Yang and Bhanu,
2011). However, quite often the strategy used con-
sists in personalizing one model per user since it en-
sures better results. This can quickly become complex
when the number of subjects increases or when the
number of collected data per subjects keeps small. In
the field of emotion recognition, different solutions to
this challenge have been considered, personalization
methods being the most promising (Chu et al., 2013)
(Yang and Bhanu, 2011).
One of the main characteristics of incremental
techniques is the ability to update models using only
recent data. This is often the only practical solution
when it comes to learning data ”on the fly” as it would
be impossible to keep in memory and re-learn from
scratch every time new information becomes avail-
able. This type of technique holds promise for per-
sonalizing models to individuals. It has been demon-
strated that Random forests (RF) (Breiman, 2001), in
addition to their multi-class nature and ability to gen-
eralize, have also the ability to increment in data and
classes (Denil et al., 2013) (Hu et al., 2018) (Lak-
shminarayanan et al., 2014) . Besides, Random for-
est models have been used successfully for personal-
ization (Chu et al., 2013) (Kollia, 2016) (Yang and
Bhanu, 2011). Nearest class mean forests derived
from RF, have demonstrated to be able to outperform
RF performance and allow an easy way to perform in-
crementation (Ristin et al., 2014), even in the emotion
recognition field (Gonzalez and Prevost, 2021).
In the era of big data, with the increase in the
size of databases, the field of machine learning faces
a challenge, the creation of ground truth, which can
be costly in time and effort. We are therefore in-
creasingly finding ourselves in contexts of incom-
plete supervision, where we are given a small amount
of labeled data, which is insufficient to train a good
learner, while unlabeled data is available in abun-
dance. To this end, different learning techniques have
been proposed (Zhou, 2018) with human intervention
such as active learning (Settles, 2009) or without hu-
man intervention such as semi-supervised methods.
One of these last ones is based on disagreement meth-
ods (Zhou and Li, 2010), co-training being one of its
most famous representations.
Co-training is a learning technique proposed in
1998 by Blum and Mitchell (Blum and Mitchell,
1998) which is traditionally based on the use of two
machine learning models. The main idea is that they
complement each other: one helps the other to cor-
rect the mistakes it does not make, and vice versa.
A second idea is to exploit data that are not labeled
(present in large quantities), rather than processing
only labeled data (present in small quantities). For
Co-Incrementation: Combining Co-Training and Incremental Learning for Subject-Specific Facial Expression Recognition
271