Automated Infant Monitoring based on R-CNN and HMM

Cheng Li

1 a

, A. Pourtaherian

1 b

, L. van Onzenoort

and P. H. N. de With

1 c

Eindhoven University of Technology, Eindhoven, The Netherlands

Maxima Medical Center, Veldhoven, The Netherlands

Keywords:

Automated Infant Monitoring, R-CNN, HMM, GERD Diagnosis.

Abstract:

Manual monitoring of young infants suffering from reﬂux is a signiﬁcant effort, since infants can hardly ar-

ticulate their feelings. This work proposes a near real-time video-based infant monitoring system for the

analysis of infant expressions. The discomfort moments can be correlated with a reﬂux measurement for

gastroesophageal reﬂux disease diagnose. The system consists of two components: expression classiﬁcation

and expression state stabilization. The expression classiﬁcation is realized by Faster R-CNN and the state

stabilization is implemented with a Hidden Markov Model. The experimental results show a mean average

precision of 82.3% and 83.4% for 7 different expression classiﬁcations, and up to 90% for discomfort detec-

tion, evaluated with both clinical and daily datasets. Moreover, when adopting temporal analysis, the false

expression changes between frames can be reduced up to 65%, which signiﬁcantly enhances the consistency

of the system output.

1 INTRODUCTION

Young infant expression analysis is a difﬁcult and im-

portant task within the ﬁeld of pediatrics, since the

verbal ability of young infants is limited or not yet

even developed. Over the years, pain assessment tools

of infants have been developed based on subjective

descriptors such as facial expressions. However, these

systems require practitioners to observe infants over

time, which is laborious and time-consuming. Like-

wise, continuous monitoring and nursing of infants

by professionals would also be too expensive. To

address this, an automated video-based infant mon-

itoring system could analyze expressions as an aux-

iliary assessment tool. This approach would differ-

entiate from other methods, such as heart-rate mon-

itoring, since video-based expression analysis has a

non-interventional character. Besides, the discomfort

moments detected by such a system can be corre-

lated with disease diagnosis. For example, gastroe-

sophageal reﬂux disease (GERD) is a common dis-

ease for young infants and is visible by unexpected

vomiting of the infant, which causes discomfort and

sometimes pain. Such an event can be measured and

combined with the generic monitoring for diagnostic

https://orcid.org/0000-0003-2900-637X

https://orcid.org/0000-0003-4542-1354

https://orcid.org/0000-0002-7639-7716

purposes, which is attractive by pediatricians. How-

ever, to design such a system, several challenges have

to be resolved.

First, for automated infant observation, the sys-

tem should perform face detection, even when the in-

fant has large deviating head poses far away from the

camera view. Second, facial expressions of infants are

signiﬁcantly different from those of adults. Hence,

most of the state-of-the-art face detection and expres-

sion analysis methods are trained with adult datasets

and therefore typically fail in pediatric applications.

As a result, a speciﬁc approach is required for infants.

Third, public infant datasets are rare in contents, so

that training a deep learning-based classiﬁer with lit-

tle data will not yield reliable results and may become

over-ﬁtted. Fourth, the trained classiﬁer should be

able to detect expressions when faces are partially oc-

cluded by objects. Finally, a steady expression-state

output of a system is also important, since a false

alarm can bring misleading information to caregivers

and doctors for disease diagnosis. Moreover, frequent

false chirps of an infant monitoring system are also

bothering, and therefore make the system perceived

as less reliable, even when the discomfort detection

accuracy is overall high. To overcome the aforemen-

tioned challenges, we propose to use CNN networks

because they have proven to be successful in multiple

detection problems offering a high accuracy and they

Li, C., Pourtaherian, A., van Onzenoort, L. and N. de With, P.

Automated Infant Monitoring based on R-CNN and HMM.

DOI: 10.5220/0010299605530560

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 5: VISAPP, pages

553-560

ISBN: 978-989-758-488-6

553

can be robust when properly trained (Mueller et al.,

2017; Long et al., 2015). From these CNNs, we con-

sider that an automated infant monitoring system can

be constructed using Faster R-CNN (Ren et al., 2017),

which can handle the variation from the above chal-

lenging situations. Faster R-CNN was adopted be-

cause it requires less memory than LSTM (Donahue

et al., 2017) and the processing is less complex. Ad-

ditionally, instead of expensive LSTM, we employ a

dynamic Hidden Markov Model (HMM) in combina-

tion with Faster R-CNN which models changes of the

expression over time, so that the frame-based dynam-

ics of the expression classiﬁcation become more sta-

ble and avoid drastic changes. This combination en-

ables near real-time operations and provides both reli-

able detection and temporally stable behavior, which

is novel extension to this application.

In this work, we make the following contributions

to automated infant observations.

1. Multi-facial Expression Classiﬁcation: Infant

expressions are subtle and instinctive, which ex-

plains why we distinguish infant expressions into

seven states, instead of a binary detection.

2. Hybrid CNN Model: We propose to combine

Faster R-CNN with HMM to model the dynamic

changes of the infant expression between video

frames over time, which improves the robust-

ness of expression classiﬁcation and reduces time-

domain complexity considerably.

3. Large-scale Validation: The discomfort detec-

tion system is evaluated with large-scale datasets,

which encompass the datasets collected from both

clinical environments and normal daily life.

4. Consistent Expression Status: To our best knowl-

edge, the temporal status consistency of an infant

monitoring system output is for the ﬁrst time eval-

uated in this work. The experimental results show

a signiﬁcant improvement of the discomfort de-

tection accuracy as well as the output consistency

when using Faster R-CNN and HMM.

The remainder of this paper is organized as fol-

lows. Section 2 brieﬂy introduces some related work.

Afterwards, the design details of the system are ex-

plained in Section 3. The experimental results for in-

fant expression analysis are provided in Section 4. Fi-

nally, Section 5 presents conclusions.

2 RELATED WORK

The absence of regular pain assessment for infants

makes pain for infants normally under treated. Un-

fortunately, no gold standard or universal approach

for infant pain assessment is available in pediatric

ﬁelds. For decades, researchers have paid signiﬁcant

attention to devising multiple validated pain scoring

systems for facilitating the objective measurement of

pain. A thorough introduction of pain assessment

methodologies for neonates is provided in (Witt et al.,

2016). These assessment tools use facial expressions

as a main indicator, such as the revised FACES pain

scale (Hicks et al., 2001), FLACC (Jaskowski, 1998),

the Wong-Baker Faces scale, neonatal facial coding

system (NFCS) (Witt et al., 2016) and the 10-cm vi-

sual analog scale, and are widely used in healthcare

settings to assess patient pain.

Inspired by these pain assessments, some auto-

mated pain detection systems for neonates have been

studied based on facial expression analysis. In (Fo-

tiadou et al., 2014), a semi-automated system is pro-

posed for discomfort detection, which adopts an Ac-

tive Appearance Model (AAM) for facial appearance

modeling. Zhi et al. (Zhi et al., 2018) proposed an au-

tomatic pain detection for infants by using geometric

features. After that, a Support Vector Machine (SVM)

is utilized to distinguish pain from no pain. In (Li

et al., 2016), it is proposed to exploit an automated

classiﬁcation model, based on appearance features

extracted by local binary patterns. Sun et al. (Sun

et al., 2018) presented a discomfort detection system

based on a template matching method, in which neu-

tral faces of the speciﬁc subject are used as a tem-

plate. Frames containing different expressions com-

pared with the template are classiﬁed as discomfort.

However, all these mentioned methods can only han-

dle situations where infants show their frontal faces

without occlusions, hence giving limited robustness.

Recently, Convolutional Neural Networks

(CNNs) have become prevalent because of their

increased performance. Tavakolian et al. (Tavako-

lian and Hadid, 2018) proposed a pain expression

intensity estimation based on CNNs and a binary

coding. Lin et al. (Lin et al., 2018) have proposed

a CNN-based expression classiﬁer trained with data

augmentation for adults, which is similar to our

work. However, these works targeted on adults

are not applicable to infants as already mentioned.

Therefore, we propose an infant expression analysis

algorithm using Faster R-CNN and a temporal

model for realizing temporal stability in decision

making. Although CNNs can also applied in the

temporal domain, this approach is hampered by the

high intrinsic complexity. Therefore we explore an

alternative, since our aim is to implement a real-time

automated infant monitoring system suitable for

clinical practice.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

554

Faster R-CNN Stabilization

Convolutional

Layer

Last Convolutional

Layer

Proposal

Layer

Bbox

cls

RoI Pooling

Layer

Fully Connected

Layer

cls

Bbox

HMM

Figure 1: Flowchart of the system design for infant discomfort analysis based on Faster R-CNN and HMM.

3 INFANT DISCOMFORT

DETECTION

3.1 R-CNN for Expression Detection

For expression detection and classiﬁcation, the Faster

R-CNN (Ren et al., 2017) framework combined with

a VGG-net architecture (Simonyan and Zisserman,

2014) is adopted, since Faster R-CNN yields state-of-

art performance for object detection. The training in-

put of the expression detection network is represented

as a tuple (I

, g

, cls), where I

is the full-image frame

containing the infant face and background, g

denotes

the ground truth of a bounding box encompassing the

infant face and cls indicates the ground-truth label for

expressions of that bounding box. In this application,

cls corresponds to one of the following expression

states: Discomfort/pain, Unhappy, Neutral, Sleep,

Joy, Open mouth and Paciﬁer, deﬁned as in (Sulli-

van and Lewis, 2003). For training, the Regions of

Interests (RoIs) that have an intersection over union

(IoU) with the ground-truth bounding box larger than

a speciﬁc threshold (IoU> 0.5), are assigned with the

corresponding ground-truth expression label cls (pos-

itive samples), while other RoIs are assigned as back-

ground (negative samples). To keep the training im-

ages balanced, we randomly sampled negative RoIs

from labeled background RoIs. The total number of

positive and negative RoIs should not exceed N = 128

for computing platform reasons. These positive and

selected negative RoIs are formed into a minibatch for

stochastic gradient descent (SGD). We adopt the same

loss function as Faster R-CNN (Ren et al., 2017). For

data augmentation purposes, we ﬂipped all images in

the horizontal direction in our infant expression train-

ing dataset.

3.2 Stabilization

Experiments have shown that a considerable num-

ber of false classiﬁcations occur between two cor-

rectly classiﬁed frames. Therefore, a Hidden Markov

Model (HMM) for modeling the dynamics of expres-

sion changes in a video sequence is utilized, to reduce

false positives and enhance the monitoring stability.

In our application, we model seven expressions of in-

terest and the background as states of the HMM. Be-

cause of limited video data availability with infants,

our HMM is trained separately from the CNN ex-

pression classiﬁer. The transition probability between

two states is obtained by analyzing training sequences

with the ground truth of states. The expression esti-

mation updated by the HMM is calculated with a for-

ward method, which is computed in two steps: pre-

diction and update. In this work, we only use and

discuss below a ﬁrst-order HMM, meaning that the

actual state only depends on the previous frame state.

Prediction. Given the observation sequence O =

, ..., o

}, the purpose of detecting and classifying

the expression of a frame at time t (abbreviated as

frame t) is to ﬁnd the maximum a-posteriori proba-

bility of the state p(q

). In our problem, o

cor-

responds to the union of all proposed RoIs in each

frame, denoted as o

= {b

1,t

, ..., b

n,t

} with bounding

boxes b

v,t

, where n is the total number of RoIs in

frame t and d that number for frame t − 1. For each

individual RoI with index u denoted as b

u,t−1

in frame

t −1 (1 ≤ u ≤ d), the posterior state probability of that

RoI for frame t can be estimated by

p(q

u,t−1

) =

∑

i=1

p(q

t−1

)p(q

t−1

u,t−1

). (1)

Here, k denotes the total number of states. This prob-

ability should be computed for all RoIs in frame t − 1.

Update. For each RoI b

v,t

in frame t, where 1 ≤

Automated Infant Monitoring based on R-CNN and HMM

555

v ≤ n, the posterior probability p(q

v,t

) is found by

p(q

v,t

) =

∑

v=1

Ov(v, u)p(b

v,t

)p(q

u,t−1

(2)

where the binary overlap function Ov(v, u) indicates

when there is an IoU overlap of RoI b

u,t−1

and b

v,t

of more than at least 70% or more. In case there is

no overlap, the contribution becomes zero, because

Ov(v, u) = 0.

4 EXPERIMENTS AND RESULTS

4.1 Datasets

Public datasets containing only infants are rare, and

datasets with ground-truth annotations for infant ex-

pressions are virtually not available. To our best

knowledge, only two public datasets exist (Zamzmi

et al., 2018) for infant pain/discomfort analysis,

which are COPE/iCOPE collected (Brahnam et al.,

2006) and another dataset collected from YouTube

videos described in (Harrison et al., 2014). How-

ever, none of these two datasets can be used for

our purpose. To solve this, we have manually col-

lected a dataset consisting of 16, 165 infant images

from the Internet, and expressions are manually la-

beled by bounding boxes. After the dataset augmen-

tation, the total number of images used for training

achieves 32, 330. Moreover, to evaluate the infant

monitoring system both clinically and practically, we

have ﬁrst randomly selected 11 from 45 video se-

quences of different infants up to 2 years old with

challenging situations, such as large head-pose devi-

ation and object occlusions, which were denoted as

Clinic for validating the clinical usage. The rest of the

video sequences is used to train the HMM. All these

videos were recorded at the Maxima Medical Cen-

ter (MMC), Veldhoven, the Netherlands, satisfying

the ethical standards of the institution, allowing us-

age after obtaining a written consent from the parents.

Videos for infant discomfort expression are captured

when experiencing pain from a heel prick, placing an

intravenous line, or a vaccination, whereas other ex-

pressions are captured when infants stay at the hospi-

tal for medical care. All selected videos last at least

2 minutes, which is the required time for professional

observation and pain scoring. Besides this dataset,

similar to (Harrison et al., 2014), we have also col-

lected 72 video sequences from Youtube denoted as

Youtube. The purpose of this dataset is to validate

for a practical application and for benchmarking with

state-of-art methods. Here, each video sequence con-

Figure 2: Examples of infants collected from Internet in

the training dataset. Different color bounding box annotates

each corresponding expression.

tains one different infant. All videos were uploaded

by Youtube users, therefore video qualities are quite

different (varying between good and poor) because of

variable conditions. Moreover, these video sequences

also contain certain frames without infants, which can

be utilized to evaluate the reliability of all methods.

Videos in both testing datasets are fully annotated.

The method of annotating the ground truth of each

frame is identical to images in the training dataset.

All ground-truth annotations of both train and test sets

are given by professional practitioners at the MMC.

The number of frames for each expression in both test

datasets are depicted in Table 1. As shown, the num-

ber of frames in each dataset exceeds 10,000 images,

which is much larger than experimental data of the

previous work (Sun et al., 2018; Zhi et al., 2018).

4.2 Metrics

In order to evaluate the accuracy of infant expres-

sion detection, the VOC2007 metrics (Everingham

et al., 2010) are utilized in all the evaluational exper-

iments, since VOC2007 is speciﬁcally designed for

detection and classiﬁcation tasks. By deﬁnition, the

Average Precision (AP) indicates the detection accu-

racy of each expression, while the mean Average Pre-

cision (mAP) shows the overall performance of seven

expressions of interest. For analyzing the consistency

of the system output, the frequency of the False Ex-

pression Changes (FEC) is computed for each dataset.

A false expression exchange is deﬁned as an occur-

rence of an expression change between one frame to

the next neighboring frame, that does not correspond

to the exchange indicated by the ground truth. This

can happen due to the false expression classiﬁcation

and the localization of bounding boxes with false de-

tections. In addition to computing the frequency of

the FEC, a second metrics is introduced, deﬁned as

the reduction rate, for analyzing the temporal stabil-

ity. This analysis applies to the system only using

R-CNNs and our proposed with temporal consistency

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

556

Table 1: Total number of frames within video sequences showing respective expressions in both test datasets.

Discomfort Neutral Sleep Joy Open mouth Unhappy Paciﬁer All

Clinic 3,523 7,940 2,587 166 1,462 1,246 1,399 18,323

Youtube 3,499 4,622 631 2,592 1,046 1,185 821 14,396

processing. The reduction rate is deﬁned as:

R =

F − F

, (3)

where F

represents the frequency of occurrence of

false changes when temporal analysis is used, and F

denotes the count of false expression changes without

temporal processing.

4.3 Experimental Results

This section ﬁrst provides results regarding the ex-

pression detection robustness using VOC2007 met-

rics (Everingham et al., 2010). Then, the consistency

analysis of the system output is evaluated with the

metrics discussed above.

A. Accuracy Evaluation. The performance of discom-

fort detection achieved by our multi-class network

is compared with conventional methods, such as (Li

et al., 2016) (Sun et al., 2018), which adopted HOG

and LBP features as facial descriptors. Tables 2 and 3

present the experimental results of multi-class classi-

ﬁcation, based on CNN features evaluated with Clin-

ical and Youtube datasets. The scores clearly indi-

cate that the conventional methods using descriptors

such as HOG and LBP, can hardly distinguish subtle

facial expressions, such as Joy, Open mouth and Un-

happy. Moreover, these methods are only capable of

detecting Discomfort and Neutral with a low preci-

sion, which is not sufﬁcient for implementation in an

infant monitoring system. In contrast, the proposed

CNN-based framework achieves an attractive score

for the mAP of 82.3% and 83.4% for overall perfor-

mance in Tables 2 and 3, respectively. For discomfort

detection, it obtains an AP of 89.0% evaluated with

Clinic, and an AP of 90.3% with the Youtube set. It

can be observed that the accuracy and robustness of

discomfort detection using CNN architectures are sig-

niﬁcantly increased compared with the conventional

methods.

When combining an HMM model with the detec-

tion, the overall system outperforms solely Faster R-

CNN (Ren et al., 2017) by 4.9% and 0.1% with the

Clinic and Youtube sets, respectively. Here, it can be

noticed that an increase in the AP occurs for Discom-

fort by 1% and 0.3% for the same datasets, respec-

tively, compared to only using Faster R-CNN (Ren

et al., 2017). This gain occurs because the trained R-

CNN classiﬁer performs less accurate in ambiguous

expressions between Unhappy and Joy, and the com-

bination of Discomfort and Unhappy. However, by

adopting a dynamic model (HMM), the performances

for these expressions are enhanced, due to the stabi-

lizing information of the previous frame. In addition,

the beneﬁt of using HMM for Youtube is limited com-

pared to that for Clinic. This is explained by the du-

ration of video sequences in Youtube being shorter (a

few seconds) and consistent expressions occurring in

each sequence. Therefore, it is difﬁcult for HMM to

improve the performance further. Fig. 3 shows exam-

ples of the expression detection obtained by the pro-

posed method evaluated with Clinic.

B. Output Consistency Evaluation. To improve the

accuracy of expression detection for infant monitor-

ing, the consistency of the proposed system is eval-

uated when using the HMM compared to solely us-

ing R-CNNs. Fig. 4 shows a bar plot of the false

changes between expressions (Discomfort, Unhappy

and others) occurring in both testing datasets. Dis-

comfort and Unhappy are mainly taken into consider-

ation for this experiment, because these two expres-

sions carry more clinical information for facilitating a

diagnosis. It can be seen from Fig. 4 that the false

exchanges are signiﬁcantly reduced, especially be-

tween Unhappy and other expressions. Furthermore,

Table 4 provides the reduction ratios of FEC within

Discomfort, Unhappy and others. It can be noticed

that the FECs are overall reduced, while the reduction

rate is preserved up to 65%. Speciﬁcally, the false

changes between Discomfort and other expressions,

which is the main reason of causing false chirps for

an infant monitoring system, are signiﬁcantly reduced

by 64.9% and 35.2% for Clinic and Youtube, respec-

tively. Fig 5 portrays an exemplary sequence of ex-

pression detections with and without the HMM. Com-

paring the two ﬁgures, it can be readily seen that after

using the HMM, the detection becomes less noisy, re-

sulting into a more stable and reliable performance.

C. Discussion: Although a high consistency is ob-

tained by using an HMM, certain drawbacks are still

noticeable. For example, the detection score of each

class updated by an HMM highly depends on the per-

formance of the trained classiﬁers. When HMM is

combined with a less accurate classiﬁer, the detection

will be stuck at false positives. However, by jointly

training Faster R-CNN and HMM in an end-to-end

fashion, these false positives can be further reduced.

Automated Infant Monitoring based on R-CNN and HMM

557

Table 2: Average precision for 7 expressions and the corresponding mAP for each method evaluated with Clinic. Boldface

numbers indicate the highest score.

Discomfort Neutral Sleep Joy Open mouth Unhappy Paciﬁer mAP

LBP+SVM 0.349 0.245 0.341 0.007 0.017 0.068 0.045 0.153

HOG+SVM 0.254 0.154 0.608 0.077 0.095 0.026 0.033 0.178

F. R-CNN 0.880 0.838 0.873 0.578 0.811 0.534 0.907 0.774

F. R-CNN + HMM 0.890 0.870 0.883 0.728 0.854 0.633 0.906 0.823

Table 3: Average precision for 7 expressions and the corresponding mAP for for each method evaluated with Youtube. Bold-

face numbers indicate the highest score.

Discomfort Neutral Sleep Joy Open mouth Unhappy Paciﬁer mAP

LBP+SVM 0.201 0.513 0.091 0.433 0.037 0.054 0.001 0.190

HOG+SVM 0.129 0.377 0.091 0.466 0.162 0.048 0.045 0.188

F. R-CNN 0.900 0.752 0.969 0.846 0.720 0.742 0.904 0.833

F. R-CNN + HMM 0.903 0.783 0.983 0.831 0.608 0.713 0.902 0.834

Figure 3: Examples of expression detection obtained by the proposed algorithm evaluated with Clinic.

(a) (b)

Figure 4: Frequency of the false expression changes (FECs) between the expressions of interest. (a) Clinic; (b) Youtube.

Figure 5: Examples of expression detection for an exemplary sequence. Left: detection by Faster R-CNN without HMM, right:

detection with HMM. Red lines represent ground-truth information, blue lines denote detection by proposed algorithms.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

558

Table 4: Reduction rate of false expression changes (FECs)

between frames of R-CNN framework with the adoption of

HMM compared to solely Faster R-CNN. Boldface num-

bers indicate the highest score.

Clinic Youtube

Discomfort - Unhappy 0.337 0.474

Unhhappy - Discomfort 0.409 0.563

Discomfort - Other 0.649 0.352

Other - Discomfort 0.610 0.325

Unhappy - Other 0.587 0.489

Other - Unhappy 0.651 0.483

In our experiment, the algorithm is executed on

a GTX-1080ti GPU, which achieves a frame rate of

7 fps. It can be assumed that with a more advanced

GPU, or combining a tracking method, the computa-

tion speed will increase, and therefore allows model

usage in a real-time infant monitoring system.

5 CONCLUSIONS AND FUTURE

WORK

This paper has proposed a near real-time video-based

infant monitoring system, using Faster R-CNN com-

bined with a Hidden Markov Model. The HMM in-

creases the stability of decision making over time

and reduces the noise in expressions. Differentiating

from the conventional methods applying face detec-

tion and expression classiﬁcation separately, we have

trained a ConvNet detector that directly outputs the

expressions. The experimental results have shown an

AP achieving up to 90.3% for discomfort detection,

which provides a dramatic accuracy increase com-

pared to conventional methods (larger than 50%). The

high-accuracy discomfort detection can be combined

with some disease analysis such as GERD. In addi-

tion, the consistency of the system output is evalu-

ated, and the experimental results have shown that the

false expression changes between frames can be sig-

niﬁcantly reduced with a temporal analysis. In the

future, as more video sequences of infants become

available, we will train our expression classiﬁer and

temporal analysis end-to-end.

REFERENCES

Brahnam, S., Chuang, C.-F., Shih, F. Y., and Slack, M. R.

(2006). Svm classiﬁcation of neonatal facial images of

pain. In Bloch, I., Petrosino, A., and Tettamanzi, A.

G. B., editors, Fuzzy Logic and Applications, pages

121–128.

Donahue, J., Hendricks, L. A., Rohrbach, M., Venugopalan,

S., Guadarrama, S., Saenko, K., and Darrell, T.

(2017). Long-term recurrent convolutional networks

for visual recognition and description. IEEE Trans.

on Pattern Anal. Mach. Intell., 39(4):677–691.

Everingham, M., Van Gool, L., Williams, C. K. I., Winn,

J., and Zisserman, A. (2010). The pascal visual ob-

ject classes (voc) challenge. International Journal of

Computer Vision, 88(2):303–338.

Fotiadou, E., Zinger, S., Tjon a Ten, W. E., Oetomo, S.,

and de With, P. H. N. (2014). Video-based facial dis-

comfort analysis for infants. In Proc. SPIE 9029, Vi-

sual Information Processing and Communication V,

90290F.

Harrison, D., Sampson, M., Reszel, J., Abdulla, K., Barrow-

man, N., Cumber, J., Li, C., Nocholls, S., and Pound,

C. M. (2014). Too many crying babies: a systematic

review of pain management of practices during immu-

nizations on youtube. BMC Pediatrics, 14:134.

Hicks, C., Baeyer, C., Spafford, P., van Korlaar, I., and

Goodenough, B. (2001). The faces pain scale - re-

vised: Toward a common metric in pediatric pain

measurement. Pain, 93:173–83.

Jaskowski, S. K. (1998). The ﬂacc: A behavioral scale for

scoring postoperative pain in young children. AACN

Nursing Scan in Critical Care, 8:16.

Li, C., Zinger, S., Tjon a Ten, W. E., and de With, P. H. N.

(2016). Video-based discomfort detection for infants

using a constrained local model. In 2016 Int. Conf.

Systems, Signals and Image Proc. (IWSSIP), pages 1–

Lin, F., Hong, R., Zhou, W., and Li, H. (2018). Facial ex-

pression recognition with data augmentation and com-

pact feature learning. In 2018 25th IEEE Int. Conf. on

Image Processing (ICIP), pages 1957–1961.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully

convolutional networks for semantic segmentation.

In 2015 IEEE Conf. on Comput. Vis. Patt. Recog.

(CVPR), pages 3431–3440.

Mueller, M., Smith, N., and Ghanem, B. (2017). Context-

aware correlation ﬁlter tracking. In 2017 IEEE Conf.

Comp. Vision Pattern Recogn. (CVPR), pages 1387–

1395.

Ren, S., He, K., Girshick, R., and Sun, J. (2017). Faster r-

cnn: Towards real-time object detection with region

proposal networks. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 39(6):1137–1149.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv 1409.1556.

Sullivan, M. and Lewis, M. (2003). Emotional expressions

of young infants and children. Infants & Young Chil-

dren, 16:120–142.

Sun, Y., Shan, C., Tan, T., Long, X., Pourtaherian, A.,

Zinger, S., and de With, P. H. N. (2018). Video-based

discomfort detection for infants. Machine Vision and

Applications.

Tavakolian, M. and Hadid, A. (2018). Deep binary repre-

sentation of facial expressions: A novel framework for

Automated Infant Monitoring based on R-CNN and HMM

559

automatic pain intensity recognition. 2018 25th IEEE

Int. Conf. on Image Proc. (ICIP), pages 1952–1956.

Witt, N., Coynor, S., Edwards, C., and Bradshaw, H. (2016).

A guide to pain assessment and management in the

neonate. Current emergency and hospital medicine

reports, 4:1–10.

Zamzmi, G., Kasturi, R., Goldgof, D., Zhi, R., Ashmeade,

T., and Sun, Y. (2018). A review of automated pain as-

sessment in infants: Features, classiﬁcation tasks, and

databases. IEEE Reviews in Biomedical Engineering,

11:77–96.

Zhi, R., Zamzmi, G., Goldgof, D., Ashmeade, T., and Sun,

Y. (2018). Automatic infants’ pain assessment by dy-

namic facial representation: Effect of proﬁle view,

gestational age, gender, and race. Journal of clinical

medicine.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

560