Let Me Take a Better Look: Towards Video-Based Age Estimation

Kre

simir Be

seni

1 a

, Igor S. Pand

1 b

and J

orgen Ahlberg

2 c

Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia

Computer Vision Laboratory, Link

oping University, 58183 Link

oping, Sweden

Keywords:

Age, Video, Benchmark, Semi-Supervised, Pseudo-Labeling.

Abstract:

Taking a better look at subjects of interest helps humans to improve conﬁdence in their age estimation. Unlike

still images, sequences offer spatio-temporal dynamic information that contains many cues related to age

progression. A review of previous work on video-based age estimation indicates that this is an underexplored

ﬁeld of research. This may be caused by a lack of well-deﬁned and publicly accessible video benchmark

protocol, as well as the absence of video-oriented training data. To address the former issue, we propose

a carefully designed video age estimation benchmark protocol and make it publicly available. To address

the latter issue, we design a video-speciﬁc age estimation method that leverages pseudo-labeling and semi-

supervised learning. Our results show that the proposed method outperforms image-based baselines on both

ofﬂine and online benchmark protocols, while the online estimation stability is improved by more than 50%.

1 INTRODUCTION

Human visual and cognitive systems allow us to

perform many incredibly complex tasks effortlessly.

However, estimating biological age from unknown

faces based only on visual cues is a challenging task,

even for humans. Research suggests that humans mis-

estimate age from facial images by 4.7 to 7.2 years on

average (Han et al., 2013). Large variations in appar-

ent age for subjects of the same biological age can be

caused not only by genetic predispositions and health,

but also by many different external factors such as

living conditions, weather exposure, facial cosmetics,

surgical operations, facial hair, and even facial expres-

sions (Dibeklio

glu et al., 2015).

While automatic age estimation has prominent ap-

plication ﬁelds, such as human-computer interaction

(HCI), precision advertising, and the beauty industry,

some more strict ﬁelds, such as security, forensics,

and law enforcement, are yet to reach widespread use.

Age estimation models often fail under in-the-wild

conditions by a margin that is not acceptable for such

demanding application ﬁelds.

Humans and ML models can estimate biological

age from a single image with comparable accuracy.

However, when humans are not conﬁdent in their es-

https://orcid.org/0000-0002-5861-7076

https://orcid.org/0000-0002-2075-8295

https://orcid.org/0000-0002-6763-5487

timation, they tend to take a better look by examining

the subject for a longer period of time and from differ-

ent viewpoints. This observation is supported by the

review in (Hadid, 2011) stating that psychological and

neural studies (Bassili, 1979; Hill and Johnston, 2001;

Knight and Johnston, 1997; O’Toole et al., 2002) in-

dicate that head-pose and facial expression changes

provide important cues for face analysis. Existing

image-based methods can be applied directly to video

frames. However, (Ji et al., 2018) point out that de-

ploying image-based age estimation models directly

to videos leads to estimation stability issues, while

(Hadid, 2011) states that the image-based approach

exploits the abundance of frames in videos but ignores

useful temporal correlations and facial dynamics.

This work explores the potential of videos and

video-based methods for reﬁnement of automatic fa-

cial age estimation. Our main contributions are sum-

marized as follows: (i) We review previous research

and public datasets for video-based age estimation to

pinpoint the main obstacles in this underresearched

ﬁeld. (ii) We reproduce the recent frame-based age

estimation benchmark from (Hazirbas et al., 2021),

achieve state-of-the-art result on their protocol, and

make the missing benchmark metadata publicly avail-

able. (iii) We design a new video-based age estima-

tion benchmark and ensure reproducibility by mak-

ing the framework and metadata publicly available.

(iv) We design a semi-supervised video age estima-

Bešeni

c, K., Pandži

c, I. and Ahlberg, J.

Let Me Take a Better Look: Towards Video-Based Age Estimation.

DOI: 10.5220/0012376800003654

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2024), pages 57-69

ISBN: 978-989-758-684-2; ISSN: 2184-4313

tion method that overcomes the lack of labeled train-

ing data and outperforms its image-based counterpart,

thus setting baseline results on the proposed bench-

mark.

2 RELATED WORK

The main premise of this study is that videos can pro-

vide extensive information useful for age estimation.

However, there is still only a handful of published

datasets and methods that leverage video information

to improve age estimation. Systematic surveys on

image-based age estimation methods and datasets can

be found in (Panis et al., 2016; Angulu et al., 2018;

Al-Shannaq and Elrefaei, 2019; Othmani et al., 2020;

Agbo-Ajala and Viriri, 2021). This section brieﬂy re-

views video-based age estimation methods and video

datasets with age annotations.

2.1 Video-Based Age Estimation

Methods

According to (Hadid, 2011), there are two main

strategies for video-based face analysis. The simplest

strategy is to apply image-based methods to all video

frames or a set of sampled frames from a video. In-

dividual frame results are then fused across the se-

quence. A more elaborate strategy consists of lever-

aging both face appearance and face dynamics infor-

mation through spatio-temporal modeling. Hadid im-

plemented two baseline methods based on SVM clas-

siﬁers, LBP features (Ojala et al., 1996; Ojala et al.,

2002) for static images, and Volume-LBP features

(Zhao and Pietikainen, 2007) for video sequences.

They used 2,000 web-scraped videos manually an-

notated with apparent age labels. Their experiments

indicated that the video-based approach can improve

performance for face recognition, gender classiﬁca-

tion, and ethnicity classiﬁcation tasks, but not for age

estimation.

(Dibeklio

glu et al., 2012a) focused on the dis-

criminative power of smile dynamics for age estima-

tion. They leveraged movement features of facial key

points to complement appearance-based LBP features

and improve SVM estimation accuracy. Their exper-

iments on the UvA-NEMO Smile Database (Dibek-

lio

glu et al., 2012b) showed signiﬁcant performance

improvement over the image-based method, as well

as compared to the method from (Hadid, 2011). In

(Dibeklio

glu et al., 2015), they also considered IEF

(Alnajar et al., 2012), GEF (Alnajar et al., 2012), and

BIF (Guo et al., 2009) features. Surface area features

based on a mesh model were used instead of facial

key-point movement features, followed by a novel

two-level classiﬁer. Experimentation was extended to

the introduced UvA-NEMO Disgust dataset. Video-

based methods signiﬁcantly outperformed the image-

based baseline on both versions of the UvA-NEMO

dataset.

Instead of handcrafted features used in (Hadid,

2011; Dibeklio

glu et al., 2012a; Dibeklio

glu et al.,

2015), (Pei et al., 2019) used a combination of

CNN, RNN, and attention modules. Their proposed

Spatially-Indexed Attention Model (SIAM) used a

CNN for appearance modeling, a spatial attention

module for the detection of salient facial regions, an

RNN model for capturing facial dynamics, and a tem-

poral attention module for temporal saliency, trained

in an end-to-end manner. UvA-NEMO Smile and

UvA-NEMO Disgust datasets were once again used,

following protocol from (Dibeklio

glu et al., 2015).

The proposed neural network-based approach outper-

formed previous methods based on handcrafted fea-

tures. (Ji et al., 2018) pointed out that deploying

image-based age estimation models directly to videos

often suffers from estimation stability issues. To ad-

dress this, they proposed a combination of CNN-

based feature extraction and attention-based feature

aggregation modules, sharing some similarities with

(Pei et al., 2019). Their loss function combined MSE

loss and a component for estimation stability. The

model was not trained in an end-to-end manner, as

the CNN was trained on the MORPH-II (Ricanek and

Tesafaye, 2006) image dataset. To train the feature

aggregation module, they built a video dataset com-

prising 18,282 frames from a single twelve-minute

video of one subject. Their experiments demonstrated

improvements with respect to both age estimation ac-

curacy and stability.

To attenuate the effect of head-pose on video-

based age estimation, (Han et al., 2021) based their

method on pose-invariant uv texture maps recon-

structed from video frames by a Wasserstein-based

GAN. They introduced the UvAge video dataset and

used it both for training and evaluation. (Zhang and

Bao, 2022) combined multi-loss CNN for head-pose

estimation (Ruiz et al., 2018) and DRF (Shen et al.,

2018) for age estimation. To mitigate the negative ef-

fect of head-pose variation, they trained a head-pose

model on the 300W-LP dataset (Zhu et al., 2016) and

used it to create multiple subsets of the CACD (Chen

et al., 2015) and AFAD (Niu et al., 2016) datasets.

Two 12-minute facial videos from two subjects were

collected for evaluation purposes. Using frontal mod-

els and selecting frames with near-frontal faces im-

proved both age estimation accuracy and stability.

Both this work and the work from (Han et al., 2021)

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

Table 1: Overview of the video-based age estimation datasets.

Dataset Subjects Videos Age range Demographics Head pose Illumination

UvA-NEMO Smile (Dibeklio

glu et al., 2012b) 400 1,240 8 - 76 unbalanced mostly frontal constrained

UvA-NEMO Disgust (Dibeklio

glu et al., 2015) 324 518 8 - 76 unbalanced mostly frontal constrained

UvAge (Han et al., 2021) 516 6,898 16 - 83 unconstrained unconstrained unconstrained

Casual Conversations (Hazirbas et al., 2021) 3,011 45,186 18 - 85 semi-balanced unconstrained unconstrained

Casual Conversations Mini (Hazirbas et al., 2021) 3,011 6,022 18 - 85 semi-balanced unconstrained balanced

Casual Conversations v2 (Porgali et al., 2023) 5,567 26,467 18 - 81 semi-balanced unconstrained unconstrained

were focused on achieving robustness to head pose

changes without exploiting any temporal information

from videos.

A recurring topic in the reviewed work is the lack

of publicly available video data, as almost all au-

thors resorted to data collection. However, the col-

lected data was not made publicly available in (Hadid,

2011; Ji et al., 2018; Zhang and Bao, 2022). Manual

data collection often results in very small datasets. Ji

et al. and Zhang et al. used very limited amounts

of video data (i.e., one or two videos), undermining

the reliability of their conclusions. Dibeklio

glu et al.

based their work on a larger video dataset, but used it

for both training and evaluation. A very speciﬁc na-

ture of the used data, where every subject transitions

from neutral to smiling facial expression, potentially

caused overﬁtting to this speciﬁc type of data. They

introduced a dataset related to a different facial ex-

pression (disgust), but did not perform cross-domain

testing, same as Pei et al.. Ji et al., Han et al., and

Zhang et al. did not perform a comparison with pre-

vious video-based age estimation methods.

2.2 Video-Based Age Estimation

Datasets

A lack of a large public in-the-wild video-based age

estimation benchmark undermines the convincing-

ness of some of the reviewed ﬁndings and the ability

to fairly compare introduced methods. This section

reviews video-based age estimation datasets used in

the reviewed work, as well as some recent datasets

suitable for this purpose.

UvA-NEMO Smile Dataset was initially intro-

duced to study differences between spontaneous and

posed smiles in (Dibeklio

glu et al., 2012b). The

dataset consists of 597 spontaneous and 643 posed

smile recordings, totaling in 1,240 videos. The videos

were collected from 400 volunteers (185 female and

215 male) with ages ranging from 8 to 76 years. Sub-

jects were mostly Caucasian. The recordings were

done in a controlled environment, with constrained il-

lumination and high-resolution cameras. Each video

segment starts with a neutral expression and transi-

tions to a smiling expression.

UvA-NEMO Disgust Dataset (Dibeklio

glu et al.,

2015) was collected concurrently with the UvA-

NEMO Smile Dataset following the same record-

ing setup. 324 volunteers (152 female, 172 male)

were recorded posing disgust facial expressions. 313

of them also participated in the UvA-NEMO Smile

Dataset collection. Similar to the Smile version of the

dataset, age varies from 8 to 76 years, and the subjects

are mostly Caucasian. Each of the 518 disgust video

segments once again starts with a neutral expression

and transitions to the target expression.

UvAge Dataset (Han et al., 2021) was created

speciﬁcally for age estimation from videos. It con-

sists of 6,898 videos from 516 subjects. The proposed

web-scraping technique was based on collection of

videos of celebrities with birth information available

on Wikipedia. To get a reliable video recording

time, they used traceable public events such as the

Academy Awards or the G20 summit. The videos

were manually veriﬁed and segmented into sequences

containing only one subject. Along with age labels,

each video was also annotated with identity, gender,

ethnicity, and occupation.

Casual Conversations Dataset (CC) (Hazirbas

et al., 2021) is a recently published video dataset from

Facebook (Meta) AI designed for measuring fairness

of computer vision and audio models across a diverse

set of ages, genders, apparent skin tones, and ambi-

ent lighting conditions. It consists of 45,186 high-

quality videos collected from 3,011 paid individuals

who agreed to participate in the project and explicitly

provided age and gender labels themselves. A group

of trained annotators additionally labeled skin tone

according to Fitzpatrick scale (Fitzpatrick, 1975) and

ambient light type. The authors also proposed a well-

balanced subset of the CC dataset, denoted as Ca-

sual Conversations Mini (CCMini). The subset was

formed by selecting one dark and one bright video

per subject (when possible) to have a balanced light-

ing distribution, with a total of 6,022 videos.

Casual Conversations v2 Dataset (Porgali et al.,

2023) is a follow-up to the original CC dataset

(CCv2), curated with special focus on geographical

diversity. 5,567 subjects from 7 countries were paid

to participate in data collection. The 26,467 col-

lected videos amount to 320 hours of scripted and 354

Let Me Take a Better Look: Towards Video-Based Age Estimation

hours of non-scripted conversations. Some of the la-

bels, such as age, gender, language/dialect, disabil-

ity, physical adornments/attributes, and geo-location,

were self-provided by the participants. Trained an-

notators were used to additionally label for Fitz-

patrick Skin Type (Fitzpatrick, 1975), Monk Skin

Tone (Monk, 2014), voice timbre, recording setup,

and per-second activity.

A summary of the reviewed datasets is presented

in Table 1. Our efforts to receive access to UvA-

NEMO Smile, UvA-NEMO Disgust, and UvAge

datasets were unsuccessful, making the Casual Con-

versation datasets the only publicly accessible re-

source suitable for video-based age estimation re-

search. However, the CC dataset licenses permit the

use of provided labels only for evaluation purposes.

3 VIDEO AGE ESTIMATION

BENCHMARK DATA

Not only that Casual Conversations are the only pub-

licly accessible video-based datasets, they were also

curated with a special focus on ethical data collec-

tion and demographic fairness. Moreover, they are

the largest video datasets with precise self-provided

age annotations, and their licenses allow for evalua-

tion of both academic and commercial models

. All

this makes them great candidates for a video age esti-

mation benchmark.

3.1 CCMiniIMG

The authors of the CC dataset explored its poten-

tial for age estimation evaluation on CCMini; a well-

balanced subset of the CC dataset. Their benchmark

data consists of 100 frames sampled from each of the

videos.

For further reference, we dub this image set as

CCMiniIMG. Although this initial effort toward a

CC-based age estimation benchmark motivated our

work, we discuss its adequacy for video-based age es-

timation. CCMiniIMG is made available in the form

of pre-extracted raw video frames, where 100 frames

are provided for each of 6,022 videos. However, the

authors did not provide face detection metadata or

clear information on the frame sampling procedure.

The lack of face detection metadata prevents the ex-

act reproduction of their evaluation protocol as some

frames contain multiple subjects. The lack of infor-

mation on the frame sampling method and the fact

For speciﬁc conditions please refer to the CC license

agreements.

Figure 1: CCMiniVID video processing framework. All

faces are tracked with a commercial tracking system. The

metadata of multi-subject videos is manually ﬁltered to con-

tain only data related to the subject of interest. The frame-

work extracts aligned face crop sequences using the raw CC

videos and the produced tracking data.

that the protocol relies only on frames rather than con-

tinuous sequences prevents the evaluation of video-

based age estimation methods that leverage temporal

information.

3.2 CCMiniVID

To design a video benchmark dataset for age esti-

mation based on the CC data, we follow CCMini-

IMG and select 6,022 well-balanced videos from the

CCMini subset. All videos were recorded at 30 FPS

and the mean video duration is 64.44±13.56 seconds,

with 99% of videos lasting for at least 20 seconds.

We set our target to extract sequences with 20 sec-

onds of continuous face presence from each video,

where possible. We utilize a commercial face tracking

system from Visage Technologies

to produce high-

quality frame-level tracking data comprising 75 facial

landmark points, apparent pitch, yaw, and roll head

angles, tracking quality, and face scale. We proceed

with semi-automatic tracking data analysis, as shown

in Figure 1.

The CC data does not provide correspondence be-

tween subjects and labels for the multi-subject videos,

potentially causing erroneous subject-level labels. We

perform automatic detection of multi-subject video

candidates, followed by manual veriﬁcation and se-

lection of tracking streams in 108 videos. In two of

the videos subjects of interest are facing away from

the camera, making their faces fully self-occluded.

These videos are not suitable for a face-oriented eval-

uation benchmark. Continuous face tracking with a

target duration of 20 seconds was achieved for 5,932

of 6,022 videos. Manual veriﬁcation of the remain-

ing 90 videos showed that in 21 videos continuous

tracking was not sustainable due to occlusions, self-

occlusions, camera issues, or extreme lighting. 69

videos were shorter than 20 seconds, with the shortest

one lasting for only 6 seconds.

https://visagetechnologies.com/facetrack/

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

Table 2: Age classiﬁcation accuracy results for the baseline methods from (Hazirbas et al., 2021) and our image-based age

estimation model from (Be

seni

c et al., 2022) on the CCMiniIMG benchmark data according to the image-based evaluation

protocol.

Gender Skin type Lighting

Overall Female Male Other Type I Type II Type III Type IV Type V Type VI Bright Dark

(Levi and Hassner, 2015) 38.05 37.44 39.48 66.67 39.56 38.72 40.84 36.47 36.47 34.89 38.49 37.04

(Lee et al., 2018) 42.26 42.28 44.53 100.00 42.33 41.78 42.30 42.79 42.44 37.99 42.94 41.12

(Serengil and Ozpinar, 2020) 54.32 54.21 56.18 83.33 46.51 55.52 54.59 55.78 53.78 52.57 54.17 55.20

(Be

seni

c et al., 2022) 73.06 70.20 76.48 87.50 76.72 82.15 72.90 70.24 67.52 65.61 73.90 71.64

Table 3: Age classiﬁcation accuracy results for our image-based age estimation model from (Be

seni

c et al., 2022) on the three

versions of CCMiniVID benchmark data according to the image-based evaluation protocol.

Gender Skin type Lighting

Overall Female Male Other Type I Type II Type III Type IV Type V Type VI Bright Dark

CCMiniIMG 73.06 70.20 76.48 87.50 76.72 82.15 72.90 70.24 67.52 65.61 73.90 71.64

CCMiniVID-O 73.60 70.90 76.75 93.75 77.59 81.31 74.25 74.09 65.82 67.48 75.18 70.96

CCMiniVID-A 73.50 70.56 77.02 87.50 77.59 81.55 73.13 74.09 66.67 67.24 75.00 70.98

CCMiniVID-R 73.59 70.85 76.81 93.55 77.23 81.38 74.43 73.88 65.70 67.49 75.17 70.99

Based on the obtained tracking data and man-

ual veriﬁcation, we propose three versions of the

CCMiniVID benchmark data. CCMiniVID-O, where

O stands for ”original”, comprises only videos from

the original CCMini subset. It contains 5,932

sequences with continuously tracked 600 frames

(20s@30FPS) and 90 outlier videos that have be-

tween 100 and 599 continuously tracked frames. It

also contains the aforementioned 2 videos where sub-

jects of interest are not visible. In CCMiniVID-A,

where A stands for ”alternatives”, replacements for

92 problematic videos were manually selected from

the full CC dataset by looking for the most similar

substitutes in video galleries of target subjects. This

allows for consistent evaluation with 600 consecu-

tively tracked frames for all 3,011 subjects from the

CC dataset. Additionally, we propose CCMiniVID-

R, where R stands for ”reduced”; a simple subset

of CCMiniVID-O where the 92 problematic videos

were removed. This allows for precise and consis-

tent testing without the need for downloading the full

6.9 TB CC dataset. Figure 2 presents distributions

for CCMiniVID-R main labels, while a broader ex-

plorative analysis is available in Appendix B.

4 VIDEO AGE ESTIMATION

BENCHMARK PROTOCOL

To design the benchmark protocol, we ﬁrst review the

image-based age classiﬁcation protocol introduced in

(Hazirbas et al., 2021).

Figure 2: CCMiniVID-R label distributions for gender, age,

and skin tone (subject-level) and lighting (video-level).

4.1 Image-Based Benchmark Protocol

The authors of (Hazirbas et al., 2021) extracted 100

frames from each of the 6,022 CCMiniIMG videos

and detected faces with the DLIB face detector (King,

2009). The authors provide only video-level metadata

and raw frames. Face detection metadata was not pro-

vided, and neither face detection setup nor frame sam-

pling algorithm were speciﬁed. In the proposed pro-

tocol, age estimation is treated as a 3-class age classi-

ﬁcation task. The median of 100 image-level age es-

timations is used as the video-level estimation which

is then mapped to 3 predeﬁned age groups (i.e., 18-

30, 31-45, and 46-85). Classiﬁcation accuracy was

selected as the main metric. The protocol addition-

ally relies on the dataset’s auxiliary labels to calcu-

late accuracy across 3 gender groups, 6 apparent skin

Let Me Take a Better Look: Towards Video-Based Age Estimation

tone types, and 2 ambient lighting types. The proto-

col reproduction details and metadata are available in

Appendix A.

4.1.1 Image-Based Baseline Results

We evaluate the performance of our best-performing

image-based age estimation model from (Be

seni

et al., 2022) and compare it to the previously re-

ported SOTA results from (Hazirbas et al., 2021).

The selected model is based on a CNN architecture

from MobileFaceNet family (Chen et al., 2018) and

trained on the large in-the-wild B3FD image dataset

(Be

seni

c et al., 2022) using the DLDLv2 age esti-

mation method (Gao et al., 2018). The model takes

128×128 RGB face crop inputs and outputs 101 val-

ues corresponding to ages from 0 to 100. More details

are available in (Be

seni

c et al., 2022).

Table 2 presents the baseline results. The selected

model outperforms the overall age classiﬁcation ac-

curacy of the previously reported leading method

(Serengil and Ozpinar, 2020) by a large margin of

18.74 points, establishing the new state-of-the-art on

this benchmark. Although it outperforms the base-

line methods by a large margin, it shows similar rela-

tive performance drop for female subjects, darker skin

tones, and poorly illuminated recordings.

To validate the CCMiniVID benchmark data from

Section 3, we apply the image-level baseline model

and protocol to its three versions. Table 3 shows

a comparison with results obtained on the original

CCMiniIMG data. The results are closely matched,

showing that while the CCMiniVID benchmark data

offers several beneﬁts over CCMiniIMG, such as con-

tinuous video sequences, face tracking data, and clear

mapping between video subjects and the dataset’s

metadata, it retains a very similar difﬁculty level.

4.2 Video-Based Benchmark Protocol

Methods that work with sequential (i.e., temporal)

data can be either ofﬂine or online. Ofﬂine methods

process a video as a unit non-causally and produce a

single estimate for the whole video. Temporal stabil-

ity is not an issue since only one estimate per video

is produced. Therefore, we ﬁnd Mean Absolute Error

(MAE) to be a sufﬁcient metric for ofﬂine estimation

methods. The absolute error is calculated based on a

single estimate per video, while the mean is calculated

over all videos in the dataset. Online methods produce

updated age estimations in real-time as new video

frames are captured. For online methods, we propose

Temporal Mean Absolute Error (tMAE) and Tempo-

ral Standard Deviation (tST D) as the benchmark met-

rics to evaluate both method’s accuracy and online

estimation stability, respectively. These are standard

MAE and ST D metrics calculated at the frame level,

as online methods produce new estimates for each of

the frames in the temporal dimension. Following the

CCMiniIMG protocol, we rely on the dataset’s addi-

tional labels to calculate the proposed metrics across

3 gender groups, 6 apparent skin tone types, and

2 ambient lighting types. To unambiguously deﬁne

the benchmark protocol, we make the metadata (in-

cluding the face tracking data from the commercial

tracking system), the test set extraction framework,

and the evaluation framework publicly available at

https://github.com/kbesenic/CCMiniVID.

4.2.1 Video-Based Baseline Results

Tables 4 and 5 present results of our image-based age

estimation model described in Section 4.1.1 on the

three CCMiniVID benchmark dataset versions, fol-

lowing the previously described ofﬂine and online

video protocols, respectively. To comply with the of-

ﬂine estimation protocol, which expects a single esti-

mate per video, we use the median of the frame-level

estimations to calculate MAE. The tMAE and tST D

metrics for the online protocol are calculated directly

from the frame-level estimation errors.

The results obtained on the three versions are very

similar, and the relative performance drop can be ob-

served in the case of female subjects, darker skin

tones, and lack of good lighting. High tST D numbers

indicate unstable age estimation across video frames.

The large gap between overall ofﬂine and online MAE

indicates that video-level estimations are more precise

than frame-level estimations.

5 TOWARDS VIDEO-BASED AGE

ESTIMATION METHOD

All previously presented results are obtained with

image-based age estimation methods. The main

premise of this work is that age can be estimated more

precisely by taking a better look, i.e. by using a longer

video sequence instead of a single image. To verify

this, we ﬁrst analyze how video sequence duration af-

fects the age estimation accuracy.

5.1 Taking a Better Look

For this experiment, we use the CCMiniVID-A

video benchmark data which contains 600 frames

(20s@30FPS) of continuous face tracking for two

videos of each subject in the CCMini dataset. We

ﬁnd this version of the benchmark data most suitable

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

Table 4: Ofﬂine age estimation MAE for our image-based age estimation model from (Be

seni

c et al., 2022) on the three

versions of CCMiniVID benchmark data according to the ofﬂine video-based evaluation protocol.

Gender Skin type Lighting

Overall Female Male Other Type I Type II Type III Type IV Type V Type VI Bright Dark

CCMiniVID-O 5.25 5.61 4.81 3.82 5.14 4.78 5.02 5.15 5.56 5.95 5.04 5.60

CCMiniVID-A 5.25 5.64 4.78 3.63 5.03 4.74 5.05 5.14 5.59 6.00 5.00 5.67

CCMiniVID-R 5.23 5.60 4.78 3.89 4.94 4.79 4.97 5.19 5.57 5.92 5.01 5.59

Table 5: Online age estimation tMAE and tST D for our image-based age estimation model from (Be

seni

c et al., 2022) on the

three versions of CCMiniVID benchmark data according to the online video-based evaluation protocol.

Gender Skin type Lighting

Overall Female Male Other Type I Type II Type III Type IV Type V Type VI Bright Dark

CCMiniVID-O 6.19±3.25 6.77±3.73 5.47±2.66 4.77±2.61 6.19±3.44 5.64±2.98 5.84±2.99 5.98±3.01 6.63±3.60 7.04±3.71 5.97±3.13 6.56±3.46

CCMiniVID-A 6.10±3.12 6.68±3.58 5.39±2.54 4.43±2.54 6.03±3.33 5.53±2.84 5.78±2.88 5.84±2.84 6.58±3.44 6.98±3.57 5.85±3.02 6.52±3.29

CCMiniVID-R 6.17±3.25 6.75±3.73 5.45±2.66 4.83±2.61 6.02±3.44 5.64±2.98 5.79±2.98 6.01±3.02 6.64±3.60 7.01±3.71 5.94±3.13 6.54±3.46

Figure 3: Age estimation MAE of our 2D CNN image-

based age estimation model from (Be

seni

c et al., 2022) and

the proposed video-based model (2D CNN + TCN4) in re-

lation to video sequence duration on the CCMiniVID-A of-

ﬂine video-based benchmark protocol.

since it contains all subjects while all video subse-

quences are of the exact same duration. The majority

of video-based methods rely on frame subsampling

techniques to avoid redundant processing of nearly

identical neighboring frames and to reduce compu-

tational complexity. We set the subsampling step to

6, following various video processing methods (Gao

et al., 2017; Xu et al., 2019; Eun et al., 2021; Xu

et al., 2021; Chen et al., 2022). We calculate age es-

timation MAE for video subsequences lasting from a

single frame (0 seconds) to 20 seconds. The results

are presented in Figure 3, denoted as 2D CNN.

By using video sequences of 1 second, the single-

frame-based estimation error is reduced by 5.59%.

By using 5 seconds, the error is reduced by 10.29%.

Increasing sequence duration from 5 to 15 seconds

reduces the error by an additional 2.67 percentage

points. However, increasing the sequence duration

from 15 to 20 seconds results in almost no improve-

ment. Using the full sequences reduces the estima-

tion error by 13.35%, compared to the single-frame

approach. The results demonstrate that taking a bet-

ter look is very useful for age estimation, even with a

basic image-based estimation method and simple me-

dian aggregation of the frame-level results.

5.2 Cherry-Picking Based on Tracking

Data

Cherry-picking is a popular name for a technique

based on selection of the best or most suitable el-

ements from a set. A frontal-face cherry-picking

approach was proposed by (Zhang and Bao, 2022),

where a ﬁxed threshold was used on the sum of abso-

lute head-pose angles. To evaluate the cherry-picking

approach, we can utilize the produced frame-level

tracking data described in Section 3.2 (i.e., head-pose

angles, tracking quality, and face scale). Using a ﬁxed

threshold is not suitable for our evaluation protocol

since it might cause biases in video-based or even

subject-based label distributions. For example, there

might be more female subjects with bad lighting in the

non-frontal subset. To mitigate this issue, we extract

three equally sized subsets from each of the bench-

mark videos. As a baseline, we extract the chronolog-

ically ﬁrst 50% of the video frames. We also extract

the best and the worst 50% of the video frames based

on a certain criterion. This ensures that all videos and

all subjects are used in all three subsets and that the

subsets are of equal size. Results of this experiment

are presented in Table 6.

Contrary to the ﬁndings in (Zhang and Bao, 2022),

head-pose angles seem to have negligible inﬂuence

on age estimation error in our experiments. The used

Let Me Take a Better Look: Towards Video-Based Age Estimation

Table 6: Age estimation MAE of our image-based age

estimation model from (Be

seni

c et al., 2022) on the

CCMiniVID-A benchmark data for frame cherry-picking

approaches based on different face tracking data.

Criterion Yaw Pitch Roll Tracking quality Face scale

First 50% 5.29 5.29 5.29 5.29 5.29

Best 50% 5.32 5.26 5.29 5.27 5.27

Worst 50% 5.26 5.32 5.26 5.30 5.31

image-based age estimation model was trained on the

B3FD dataset (Be

seni

c et al., 2022), a very large in-

the-wild age estimation dataset with unconstrained

head poses. The model was also trained with data

augmentations to make it robust to tracking instabil-

ities and low face resolution, further explaining why

ﬁltering based on auxiliary tracking data did not result

in signiﬁcant improvements.

5.3 Video-Based Age Estimation

Method

The main obstacle to training an age estimation model

that leverages facial dynamics and temporal informa-

tion from videos is the lack of video data with age la-

bels. As mentioned in our data review in Section 2.2,

there are no publicly accessible video datasets that

permit training of age models. As reviewed in Sec-

tion 2.1, some researchers resorted to a manual col-

lection of very small video datasets (e.g., only one or

two subjects). Our opinion is that this is not enough to

train a reliable model with generalization capabilities.

To deal with the lack of annotated video data that

restricts usability of fully supervised learning, we

explore pseudo-labeling and semi-supervised learn-

ing. Whereas supervised learning requires labels for

all training samples, semi-supervised learning algo-

rithms aim at improving their performance by utiliz-

ing unlabeled data (Oliver et al., 2018). One approach

for making use of unlabeled data is generating pseudo

labels. Pseudo labels are weak labels generated by the

model itself and subsequently used to further train the

model (Hu et al., 2021). In our setup, we leverage

the fact that subject’s age does not change during a

single video recording. We rely on an image-based

age estimation model and apply it to every frame of

an unlabeled facial video dataset. The frame-level re-

sults are aggregated to get video-level pseudo labels.

These pseudo labels can then be used to supervise

learning of spatio-temporal models. In case of end-to-

end training, the image-based backbone used to gen-

erate the pseudo-labels is also further optimized and

improved.

[16 x 3 x 128²]

[16 x 512]

2D CNN

VID

4L TCN

[1 x 512]

[1 x 101]



'

t-15



t-14



t-13



t-12



t-11



t-10



t-9



t-8



t-7



t-6



t-5



t-4



t-3



t-2



t-1



Figure 4: The proposed video age estimation method based

on a 2D CNN feature extractor and a 4-layer TCN tempo-

ral model (4L TCN), followed by a fully connected layer

(FC) for dimensionality reduction. The model takes 16 in-

put video frames (VID) denoted as x

to x

t−15

to produce an

age estimation probability vector y

′

5.3.1 Pseudo-Labeled Training Data

The main advantage of the proposed approach is that

any source of unlabeled facial videos can be used for

model training. We combine two large sources of

raw videos. The video portion of the IJB-C dataset

(Maze et al., 2018) is selected since the dataset’s gen-

eral statistics indicate good age distribution. CelebV-

HQ dataset (Zhu et al., 2022) is selected for its size

and good distribution with respect to facial expres-

sions, appearance attributes, and actions. The raw

videos were once again processed with the commer-

cial face tracking system and frame-level estimations

were produced with the image-based age estimation

model from Section 4.1.1. The median was used

to aggregate frame-level estimations into video-level

pseudo labels. We ﬁltered the video data with respect

to face scale, tracking quality, and sequence duration.

The produced set consists of 28,619 videos with a to-

tal of 7,535,299 frames, averaging 263 continuously

tracked frames per video. Pseudo-label distribution

ranges from 6 to 89 years.

5.3.2 Semi-Supervised Video-Based Method

In Section 4.2.1 we demonstrated that ofﬂine me-

dian estimation across video sequences can be much

more precise than frame-level predictions. Our goal

is to use the median pseudo labels to train a temporal

model that will be able to replicate that performance

boost, but in an online fashion and based on a much

shorter time window.

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

Table 7: Ofﬂine age estimation MAE for our 2D CNN image-based age estimation model (Be

seni

c et al., 2022) and the

proposed video-based age estimation models (2D CNN + TCN) on the CCMiniVID-A benchmark data according to the

ofﬂine video-based evaluation protocol.

Gender Skin type Lighting

Overall Female Male Other Type I Type II Type III Type IV Type V Type VI Bright Dark

2D CNN 5.25 5.64 4.78 3.63 5.03 4.74 5.05 5.14 5.59 6.00 5.00 5.67

2D CNN + TCN3 5.07 5.39 4.68 3.72 4.98 4.64 4.81 5.03 5.40 5.72 4.78 5.57

2D CNN + TCN4 4.95 5.25 4.58 3.81 4.71 4.49 4.76 4.83 5.27 5.61 4.76 5.27

Table 8: Online age estimation tMAE and tST D for our 2D CNN image-based age estimation model (Be

seni

c et al., 2022)

and the proposed video-based age estimation models (2D CNN + TCN) on the CCMiniVID-A benchmark data according to

the online video-based evaluation protocol.

Gender Skin type Lighting

Overall Female Male Other Type I Type II Type III Type IV Type V Type VI Bright Dark

2D CNN 6.10±3.12 6.68±3.58 5.39±2.54 4.43±2.54 6.03±3.33 5.53±2.84 5.78±2.88 5.84±2.84 6.58±3.44 6.98±3.57 5.85±3.02 6.52±3.29

2D CNN + TCN3 5.36±1.79 5.75±2.07 4.88±1.44 4.12±1.70 5.33±1.82 4.91±1.64 5.07±1.68 5.23±1.64 5.73±1.98 6.05±2.03 5.06±1.71 5.85±1.93

2D CNN + TCN4 5.16±1.51 5.52±1.74 4.71±1.23 4.02±1.40 4.97±1.53 4.69±1.40 4.94±1.42 5.00±1.37 5.52±1.67 5.85±1.70 4.96±1.44 5.49±1.62

Our method follows the basic design principle

of many video-processing methods (De Geest et al.,

2016; De Geest and Tuytelaars, 2018; Xu et al., 2019;

Kim et al., 2021; Wang et al., 2022), including some

previously reviewed video age estimation methods

(Pei et al., 2019; Ji et al., 2018), meaning that the

model consists of an image-based 2D CNN feature

extractor and a temporal model that learns to aggre-

gate frame-level features in an optimal way.

For feature extraction, we once again rely on our

2D CNN model for image-based age estimation from

Section 4.1.1. The feature extractor backbone is

stripped of its last age classiﬁcation layer to produce

features of 512 elements. Motivated by the results

of (Bai et al., 2018), we base our temporal model

on their implementation of Temporal Convolutional

Network (TCN)

. The proposed TCN can be parame-

terized with respect to number of layers, kernel size,

and number of input and output channels. We set the

kernel size to 2 and parameterize the TCN to input

and output feature vectors of size 512, matching the

chosen feature extractor output. The TCN’s output

is passed to a fully connected layer that translates the

temporal feature vectors to age estimations. The num-

ber of layers determines the network’s receptive ﬁeld

and therefore the time window size used in online pro-

cessing. We experiment with 3 and 4 layers, resulting

in receptive ﬁelds of 8 and 16 frames, respectively.

An overview of the method is presented in Fig-

ure 4. The ﬁgure illustrates the method’s processing

pipeline for a 4-layer TCN with a receptive ﬁeld of

16 frames. The method predicts a softmax probabil-

github.com/locuslab/TCN/blob/master/TCN/tcn.py

ity vector y

′

of size 1 × 101 (mapping to ages from 0

to 100) based on the current frame face crop x

and 15

previous face crops (x

t−1

to x

t−15

). The weighted sum

of the softmax probabilities is used as the ﬁnal age

prediction, according to the DLDLv2 method (Gao

et al., 2018).

We train the proposed model in an end-to-end

manner, meaning that we jointly optimize both the

image-pretrained feature extractor and the randomly

initialized TCN. The training data is divided into

training and validation subsets with a random 80:20

split. The model is trained with the DLDLv2 age esti-

mation method and Adam optimization (Kingma and

Ba, 2014), using a learning rate of 10

−6

and weight

decay of 10

−3

. We adopt an early stopping approach,

where training is ended when the validation subset

MAE plateaus.

5.3.3 Video-Based Benchmark Results

The results in Tables 7 and 8 show that we outper-

formed image-based baseline according to both of-

ﬂine and online evaluation protocols, even though our

model trainings were supervised with pseudo labels

generated by that exact baseline model. Figure 3

presents consistent improvements of the 4-layer TCN

model (TCN4) on the ofﬂine protocol in relation to se-

quence duration. Signiﬁcant improvements can also

be observed in the online protocol results in Table 8,

where the TCN4 overall tMAE is reduced by 15.41%,

while tST D is reduced by an even larger margin of

51.60%.

The TCN3 model uses a time window of size 8,

which amounts to just 1.6 seconds of video data un-

Let Me Take a Better Look: Towards Video-Based Age Estimation

der the proposed frame subsampling setting. By us-

ing the ﬁrst 1.6 seconds of each video, TCN3 can

achieve MAE of 5.32 with a single online inference

pass, compared to 5.52 in the case of the aggregated

image-based results. Image-based result aggregation

on the full 20-second videos gives MAE of 5.25. By

using a single online TCN4 inference pass on the ﬁrst

3.2 seconds of each video, we even outperform the

full-video results with MAE of 5.15, simultaneously

achieving 84% shorter estimation time and improved

estimation accuracy.

6 CONCLUSIONS

Aligned with our initial premise for this work, tak-

ing a better look on video sequences signiﬁcantly im-

proves age estimation, compared to the single-image

approach. Our evaluation also conﬁrmed previous

ﬁndings regarding inconsistent performance with re-

spect to gender, skin tone, and lighting type, high-

lighting the issue of demographic biases in ML. Con-

trary to the ﬁndings of previous work, our experi-

ments showed that the correlation between age esti-

mation error of contemporary age estimation models

and auxiliary face tracking data (i.e., head-pose an-

gles, tracking quality, and face scale) is negligible,

making the frame cherry-picking method ineffective.

The proposed video age estimation method, based on

pseudo-labeling and semi-supervised learning, over-

comes the lack of available annotated training data

and improves age estimation accuracy according to

both ofﬂine and online evaluation protocols, while

estimation stability is improved by more than 50%.

Our goal is that our carefully designed and publicly

available benchmark protocol, along with the baseline

video age estimation results, encourages and supports

further research on the topic of video-based age esti-

mation, as the potential of this underexplored ﬁeld of

research is clearly demonstrated.

ACKNOWLEDGEMENTS

This research is funded by Visage Technologies AB.

REFERENCES

Agbo-Ajala, O. and Viriri, S. (2021). Deep learning

approach for facial age classiﬁcation: a survey of

the state-of-the-art. Artiﬁcial Intelligence Review,

54:179–213.

Al-Shannaq, A. S. and Elrefaei, L. A. (2019). Comprehen-

sive analysis of the literature for age estimation from

facial images. IEEE Access, 7:93229–93249.

Alnajar, F., Shan, C., Gevers, T., and Geusebroek, J.-

M. (2012). Learning-based encoding with soft as-

signment for age estimation under unconstrained

imaging conditions. Image and Vision Computing,

30(12):946–953.

Angulu, R., Tapamo, J. R., and Adewumi, A. O. (2018).

Age estimation via face images: a survey. EURASIP

Journal on Image and Video Processing, 2018(1):1–

35.

Bai, S., Kolter, J. Z., and Koltun, V. (2018). An em-

pirical evaluation of generic convolutional and recur-

rent networks for sequence modeling. arXiv preprint

arXiv:1803.01271.

Bassili, J. N. (1979). Emotion recognition: the role of facial

movement and the relative importance of upper and

lower areas of the face. Journal of personality and

social psychology, 37(11):2049.

seni

c, K., Ahlberg, J., and Pand

c, I. S. (2022). Picking

out the bad apples: unsupervised biometric data ﬁlter-

ing for reﬁned age estimation. The Visual Computer,

pages 1–19.

Chen, B.-C., Chen, C.-S., and Hsu, W. H. (2015). Face

recognition and retrieval using cross-age reference

coding with cross-age celebrity dataset. IEEE Trans-

actions on Multimedia, 17(6):804–815.

Chen, J., Mittal, G., Yu, Y., Kong, Y., and Chen, M. (2022).

Gatehub: Gated history unit with background sup-

pression for online action detection. In Proceedings

of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 19925–19934.

Chen, S., Liu, Y., Gao, X., and Han, Z. (2018). Mobile-

facenets: Efﬁcient cnns for accurate real-time face

veriﬁcation on mobile devices. In Biometric Recogni-

tion: 13th Chinese Conference, CCBR 2018, Urumqi,

China, August 11-12, 2018, Proceedings 13, pages

428–438. Springer.

De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C.,

and Tuytelaars, T. (2016). Online action detection. In

Computer Vision–ECCV 2016: 14th European Con-

ference, Amsterdam, The Netherlands, October 11-

14, 2016, Proceedings, Part V 14, pages 269–284.

Springer.

De Geest, R. and Tuytelaars, T. (2018). Modeling temporal

structure with lstm for online action detection. In 2018

IEEE Winter Conference on Applications of Computer

Vision (WACV), pages 1549–1557. IEEE.

Dibeklio

glu, H., Alnajar, F., Salah, A. A., and Gevers, T.

(2015). Combining facial dynamics with appearance

for age estimation. IEEE Transactions on Image Pro-

cessing, 24(6):1928–1943.

Dibeklio

glu, H., Gevers, T., Salah, A. A., and Valenti, R.

(2012a). A smile can reveal your age: Enabling fa-

cial dynamics in age estimation. In Proceedings of

the 20th ACM international conference on Multime-

dia, pages 209–218.

Dibeklio

glu, H., Salah, A. A., and Gevers, T. (2012b). Are

you really smiling at me? spontaneous versus posed

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

enjoyment smiles. In Computer Vision–ECCV 2012:

12th European Conference on Computer Vision, Flo-

rence, Italy, October 7-13, 2012, Proceedings, Part III

12, pages 525–538. Springer.

Eun, H., Moon, J., Park, J., Jung, C., and Kim, C. (2021).

Temporal ﬁltering networks for online action detec-

tion. Pattern Recognition, 111:107695.

Fitzpatrick, T. B. (1975). Soleil et peau. J. Med. Esthet.,

2:33–34.

Gao, B.-B., Zhou, H.-Y., Wu, J., and Geng, X. (2018).

Age estimation using expectation of label distribution

learning. In IJCAI, pages 712–718.

Gao, J., Yang, Z., and Nevatia, R. (2017). Red: Rein-

forced encoder-decoder networks for action anticipa-

tion. arXiv preprint arXiv:1707.04818.

Guo, G., Mu, G., Fu, Y., and Huang, T. S. (2009). Hu-

man age estimation using bio-inspired features. In

2009 IEEE conference on computer vision and pattern

recognition, pages 112–119. IEEE.

Hadid, A. (2011). Analyzing facial behavioral features from

videos. In Human Behavior Understanding: Sec-

ond International Workshop, HBU 2011, Amsterdam,

The Netherlands, November 16, 2011. Proceedings 2,

pages 52–61. Springer.

Han, H., Otto, C., and Jain, A. K. (2013). Age estimation

from face images: Human vs. machine performance.

In 2013 international conference on biometrics (ICB),

pages 1–8. IEEE.

Han, J., Wang, W., Karaoglu, S., Zeng, W., and Gevers, T.

(2021). Pose invariant age estimation of face images

in the wild. Computer Vision and Image Understand-

ing, 202:103123.

Hazirbas, C., Bitton, J., Dolhansky, B., Pan, J., Gordo, A.,

and Ferrer, C. C. (2021). Towards measuring fairness

in ai: the casual conversations dataset. IEEE Trans-

actions on Biometrics, Behavior, and Identity Science,

4(3):324–332.

Hill, H. and Johnston, A. (2001). Categorizing sex and iden-

tity from the biological motion of faces. Current biol-

ogy, 11(11):880–885.

Hu, Z., Yang, Z., Hu, X., and Nevatia, R. (2021). Simple:

Similar pseudo label exploitation for semi-supervised

classiﬁcation. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 15099–15108.

Ji, Z., Lang, C., Li, K., and Xing, J. (2018). Deep age es-

timation model stabilization from images to videos.

In 2018 24th International Conference on Pattern

Recognition (ICPR), pages 1420–1425. IEEE.

Kim, Y. H., Nam, S., and Kim, S. J. (2021). Temporally

smooth online action detection using cycle-consistent

future anticipation. Pattern Recognition, 116:107954.

King, D. E. (2009). Dlib-ml: A machine learning toolkit.

The Journal of Machine Learning Research, 10:1755–

1758.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Knight, B. and Johnston, A. (1997). The role of movement

in face recognition. Visual cognition, 4(3):265–273.

Lee, J.-H., Chan, Y.-M., Chen, T.-Y., and Chen, C.-S.

(2018). Joint estimation of age and gender from un-

constrained face images using lightweight multi-task

cnn for mobile applications. In 2018 IEEE conference

on multimedia information processing and retrieval

(MIPR), pages 162–165. IEEE.

Levi, G. and Hassner, T. (2015). Age and gender classiﬁca-

tion using convolutional neural networks. In Proceed-

ings of the IEEE conference on computer vision and

pattern recognition workshops, pages 34–42.

Maze, B., Adams, J., Duncan, J. A., Kalka, N., Miller, T.,

Otto, C., Jain, A. K., Niggel, W. T., Anderson, J., Ch-

eney, J., et al. (2018). Iarpa janus benchmark-c: Face

dataset and protocol. In 2018 international conference

on biometrics (ICB), pages 158–165. IEEE.

Monk, Ellis P., J. (2014). Skin Tone Stratiﬁcation

among Black Americans, 2001–2003. Social Forces,

92(4):1313–1337.

Niu, Z., Zhou, M., Wang, L., Gao, X., and Hua, G. (2016).

Ordinal regression with multiple output cnn for age

estimation. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 4920–

4928.

Ojala, T., Pietik

ainen, M., and Harwood, D. (1996). A com-

parative study of texture measures with classiﬁcation

based on featured distributions. Pattern recognition,

29(1):51–59.

Ojala, T., Pietikainen, M., and Maenpaa, T. (2002). Mul-

tiresolution gray-scale and rotation invariant texture

classiﬁcation with local binary patterns. IEEE Trans-

actions on pattern analysis and machine intelligence,

24(7):971–987.

Oliver, A., Odena, A., Raffel, C. A., Cubuk, E. D., and

Goodfellow, I. (2018). Realistic evaluation of deep

semi-supervised learning algorithms. Advances in

neural information processing systems, 31.

Othmani, A., Taleb, A. R., Abdelkawy, H., and Hadid, A.

(2020). Age estimation from faces using deep learn-

ing: A comparative analysis. Computer Vision and

Image Understanding, 196:102961.

O’Toole, A. J., Roark, D. A., and Abdi, H. (2002). Rec-

ognizing moving faces: A psychological and neural

synthesis. Trends in cognitive sciences, 6(6):261–266.

Panis, G., Lanitis, A., Tsapatsoulis, N., and Cootes, T. F.

(2016). Overview of research on facial ageing using

the fg-net ageing database. Iet Biometrics, 5(2):37–

46.

Pei, W., Dibeklio

glu, H., Baltru

saitis, T., and Tax, D. M.

(2019). Attended end-to-end architecture for age esti-

mation from facial expression videos. IEEE Transac-

tions on Image Processing, 29:1972–1984.

Pizer, S. M., Amburn, E. P., Austin, J. D., Cromartie, R.,

Geselowitz, A., Greer, T., ter Haar Romeny, B., Zim-

merman, J. B., and Zuiderveld, K. (1987). Adaptive

histogram equalization and its variations. Computer

vision, graphics, and image processing, 39(3):355–

368.

Porgali, B., Albiero, V., Ryda, J., Ferrer, C. C., and Hazir-

bas, C. (2023). The casual conversations v2 dataset.

Let Me Take a Better Look: Towards Video-Based Age Estimation

In Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 10–17.

Ricanek, K. and Tesafaye, T. (2006). Morph: A longitudi-

nal image database of normal adult age-progression.

In 7th international conference on automatic face and

gesture recognition (FGR06), pages 341–345. IEEE.

Ruiz, N., Chong, E., and Rehg, J. M. (2018). Fine-grained

head pose estimation without keypoints. In Proceed-

ings of the IEEE conference on computer vision and

pattern recognition workshops, pages 2074–2083.

Serengil, S. I. and Ozpinar, A. (2020). Lightface: A hy-

brid deep face recognition framework. In 2020 Inno-

vations in Intelligent Systems and Applications Con-

ference (ASYU), pages 1–5. IEEE.

Shen, W., Guo, Y., Wang, Y., Zhao, K., Wang, B., and

Yuille, A. L. (2018). Deep regression forests for age

estimation. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 2304–

2313.

Wang, W., Peng, X., Qiao, Y., and Cheng, J. (2022). An em-

pirical study on temporal modeling for online action

detection. Complex & Intelligent Systems, 8(2):1803–

1817.

Xu, M., Gao, M., Chen, Y.-T., Davis, L. S., and Crandall,

D. J. (2019). Temporal recurrent networks for online

action detection. In IEEE International Conference on

Computer Vision (ICCV).

Xu, M., Xiong, Y., Chen, H., Li, X., Xia, W., Tu, Z., and

Soatto, S. (2021). Long short-term transformer for on-

line action detection. Advances in Neural Information

Processing Systems, 34:1086–1099.

Zhang, B. and Bao, Y. (2022). Age estimation of faces in

videos using head pose estimation and convolutional

neural networks. Sensors, 22(11):4171.

Zhao, G. and Pietikainen, M. (2007). Dynamic texture

recognition using local binary patterns with an ap-

plication to facial expressions. IEEE transactions on

pattern analysis and machine intelligence, 29(6):915–

928.

Zhu, H., Wu, W., Zhu, W., Jiang, L., Tang, S., Zhang, L.,

Liu, Z., and Loy, C. C. (2022). CelebV-HQ: A large-

scale video facial attributes dataset. In ECCV.

Zhu, X., Lei, Z., Liu, X., Shi, H., and Li, S. Z. (2016). Face

alignment across large poses: A 3d solution. In Pro-

ceedings of the IEEE conference on computer vision

and pattern recognition, pages 146–155.

APPENDIX A: CCMiniIMG

Metadata Reproduction

As discussed in Sections 3.1 and 4.1, the authors of

(Hazirbas et al., 2021) extracted 100 frames from

each of the 6,022 videos and detected faces with the

DLIB face detector (King, 2009). The authors pro-

vide only video-level metadata and raw frames. Face

detection metadata was not provided, and neither face

detection setup nor frame sampling algorithm were

speciﬁed.

To reproduce the evaluation protocol, we ﬁrst pro-

cess CCMiniIMG raw frames following the limited

information available in (Hazirbas et al., 2021) and

explore DLIB’s face detection options. DLIB offers

two types of face detectors: HOG-based and CNN-

based. The HOG-based face detector is very light,

but it is not able to detect many of the challenging

samples from this dataset, therefore we choose the

CNN-based detector. We experiment with DLIB’s

sole face detection parameter (i.e., upsampling factor)

and CLAHE image enhancement (Pizer et al., 1987)

with different tile sizes and clip limits to push the de-

tection rate to 99.82%. No protocol for handling mul-

tiple detections was speciﬁed, so in each frame, we

select the detection with the highest detection conﬁ-

dence. 62 of the 3,011 subjects didn’t provide age

labels, eliminating 12,400 images from the test set.

This results in 588,720 video frames with success-

fully detected faces and valid age labels. We proceed

by extracting face crops with DLIB’s afﬁne alignment

algorithm which uses 5 face landmarks detected by

DLIB’s shape predictor. Face crops were extracted

with 50% context and resized to 256×256p.

To enable easy reproduction of this evaluation

protocol, we make DLIB’s detection metadata and

CCMiniIMG frame processing scripts publicly avail-

able

and encourage authors to utilize it in their work

to avoid inconsistencies.

APPENDIX B: CCMiniVID

Explorative Analysis

All three versions of the CCMiniVID are based on

videos from the same 3,011 subjects, with two videos

per subject selected from CC to create a balanced

subset. Videos of 62 subjects that did not provide

age information are not suitable for an age estima-

tion benchmark. That leaves us with 5,898 videos

of 2,949 unique subjects used in CCMiniVID bench-

mark datasets.

The original CC metadata provides some subject-

level labels (i.e., age, gender, and skin tone) and

video-level labels for lighting (i.e., bright or dark).

The authors encourage users to extend the annotations

on their dataset (Hazirbas et al., 2021). To this ex-

tent, we processed the dataset with a commercial face

tracking system

and ﬁltered the results by automat-

ically processing the tracking data and manually val-

https://github.com/kbesenic/CCMiniVID

https://visagetechnologies.com/facetrack/

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

idating edge cases. The tracking data consists of 75

facial landmark points, apparent pitch, yaw, and roll

head rotation, and tracking quality for each frame.

CCMiniVID-O label distributions for labels from

the original CC metadata are presented in Section 3.2.

We can see that gender and lighting label distributions

are fairly balanced, with some reasonable underrepre-

sentations in the skin tone distribution. CCMiniVID-

A shares the same distribution for the subject-level

labels (i.e., age, gender, and skin tone), but there is

a minor difference in the lighting distribution caused

by the selection of alternative videos (i.e., 62.67%

bright and 37.33% dark). Videos of 10 subjects were

dropped in the reduced CCMiniVID-R version, caus-

ing a very minor variation in all distributions.

Figure 5: CCMiniVID-O head-pose angle distributions,

based on the produced face tracking data.

Figure 6: CCMiniVID-O face scale distribution, based on

the produced face tracking data.

Figure 7: CCMiniVID-O tracking quality distribution,

based on the produced face tracking data.

Figures 5, 6 and 7 present distribution for rele-

vant tracking data from our extended CCMiniVID-

O metadata. Head-pose angle distributions show that

this conversational dataset is oriented towards frontal

and near-frontal faces, with some occurrences of ex-

treme head poses. The face scale distribution, where

the face scale is the width or height of a square face

bounding box in pixels, shows that the provided high-

quality videos contain faces that are mostly in the 100

to 400 pixel range. The tracking quality distribution

indicates continuous face tracking with high conﬁ-

dence was sustained on a large majority of the videos.

Let Me Take a Better Look: Towards Video-Based Age Estimation