Multitask Metamodel for Keypoint Visibility Prediction in Human Pose

Estimation

Romain Guesdon, Carlos Crispim-Junior and Laure Tougne

Univ Lyon, Lyon 2, LIRIS UMR 5205

Lyon, France, F-69676

Keywords:

Neural Networks, Human Pose Estimation, Keypoint Visibility Prediction.

Abstract:

The task of human pose estimation (HPE) aims to predict the coordinates of body keypoints in images. Even

if nowadays, we achieve high performance on HPE, some difﬁculties remain to be fully overcome. For in-

stance, a strong occlusion can deceive the methods and make them predict false-positive keypoints with high

conﬁdence. This can be problematic in applications that require reliable detection, such as posture analysis in

car-safety applications. Despite this difﬁculty, actual HPE solutions are designed to always predict coordinates

for each keypoint. To answer this problem, we propose a new metamodel that predicts both keypoints coor-

dinates and their visibility. Visibility is an attribute that indicates if a keypoint is visible, non-visible, or not

labeled. Our model is composed of three modules: the feature extraction, the coordinate estimation, and the

visibility prediction modules. We study in this paper the performance of the visibility predictions and the im-

pact of this task on the coordinate estimation. Baseline results are provided on the COCO dataset. Moreover,

to measure the performance of this method in a more occluded context, we also use the driver dataset DriPE.

Finally, we implement the proposed metamodel on several base models to demonstrate the general aspect of

our metamodel.

1 INTRODUCTION

Human Pose Estimation (HPE) is the task that aims

to locate body keypoints on images. These keypoints

can be body joints (shoulders, elbows, hips, ankles,

etc.) or facial markers (eyes, ears, nose). Additional

keypoints on the face, hands or feet are sometimes

used (Hidalgo et al., 2019; Cao et al., 2019).

One of the difﬁculties of HPE is handling key-

points occlusion. Even if recent solutions have

been able to reach high performance, state-of-the-art

datasets depict many pictures with few occlusion, es-

pecially in pictures presenting one person (Andriluka

et al., 2014; Lin et al., 2015). In contrast, in some

speciﬁc contexts like crowds or narrow spaces, body

parts have a high probability of being occluded or get-

ting out of the ﬁeld of view.

Strong occlusion can lead the network to predict

with high conﬁdence keypoints that are not anno-

tated, as we can see in Figure 1. Furthermore, the

networks may predict many false-positive keypoints

(Guesdon et al., 2021), which can be problematic in

applications where reliable predictions with signiﬁ-

cant precision are required, e.g., for action recognition

or driver’s posture analysis (Das et al., 2017; Zhao

et al., 2020b). Despite the difﬁculty caused by oc-

clusion, actual HPE networks are designed to predict

coordinates for each keypoints during inference, even

if the keypoint is outside of the image. Networks usu-

ally predict a conﬁdence score; however, it covers the

conﬁdence of both the presence and the coordinates of

the keypoints. Therefore, this score cannot be used to

properly distinguish keypoints that the network could

consider as absent from the image.

State-of-the-art datasets provide visibility labels,

an attribute that depicts the perceptibility of each key-

point. A labeled keypoint can be visible, or non-

visible when the keypoint is lightly occluded but with

enough information to be located. If the keypoint is

heavily occluded or out of the ﬁeld of view, it is not

labeled. However, state-of-the-art networks do not

consider these visibility labels. Furthermore, the few

existing methods using visibility only consider binary

visibility, i.e., labeled or non-labeled keypoints (Stofﬂ

et al., 2021; Kumar et al., 2020).

This paper proposes a novel HPE metamodel

that

Source code is publicly available on: https://gitlab.liris.

cnrs.fr/aura autobehave/vis-pred

428

Guesdon, R., Crispim-Junior, C. and Tougne, L.

Multitask Metamodel for Keypoint Visibility Prediction in Human Pose Estimation.

DOI: 10.5220/0010831200003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 5: VISAPP, pages

428-436

ISBN: 978-989-758-555-5; ISSN: 2184-4321

can predict both the visibility and the coordinates of

the keypoints. Our solution can be implemented with

most of the deep-learning HPE methods and allows

these base models to predict keypoint visibility. The

model can predict the three classes of labels, which

provides a ﬁner description of the keypoint visibility.

0.653

0.699

0.848

Figure 1: HPE prediction. Red points represent false pos-

itives, i.e., keypoints that were predicted even if not anno-

tated due to strong occlusion. Conﬁdence scores are pro-

vided in the boxes (maximum score = 1.0).

This paper is organized as follows. We present in

Section 2 the related work on human pose estimation

and visibility prediction. Section 3 presents our meta-

model and its detailed architecture, especially the vis-

ibility module. We describe in Section 4 the details

about the experiments, and present the results in Sec-

tion 5. Finally, we discuss in Section 6 our conclu-

sions and future work.

2 RELATED WORK

This section presents existing work on human pose

estimation and visibility keypoints prediction.

The task of human pose estimation is divided into

two categories. Single-person HPE focuses on the de-

tection in pictures presenting one person, in opposi-

tion to multiperson detection. The ﬁrst approach to

solve single-person HPE using deep learning was pro-

posed in (Toshev and Szegedy, 2014). This solution is

based on the deep architecture AlexNet (Krizhevsky

et al., 2012), which is used to estimate and reﬁne

the coordinates. An Iterative Error Feedback net-

work was proposed in (Carreira et al., 2016) based

on the convolutional network GoogleNet (Szegedy

et al., 2015). The authors of (Sun et al., 2017) used

ResNet50 (He et al., 2016) to predict a parametrized

bones representation. However, all these methods try

to directly predict the keypoints coordinates from the

images, which affects the robustness of these methods

due to the high non-linearity of this approach. Other

solutions categorized as detection-based methods aim

to predict 2D matrices called heatmaps where each

pixel represents the probability for a joint to be lo-

cated here. The work of (Newell et al., 2016) pro-

posed an hourglass module that can be stacked to pre-

dict and reﬁne features at several scales, which has in-

spired many other works (Chu et al., 2017; Ke et al.,

2018; Tang and Wu, 2019; Tang et al., 2018). Besides

hourglass architectures, other detection-based meth-

ods have been proposed. The architecture in (Chen

et al., 2017) combines a heatmap generator with two

discriminators. Simple Baseline (Xiao et al., 2018),

is an architecture based on the ResNet network (He

et al., 2016) with a deconvolution stage to generate

the ﬁnal heatmaps. Finally, Unipose (Artacho and

Savakis, 2020) combines atrous and cascade convo-

lutions to produce a multi-scale representation.

In addition to ﬁnding the keypoints in the picture,

multiperson HPE brings a new difﬁculty: to associate

the different persons to the detected keypoints. State-

of-the-art performance is achieved by methods called

top-down approaches that ﬁrst detect the subjects in

the picture and then locate the keypoints for each per-

son individually. These methods usually combine a

person detector with a single-person HPE architec-

ture (Xiao et al., 2018; Sun et al., 2019; Lin et al.,

2017; Cai et al., 2020; Li et al., 2019). Conversely,

the bottom-up approaches ﬁrst detect every keypoints

in the image before associating them to form people

instances (Newell et al., 2017; Cao et al., 2017; Nie

et al., 2018). Top-down approaches tend to outper-

form the bottom-up methods while taking advantage

of both state-of-the-art person detectors and HPE ar-

chitectures.

Among top-down methods, the Simple Baseline

(SBl) network (Xiao et al., 2018) presents competi-

tive performance while preserving a small size, which

makes it practical for modiﬁcations and tests. In addi-

tion, it can be used for multiperson HPE by combin-

ing it with a person detector.

Recent work on human pose estimation has

mainly focused on improving the prediction of the

keypoints’ coordinates. Therefore, methods which

estimate the visibility of HPE keypoints are scarce.

In (Zhao et al., 2020a), visibility prediction is used

to propose a new evaluation method for multiperson

pose estimation in heavily occluded contexts. Visi-

bility is predicted as an occlusion score and is used

to compute a metric that highlights the performance

of the evaluated networks on occluded points. The

multi-instance HPE network in (Stofﬂ et al., 2021)

uses transformers to predict keypoint visibility, which

serves as a secondary task for end-to-end training.

Besides, keypoint visibility is predicted in (Kumar

et al., 2017; Kumar et al., 2020) as an annex task for

Multitask Metamodel for Keypoint Visibility Prediction in Human Pose Estimation

429

Feature

Extractor

Coordinate

Estimator

Visibility

Predictor

...

Feature Vector

Base model

; y

)

; y

)

; y

)

; y

)

; y

)

; y

)

...

Figure 2: Architecture of our multitask metamodel for keypoint and visibility estimations.

face detection.

However, prior works only predict binary visibil-

ity and do not take advantage of the three visibility

labels provided by the current datasets (visible, non-

visible, non-labeled). Furthermore, the authors pro-

vide few quantiﬁed results on the actual performance

of the visibility predictions. Finally, these works pro-

pose a ﬁxed network where the visibility prediction

part is mostly ancillary. In this context, we propose a

metamodel that allows HPE methods to predict both

keypoints coordinates and ternary visibility.

3 PROPOSED METAMODEL

This section presents the architecture of the proposed

HPE visibility metamodel. First, we describe the

overall architecture. Then, we provide a more detailed

description of our visibility module.

3.1 Metamodel

The proposed architecture is split into three parts: the

feature extraction, the coordinate estimation, and the

visibility prediction modules. First, the feature ex-

traction module processes the input image to gen-

erate a feature vector. Examples of feature extrac-

tor are encoder architectures (Newell et al., 2016;

Tang and Wu, 2019; Artacho and Savakis, 2020; Li

et al., 2019), or image recognition backbones such

as ResNet (He et al., 2016) or EfﬁcientNet (Tan and

Le, 2019). Then, the generated vector serves as the

input of the two other modules. Coordinate estima-

tion can be performed by modules such as decoder or

deconvolution stages, usually followed by a convolu-

tion layer which generates the ﬁnal heatmaps (Newell

et al., 2016; Tang and Wu, 2019; Artacho and Savakis,

2020; Li et al., 2019). Final coordinate predictions

are computed as the local maximum of each heatmap.

The majority of the HPE networks can be split into a

feature extraction and a heatmap generation modules,

which allows most of the architectures to be compati-

ble with our metamodel.

In addition to these two regular modules, we add a

visibility branch (Figure 3). This module takes as in-

put the same feature vector as the coordinate estima-

tion module and outputs the visibility prediction for

each keypoint. The detailed architecture is presented

in the next section.

3.2 Visibility Branch

We model the visibility prediction problem as a classi-

ﬁcation task. We follow the COCO dataset formalism

and deﬁne the visibility using integer labels: 0 when

the keypoint is not labeled, 1 when it is labeled but

not visible, and 2 when it is fully visible. Therefore,

we associate to each keypoint one of the three labels.

The visibility module takes as input the feature vector

computed by the feature extraction module. It is com-

posed of a convolutional module, followed by a fully

connected network (FCN) that generates the ﬁnal vis-

ibility predictions.

More precisely, a residual block (He et al., 2016)

ﬁrst processes the input features. This block is com-

Flatten

Residual

module

Conv 1x1 +

MaxPool

Fully-

connected

layer

Final layer +

Softmax

Feature Vector

...

Figure 3: Architecture of our visibility predictor module.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

430

posed of three successive convolution layers with re-

spective kernel sizes of 3x3, 1x1, and 3x3, which

form a bottleneck. An additional skip connection en-

ables the features to be directly propagated to the next

layer. We use this block in our branch since it has

shown good results in feature computation for HPE

((Newell et al., 2016; Tang and Wu, 2019)). Then,

a convolutional layer of kernel size 1x1 with Batch-

Norm and 2x2 max pooling reduces the size and the

number of channels of the features. Finally, features

are ﬂattened and a fully connected network with three

hidden layers (4096, 2048, and 1024 neurons) fol-

lowed by a SoftMax produces the predictions. Since

the COCO dataset provides 17 annotated keypoints

with three possible visibility classes, the output layer

is composed of 51 neurons. The SoftMax function

is applied to groups of three visibility neurons (one

group representing one keypoint).

3.3 Cost Function

The global cost function used to train the network is

deﬁned as follows:

L = (1 − α).L

+ α.L

(1)

where L

is an L2 distance between the predicted

heatmaps and the ground-truth. The ground-truth

heatmaps are generated using Gaussian centered

around the location of the keypoint, with a standard

deviation of 1px.

The function L

is the cross-entropy loss applied

to the predictions of the visibility classes. Weighted

cross-entropy is used to compensate for the imbal-

anced distribution of keypoints within the three vis-

ibility classes. Therefore, the weights are computed

as the size of the biggest class divided by the size of

each class. Finally, α is the parameter used to balance

the ratio between the loss functions associated with

the two tasks. This regulates the impact of each tasks

on the training of the feature extractor weights.

4 EXPERIMENTS

In this section, we provide details about how the

experiments have been carried out, such as used

datasets, training, network base models, and evalua-

tion procedure.

4.1 Datasets

We adopted two datasets for the experiments. First,

the COCO dataset (Lin et al., 2015), which is one

of the largest and most used datasets for 2D human

pose estimation in a general context. It is composed of

118k pictures for training and 5k for validation. How-

ever, because of the high number of pictures in this

dataset, the visibility annotations present some incon-

sistencies. Also, the non-visible keypoints are weakly

represented in the COCO dataset, with only 7% of

the total keypoints. Therefore, we evaluated our ar-

chitecture on a second dataset called DriPE (Guesdon

et al., 2021). Figure 4 illustrates some samples. This

dataset possesses 10k manually annotated images of

drivers in consumer vehicles (7.4k images for train-

ing, 1.3k images each for training and testing). The

car environment and the side view-angle of the cam-

eras produce strong occlusion which induces 19% of

non-visible keypoints.

Figure 4: Image samples from DriPE dataset. Faces on the

ﬁgure have been blurred for anonymity purpose.

4.2 Basic Training

Most of the results on our architecture are provided

using the Simple Baseline (SBl) network as the base

model (Xiao et al., 2018). This network combines

ResNet50 as feature extractor with a deconvolution

stage (as coordinate estimator) to generate the ﬁnal

heatmaps. The feature extractor is initialized with

weights pre-trained on ImageNet. The networks are

trained on the COCO dataset for 140 epochs with a

learning rate of 1E-3, decreased by a factor of 10 at

epochs 90 and 120.

Finetuning on DriPE is done during 10 epochs

with a learning rate of 1E-4. We use data augmen-

tation operations (rotation, ﬂipping, etc.) for both

datasets. Following the state of the art, the input

images are cropped around the subjects using the

ground-truth, for both training and evaluation. Train-

ing is performed on a computer with an Nvidia GTX

1080 graphic card, an Intel Core i990k processor, and

32 GB of RAM.

Multitask Metamodel for Keypoint Visibility Prediction in Human Pose Estimation

431

4.3 Multitask Training

We tested in our experiments three strategies for mul-

titask training. As detailed in the previous section,

weights of the feature extractor are initialized on Ima-

geNet and the visibility predictor’s weights are initial-

ized randomly. For the ﬁrst strategy (S1), we train the

keypoint estimation and the visibility prediction tasks

jointly with a ﬁxed α set to 0.25 (value chosen empiri-

cally). For the second and third strategies (S2 and S3),

we pre-train the feature extraction and coordinate pre-

diction modules on COCO dataset, in the same way

as regular HPE networks are trained. Then, we re-

sume the training for 80 epochs, while incrementing

α by 0.1 every 20 epochs, starting from α=0. In S2,

the whole model is updated during these 80 epochs.

However, in S3, only the visibility predictor is trained

during this step, while the remaining weights (feature

extractor and coordinate estimator) are frozen.

4.4 Base Models

We implemented for the experiments three base mod-

els with our method, besides Simple Baseline. We

ﬁrst used EfﬁcientNet as a feature extractor (Tan and

Le, 2019), which is more recent than ResNet. We em-

ployed two different sizes: B0 (the smallest) and B6

(the second largest). We followed the same training

strategy and reused the heatmap generator from the

Simple Baseline model.

We also set up our metamodel with the MSPN net-

work (Li et al., 2019), as a feature extractor and a

heatmap generator. Because MSPN uses a multi-stage

architecture, we extracted the feature vector from the

output of the last encoder to feed the visibility mod-

ule. We initialized the model with the weights already

trained on COCO for human pose estimation.

4.5 Evaluation

The performance of the coordinate prediction mod-

ule was measured using two metrics. First, we

used the regular metric for the COCO dataset called

AP OKS (Lin et al., 2015). This metric computes

the average precision and recall using a score called

OKS. However, this metric is person-centered and

does not provide information on the model perfor-

mance of each keypoint detection. Furthermore, this

metric only considers labeled keypoints, i.e., visible

and non-visible keypoints, which puts aside false-

positive predictions. Therefore, we also evaluated the

models with the mAPK metric (Guesdon et al., 2021).

This metric provides an evaluation at a keypoint level

and allows to measure the performance of the model

on each body part separately.

5 RESULTS

In this section, we present and discuss the perfor-

mance of the proposed metamodel. More precisely,

we ﬁrst study the quality of the visibility predictions

using different strategies to train the models. Then,

we study the impact of the visibility prediction on

the keypoint detection task using both AP OKS and

mAPK metrics. Finally, we discuss the performance

of the proposed solution with different base models.

5.1 Visibility Prediction

We tried out several strategies to train the model, de-

scribed in Section 4.3. The performance of the three

resulting networks is presented in Table 1.

Table 1: F1-score of the network for visibility prediction on

COCO 2017 val set with different training strategies.

Strategy non-labeled non-visible visible total

S1 0.72 0.21 0.76 0.71

S2 0.75 0.34 0.79 0.74

S3 0.77 0.37 0.80 0.76

First, we can observe in Table 1 that pre-training

the network on the keypoint estimation task (S2 and

S3) outperforms the joint training of the three mod-

ules (S1). Indeed, we can notice an increase of 5% of

the total F1-score between S1 and S3. This improve-

ment is mostly perceptible in the non-visible class

(gain of 16%). However, training on the visibility task

while freezing the rest of the network (S3) does not

impact the overall performance. Indeed, we trained

several models and present in Table 1 the model for

each strategy with the best performance. Neverthe-

less, we observed little performance differences be-

tween the networks trained with and without freezing.

In the end, this experiment demonstrates that already

trained HPE networks can be used with our meta-

model and reach optimal performance. This allows

saving time and computing power, especially with a

large dataset like COCO.

Regarding the performance of visibility predic-

tion, results in Table 1 show that we are able to predict

keypoint visibility with a total F1-score up to 76%.

However, we can notice that the model has difﬁcul-

ties to predict the ”non-visible” class, with a maxi-

mum F1-score of 37%. Two reasons can explain this

gap. First, non-visible keypoint is a subjective notion,

since it corresponds to the keypoints which are oc-

cluded but where we have enough information in the

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

432

image to deduce the location of the keypoint. Because

the assessment of the ”enough information” is left to

the annotator, it leads to inconsistency in the annota-

tions. Secondly, the keypoints labeled as non-visible

represent only 7% of the COCO keypoints (Figure 5).

Even if this distribution gap is taken into considera-

tion in the computation of the weighted cross-entropy

cost function L

, it still has a negative impact on the

learning process.

Figure 5: Distribution of the keypoint visibility labels in the

COCO dataset.

To study the impact of the distribution of examples

of the three visibility classes, we ﬁnetuned our net-

work on DriPE dataset (Guesdon et al., 2021). This

dataset presents a more homogeneous keypoints class

distribution, as shown in Figure 6.

Figure 6: Distribution of the keypoint visibility labels in the

DriPE dataset.

Table 2: Performance of the network for visibility predic-

tion on DriPE dataset before and after ﬁnetuning.

F1-score non-labeled non-visible visible total

COCO baseline 0.71 0.34 0.64 0.60

Finetuned on DriPE 0.81 0.70 0.76 0.76

As we can see in Table 2, after ﬁnetuning, the

model achieves an F1-score of 70% for the non-

visible keypoints. These results demonstrate that with

a better distribution of the visibility classes and more

homogeneous images, our metamodel is able to bet-

ter estimate the visibility of keypoints, in particular

for non-visible classes.

5.2 Keypoint Estimation

We now study the impact of the addition of the

visibility module on the performance of the key-

point detection. We use for this study the mAPK

metric (Guesdon et al., 2021), which provides a

more keypoint-centered performance measurement

than AP OKS (Lin et al., 2015). Similar to AP OKS,

mAPK measures both average precision (AP) and av-

erage recall (AR). We provide results for both COCO

(Table 3) and DriPE (Table 4) datasets. The ”SBl +

visibility” network refers to the implementation of our

metamodel with the Simple Baseline network. The

”non-0” term deﬁnes the experiment where all key-

point coordinates predicted by the visibility module as

”non-labeled” are considered as not predicted for the

computation of the mAPK metric. This strategy aims

to improve the precision on scenes where some key-

points are outside the image or strongly occluded of

the keypoint prediction module, which is classically

designed to predict coordinates for each type of key-

point during inference.

Table 3: HPE on the COCO 2017 validation set with

mAPK.

conﬁguration Head Sho. Elb. Wri. Hip Knee Ank. Mean

SBl 0.66 0.76 0.73 0.70 0.74 0.74 0.74 0.72

SBl + visibility 0.66 0.76 0.72 0.70 0.73 0.73 0.73 0.72

SBl + visibility + non-0 0.71 0.78 0.77 0.73 0.73 0.76 0.74 0.75

SBl 0.73 0.77 0.73 0.70 0.70 0.72 0.72 0.72

SBl + visibility 0.73 0.76 0.73 0.69 0.70 0.72 0.72 0.72

SBl + visibility + non-0 0.43 0.72 0.58 0.68 0.68 0.66 0.35 0.59

Table 4: HPE on the DriPE test set with mAPK.

conﬁguration Head Sho. Elb. Wri. Hip Knee Ank. Mean

SBl 0.85 0.90 0.94 0.96 0.98 0.95 0.68 0.89

SBl + visibility 0.84 0.90 0.94 0.96 0.98 0.95 0.68 0.89

SBl + visibility + non-0 0.86 0.90 0.94 0.97 0.98 0.96 0.72 0.90

SBl 0.87 0.96 0.96 0.97 0.98 0.95 0.80 0.93

SBl + visibility 0.87 0.96 0.96 0.97 0.98 0.95 0.80 0.93

SBl + visibility + non-0 0.44 0.96 0.85 0.97 0.98 0.93 0.77 0.84

Firstly, we can observe that our metamodel

(SBl + visibility) achieves performance similar to the

SBl baseline on keypoint detection. It indicates that

adding the visibility task has no negative impact on

the primary task, regardless of the dataset used.

Secondly, the non-0 strategy slightly improves the

average precision of the keypoint detection, which

denotes a decrease in the number of false positives.

However, this precision increase comes with a nega-

tive trade-off regarding the average recall, caused by

an increase of the false negatives. The decrease of the

recall is signiﬁcant for the keypoints on the head, el-

bow, and ankles. Prediction of the visibility on the

face is a delicate task since almost none of these key-

points are labeled as non-visible due to the COCO an-

notation style. Ankles are also difﬁcult keypoints to

predict in a general context, even if it is less observ-

Multitask Metamodel for Keypoint Visibility Prediction in Human Pose Estimation

433

Table 5: Performance of the network for keypoint detection on COCO 2017 with different base models.

Table 6: Performance of the network for keypoint detection on DriPE with different base models.

Base model parameters AP AP

AR AR

SBl 71.2M 71.9 91.5 79.0 69.2 76.4 75.3 92.8 81.8 72.1 80.1

EfﬁcientNet B0 55.6M 67.1 90.4 74.9 63.9 71.7 70.3 91.1 77.0 66.8 75.5

EfﬁcientNet B6 95.5M 72.5 92.4 80.1 69.8 76.9 75.8 93.0 82.7 72.6 80.7

MSPN 2-stg 104.6M 71.8 92.5 81.4 69.0 76.1 75.3 93.5 83.8 71.9 80.3

Table 7: Qualitative comparison of keypoints prediction ﬁltered with a conﬁdence threshold (top row) and with the visibility

predicted by our metamodel (bottom row). Red dots represent the false-positive keypoints.

Base model parameters AP AP

AR AR

SBl 71.2M 96.5 99.9 99.9 - 96.5 97.5 99.9 99.9 - 97.5

EfﬁcientNet B0 55.6M 91.8 99.0 99.0 - 91.8 94.7 99.9 99.6 - 94.7

EfﬁcientNet B6 95.5M 99.4 99.0 99.0 - 94.4 96.5 99.9 99.6 - 96.5

MSPN 2-stg 104.6M 97.8 99.0 99.0 - 97.8 99.0 99.9 99.9 - 99.0

able in the DriPE dataset due to the lower number of

labeled ankles. In the end, an increase of precision

can be useful in applications that require high conﬁ-

dence in the predicted keypoints.

Table 8: Performance of the network for visibility predic-

tion on COCO 2017 with different base models.

Base model parameters non-labeled non-visible visible total

SBl 71.2M 0.77 0.37 0.80 0.76

EfﬁcientNet B0 55.6M 0.74 0.32 0.77 0.73

EfﬁcientNet B6 95.5M 0.75 0.34 0.80 0.76

MSPN 2-stg 104.6M 0.69 0.34 0.69 0.67

We present qualitative results in Figure 7. As

we observed in Tables 3 and 4, the gain in precision

comes mostly from face keypoints. This is illustrated

by face keypoints which were predicted even with

the strong occlusion and the lack of information (Fig-

ure 7-A,B). However, the precision of other parts pre-

diction has also been improved, such as knees (Fig-

ure 7-C). Finally, the negative trade-off regarding the

recall is caused by keypoints that were correctly pre-

dicted by the coordinate estimator but predicted as

non-labeled by the visibility predictor (Figure 7-D).

5.3 Other Base Models

We evaluated our metamodel with different HPE ar-

chitectures: EfﬁcientNet B0 and B6, and MSPN. The

performance of these implementations can be found

in Tables 7 and 8. The two tasks were trained succes-

sively while freezing the feature extractor during the

visibility task training.

As we can observe, the models achieve good per-

formance on pose estimation while reaching perfor-

mance on visibility prediction similar to the one pre-

sented in Table 1. These results intend to demon-

strate that our metamodel can be deployed with net-

works of varied sizes and architectures while preserv-

ing the performance on both tasks. Please note that we

trained each network only once except SBl which is

used as the baseline for our study. Therefore, these re-

sults may not reﬂect the optimal performance of each

network.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

434

Table 9: Performance of the network for visibility predic-

tion on DriPE with different base models.

Base model parameters non-labeled non-visible visible total

SBl 71.2M 0.81 0.70 0.76 0.76

EfﬁcientNet B0 55.6M 0.72 0.54 0.72 0.69

EfﬁcientNet B6 95.5M 0.78 0.58 0.63 0.67

MSPN 2-stg 104.6M 0.57 0.55 0.46 0.51

Finally, we ﬁnetuned and evaluated the networks

on DriPE dataset (Tables 7 and 9). The models still

achieve 60% of visibility prediction while reaching

over 90% of precision and recall on the keypoint es-

timation. We can notice that the performance of the

MSPN network is below what we could expect for

such a large number of parameters. An adjustment

of the training and ﬁnetuning parameters could im-

prove performance, especially considering the size of

the network. Also, because of the multiscale and mul-

tistage architecture of MSPN, concatenating several

scale levels to extract the feature vector from the net-

work could improve the results.

6 CONCLUSIONS

In this paper, we have presented a new metamodel for

human pose estimation and visibility prediction. This

method achieves good performance on visibility pre-

diction while preserving the performance of the key-

point estimation of the base model. We demonstrated

that these results can be achieved using different base

models. We also showed that the metamodel performs

well on two public datasets regarding the visibility

prediction: the COCO dataset, a general and state-of-

the-art dataset, and the DriPE dataset which contains

images with stronger occlusion. Finally, we used the

predicted visibility to improve the keypoint detection,

by discarding the keypoints predicted as non-labeled.

Our results show that this strategy can improve the

precision of the detection, even though it may reduce

the recall, especially for head and ankles keypoints.

Future work will investigate strategies to improve

the precision of keypoint coordinates estimation using

visibility prediction with a lesser negative trade-off on

recall. For instance, we could combine the predicted

conﬁdence of the two tasks for a ﬁnal prediction. Fur-

thermore, it would be interesting to study the integra-

tion of the proposed metamodel to multi-scale archi-

tectures, like MSPN architecture. These architectures

present a higher performance on keypoint estimation,

but the proposed integration still does not take full ad-

vantage of the multiscale features available. Finally, it

would be interesting to study the inﬂuence of the gain

of keypoint estimation accuracy in practical applica-

tions, such as action recognition or posture analysis in

car-safety applications.

ACKNOWLEDGEMENTS

This work was supported by the Pack Ambition

Recherche 2019 funding of the French AURA Region

in the context of the AutoBehave project.

REFERENCES

Andriluka, M., Pishchulin, L., Gehler, P., and Schiele, B.

(2014). 2d human pose estimation: New benchmark

and state of the art analysis. In 2014 IEEE Conference

on Computer Vision and Pattern Recognition, pages

3686–3693.

Artacho, B. and Savakis, A. (2020). Unipose: Uniﬁed hu-

man pose estimation in single images and videos. In

Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages

7035–7044.

Cai, Y., Wang, Z., Luo, Z., Yin, B., Du, A., Wang,

H., Zhang, X., Zhou, X., Zhou, E., and Sun, J.

(2020). Learning delicate local representations for

multi-person pose estimation. In Vedaldi, A., Bischof,

H., Brox, T., and Frahm, J.-M., editors, Computer Vi-

sion – ECCV 2020, pages 455–472, Cham. Springer

International Publishing.

Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., and Sheikh, Y.

(2019). Openpose: realtime multi-person 2d pose esti-

mation using part afﬁnity ﬁelds. IEEE transactions on

pattern analysis and machine intelligence, 43(1):172–

186.

Cao, Z., Simon, T., Wei, S.-E., and Sheikh, Y. (2017). Real-

time multi-person 2d pose estimation using part afﬁn-

ity ﬁelds. In 2017 IEEE Conference on Computer Vi-

sion and Pattern Recognition (CVPR), pages 1302–

1310.

Carreira, J., Agrawal, P., Fragkiadaki, K., and Malik, J.

(2016). Human pose estimation with iterative error

feedback. In 2016 IEEE Conference on Computer Vi-

sion and Pattern Recognition (CVPR), pages 4733–

4742.

Chen, Y., Shen, C., Wei, X.-S., Liu, L., and Yang, J. (2017).

Adversarial posenet: A structure-aware convolutional

network for human pose estimation. In 2017 IEEE

International Conference on Computer Vision (ICCV),

pages 1221–1230.

Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A. L., and

Wang, X. (2017). Multi-context attention for human

pose estimation. In 2017 IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages

5669–5678.

Das, S., Koperski, M., Bremond, F., and Francesca, G.

(2017). Action recognition based on a mixture of rgb

and depth based skeleton. In 2017 14th IEEE Inter-

national Conference on Advanced Video and Signal

Based Surveillance (AVSS), pages 1–6.

Guesdon, R., Crispim-Junior, C., and Tougne, L. (2021).

Dripe: A dataset for human pose estimation in

real-world driving settings. In Proceedings of the

Multitask Metamodel for Keypoint Visibility Prediction in Human Pose Estimation

435

IEEE/CVF International Conference on Computer Vi-

sion (ICCV) Workshops, pages 2865–2874.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition (CVPR).

Hidalgo, G., Raaj, Y., Idrees, H., Xiang, D., Joo, H., Simon,

T., and Sheikh, Y. (2019). Single-network whole-body

pose estimation. In Proceedings of the IEEE/CVF

International Conference on Computer Vision, pages

6982–6991.

Ke, L., Chang, M.-C., Qi, H., and Lyu, S. (2018). Multi-

scale structure-aware network for human pose esti-

mation. In Ferrari, V., Hebert, M., Sminchisescu,

C., and Weiss, Y., editors, Computer Vision – ECCV

2018, pages 731–746, Cham. Springer International

Publishing.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classiﬁcation with deep convolutional neu-

ral networks. In Pereira, F., Burges, C. J. C., Bottou,

L., and Weinberger, K. Q., editors, Advances in Neu-

ral Information Processing Systems 25, pages 1097–

1105. Curran Associates, Inc.

Kumar, A., Alavi, A., and Chellappa, R. (2017). Kepler:

Keypoint and pose estimation of unconstrained faces

by learning efﬁcient h-cnn regressors. In 2017 12th

IEEE International Conference on Automatic Face

Gesture Recognition (FG 2017), pages 258–265.

Kumar, A., Marks, T. K., Mou, W., Wang, Y., Jones, M.,

Cherian, A., Koike-Akino, T., Liu, X., and Feng, C.

(2020). Luvli face alignment: Estimating landmarks’

location, uncertainty, and visibility likelihood. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 8236–8246.

Li, W., Wang, Z., Yin, B., Peng, Q., Du, Y., Xiao, T., Yu,

G., Lu, H., Wei, Y., and Sun, J. (2019). Rethinking on

multi-stage networks for human pose estimation.

Lin, T.-Y., Doll

ar, P., Girshick, R., He, K., Hariharan, B.,

and Belongie, S. (2017). Feature pyramid networks

for object detection. In 2017 IEEE Conference on

Computer Vision and Pattern Recognition (CVPR),

pages 936–944.

Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick,

R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L.,

and Doll

ar, P. (2015). Microsoft coco: Common ob-

jects in context.

Newell, A., Huang, Z., and Deng, J. (2017). Associative

embedding: End-to-end learning for joint detection

and grouping. In Guyon, I., Luxburg, U. V., Ben-

gio, S., Wallach, H., Fergus, R., Vishwanathan, S., and

Garnett, R., editors, Advances in Neural Information

Processing Systems 30, pages 2277–2287. Curran As-

sociates, Inc.

Newell, A., Yang, K., and Deng, J. (2016). Stacked Hour-

glass Networks for Human Pose Estimation. In Leibe,

B., Matas, J., Sebe, N., and Welling, M., editors, Com-

puter Vision – ECCV 2016, pages 483–499, Cham.

Springer International Publishing.

Nie, X., Feng, J., Xing, J., and Yan, S. (2018). Pose partition

networks for multi-person pose estimation. In Com-

puter Vision – ECCV 2018, pages 684–699, Cham.

Springer International Publishing.

Stofﬂ, L., Vidal, M., and Mathis, A. (2021). End-to-end

trainable multi-instance pose estimation with trans-

formers.

Sun, K., Xiao, B., Liu, D., and Wang, J. (2019). Deep high-

resolution representation learning for human pose es-

timation. In 2019 IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages

5686–5696.

Sun, X., Shang, J., Liang, S., and Wei, Y. (2017). Composi-

tional human pose regression. In 2017 IEEE Interna-

tional Conference on Computer Vision (ICCV), pages

2621–2630.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition (CVPR).

Tan, M. and Le, Q. (2019). Efﬁcientnet: Rethinking model

scaling for convolutional neural networks. In Interna-

tional Conference on Machine Learning, pages 6105–

6114. PMLR.

Tang, W. and Wu, Y. (2019). Does learning speciﬁc fea-

tures for related parts help human pose estimation? In

2019 IEEE/CVF Conference on Computer Vision and

Pattern Recognition (CVPR), pages 1107–1116.

Tang, W., Yu, P., and Wu, Y. (2018). Deeply learned compo-

sitional models for human pose estimation. In Ferrari,

V., Hebert, M., Sminchisescu, C., and Weiss, Y., edi-

tors, Computer Vision – ECCV 2018, pages 197–214,

Cham. Springer International Publishing.

Toshev, A. and Szegedy, C. (2014). Deeppose: Human pose

estimation via deep neural networks. In 2014 IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 1653–1660.

Xiao, B., Wu, H., and Wei, Y. (2018). Simple baselines

for human pose estimation and tracking. In Ferrari,

V., Hebert, M., Sminchisescu, C., and Weiss, Y., edi-

tors, Computer Vision – ECCV 2018, pages 472–487,

Cham. Springer International Publishing.

Zhao, L., Xu, J., Zhang, S., Gong, C., Yang, J., and Gao,

X. (2020a). Perceiving heavily occluded human poses

by assigning unbiased score. Information Sciences,

537:284–301.

Zhao, M., Beurier, G., Wang, H., and Wang, X. (2020b).

A pipeline for creating in-vehicle posture database

for developing driver posture monitoring systems.

In DHM2020: Proceedings of the 6th International

Digital Human Modeling Symposium, August 31-

September 2, 2020, volume 11, pages 187–196. IOS

Press.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

436