FingerSeg: Highly-Efﬁcient Dual-Resolution Architecture for Precise

Finger-Level Semantic Segmentation

Gibran Benitez-Garcia

1 a

and Hiroki Takahashi

1,2,3

Graduate School of Informatics and Engineering, The University of Electro-Communications, Japan

Artiﬁcial Intelligence eXploration Research Center (AIX), The University of Electro-Communications, Japan

Meta-Networking Research Center (MEET), The University of Electro-Communications, Japan

Keywords:

Semantic Segmentation, Finger Segmentation, DDRNet, Real-Time CNN, IPN-Hand Dataset.

Abstract:

Semantic segmentation at the ﬁnger level poses unique challenges, including the limited pixel representation

of some classes and the complex interdependency of the hand anatomy. In this paper, we propose FingerSeg,

a novel architecture inspired by Deep Dual-Resolution Networks, speciﬁcally adapted to address the nuances

of ﬁnger-level hand semantic segmentation. To this end, we introduce three modules: Enhanced Bilateral Fu-

sion (EBF), which reﬁnes low- and high-resolution feature fusion via attention mechanisms; Multi-Attention

Module (MAM), designed to augment high-level features with a composite of channel, spatial, orientational,

and categorical attention; and Asymmetric Dilated Up-sampling (ADU), which combines standard and asym-

metric atrous convolutions to capture rich contextual information for pixel-level classiﬁcation. To properly

evaluate our proposal, we introduce IPN-ﬁnger, a subset of the IPN-Hand dataset, manually annotated pixel-

wise for 13 ﬁnger-related classes. Our extensive empirical analysis, including evaluations of the synthetic

RHD dataset against current state-of-the-art methods, demonstrates that our proposal achieves top results.

FingerSeg reaches 73.8 and 71.1 mIoU on the IPN-Finger and RHD datasets, respectively, while maintaining

an efﬁcient computational cost of about 7 GFLOPs and 6 million parameters at VGA resolution. The dataset,

source code, and a demo of FingerSeg will be available upon the publication of this paper.

1 INTRODUCTION

Hand segmentation represents a dense prediction

problem, focused on identifying each pixel asso-

ciated with a hand in binary segmentation frame-

works (Urooj and Borji, 2018), and in more advanced

applications, distinguishing between left and right

hands (Bandini and Zariffa, 2020). This segmenta-

tion is often a preliminary step in diverse tasks, rang-

ing from hand gesture recognition (HGR) to human

behavior analysis (Dadashzadeh et al., 2019; Lik-

itlersuang et al., 2019; Benitez-Garcia et al., 2021b).

While effective for broad categorization, conventional

left and right hand detection falls short in applications

requiring ﬁner granularity. Speciﬁcally, ﬁnger-level

segmentation may provide a clearer distinction in ges-

tures, particularly those involving ambiguity between

the number of ﬁngers involved. This level of detail is

crucial for accurately distinguishing between nuanced

gestures, as illustrated in Figure 1, where the standard

https://orcid.org/0000-0003-4945-8314

Figure 1: Comparison between gestures based on one and

two ﬁngers. The ﬁnger segmentation granularity helps to

determine the number of ﬁngers involved.

hand segmentation can help localize the hand but fails

to determine the number of ﬁngers shown in the ges-

ture. Therefore, in this paper, we propose FingerSeg,

a novel segmentation architecture explicitly designed

for ﬁnger-level hand semantic segmentation.

FingerSeg is inspired by the principles of Deep

Dual-Resolution Networks (DDRNet (Pan et al.,

242

Benitez-Garcia, G. and Takahashi, H.

FingerSeg: Highly-Efﬁcient Dual-Resolution Architecture for Precise Finger-Level Semantic Segmentation.

DOI: 10.5220/0012575000003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 3: VISAPP, pages

242-251

ISBN: 978-989-758-679-8; ISSN: 2184-4321

2022)) and tailored to overcome the unique chal-

lenges posed by ﬁnger segmentation, such as the

limited pixel area of certain ﬁngers and the com-

plex spatial relationships within hand anatomy. Fin-

gerSeg introduces advanced feature fusion and atten-

tion mechanisms to capture the subtle distinctions be-

tween ﬁngers. This enhanced granularity in segmen-

tation is not only crucial for applications in HGR and

sign language recognition but also holds signiﬁcant

promise in medical ﬁelds, where precise hand and

ﬁnger movements are essential (Rangesh and Trivedi,

2018; Likitlersuang et al., 2019).

Efﬁciency in computation is essential when in-

tegrating ﬁnger segmentation into real-time applica-

tions. The demand for lightweight models to ensure

prompt and responsive performance is paramount. In

this way, DDRNet offers an efﬁcient architecture op-

timized for speed and accuracy. However, it does not

inherently address the mentioned challenges of ﬁnger

segmentation. To bridge this gap, we propose three

novel modules that do not impose excessive computa-

tional costs. Our proposal includes the Enhanced Bi-

lateral Fusion (EBF), which improves feature merg-

ing precision; the Multi-Attention Module (MAM),

which provides nuanced processing of features; and

the Asymmetric Dilated Up-sampling (ADU), which

enhances feature resolution effectively. Each module

is carefully designed to contribute to the overall ef-

ﬁcacy, ensuring that FingerSeg remains highly efﬁ-

cient while setting new standards in segmentation de-

tail and accuracy.

To thoroughly evaluate the capabilities of Fin-

gerSeg, we introduce IPN-ﬁnger, a curated subset of

the IPN-Hand dataset for HGR (Benitez-Garcia et al.,

2021a). Speciﬁcally, we deﬁned 13 ﬁnger-related

classes and manually annotated 1000 frames at pixel

level. Moreover, our evaluation extends to the Ren-

dered Hand Pose Dataset (RHD) (Zimmermann and

Brox, 2017), which, to the best of our knowledge,

is the only publicly available dataset offering ﬁnger-

level pixel-wise annotations. Thus, including real-

world samples of the IPN-ﬁnger dataset serves as a

critical benchmark to validate the precision and effec-

tiveness of ﬁnger segmentation approaches.

In our experimental analysis, FingerSeg not only

demonstrates an exceptional balance between accu-

racy and computational efﬁciency but also presents

a signiﬁcant improvement over our baseline model.

Speciﬁcally, FingerSeg overcomes DDRNet by ap-

proximately 3%, achieving a mIoU of 73.8 on the

IPN-Finger dataset. FingerSeg demands only about 7

GFLOPs and 6 million parameters for a 640x480 im-

age resolution, which allows faster than real-time per-

formance across different platforms. Consequently,

FingerSeg presents a valuable solution for ﬁnger seg-

mentation, serving as an efﬁcient preprocessing step

and a robust framework for real-time applications.

The main contributions are summarized as fol-

lows:

• Introduction of FingerSeg, an architecture specif-

ically designed for efﬁcient and accurate ﬁnger-

level hand segmentation.

• Development and integration of three novel mod-

ules: Enhanced Bilateral Fusion (EBF), Multi-

Attention Module (MAM), and Asymmetric Di-

lated Up-sampling (ADU), to achieve state-of-

the-art results with minimal computational cost

increase.

• Compilation of the IPN-Finger dataset, com-

prising 1000 frames with pixel-wise annotations

across 13 classes, including the palm, all ﬁngers

of each hand, and the overall shape of the person.

• Superior performance of FingerSeg with results of

approximately 73.8 and 71.1 mIoU on the IPN-

Finger and RHD datasets, respectively, surpassing

real-time semantic segmentation approaches, in-

cluding notable methods like DDRNet (Pan et al.,

2022) and PIDNet (Xu et al., 2023).

2 RELATED WORK

2.1 Hand Segmentation

Hand segmentation has been an active research topic

in recent years, with signiﬁcant implications for di-

verse applications. Notable among these are hand

gesture recognition (HGR), RGB-based hand pose

estimation, and the analysis of egocentric interac-

tions (Bandini and Zariffa, 2020).

Segmentation serves as a preprocessing step for

HGR, enhancing subsequent processes like classiﬁ-

cation. Studies such as (Dadashzadeh et al., 2019)

and (Benitez-Garcia et al., 2021b) have illustrated

how effective segmentation can boost the accuracy of

HGR systems. Binary segmentation, in particular, has

been a staple in preprocessing for hand pose estima-

tion. For instance, HandSegNet (Zimmermann and

Brox, 2017) underpins the 3D hand pose estimation

from RGB frames. Likewise, the end-to-end train-

able framework proposed in (Baek et al., 2019) uti-

lizes segmentation masks to facilitate 3D hand pose

reconstructions from 2D joint estimations.

In the context of egocentric vision, robust hand

segmentation has proven critical for action and ac-

tivity recognition involving hands. This has led to

the development of methods that rely on accurate

FingerSeg: Highly-Efﬁcient Dual-Resolution Architecture for Precise Finger-Level Semantic Segmentation

243

hand segmentation as a precursor to activity recog-

nition (Li et al., 2019). A signiﬁcant advancement in

binary hand segmentation is presented in (Cai et al.,

2020), which showcases a Bayesian CNN framework

enhancing generalizability across diverse domains.

However, a noticeable gap in the existing litera-

ture is the lack of focus on hand segmentation with

the granularity necessary to distinguish individual ﬁn-

gers. This paper seeks to ﬁll that gap by introduc-

ing an approach speciﬁcally designed for this pur-

pose, extending the scope and applicability of hand

segmentation in computational vision.

2.2 Real-Time Semantic Segmentation

Advanced semantic segmentation techniques typi-

cally depend on preserving high-resolution features

while implementing convolutions with extensive di-

lation rates to broaden receptive ﬁelds, as seen

in methods like PSPNet (Zhao et al., 2017) and

DeepLabV3+ (Chen et al., 2018). Despite their high

accuracy, the computational intensity and complex

pooling mechanisms of these methods often prohibit

real-time performance.

In contrast, real-time segmentation algorithms

consider more efﬁcient architectures, such as

lightweight encoder-decoder or bilateral pathway de-

signs. These often incorporate compact pyramidal

pooling and depth-wise convolutions. DABNet (Li

and Kim, 2019), for example, leverages Depth-

wise Asymmetric Bottleneck modules, which com-

bine factorized depth-wise convolutions in a bottle-

neck structure to extract local and contextual infor-

mation jointly, obviating the need for extensive pool-

ing modules. Other approaches like HardNet (Chao

et al., 2019) and FASSDNet (Rosas-Arias et al., 2021)

use classic encoder-decoder architectures relying on

Harmonic Dense Blocks. These blocks are engi-

neered to reduce memory usage and computational

density, addressing the challenges of the dense blocks

proposed by DenseNet. On the other hand, a sig-

niﬁcant development in real-time semantic segmen-

tation is DDRNet (Pan et al., 2022), which intro-

duces a dual-resolution backbone comprising low-

and high-resolution branches with a one-to-one corre-

lation between paths. This includes bilateral connec-

tions to foster efﬁcient information exchange between

context (low-resolution) and detail (high-resolution)

branches. More recently, PIDNet (Xu et al., 2023)

advances the ﬁeld with a three-branch network archi-

tecture. This design parses detailed, contextual, and

boundary information through separate branches, uti-

lizing boundary attention to guide the fusion of de-

tailed and contextual information.

Our choice of DDRNet as a baseline derives from

its efﬁcient dual-resolution approach and the potential

for enhancements in ﬁnger-level segmentation. This

framework allows for an optimal balance between de-

tail capture and computational efﬁciency, making it

an ideal foundation for our FingerSeg model.

3 FINGERSEG NETWORK

The proposed architecture for ﬁnger-level hand seg-

mentation is designed to address the speciﬁc chal-

lenges inherent in this task. Our methodology is in-

spired by the efﬁcient architecture of DDRNet, which

we have signiﬁcantly adapted and enhanced to cater

to the nuanced requirements of segmenting individ-

ual ﬁngers. At the core of FingerSeg are three key

modules: Enhanced Bilateral Fusion (EBF), Multi-

Attention Module (MAM), and Asymmetric Dilated

Up-sampling (ADU). Each of these modules plays

a pivotal role in reﬁning the segmentation process,

ensuring both high accuracy and computational efﬁ-

ciency. In the following sections, we delve into the

intricacies of these modules, explaining how they col-

lectively contribute to the superior performance of

FingerSeg in ﬁnger-level segmentation tasks.

3.1 Network Overview

FingerSeg’s architecture, as illustrated in Figure 2,

is built upon a dual-resolution backbone that bifur-

cates from a single trunk into two parallel branches,

each operating at a distinct resolution. The high-

resolution branch aims to generate detailed feature

maps at 1/8 the resolution of the input image. No-

tably, this branch excludes any downsampling opera-

tions to preserve high-resolution information, main-

taining a one-to-one correspondence with the low-

resolution branch to form deep, detailed representa-

tions. Conversely, the low-resolution branch, akin to

DDRNet’s design, employs multiple downsampling

operations within its Residual Blocks (RB) to produce

feature maps at a reduced 1/64 resolution. This struc-

ture not only captures rich contextual information but

also contributes to the network’s overall efﬁciency.

EBF blocks are employed to integrate detailed and

contextual features at two critical points within the ar-

chitecture. Additionally, the end of the low-resolution

pathway incorporates the Deep Aggregation Pyra-

mid Pooling Module (DAPPM), as utilized in DDR-

Net (Pan et al., 2022). The DAPPM enriches se-

mantic information without compromising inference

speed by processing lower-resolution feature maps. It

operates on feature maps at a 1/64 resolution, utiliz-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

244

Figure 2: Overview of the FingerSeg architecture. The dual-resolution network starts as a single trunk and divides into two

branches: the high-resolution branch (top) and the low-resolution branch (bottom). Enhanced Bilateral Fusion (EBF) blocks

facilitate the integration of detailed and contextual features. The Deep Aggregation Pyramid Pooling Module (DAPPM) at

the end of the low-resolution branch enriches semantic information. Feature maps are then reﬁned by the Multi-Attention

Module (MAM) and upsampled by the Asymmetric Dilated Up-sampling (ADU) module for ﬁnal segmentation.

ing large pooling kernels to create multi-scale feature

maps, and incorporates global average pooling for ad-

ditional image-level detail. A cascading fusion strat-

egy involving successive upsampling and 3×3 convo-

lutions is applied to integrate varying scales of fea-

tures, which are then uniﬁed and compacted using a

1×1 convolution.

Subsequently, the reﬁned outputs from the

DAPPM and the high-resolution branch converge

within our MAM block before being upsampled by

the ADU module, leading to the ﬁnal segmentation

prediction. The segmentation scores are subsequently

upscaled using bilinear interpolation, aligning with

the supervision of a standard cross-entropy loss func-

tion. The forthcoming subsections will delve into the

speciﬁcs of the proposed modules.

3.2 EBF: Enhanced Bilateral Fusion

The EBF block, depicted in Figure 3, is designed

to effectively merge high-resolution detail with low-

resolution context features. High-to-low resolution

features undergo a transformation involving channel

expansion and spatial reduction through a 3x3 convo-

lution with a stride of 2. Concurrently, low-to-high

resolution features are channel-wise compacted using

a 1x1 convolution and subsequently spatially enlarged

via bilinear interpolation (x2up).

Once the feature maps from both resolutions are

matched in spatial and channel dimensions, they are

combined through element-wise addition. Building

upon the concept introduced by CBAM (Woo et al.,

2018), we further reﬁne the fused features by apply-

ing channel and spatial attention mechanisms to each

respective branch. This dual attention schema ensures

Figure 3: The structure of the Enhanced Bilateral Fusion

(EBF) block.

that the most relevant features are emphasized, en-

hancing the quality of the subsequent feature repre-

sentations for the precise segmentation tasks at hand.

3.3 MAM: Multi-Attention Module

The Multi-Attention Module (MAM), as shown in

Figure 4, is crucial in reﬁning the feature fusion pro-

cess within FingerSeg. It starts by upscaling the out-

put feature maps from the DAPPM using bilinear in-

terpolation to match the spatial dimensions of the

high-resolution feature maps. These aligned feature

maps are then combined via element-wise summa-

tion. This fusion sets the stage for a series of spe-

cialized attention mechanisms, extending beyond the

scope of CBAM to address the unique challenges of

FingerSeg: Highly-Efﬁcient Dual-Resolution Architecture for Precise Finger-Level Semantic Segmentation

245

Figure 4: Conﬁguration of the Multi-Attention Module

(MAM).

ﬁnger segmentation.

We employ a Gabor Attention mecha-

nism (Richards et al., 2022) after channel and

spatial attention, speciﬁcally designed to be robust

to the orientation of features. This mechanism

utilizes Gabor-modulated convolutions, where con-

volutional weights are multiplied by Gabor ﬁlters

across different rotation parameters to generate

orientation-sensitive feature representations. The

Gabor Attention then calculates correlations along

the orientation axis, which are crucial for interpreting

the diverse positioning of ﬁngers.

Further, the Category Attention Block (CAB) (He

et al., 2020) is integrated to address the challenges

associated with the imbalanced data characteristic of

ﬁnger segmentation, where pixel representation of

different ﬁngers varies signiﬁcantly. The Category

Attention of CAB operates in a class-speciﬁc man-

ner, allocating an equal number of feature channels to

each class. Thus, it mitigates channel bias and ampli-

ﬁes inter-class feature distinction, ensuring each cate-

gory receives equal treatment.

MAM’s strategy, which includes channel, spatial,

orientational, and categorical attention, fortiﬁes the

feature richness. This robustness is essential for Fin-

gerSeg’s capability to accurately segment ﬁngers in

varying poses and alignments, which is critical for the

following ADU block and ﬁnal segmentation process.

Figure 5: Schematic representation of the Asymmetric Di-

lated Up-sampling (ADU) block.

3.4 ADU: Asymmetric Dilated

Up-Sampling

As shown in Figure 5, the process of ADU begins

with a 3×3 convolution that further reﬁnes the fea-

ture maps received from MAM. After this initial re-

ﬁnement, two pathways are used to process the fea-

tures in parallel. One branch utilizes asymmetric con-

volutions with dilation to effectively capture contex-

tual information from the feature maps, which have

been previously upsampled using bilinear interpola-

tion. This branch’s dilated convolutions enable a

broader receptive ﬁeld, facilitating the assimilation of

context without loss of resolution. Simultaneously,

the second branch, employing a direct upsampling

approach without dilation, concentrates on learning

ﬁne-grained details to enhance the spatial resolution

of the features. The outputs of both branches are then

concatenated, providing a composite feature map that

embodies contextual and detailed attributes.

The concatenated feature maps undergo a ﬁnal

1x1 convolution for the pixel-level classiﬁcation. This

fusion of asymmetric dilation and direct detail learn-

ing in the ADU block optimizes the balance between

contextual understanding and detail preservation, a

critical aspect for accurately segmenting ﬁngers.

4 IPN-FINGER DATASET

The scarcity of hand datasets offering annotations be-

yond binary pixel labels presents a signiﬁcant chal-

lenge in advancing hand segmentation research. For

instance, EgoHands (Bambach et al., 2015) offers

pixel-wise annotations for activity recognition across

four classes, limited to distinguishing only between

the user’s and others’ left and right hands. Simi-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

246

Figure 6: Class labels in the IPN-Finger dataset.

larly, the dataset from (Benitez-Garcia et al., 2021b)

provides annotations for 500 frames of left and right

hands, targeted at touchless screen interactions. The

WorkingHands dataset (Kim et al., 2020) is another

substantial contribution, featuring over 400 thousand

frames of thermally and RGB-D captured ”hands us-

ing tools.” However, these datasets do not address

ﬁnger-level granularity. On the other hand, to the best

of our knowledge, the Rendered Hand Pose Dataset

(RHD) (Zimmermann and Brox, 2017) is the only

publicly available dataset with ﬁnger-level pixel-wise

annotations, comprising 43,986 synthetically gener-

ated images.

To ﬁll the void of real-world, ﬁnger-speciﬁc an-

notations, we have extended the IPN-Hand dataset

with ﬁnger-level semantic annotations. The IPN-

Hand dataset, known for capturing genuine interac-

tions with touchless screens, was the ideal candidate

for this subset. We selected 1000 frames showcas-

ing a range of ﬁnger positions, spanning the 13 static

and dynamic gestures and the variety of the 28 scenes

and backgrounds present in the dataset. This sub-

set, dubbed IPN-Finger, represents samples from 50

different subjects and deﬁnes 13 classes that com-

prise the palms, all ﬁngers on each hand, and the per-

son’s shape. Figure 6 provides a visual guide to these

classes. The annotation was performed manually us-

ing the LabelMe toolbox (Russell et al., 2008), ensur-

ing precise and comprehensive pixel-level labeling.

4.1 Dataset Statistics

The IPN-Finger dataset exhibits a considerable class

imbalance in terms of the frequency of class appear-

ances across images and the pixel area occupied by

each class, as shown in Table 1. The table also high-

lights the pixel area covered by each class, given a

standard image size of 640x480 pixels, where the area

percentage reﬂects the proportion of the image that a

class occupies. From this table, we can see that obvi-

ously, the ’person’ class is the most prevalent, appear-

ing in all 1000 images and covering 26.45% of the

Table 1: Distribution of classes across the dataset.

Class Images Area (%) Area (pix.)

person 1000 26.45 285

l thumb 61 0.34 32

l palm 107 1.30 63

l index 77 0.34 32

l medium 90 0.39 35

l ring 97 0.37 34

l pinky 96 0.28 29

r thumb 729 0.55 41

r palm 853 1.59 70

r index 840 0.60 43

r medium 851 0.66 45

image area on average. Conversely, the classes rep-

resenting individual ﬁngers occupy signiﬁcantly less

space, with most covering less than 1% of the image

area. Notably, the left pinky class (’l pinky’) is the

least represented in terms of area, averaging an area

of approximately 29x29 pixels.

This imbalance extends to the visibility of hands

within the images: the left hand is notably less

present, appearing in fewer than 110 images. Such

disparities underscore the challenges of ﬁnger seg-

mentation, speciﬁcally in training robust models ca-

pable of accurately segmenting classes of varying

pixel representations and frequencies. The stark con-

trast between the dataset’s most and least represented

classes accentuates the necessity of a model like Fin-

gerSeg, which is adept at handling the intricacies of

ﬁnger-level segmentation within this uneven scenario.

5 EXPERIMENTAL RESULTS

5.1 Datasets

To assess the performance of our FingerSeg architec-

ture and compare it with state-of-the-art methods, we

evaluated two datasets: our IPN-Finger and the pub-

licly available RHD dataset (Zimmermann and Brox,

2017).

For the IPN-Finger dataset, we randomly chose

a ﬁxed subset of 850 images for training and 150

for testing. To address the challenge of class repre-

sentation imbalance across images, we expanded the

dataset ofﬂine by mirroring each image, effectively

doubling the number of images for training and test-

ing to 1700 and 300, respectively. The RHD dataset’s

standard data split was employed, comprising 41,258

images for training and 2,728 for testing. It is impor-

tant to note that the RHD dataset’s original annota-

tions span 31 classes, reﬂecting individual ﬁnger pha-

FingerSeg: Highly-Efﬁcient Dual-Resolution Architecture for Precise Finger-Level Semantic Segmentation

247

langes. We consolidated these annotations into the

same 13 classes deﬁned for our evaluation.

5.2 Implementation Details

All experiments were conducted using Python 3.7.16

and PyTorch 1.10.2, with CUDA 12.0 acceleration,

on an Intel Core i7-9700K desktop paired with an

Nvidia RTX 2080Ti GPU. To ensure a fair compar-

ison among all evaluated models, we standardized

the training settings across the board. The Stochas-

tic Gradient Descent (SGD) algorithm was employed

as the optimizer for all models, in conjunction with a

cross-entropy loss function as suggested by the online

bootstrapping strategy (Wu et al., 2016).

Data augmentation techniques were uniformly ap-

plied to each dataset, including random horizon-

tal ﬂips, random scaling, and random cropping to

480x480 size. Before augmentation, images from

both datasets were upscaled to a uniform resolution

of 640x640 pixels. However, for testing, images were

evaluated at their native 640x480 resolution without

cropping. Each model was trained with a batch size of

32 images. Speciﬁcally for the RHD dataset, models

were trained from scratch over 60,000 iterations. In

contrast, for the IPN-Finger dataset, we ﬁne-tuned the

models pre-trained on RHD for an additional 35,000

iterations.

5.3 Ablation Study

The efﬁcacy of segmentation models is traditionally

measured by the mean intersection-over-union accu-

racy (mIoU). Alongside mIoU, we also provide in-

sights into the model complexity by reporting the

number of parameters and computational cost mea-

sured in GFLOPs.

This ablation study dissects the incremental con-

tributions of the proposed modules integrated into the

DDRNet architecture. Our baseline is the DDRNet23

slim variant (Pan et al., 2022), which comprises 5.73

million parameters with a computational cost of 5.55

GFLOPs. We systematically enhance this baseline

by sequentially incorporating our proposed modules.

The results of this study are shown in Table 2.

The addition of each module demonstrates a sig-

niﬁcant improvement in mIoU, as illustrated in Ta-

ble 2. The full implementation of our FingerSeg

model, which includes all three modules, shows a

mIoU of 73.79, outperforming the baseline by a sig-

niﬁcant margin. It is noteworthy that FingerSeg

achieves these results with only a moderate increase

in parameters and GFLOPs.

We also compare FingerSeg against the more

Table 2: Ablation study showing the enhancements of

the proposed modules compared with the baseline (DDR-

Net23 slim), including the complex DDRNet23 full for ref-

erence. Results on the IPN-Finger dataset.

Method Params GFLOPs mIoU

DDRNet23 slim 5.73M 5.55 70.70

+EBF 5.84M 5.56 71.21

+EBF+MAM 6.15M 6.31 72.52

+EBF+MAM+ADU

(FingerSeg)

6.20M 7.30 73.79

DDRNet23 full 20.30M 21.79 74.37

complex DDRNet23 full model for a broader perspec-

tive. Although the full model has nearly triple the pa-

rameters and computational complexity of FingerSeg,

the improvement in mIoU is marginal. Speciﬁcally,

the DDRNet23 full model registers a mIoU of 74.37,

a modest increment over FingerSeg’s 73.79 mIoU.

FingerSeg’s design strategy demonstrates that strate-

gic module enhancements can yield near-comparable

accuracy while signiﬁcantly reducing the computa-

tional burden of more complex models. This under-

scores the effectiveness of each integrated module in

FingerSeg and emphasizes the model’s capacity to

achieve high-level accuracy with a more efﬁcient use

of resources.

5.4 Per-Class Analysis

The task of ﬁnger segmentation poses varying de-

grees of difﬁculty across different classes, mainly due

to the size and frequency of occurrence of each ﬁn-

ger within the dataset. Our analysis, detailed in Ta-

ble 3, suggests that the smaller ﬁngers, particularly

the left pinky—the least represented class—pose the

most signiﬁcant challenge.

When examining the performance metrics, it is ev-

ident that our FingerSeg model outperforms the base-

line DDRNet23 slim, especially in the classes where

size and representation pose a challenge. Notably,

FingerSeg achieves a signiﬁcant improvement in the

segmentation of the left palm, with an increase in ac-

curacy from 68.6 to 73.1 mIoU, indicating a promi-

nent enhancement in distinguishing this particular re-

gion compared to the baseline. On the other hand,

the left pinky, being the smallest and least represented

class, shows no critical increase from the baseline per-

formance. Another point of interest is the anoma-

lously low accuracy for the right palm in DDRNet’s

results, which FingerSeg effectively addresses, im-

proving accuracy by over 20 percentage points.

Figure 7 offers visual insights into the perfor-

mance improvements by illustrating qualitative com-

parisons between FingerSeg and the baseline DDR-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

248

Table 3: FingerSeg per-class mIoU comparison with the baseline on the IPN-Finger dataset.

Method l thu l pal l ind l med l rin l pin r thu r pal r ind r med r rin r pin

DDRNet 69.8 68.6 67.6 71.2 68.3 67.4 69.4 53.7 68.2 70.4 67.3 67.9

FingerSeg 69.9 73.1 71.2 69.4 67.8 67.5 69.8 75.6 70.8 71.4 67.7 68.3

Table 4: Comparative analysis of FingerSeg and state-of-the-art semantic segmentation methods on the IPN and RHD datasets.

Method Params (M) GFLOPs mIoU (IPN) mIoU (RHD)

FastSCNN (Poudel et al., 2019) 1.134 1.03 64.66 55.05

DABNet (Li and Kim, 2019) 0.755 6.12 64.26 61.22

FC-HarDNet (Chao et al., 2019) 4.119 5.19 68.93 68.59

FASSDNet (Rosas-Arias et al., 2021) 2.845 6.60 69.58 68.50

DDRNet23 slim (Pan et al., 2022) 5.734 5.55 70.70 69.03

PIDNet (Xu et al., 2023) 7.625 7.20 71.98 70.91

FingerSeg 6.196 7.30 73.79 71.15

Figure 7: Qualitative comparison of DDRNet (baseline) vs.

FingerSeg.

Net on test images from the IPN-ﬁnger dataset. The

third row particularly tells that DDRNet erroneously

classiﬁes skin regions as part of the right and left

palms, a signiﬁcant error that underscores difﬁculties

in discerning hand anatomy. Moreover, the baseline

occasionally misclassiﬁes objects and background el-

ements with hands, suggesting limitations in its dual-

resolution architecture when learning ﬁne anatomi-

cal structures. In stark contrast, FingerSeg exhibits

no such classiﬁcation errors, reinforcing the notion

that our integrated modules considerably enhance the

model’s ability to segment and differentiate hand and

ﬁnger regions accurately.

In general, FingerSeg consistently maintains or

improves upon the baseline accuracies, afﬁrming the

effectiveness of the integrated modules tailored to ad-

dress the intricacies of ﬁnger segmentation. This

per-class improvement demonstrates that FingerSeg

is well-suited for the detailed task at hand, capable

of precisely discerning between closely situated and

similarly sized classes.

5.5 Comparisons with State-of-the-Art

Approaches

The comparative analysis shown in Table 4 showcases

the performance of FingerSeg against several leading

semantic segmentation models: FastSCNN (Poudel

et al., 2019), DABNet (Li and Kim, 2019), FC-

HardNet (Chao et al., 2019), FASSDNet (Rosas-

Arias et al., 2021), DDRNet (Pan et al., 2022),

and PIDNet (Xu et al., 2023). Notably, FingerSeg

achieves the highest mIoU scores on both the IPN

and RHD datasets, reaching 73.79 and 71.15, respec-

tively. This marks a substantial improvement over

other approaches while maintaining competitive com-

putational efﬁciency.

FastSCNN, while being the most computationally

lightweight model with only 1.03 GFLOPs, falls short

in mIoU performance. DABNet offers the lowest pa-

rameter count at 0.755 million, yet its mIoU scores do

not compete with FingerSeg, emphasizing the latter’s

superior balance of model complexity and segmenta-

tion capability. FC-HarDNet and FASSDNet present

themselves as intermediate options in terms of param-

eters and GFLOPs. While these models offer compet-

itive mIoU scores, particularly FC-HarDNet’s perfor-

mance on the RHD dataset, they still do not reach the

benchmark set by FingerSeg.

Our baseline model, DDRNet, demonstrates ro-

bust performance with a mIoU of 70.70 on IPN

and 69.03 on RHD. However, PIDNet, one of the

FingerSeg: Highly-Efﬁcient Dual-Resolution Architecture for Precise Finger-Level Semantic Segmentation

249

most recent models, stands out with high mIoU

scores of 71.98 and 70.91 on the IPN and RHD

datasets, respectively. Yet, FingerSeg overcomes

PIDNet in accuracy while requiring fewer parameters.

The GFLOPs of FingerSeg and PIDNet are closely

matched, underscoring FingerSeg’s architectural opti-

mizations that allow for high accuracy without a sub-

stantial increase in computational demand.

In summary, FingerSeg sets new standards in seg-

mentation accuracy and exhibits a notorious balance

between computational requirements and model com-

plexity. This performance is particularly important

given the ﬁne-grained nature of the ﬁnger segmenta-

tion task, proving the worth of FingerSeg’s design.

6 CONCLUSIONS AND FUTURE

WORK

In this paper, we introduced FingerSeg as an advanced

solution for ﬁnger-level hand segmentation. Through

meticulous design and the integration of specialized

modules (EBF, MAM, and ADU), FingerSeg has

demonstrated a signiﬁcant leap forward in the accu-

racy and efﬁciency of semantic segmentation for nu-

anced hand gestures. The empirical results, bolstered

by thorough ablation studies and comparisons with

state-of-the-art methods, afﬁrm FingerSeg’s standing

as a leading solution to the presented task. More-

over, the creation and annotation of the IPN-Finger

dataset have not only facilitated the development of

FingerSeg but also enriched the resources available to

the research community. By offering this dataset pub-

licly, alongside the FingerSeg model, we anticipate

stimulating further innovation and exploration in the

detailed segmentation of hands and ﬁngers.

Looking ahead, the integration of FingerSeg into

multimodal hand gesture recognition (HGR) systems

presents promising future work. Its application as an

additional modality can potentially enrich the inter-

pretative capabilities of HGR, particularly in complex

or nuanced scenarios. Exploring the synergy between

FingerSeg’s detailed segmentation and other modali-

ties will be instrumental in developing more intuitive

and natural user interfaces, contributing signiﬁcantly

to advancements in human-computer interaction.

ACKNOWLEDGEMENTS

This work is supported by a Research Grant (S) at

Tateisi Science and Technology Foundation.

REFERENCES

Baek, S., Kim, K. I., and Kim, T.-K. (2019). Pushing the

envelope for rgb-based dense 3d hand pose estimation

via neural rendering. In The IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 1067–1076.

Bambach, S., Lee, S., Crandall, D. J., and Yu, C. (2015).

Lending a hand: Detecting hands and recognizing ac-

tivities in complex egocentric interactions. In The

IEEE International Conference on Computer Vision

(ICCV), pages 1949–1957.

Bandini, A. and Zariffa, J. (2020). Analysis of the hands

in egocentric vision: A survey. IEEE transactions on

pattern analysis and machine intelligence.

Benitez-Garcia, G., Olivares-Mercado, J., Sanchez-Perez,

G., and Yanai, K. (2021a). Ipn hand: A video dataset

and benchmark for real-time continuous hand gesture

recognition. In 2020 25th international conference on

pattern recognition (ICPR), pages 4340–4347. IEEE.

Benitez-Garcia, G., Prudente-Tixteco, L., Castro-Madrid,

L. C., Toscano-Medina, R., Olivares-Mercado, J.,

Sanchez-Perez, G., and Villalba, L. J. G. (2021b). Im-

proving real-time hand gesture recognition with se-

mantic segmentation. Sensors, 21(2):356.

Cai, M., Lu, F., and Sato, Y. (2020). Generalizing hand

segmentation in egocentric videos with uncertainty-

guided model adaptation. In The IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 14392–14401.

Chao, P., Kao, C.-Y., Ruan, Y.-S., Huang, C.-H., and Lin,

Y.-L. (2019). HarDNet: A Low Memory Trafﬁc Net-

work. In The IEEE International Conference on Com-

puter Vision (ICCV).

Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and

Adam, H. (2018). Encoder-decoder with atrous sep-

arable convolution for semantic image segmentation.

In The European Conference on Computer Vision

(ECCV), pages 801–818.

Dadashzadeh, A., Targhi, A. T., Tahmasbi, M., and Mirme-

hdi, M. (2019). Hgr-net: a fusion network for hand

gesture segmentation and recognition. IET Computer

Vision, 13(8):700–707.

He, A., Li, T., Li, N., Wang, K., and Fu, H. (2020). Cab-

net: Category attention block for imbalanced diabetic

retinopathy grading. IEEE Transactions on Medical

Imaging, 40(1):143–153.

Kim, S., Chi, H.-g., Hu, X., Vegesana, A., and Ramani, K.

(2020). First-person view hand segmentation of multi-

modal hand activity video dataset. In British Machine

Vision Conference (BMVC).

Li, G. and Kim, J. (2019). DABNet: Depth-wise Asymmet-

ric Bottleneck for Real-time Semantic Segmentation.

In British Machine Vision Conference (BMVC).

Li, M., Sun, L., and Huo, Q. (2019). Flow-guided feature

propagation with occlusion aware detail enhancement

for hand segmentation in egocentric videos. Computer

Vision and Image Understanding, 187:102785.

Likitlersuang, J., Sumitro, E. R., Cao, T., Vis

ee, R. J., Kalsi-

Ryan, S., and Zariffa, J. (2019). Egocentric video: a

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

250

new tool for capturing hand use of individuals with

spinal cord injury at home. Journal of neuroengineer-

ing and rehabilitation, 16(1):1–11.

Pan, H., Hong, Y., Sun, W., and Jia, Y. (2022). Deep dual-

resolution networks for real-time and accurate seman-

tic segmentation of trafﬁc scenes. IEEE Transactions

on Intelligent Transportation Systems, 24(3):3448–

3460.

Poudel, R. P., Liwicki, S., and Cipolla, R. (2019). Fast-

scnn: fast semantic segmentation network. In British

Machine Vision Conference (BMVC).

Rangesh, A. and Trivedi, M. M. (2018). Handynet: A one-

stop solution to detect, segment, localize & analyze

driver hands. In The IEEE Conference on Computer

Vision and Pattern Recognition Workshops (CVPRW),

pages 1103–1110.

Richards, F., Xie, X., Paiement, A., Sola, E., and Duc, P.-A.

(2022). Multi-scale gridded gabor attention for cirrus

segmentation. In 2022 IEEE International Conference

on Image Processing (ICIP), pages 3733–3737. IEEE.

Rosas-Arias, L., Benitez-Garcia, G., Portillo-Portillo, J.,

Olivares-Mercado, J., Sanchez-Perez, G., and Yanai,

K. (2021). Fassd-net: Fast and accurate real-time se-

mantic segmentation for embedded systems. IEEE

Transactions on Intelligent Transportation Systems,

23(9):14349–14360.

Russell, B. C., Torralba, A., Murphy, K. P., and Freeman,

W. T. (2008). Labelme: a database and web-based

tool for image annotation. International journal of

computer vision, 77(1-3):157–173.

Urooj, A. and Borji, A. (2018). Analysis of hand segmenta-

tion in the wild. In The IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), pages 4710–

4719.

Woo, S., Park, J., Lee, J.-Y., and Kweon, I. S. (2018). Cbam:

Convolutional block attention module. In Proceed-

ings of the European conference on computer vision

(ECCV), pages 3–19.

Wu, Z., Shen, C., and Hengel, A. v. d. (2016).

High-performance semantic segmentation using very

deep fully convolutional networks. arXiv preprint

arXiv:1604.04339.

Xu, J., Xiong, Z., and Bhattacharyya, S. P. (2023). Pidnet:

A real-time semantic segmentation network inspired

by pid controllers. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 19529–19539.

Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017). Pyra-

mid Scene Parsing Network. In The IEEE Conference

on Computer Vision and Pattern Recognition (CVPR).

Zimmermann, C. and Brox, T. (2017). Learning to estimate

3d hand pose from single rgb images. In The IEEE

International Conference on Computer Vision (ICCV),

pages 4903–4911.

FingerSeg: Highly-Efﬁcient Dual-Resolution Architecture for Precise Finger-Level Semantic Segmentation

251