ALiSNet: Accurate and Lightweight Human Segmentation Network for

Fashion E-Commerce

Amrollah Seifoddini

1 a

, Koen Vernooij

, Timon K

unzle

, Alessandro Canopoli

, Malte Alf

Anna Volokitin

and Reza Shirvany

Zalando SE, Switzerland

Zalando SE, Germany

Keywords:

On-Device, Human-Segmentation, Privacy-Preserving, Fashion, e-Commerce.

Abstract:

Accurately estimating human body shape from photos can enable innovative applications in fashion, from

mass customization, to size and ﬁt recommendations and virtual try-on. Body silhouettes calculated from

user pictures are effective representations of the body shape for downstream tasks. Smartphones provide a

convenient way for users to capture images of their body, and on-device image processing allows predicting

body segmentation while protecting users’ privacy. Existing off-the-shelf methods for human segmentation

are closed source and cannot be specialized for our application of body shape and measurement estimation.

Therefore, we create a new segmentation model by simplifying Semantic FPN with PointRend, an existing

accurate model. We ﬁnetune this model on a high-quality dataset of humans in a restricted set of poses

relevant for our application. We obtain our ﬁnal model, ALiSNet, with a size of 4MB and 97.6 ± 1.0% mIoU,

compared to Apple Person Segmentation, which has an accuracy of 94.4 ± 5.7% mIoU on our dataset.

1 INTRODUCTION

Human segmentation has emerged as foundational to

applications across a diverse range from autonomous

driving to social media, virtual and augmented real-

ity, and online fashion. In the case of online fash-

ion, giving users a way to easily capture their body

shape is valuable, since it can be used to recommend

appropriate clothing sizes or to enable virtual try-on.

However, to determine the right size and ﬁt of cloth-

ing, the body shape needs to be determined with very

high accuracy in order to be of value. For example,

in an image of 2k resolution in height, a segmenta-

tion error of two pixels on the boundary can change a

measurement such as chest circumference by 10mm.

Users’ body shape can be more accurately determined

if they wear tight-ﬁtting clothes, making it even more

important than in other applications to preserve pri-

vacy. Hence, mobile human segmentation is a good

ﬁt for fashion applications, as images can be both cap-

tured and processed on-device.

In this paper we propose an approach to achieve an

accurate and lightweight human segmentation method

for these applications. Although off-the-shelf mo-

bile human segmentation methods are available, such

https://orcid.org/0000-0002-6757-3175

Figure 1: Ground truth body annotations. The boundary in

particular is critical for body shape prediction.

as Apple Person Segmentation (Apple, ) and Google

MLKit’s BlazePose (Bazarevsky et al., 2020), these

methods are closed-source and cannot be adapted to

our task to achieve the required accuracy. Instead,

we design a model based on Semantic FPN with

PointRend for our task.

Crucial to the success of our method is ﬁnetun-

ing on a task speciﬁc dataset of user-taken photos in

front and side views, as shown in Figure 1. Orthogo-

nal views such as this are commonly used in various

anthropometry setups, e.g. (Smith et al., 2019). Such

silhouettes can be used to model the 3D body shape of

users, as proposed in (Dibra et al., 2016; Dibra et al.,

2017; Smith et al., 2019) or to directly predict mea-

surements (Yan et al., 2021) for fashion applications.

While relying on the large body of publicly available

746

Seifoddini, A., Vernooij, K., Künzle, T., Canopoli, A., Alf, M., Volokitin, A. and Shirvany, R.

ALiSNet: Accurate and Lightweight Human Segmentation Network for Fashion E-Commerce.

DOI: 10.5220/0011670400003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 4: VISAPP, pages

746-754

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

data for the segmentation task, we augment it by us-

ing a small yet speciﬁc dataset of 6147 high resolution

images with highly accurate annotations to overcome

the limitations of publicly available data.

Our main contributions are thus two-fold: First,

we demonstrate that a relatively small set of high-

quality annotations can boost segmentation accuracy.

Second, we simplify a large and high quality base-

line method, Semantic FPN (Kirillov et al., 2019)

with PointRend reﬁnement (Kirillov et al., 2020)

with a few steps to achieve almost the same perfor-

mance with 100× model size. The main changes

to the original model are: exchanging the back-

bone with a modiﬁed version of the mobile-optimized

MnasNet (Tan et al., 2019), using quantization-aware

training, and removing network components that we

found not to be contributing to segmentation accu-

racy. Our ﬁnal Accurate and Lightweight mobile

human Segmentation Network (ALiSNet), achieves

97.6% mIoU and is 4MB in size, where an off-

the-shelf method such as BlazePose-Segmentation

achieves 93.7% mIoU on our data, with a 6MB

model, and Apple Person Segmentation achieves

94.4% mIoU. It was not possible to ﬁne-tune either of

these models to our data as they are closed-sourced.

Additionally, ALiSNet’s accuracy is only marginally

lower than the 97.8% mIoU achieved by the 350 MB

baseline.

2 RELATED WORK

The categories of methods most relevant to our work

in the domain of on-device human segmentation are

portrait editing, video call background effects, and

general-purpose real time whole body segmentation

methods.

Many portrait editing predict alpha mattes, which

are masks that allow blending foreground and back-

ground regions. In this application, having accu-

rate segmentation of textures such as wisps of hair

is very important. Google Pixel’s alpha matting

method (Orts-Escolano and Ehman, 2022) relies on

data collected using a custom volumetric lighting

setup. Apple Person Segmentation (Apple, ) in accu-

rate mode also belongs to this category of methods.

However, such accuracy on the texture level is not

necessary for our application. Besides, most existing

alpha matting methods are trained only on faces.

Real-time portrait segmentation methods for

video calls focus on segmenting the human upper

body. ExtremeC3Net (Park et al., 2019) and SiNet (Li

et al., 2020) are examples of models that achieve very

good performance under a parameter count of 200K.

There are also several methods focused on real-

time segmentation of the whole body. One example is

Google MLKit BlazePose-Segmentation (Bazarevsky

et al., 2020) which relies on correct prediction of

body bounding box. (Strohmayer et al., 2021) fo-

cuses on reducing latency for general purpose human

segmentation. (Liang et al., 2022) introduces Multi-

domain TriSeNet Networks for the real-time single

person segmentation for photo-editing applications.

(Xi et al., 2019) uses saliency map derived from ac-

curate pose information to improve segmentation ac-

curacy especially in multi-person scenes.

(Han et al., 2020) categorizes the set of techniques

for reducing model size into model compression and

compact model design methods. Although we make

use of quantization-aware training (Wu et al., 2015)

in this paper, which is a compression technique, we

mostly take advantage of compact model compo-

nents. These include compact networks that can be

used as feature extractors, such as MnasNet (Tan

et al., 2019), FBNet (Wu et al., 2019a) and Mo-

bileNetv3 (Howard et al., 2019) which have been

found using Neural Architecture Search.

We recommend the related work section of

(Knapp, 2021) for a more extensive review of works

related to mobile person segmentation.

Finally, our work can be used in downstream ap-

plications for estimating body shape. This is an ac-

tive research area, with several approaches of estimat-

ing body shape from silhouette, such as (Song et al.,

2018; Song et al., 2016; Ji et al., 2019; Dibra et al.,

2016).

Datasets. We review several datasets with permis-

sive licenses that include person segmentation labels.

These include MS COCO (Lin et al., 2014), shown

in Figure 2, LVIS (Gupta et al., 2019) (contains higher

quality annotations for COCO images), Google Open

Images (Kuznetsova et al., 2018). There are limi-

tations with each of these datasets. COCO annota-

tions are based on polygons and therefore not accurate

around the object boundaries. LVIS annotations are

very accurate and dense in each image but they are not

Figure 2: An example image and annotation from the

COCO dataset. Note the low accuracy of the polygon anno-

tation.

ALiSNet: Accurate and Lightweight Human Segmentation Network for Fashion E-Commerce

747

Figure 3: ALiSNet Architecture. The ﬁrst stage of our method is the feature extractor backbone which produces features at

5 resolution levels. These features are convolved and added together to form the coarse features. The coarse features are

projected to a coarse segmentation using a 1 × 1 convolution. Uncertain points on the coarse segmentation mask are selected

and reﬁned using PointRend. During a PointRend reﬁnement step, coarse predictions and ﬁnegrained features from these

locations are concatenated and fed to an MLP to obtain reﬁned segmentation predictions. This reﬁnement is repeated two

times. In the diagram, ConvBlock is a conv, batch-norm, ReLu sequence and 2up is a 2x bi-linear upsampling.

yet available for the entire COCO dataset, especially

in the human category where only around 1.8k im-

ages are annotated. Finally, the Google Open Images

dataset is sparse in segmentation coverage in each im-

age compared to COCO, as many object instances are

not segmented yet.

3 METHOD

3.1 Model

Our model, ALiSNet, is a version of Semantic FPN

with PointRend (Kirillov et al., 2020), simpliﬁed

for on-device use. We chose Semantic FPN with

PointRend as a baseline because of its high accu-

racy in segmenting object boundaries. In theory, other

baseline methods could also be used to show the ef-

fectiveness of our approach.

Semantic FPN with PointRend. Semantic FPN

ﬁrst extracts features using a backbone and fur-

ther process them with a Feature Pyramid Network

(FPN) (Lin et al., 2017a). Then, a coarse segmen-

tation map is computed from the aggregated coarse

features. PointRend then samples uncertain points

on the coarse segmentation and concatenates ﬁne-

grained features from the FPN with the coarse pre-

dictions at each location and uses this as an input to a

classiﬁer to reﬁne the prediction at this location.

Changes to Baseline for On-Device Use. In this

work, we use three approaches to address the prob-

lem of reducing the model size while preserving seg-

mentation accuracy. First, we take advantage of a

mobile feature extraction backbone to replace the

ResNeXt101 (Xie et al., 2017) feature extractor in

our baseline. Second, we quantize our model using

quantization-aware training, allowing us to replace

32bit ﬂoating point parameters with equivalent int8

representations. We choose to use quantization aware

training as opposed to post-training quantization as it

is known to lead to more accurate results. Third, we

replace the feature pyramid network used in the orig-

inal model with a simpler aggregation step, skipping

the FPN top-down path. The ﬁnal ALiSNet architec-

ture is shown in Figure 3.

Training Loss. Our training loss includes the seg-

mentation loss between the predicted and ground-

truth labels, and the PointRend loss. For the segmen-

tation loss, we experimented with cross-entropy and

focal loss (Lin et al., 2017b), and found cross-entropy

to be more stable during training. We therefore

used the latter for further trainings. The PointRend

loss is taken from the reference implementation of

PointRend in detectron2 and contains the sum of

cross-entropies between predicted and ground-truth

labels of all points reﬁned during the reﬁnement pro-

cess for each sample in the mini-batch.

3.2 Data

An important element in making our approach suc-

cessful is to pre-train on a large-scale coarsely anno-

tated dataset and ﬁne-tune on a high-quality speciﬁc

dataset.

Pretraining on COCO. Following other segmenta-

tion methods (Kirillov et al., 2020) we base our work

on MS COCO. Out of all images in COCO, we only

use those containing at least one person (around 60k

images). The COCO default annotation format is de-

signed for instance segmentation. Thus, we merge the

segmentation masks of human instances in each im-

age to create corresponding segmentation masks for

semantic segmentation.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

748

Small Scale In-House Dataset. We observe that

even though object instance coverage in COCO is

very high, there are disadvantages when using only

that data for our task: Segmentation annotations are

not pixel accurate since the objects are annotated us-

ing polygons. The scale of objects can be extremely

diverse, ranging from objects of 10 pixels in height

to objects covering almost the entire image. The di-

versity of human poses and occlusions is extreme and

most images contain only a small body part or crowds

of people. This is helpful for general segmentation

tasks but our experiments show that it limits the ac-

curacy of on-device models in more controlled tasks

such as body shape estimation where pose, viewing

angle and scale of the body does not vary much.

To overcome these limitations of large-scale gen-

eral purpose datasets, we make use of a small high-

quality dataset focused on enabling accurate human

body segmentation for our task. To build this com-

plementary dataset, we use an in-house mobile app

with interactive video features to guide the users to

stand in the correct front and side view poses at the

right distance to be fully in the frame. The app was

made available for both iOS and Android, powered

by real-time pose estimation models native to the OS.

Images are taken with portrait (vertical) framing for-

mat. The calculated pose-keypoints are available for

the captured images and can be used in downstream

tasks too. To protect participants’ privacy, the im-

ages were cropped around the bounding box contain-

ing the person. The bounding boxes are calculated

from predicted pose-keypoints in the app and enlarg-

ing the box by 10% margin on each side. This dataset

includes 6147 images which are randomly distributed

in train/validation/test splits with 60/20/20% ratios re-

spectively. The number of front and side view im-

ages is almost balanced, and the ratio of male/female

participants is 45/55%. The annotation is performed

by an expert annotation team, and the quality of seg-

mentation masks was checked by two quality expert

groups. An example of this data is shown in Figure 4.

4 EXPERIMENTAL SETUP

4.1 Implementation Details

For this paper, we used the detectron2 framework (Wu

et al., 2019b) built on top of PyTorch, which includes

the reference implementation of PointRend.

On the mobile side, the model is executed by the

PyTorch Mobile interpreter to facilitate the deploy-

ment of developed models. To that end, the model

is ﬁrst converted by the TorchScript compiler and the

Figure 4: Left: Examples of front and side view images in

the target poses. Ground-truth annotation masks are over-

laid with green color onto the pictures.

resulting model graph is loaded by the mobile inter-

preter. Because the iterative PointRend head con-

tains control ﬂows based on the input, we have to

use scripting instead of tracing for computation graph

generation. Scripted models are not fully optimized

for runtime, therefore the performance sometimes is

lower than traced models. For the quantization of the

model we use the QNNPACK (Dukhan et al., 2020)

backend which is optimized for ARM CPUs available

in mobile devices.

4.2 Model Training

As described in subsection 3.2, we augment train-

ing our models on COCO with ﬁne-tuning on our in-

house dataset. We also experiment with CNN back-

bones that are pre-trained on ImageNet (Deng et al.,

2009) classiﬁcation tasks. All experiments are done

in machines in Amazon AWS with 4 V100 GPUs

totalling 128 GB graphic memory. The mini-batch

size of training is set to 8 for large models (e.g.

ResNeXt101 backbone) and 16 for smaller backbones

(i.e. MnasNet, MobileNet, FBNet). The base learn-

ing rate is set to 0.01 for the case of batch-size = 8

and 0.02 for batch-size = 16 following the recom-

mendation of a linearly scaling learning rate (Goyal

et al., 2017). During training we augment the data

using default augmentation tools provided by detec-

tron2. This includes random resizing, horizontal ﬂip,

color jitter and brightness and saturation change. The

range of sizes for the shorter side of image in resizing

is set randomly from a predeﬁned list (between 120

and 800) while keeping the longer side under 1024

and the scale-factor between 0.5 and 4. This prevents

too much down-sampling of our high-resolution im-

ages during training.

For the ﬁne-tuning step, as a standard practice

we experimented with freezing the ﬁrst N (N ≤ 2)

stages of the backbone to improve generalization of

the model and avoid over-ﬁtting but we have observed

ALiSNet: Accurate and Lightweight Human Segmentation Network for Fashion E-Commerce

749

Table 1: Effect of each model change step on the accuracy and size of model. mIoU values are reported as mean ± std in

percent.

Model conﬁguration Size (MB) mIoU

ResNeXt101 + FPN-SemSeg + PointRend 351.7 97.8 ± 1.0

Replace ResNeXt101 with MnasNet-B1 52.0 97.7 ± 0.9

Quantization-Aware-Training 12.9 97.7 ± 0.9

Remove FPN-top-down path (ALiSNet) 4.0 97.6 ± 1.0

Google MLKit (BlazePose-Segmentation)-accurate 27.7 93.9 ± 5.3

Google MLKit (BlazePose-Segmentation)-balanced 6.4 93.7 ± 5.9

Apple Person Segmentation (accurate) - 94.3 ± 5.9

Apple Person Segmentation (balanced) - 94.4 ± 5.7

Table 2: Effect of ﬁne-tuning on our dataset. First mIoU column: results after training on COCO only. second mIoU column:

result after ﬁne-tuning on our dataset. (Q) indicates the model is quantized using quantization-aware-training.

Model Size mIoU mIoU

(MB) with COCO with ﬁne-tuning

ResNeXt101 + FPN-SemSeg + PointRend 351.7 94.0 ± 3.8 97.8 ± 1.0

MobileNetV3 + FPN-SemSeg + PointRend 35.1 91.2 ± 6.2 97.7 ± 1.1

(Q) MnasNet-B1 + SemSeg + PointRend (ALiSNet) 4.0 90.0 ± 6.8 97.6 ± 0.9

that not freezing any layer can improve the general-

ization results in our case. We also experimented with

reducing the learning-rate by ×10 for the ﬁne-tuning

step compared to the pre-training learning rate. How-

ever, we found that the model converged quicker and

to better results when we did not reduce the learning

rate.

4.3 Evaluation

We evaluate our models with mean Intersection Over

Union (mIoU) which is deﬁned in Equation 1. This

metric is in the [0, 1] range and then reported as per-

centage.

mIoU =

∑

i=1

|pred

∩ GT

|pred

∪ GT

× 100 (1)

where pred and GT are the prediction and ground-

truth segmentation masks of sample i respectively.

During evaluation, images are sized to 1024 on height

after cropping them to the person bounding box.

5 EXPERIMENTAL RESULTS

In this section, ﬁrst we explore the effect of different

aspects of our model. Then we do a quantitative and

qualitative comparison of our method with two other

related person-segmentation methods.

5.1 Effect of Model Design Choices

Reduction of Size of Components: As shown

in Table 1, starting from the baseline model

(351.7MB), we ﬁrst obtain a model size of 52.0MB

by replacing the ResNeXt101 feature backbone with

MnasNet, saving around 300 MB. Then we applied

Quantization Aware Training, which further shrinks

the model size by ×4, resulting in 12.9MB, as we re-

place 32 bit ﬂoating point with int8 representation.

We then show that the top-down branch of FPN

which combines high-level semantic features with

low-level features can be removed from the model

with only 0.1% reduction in accuracy. We argue that

in our model, PointRend carries the job of merg-

ing high-level and low-level features, thus making

the top-down path of FPN mostly redundant. Fur-

thermore, scale of persons in our data does not vary

enough to require FPN-top-down features.

Fine-Tuning. We ﬁrst train all the models on the

COCO person class. In Table 2, we show the effect

of ﬁne-tuning these models on our high-quality task-

speciﬁc dataset. It is clear that the ﬁne-tuning signif-

icantly improves the mIoU, and the effect is greater

for smaller models.

PointRend. Although the PointRend module adds

to the 0.2MB size and 20% to the runtime of our

method, we use it, as Table 3 shows that it adds around

0.3% mIoU to the model accuracy.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

750

Table 3: Effect of PointRend on accuracy of Semantic FPN

without and with PointRend. Both models are quantized.

Backbone FBNetV3 MnasNet-B1

no PR 97.4 ± 1.1 97.4 ± 1.0

With PR 97.7 ± 0.9 97.7 ± 0.9

CNN Backbones Choice. We compared the effect

using MobileNetV3 (Howard et al., 2019), Mnas-

Net (Tan et al., 2019), FBNetV3 (Dai et al., 2021)

as mobile-friendly backbone feature extractors. Ta-

ble 4 shows that all of the mobile feature extractors

have around the same performance in our task, which

is only 0.1% lower than the much larger ResNeXt101.

We chose to use MnasNet due to lower variance and

its availability in the the torchvision framework.

Table 4: Effect of different CNN backbones on accuracy of

the Semantic FPN segmentation model.

Model backbone mIoU

ResNeXt101 97.8 ± 1.0

MobileNetV3 97.7 ± 1.1

(Q) FBNetV3 97.7 ± 0.9

(Q) MnasNet-B1 97.7 ± 0.9

5.2 Runtime on Mobile Devices

We evaluated our model on a set of real mobile de-

vices provided by the AWS Device Farm

. The distri-

bution of runtime is shown in Figure 5. For this eval-

uation, 90 high-resolution images from the dataset

are processed using our in-house evaluation mobile

app. Images are resized to 2k resolution in height

while preserving the aspect ratio and then cropped to

the person bounding box. The cropped images are

passed to the model, where they are resized to 1024 in

height internally before segmentation. As the bound-

ing box of the person varies between images, a sig-

niﬁcant variance of the runtime on a single device can

be noticed. On recent iPhones the model runs well be-

low 1s and for an older Android phone like the Moto

G4 the mean is around 8s. In iPhone SE and Galaxy

S21 there are some outliers in runtime, for reasons

we were not able to determine. The model is running

on CPU mode due to limited support of PyTorch for

mobile-GPU in quantized models.

5.3 Comparison to BlazePose and Apple

Person Segmentation

We compare with two on-device person segmentation

methods.

https://aws.amazon.com/device-farm/

Figure 5: Runtime on mobile devices. Our method runs

in less than two seconds on most modern devices. As of

October 2022, iPhone 13 is the latest iPhone available on

the Amazon Device Farm.

ours BlazePose Apple

Figure 6: Comparison of segmentation methods. Red ar-

rows indicate mis-segmented regions. In these images, the

threshold for the Apple segmentation algorithm (balanced)

was set to 0.6 to obtain the best results. In the quantitative

evaluation, the value is 0.3.

BlazePose (Bazarevsky et al., 2020) is a on-device

real-time body pose tracking method, which provides

a segmentation prediction option. We compare two

settings of the BlazePose model, balanced and accu-

rate, which have model sizes of 6.4MB and 27.7MB

respectively.

We also compare to Apple Person Segmentation

which was made available in iOS15. Information

about the details of this model is not available.

The output segmentation maps of these methods

are probabilistic, and need to be thresholded to com-

pute the ﬁnal binary silhouette maps. Thresholds

ALiSNet: Accurate and Lightweight Human Segmentation Network for Fashion E-Commerce

751

ours BP Apple ours BP Apple ours BP Apple ours BP Apple

95.7 93.8 87.3 96.4 93.2 95.5 98.6 96.7 97.6 98.8 90.9 97.0

Figure 7: Overlays of predicted segmentations over the ground truth annotations, blue intersection, from our in-house dataset,

from our method, BlazePose (BP) and Apple, with the IoU. We ranked the images in the test set by their their mIoUs using

AlisNet, and displayed the 5th, 10th, 90th and 95th-% images in this ranking. As the photos are conﬁdential, we show only the

silhouettes here. After data collection, all images were anonymized using face-blur, which is seen in the Apple Segmentation

in the ﬁrst image. In one of the BP segmentations you can see the issue of bounding box prediction cutting out part of the feet.

were determined using a sweep of values and were

set to 0.5 for BlazePose and 0.3 for Apple.

It is not possible to ﬁne-tune either of these meth-

ods on our data.

As seen in Table 1, both BlazePose and Apple

Segmentations have a much lower mIoU than AliSNet

on our dataset, while having a much higher standard

deviation. This indicates that ﬁne-tuning on our data

allows our model to avoid certain mistakes in segmen-

tation. Neither of the compared methods are designed

for the use-case of segmentation for accurate human

body measurement. BlazePose is optimized for real-

time pose estimation, which means it lies on a differ-

ent point of the performance-accuracy trade-off. Ap-

ple Person Segmentation is designed to power Portrait

Mode. Segmentation examples from all three meth-

ods are shown in Figure 6. In the front view, both our

method and Apple produce more accurate segmenta-

tions than BlazePose. In the side view, we see that

all three methods have difﬁculty with background ob-

jects, and that the Apple method produces artifacts in

the lamp.

Figure 7 shows overlays from the test set of our

dataset. In this ﬁgure we show both “bad” and “good”

segmentations from our model, however, we see that

our segmentation has a higher IoU than the other

methods, as we have trained our model on this dataset.

6 DISCUSSION

We presented a method for accurate mobile human

segmentation along with a set of general steps that

can be used to simplify existing large-scale models

for on-device applications.

Although our model handles most images well,

there are cases with confusing background textures

where our method and other methods fail, as shown

in Figure 8. Other challenging conditions include

dim lighting, dark shadows or other image distor-

tions. Improving the performance under these con-

ours BlazePose Apple

Figure 8: Examples of failure cases for segmentation algo-

rithms. In our user collected dataset, people take pictures at

home, and sometimes have clothes (top) or mirrors (bottom)

in the background, which cause the segmentation method to

not work correctly. This is a shortcoming of all the methods

we evaluate here.

ditions would be an important future direction.

In the future we will be experimenting with on-

device segmentation models for accurate body shape

and measurement estimation.

ACKNOWLEDGEMENTS

We would like to thank Julia Friberg for her contri-

butions in evaluation of the models on real mobile

phones.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

752

REFERENCES

Apple. Applying Matte Effects to People in Images and

Video.

Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T.,

Zhang, F., and Grundmann, M. (2020). Blazepose:

On-device real-time body pose tracking. CoRR,

abs/2006.10204.

Dai, X., Wan, A., Zhang, P., Wu, B., He, Z., Wei, Z., Chen,

K., Tian, Y., Yu, M., Vajda, P., et al. (2021). Fbnetv3:

Joint architecture-recipe search using predictor pre-

training. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

16276–16285.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In 2009 IEEE conference on com-

puter vision and pattern recognition, pages 248–255.

Ieee.

Dibra, E., Jain, H.,

Oztireli, C., Ziegler, R., and Gross,

M. (2016). Hs-nets: Estimating human body shape

from silhouettes with convolutional neural networks.

In 2016 fourth international conference on 3D vision

(3DV), pages 108–117. IEEE.

Dibra, E., Jain, H., Oztireli, C., Ziegler, R., and Gross, M.

(2017). Human shape from silhouettes using genera-

tive hks descriptors and cross-modal neural networks.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 4826–4836.

Dukhan, M., Wu, Y., and Lu, H. (2020). Qnnpack: open

source library for optimized mobile deep learning.

https://github.com/pytorch/QNNPACK.

Goyal, P., Doll

ar, P., Girshick, R. B., Noordhuis, P.,

Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and

He, K. (2017). Accurate, large minibatch SGD: train-

ing imagenet in 1 hour. CoRR, abs/1706.02677.

Gupta, A., Doll

ar, P., and Girshick, R. B. (2019). LVIS:

A dataset for large vocabulary instance segmentation.

CoRR, abs/1908.03195.

Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., and Xu, C.

(2020). Ghostnet: More features from cheap opera-

tions. In Proceedings of the IEEE/CVF conference on

computer vision and pattern recognition, pages 1580–

1589.

Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B.,

Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V.,

et al. (2019). Searching for mobilenetv3. In Pro-

ceedings of the IEEE/CVF International Conference

on Computer Vision, pages 1314–1324.

Ji, Z., Qi, X., Wang, Y., Xu, G., Du, P., Wu, X., and Wu,

Q. (2019). Human body shape reconstruction from

binary silhouette images. Computer Aided Geometric

Design, 71:231–243.

Kirillov, A., Girshick, R., He, K., and Dollar, P. (2019).

Panoptic feature pyramid networks. In Proceedings

of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition (CVPR).

Kirillov, A., Wu, Y., He, K., and Girshick, R. (2020).

Pointrend: Image segmentation as rendering. In Pro-

ceedings of the IEEE/CVF conference on computer vi-

sion and pattern recognition, pages 9799–9808.

Knapp, J. (2021). Real-Time Person Segmentation on Mo-

bile Phones. PhD thesis, Wien.

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J. R. R.,

Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Mal-

loci, M., Duerig, T., and Ferrari, V. (2018). The open

images dataset V4: uniﬁed image classiﬁcation, object

detection, and visual relationship detection at scale.

CoRR, abs/1811.00982.

Li, Y., Luo, A., and Lyu, S. (2020). Fast portrait seg-

mentation with highly light-weight network. In 2020

IEEE International Conference on Image Processing

(ICIP), pages 1511–1515. IEEE.

Liang, Z., Guo, K., Li, X., Jin, X., and Shen, J. (2022).

Person foreground segmentation by learning multi-

domain networks. IEEE Transactions on Image Pro-

cessing, 31:585–597.

Lin, T.-Y., Doll

ar, P., Girshick, R., He, K., Hariharan, B.,

and Belongie, S. (2017a). Feature pyramid networks

for object detection. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 2117–2125.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll

ar, P.

(2017b). Focal loss for dense object detection. In

Proceedings of the IEEE international conference on

computer vision, pages 2980–2988.

Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick,

R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L.,

and Doll

ar, P. (2014). Microsoft coco: Common ob-

jects in context. cite arxiv:1405.0312Comment: 1)

updated annotation pipeline description and ﬁgures;

2) added new section describing datasets splits; 3) up-

dated author list.

Orts-Escolano, S. and Ehman, J. (2022). Accu-

rate Alpha Matting for Portrait Mode Selﬁes

on Pixel 6 . https://ai.googleblog.com/2022/01/

accurate-alpha-matting-for-portrait.html.

Park, H., Sj

osund, L. L., Yoo, Y., and Kwak, N. (2019).

Extremec3net: Extreme lightweight portrait segmen-

tation networks using advanced c3-modules. arXiv

preprint arXiv:1908.03093.

Smith, B. M., Chari, V., Agrawal, A., Rehg, J. M., and

Sever, R. (2019). Towards accurate 3d human body

reconstruction from silhouettes. In 2019 Interna-

tional Conference on 3D Vision (3DV), pages 279–

288. IEEE.

Song, D., Tong, R., Chang, J., Yang, X., Tang, M., and

Zhang, J. J. (2016). 3d body shapes estimation from

dressed-human silhouettes. In Computer Graphics Fo-

rum, volume 35, pages 147–156. Wiley Online Li-

brary.

Song, D., Tong, R., Du, J., Zhang, Y., and Jin, Y. (2018).

Data-driven 3-d human body customization with a

mobile device. IEEE Access, 6:27939–27948.

Strohmayer, J., Knapp, J., and Kampel, M. (2021). Efﬁcient

models for real-time person segmentation on mobile

phones. In 2021 29th European Signal Processing

Conference (EUSIPCO), pages 651–655. IEEE.

ALiSNet: Accurate and Lightweight Human Segmentation Network for Fashion E-Commerce

753

Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M.,

Howard, A., and Le, Q. V. (2019). Mnasnet: Platform-

aware neural architecture search for mobile. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 2820–2828.

Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian,

Y., Vajda, P., Jia, Y., and Keutzer, K. (2019a). Fbnet:

Hardware-aware efﬁcient convnet design via differen-

tiable neural architecture search. In Proceedings of

the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 10734–10742.

Wu, J., Leng, C., Wang, Y., Hu, Q., and Cheng, J.

(2015). Quantized convolutional neural networks for

mobile devices. arxiv eprints, page. arXiv preprint

arXiv:1512.06473.

Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., and Gir-

shick, R. (2019b). Detectron2. https://github.com/

facebookresearch/detectron2.

Xi, W., Chen, J., Lin, Q., and Allebach, J. P. (2019). High-

accuracy automatic person segmentation with novel

spatial saliency map. In 2019 IEEE International

Conference on Image Processing (ICIP), pages 1560–

1564.

Xie, S., Girshick, R., Doll

ar, P., Tu, Z., and He, K. (2017).

Aggregated residual transformations for deep neural

networks. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 1492–

1500.

Yan, S., Wirta, J., and K

ainen, J.-K. (2021). Silhouette

body measurement benchmarks. In 2020 25th Inter-

national Conference on Pattern Recognition (ICPR),

pages 7804–7809. IEEE.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

754