Study of an Expansion Method Based on an Image-Speciﬁc Classiﬁer and

Multi-Features for Weakly Supervised Semantic Segmentation

Zhengyang Lyu

, Pierre Beauseroy

and Alexandre Baussard

Universit

e de Technologie de Troyes, LIST3N, 12 Rue Marie Curie, 10300 Troyes, France

Keywords:

Semantic Segmentation, Weak Supervision, Convolutional Neural Network, Support Vector Machine.

Abstract:

In this paper, we propose a study of an expansion method based on an image-speciﬁc classiﬁer and multi-

features for Weakly Supervised Semantic Segmentation (WSSS) with only image-level labels. Recent WSSS

methods focus mainly on enhancing the pseudo masks to improve the segmentation performance by obtain-

ing improved Class Activation Maps (CAM) or by applying post-process methods that combine expansion

and reﬁnement. Most of these methods either lack of consideration for the balance between resolution and

semantics in the used features, or are carried out globally for the whole data set, without taking into account

potential additional improvements based on the speciﬁc content of the image. Previously, we proposed an

image-speciﬁc expansion method using multi-features to alleviate these limitations. This new study aims

ﬁrstly at determining the upper performance limit of the proposed method using the ground truth masks, and

secondly at analysing this performance limit in relation with the features chosen. Experiments show that our

expansion method can achieve promising results, when used with the ground truth (upper performance) and

the features that strike a balance between semantics and resolution.

1 INTRODUCTION

Semantic segmentation is a popular task in computer

vision, with wide applications in various ﬁelds such

as autonomous driving, medical imaging, or remote

sensing imaging. However, training a Fully Super-

vised Semantic Segmentation (FSSS) model requires

laborious pixel-level annotations.

Weakly Supervised Semantic Segmentation

(WSSS) has been proposed to reduce the annotation

burden. The weak supervision can be based on

points (Amy et al., 2016), scribbles (Vernaza and

Chandraker, 2017), bounding-boxes (Dai et al.,

2015) or image-level labels (Chen et al., 2022).

The latter, which is considered in this paper, is the

most prevalent in research since it is the easiest and

cheapest annotations to obtain.

Figure 1 illustrates a comparison between the

pipelines of FSSS and WSSS with only image-level

labels. The goal of both tasks is to get a segmenta-

tion model capable of making pixel-level prediction

for a given image. Generally, segmentation mod-

els based on Convolutional Neural Network (CNN)

https://orcid.org/0009-0001-8838-4170

https://orcid.org/0000-0002-2883-1303

https://orcid.org/0000-0002-6693-4282

(Chen et al., 2017; Chen et al., 2018) or more re-

cently, transformers (Strudel et al., 2021) are com-

monly used. In contrast to FSSS, where the ground

truth is available during training, WSSS methods

must ﬁrst generate pseudo masks, which are used as

ground truth during the second step to train the seg-

mentation model. To generate the pseudo masks, we

start by training a a classiﬁcation model with image-

level labels and then, generating Class Activation

Maps (CAM) by processing the class-wise deep fea-

tures from the trained network for each image in the

training set (Zhou et al., 2016). Next, an expansion

method, which includes interpolation and argmax op-

erations, is used to get seed. Finally a reﬁnement pro-

cess (Kr

ahenb

uhl and Koltun, 2011; Ahn et al., 2019)

allows to provide an improved pseudo mask, by re-

covering details using characteristics given by image

color features. Of course in WSSS approaches the

quality of the pseudo masks directly inﬂuences the ac-

curacy of the segmentation results, since they are used

to train, in the second step, a fully supervised segmen-

tation model. That is why, recent WSSS methods fo-

cus on the generation of pseudo masks whose quality

approaches that of the ground truth.

Figure 2 shows the mean Intersection-over-Union

(mIoU), obtained by several WSSS methods: (Zhang

402

Lyu, Z., Beauseroy, P. and Baussard, A.

Study of an Expansion Method Based on an Image-Speciﬁc Classiﬁer and Multi-Features for Weakly Supervised Semantic Segmentation.

DOI: 10.5220/0012462100003654

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2024), pages 402-409

ISBN: 978-989-758-684-2; ISSN: 2184-4313

Figure 1: Pipelines of fully supervised semantic segmenta-

tion (FSSS) and weakly supervised semantic segmentation

(WSSS) with image-level labels. Lseg is the segmentation

loss function. Lcls is the classiﬁcation loss function.

Figure 2: mIoU (%) of the pseudo mask on train set (blue

line), mIoU of segmentation (red line), trained from the

pseudo masks, on test set of PASCAL VOC 2012 for several

WSSS methods.

et al., 2020; Li et al., 2022b; Li et al., 2022a; Chen

et al., 2022; Qin et al., 2022; Lee et al., 2022a;

Lee et al., 2021a; Lee et al., 2021b; Jo et al., 2023;

Lee et al., 2022b), for the pseudo masks (blue line)

and the ﬁnal segmentation (red line). All the con-

sidered methods use a classic expansion process and

common used reﬁnements processes, such as dCRF

(Kr

ahenb

uhl and Koltun, 2011) and IRNet (Ahn et al.,

2019), to generate the pseudo masks. DeepLabV2

(Chen et al., 2017), with ResNet101 backbone, is used

as the fully supervised segmentation model. The qual-

ity of the pseudo-mask is still far from perfect, and

segmentation performance is expected to be better.

Figure 3 provides some visual comparisons be-

tween the outputs of the segmentation models trained

using ground truth in the FSSS task, or the pseudo

mask in the WSSS task. The model trained using

pseudo masks tends to generate predictions with im-

precise and blurred boundaries even if those pseudo

masks seem relatively accurate. However, this effect

is mitigated with full supervision (as can be seen line

3 in ﬁgure 3). This suggests that the features learned

by the WSSS model are adversely affected by the la-

belling errors presented in the pseudo-masks. When

pseudo masks are deduced from a global model, the

labelling errors observed on different pseudo masks

can reinforce each other or have common causes

Figure 3: Visual comparison between the the output of the

segmentation models trained by ground truth and pseudo

mask. From top to bottom: image, ground truth, output of

the fully supervised segmentation network, pseudo mask,

output of segmentation network trained with pseudo mask.

and contribute to degrading the overall quality of the

pseudo masks used to create the segmentation net-

work training set (i.e., rail tracks with train). To re-

duce this effect, it is necessary to make better use

of the individual properties of each image and to im-

prove the pseudo masks.

According to those observations, we proposed in

a previous work an image-speciﬁc expansion method

using multi-features to alleviate the limitations of ex-

pansion methods which ignore the balance of resolu-

tion and semantics in the features used or miss con-

sideration of image’s speciﬁcity. The detailed pro-

posed pipeline is shown in Figure 5. a. We do not

give too speciﬁc introduction for this method in this

paper. Just in brief, with only image-level labels, we

designed a sample selection strategy by using multi-

features: CAM, seed and shallow features, to select

data from high-resolution shallow features with suf-

ﬁcient semantics, and label them by the value in the

seed, which enables to create a pixel-wise data set for

training an image-speciﬁc Support Vector Machine

(SVM) classiﬁer to infer the pixel-wise prediction for

the entire image. Using this expansion method, we

can get better predictions compared with the original

seed. With further reﬁnement process, our enhanced

pseudo masks are also promising. Some predictions

examples from the SVM are shown in the last column

in Figure 4.

The critical part of the proposed expansion

method is the sample selection process. It directly in-

ﬂuences the accuracy of the labels in the pixel-wise

train set and thus the quality of the prediction. In

order to evaluate the full potential of our expansion

Study of an Expansion Method Based on an Image-Speciﬁc Classiﬁer and Multi-Features for Weakly Supervised Semantic Segmentation

403

Figure 4: Qualitative results from the proposed expansion

method in weakly supervised. From the left to right: image,

ground truth, CAM, seed and the result from the SVM in

our expansion method.

method, we conduct, in this study, experiments as-

suming that the ground truth is available, which en-

ables us to deﬁne a pixel-wise training set free of

labelling error. This training set can be regarded

as the best result for the sample selection process.

Thus, sample selection step of the original method de-

scribed in Figure 5.a is replaced by a uniform random

sampling from features of the classiﬁcation network

that are labelled using ground truth, as shown in Fig-

ure 5.b.

For the feature, we explore various options by

choosing activation map values from different layers

in the classiﬁcation network. Next, for each image

on the training set, a SVM is trained to label its pix-

els. We show that, when ground truth is available, the

prediction results from SVM is particularly promis-

ing under the condition that the features used strike a

balance between resolution and semantics.

The rest of the paper is organized as follows: Sec-

tion 2 presents related work about the expansion pro-

cess in the WSSS framework. Section 3 describes the

proposed expansion method when assuming ground

truth is available. Section 4 provides the experimental

setup and substantial results. We conclude and outline

future research directions in Section 5.

2 RELATED WORK

Generally, weakly supervised semantic methods with

only image-level labels requires a 2-step pipeline:

pseudo mask generation and segmentation model

training. Since the quality of the pseudo masks di-

rectly impacts the performance of the segmentation

model, most methods put main efforts on improving

the accuracy of those pseudo masks. These methods

can be divided into 2 groups. The ﬁrst one tries to ad-

just classiﬁcation models to obtain improved CAM,

by strategies likes improving training mechanisms

(Wang et al., 2020) and contrastive learning (Yuan

et al., 2023). The main challenge in these methods

is ﬁnding an effective connection between the imple-

mented modiﬁcations and the resulting enhancement

of CAMs.

The second one aims at designing better ex-

pansion methods and using reﬁnement processes

to obtain high quality pseudo masks from CAM

(Kr

ahenb

uhl and Koltun, 2011; Ahn and Kwak, 2018;

Ahn et al., 2019; Li et al., 2021; Jo et al., 2023).

For example, PMM (Li et al., 2021) strives to over-

come the partial response problem in CAM by using

a smoothing method to expand the localization area,

and generate class-speciﬁc background for each im-

age to independently obtain the pixel-wise label when

obtaining the pseudo mask. In the same way, by gen-

erating class-wise centroids prototype from unsuper-

vised features among the whole dataset, the MARS

method (Jo et al., 2023) is proposed to exclude false

activation made by co-occurrences between the back-

ground elements and associated objects in CAM.

Different with fuzzy localization map provided by

CAM, details are comparatively recovered after the

well-designed expansion method with reﬁnement pro-

cess. We observe that improved expansion methods

can be achieved by using high resolution shallow fea-

tures, such as the color information of the original

image, and relying on the speciﬁc semantic informa-

tion for the given image. However, we argue that,

color information may not contain sufﬁcient seman-

tics to well represent the class in the image, which

may lead to incorrect predictions (Kr

ahenb

uhl and

Koltun, 2011; Li et al., 2021). Besides, valuable

image-speciﬁc feature may not be effectively utilized

in expansion methods which are implemented glob-

ally (Ahn and Kwak, 2018; Ahn et al., 2019; Jo et al.,

2023).

The proposed expansion method in our previous

work introduces an image-speciﬁc pixel-wise classi-

ﬁer with multi-features to solve the limitations in the

recent expansion method. In this paper, we want to

study the potential of this expansion method by as-

suming ground truth is available, and ﬁnd the opti-

mal options for the feature used by revealing the rele-

vance between performance and resolution-semantics

balance.

3 METHOD

In this section, we describe the details of proposed

expansion method for fully supervised semantic seg-

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

404

Figure 5: Detailed pipeline of the proposed expansion method for WSSS (a) and its corresponding fully supervised approach

(b). The solid lines are related to training phase of the SVM classiﬁer, and the dot lines are related to the inference phase.

mentation, as shown in Figure 5.b. We implement

this method assuming the availability of ground truth,

with the goal of assessing its potential upper perfor-

mance limit for the WSSS task and ﬁnding the clues

about the feature to use.

The proposed method is divided into 2 steps:

Firstly,for each image we construct a training set com-

posed of a data vector, which can be a set of se-

lected pixels of the image itself or from resized fea-

tures of the classiﬁcation model (outputted activation

maps of a layer in the trained classiﬁcation network),

and the corresponding training labels given by the

ground truth. Secondly, we start by training a pixel-

wise classiﬁer, here we consider a SVM model, using

the training set. Then, we output pixel-wise predic-

tions by inferring all the pixels of the original image

or the considered resized features of the classiﬁcation

model (called the test data in what follows). Finally,

the pixel-wise predictions are reshaped into a two-

dimensional prediction mask.

3.1 Construction of the Training Set

and Test Data

The ﬁrst part in our method is to construct the pixel-

wise training set, which includes training data and

training labels, and the test data.

The training set is constructed by sampling data

from the feature generated in the classiﬁcation model

or directly from the original image, and labeling the

data by the ground truth. The details processes are

illustrated as followed: ﬁrstly, a classiﬁcation model

is trained by the given images with image-level la-

bels. Assuming that l classes are labelled in the data

set, we deﬁne O = {1, 2,·· · , l} as the set of classes

for image-level labels. The data set is a collection of

image I

I ∈ R

W ×H×3

with image-level labels Y , with

Study of an Expansion Method Based on an Image-Speciﬁc Classiﬁer and Multi-Features for Weakly Supervised Semantic Segmentation

405

Y ⊂ O. Then, we uniform random sample n training

samples for each class in the image, as shown in Fig-

ure 6. Each selected sample S

′

is characterized by the

pixel-wise feature deduced from the original im-

age or outputted activation maps S

S of a layer in the

trained classiﬁcation network after bilinear interpola-

tion to image size. Finally, the pixel-wise training set

is built with the (|Y | + 1)n pairs, (S

′

), where S

′

is the pixel-wise feature vector and M

is the ground

truth label in pixel i. |Y | is the number of the object

class in the image.

Notice that, in order to avoid training errors

caused by imbalanced data, the number of samples is

kept the same for each object class and background. It

is essential to stress that the accuracy of SVM results

is related to the total number of samples. When using

a large amount of samples, the results will get better,

at the cost of training time. It is also interesting to set

the total number of samples linked to the resolution

of the feature.

Depending on the layer, the resolution of the cho-

sen feature varied from 1/2 to 1/16 of the original

image. The details resolution for the chosen feature

can be seen in Section 4.1. Normally, the shallow

feature generated from the early layers in the classi-

ﬁcation network tends to have shallow-semantic in-

formation with high-resolution. In contrast, the deep

features have high-semantic information with low-

resolution. Different features used in generating lo-

calization maps can bring varied performance. Low-

resolution features are prone to result in smoothed

boundaries and missing details (Zhou et al., 2016),

while insufﬁcient semantics can lead to noisy and in-

accurate predictions (Kr

ahenb

uhl and Koltun, 2011).

Thereby, by comparing the predictions given by the

classiﬁers trained with different features, the appro-

priate balance between the semantics and resolution

for the ideal prediction can be observed.

To construct the so-called test data, we select all

the pixels of the original image or the feature S

′

. The

test data serves as the input for the trained pixel-wise

classiﬁer during the inference stage.

3.2 Training and Infer of the Pixel-Wise

Classiﬁer

As already mentioned and shown in Figure 5, the

pixel-wise classiﬁer, implemented as a Support Vec-

tor Machine (SVM) in our study, conducts a training

phase using the constructed image-speciﬁc pixel-wise

training set.

During training, the SVM learns to distinguish be-

tween different classes based on the sampled feature

data. The one-vs-all strategy is employed in multi-

Figure 6: Labeled samples from the ground truth. From left

to the right: image, ground truth, and the selected labeled

samples.

class classiﬁcation task for SVM, where the classiﬁer

is trained for each class against the others.

Once the SVM is trained, the pixel-wise features

from the test data are inferred to assign class labels.

These predictions are then reconstructed into a seg-

mentation mask.

4 EXPERIMENTS

In this section, we ﬁrst give the details for experimen-

tal settings like dataset, evaluation metrics and imple-

mentation details. Then, we exhibits our experiment

results quantitatively and qualitatively.

4.1 Experimental Settings

Dataset and Evaluation Metrics. Our preliminary

study is evaluated on PASCAL VOC 2012 dataset

(Everingham et al., 2015), which has 20 foreground

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

406

Figure 7: Qualitative segmentation results on PASCAL VOC 2012 train set. From left to right: image, ground truth, prediction

when SVM is trained by using the original image as feature, and features generated from the different layers in the backbone

classiﬁcation network, namely F

to F

Table 1: Comparison of segmentation mIoU (%) scores using different features on the PASCAL VOC 2012 train set. The best

is highlight by bold.

Feature bkg aero bike bird boat bottle bus car cat chair cow table dog horse motor person plant sheep sofa train tv mean

img 86.98 68.79 35.91 64.40 57.83 58.97 68.77 62.19 77.31 58.65 73.76 69.01 71.25 73.22 57.99 68.02 61.07 70.89 74.65 64.80 60.96 65.97

98.98 96.70 89.97 96.95 96.39 95.61 96.80 96.74 98.32 95.76 97.77 97.74 97.82 97.75 95.22 96.87 96.17 97.63 97.82 96.70 96.13 96.67

98.65 92.87 85.61 95.38 93.90 95.34 95.66 95.63 98.16 94.89 96.77 96.76 97.23 96.86 93.93 95.85 95.02 97.22 97.47 95.80 95.19 95.44

98.58 87.62 84.83 93.02 91.31 96.36 96.74 95.63 97.91 95.57 95.73 97.82 96.44 96.31 93.88 96.27 95.31 96.53 98.29 96.52 97.50 95.15

96.98 78.48 75.53 86.10 84.00 93.96 94.35 91.75 95.39 91.04 87.35 96.74 90.63 88.29 89.19 93.85 91.03 88.22 96.43 93.71 96.69 90.46

90.19 55.38 66.41 61.51 60.54 77.25 80.48 75.81 82.78 80.74 72.03 91.36 72.52 71.02 73.94 85.84 81.31 72.71 82.62 73.93 86.00 75.92

classes and 1 background class. The ofﬁcial dataset

is split into train set, validation set and test set, which

contains 1464, 1449 and 1456 images, respectively.

There is an additional annotations provided by Se-

mantic Boundary Dataset (Hariharan et al., 2011)

which augment the train set to 10582 images. The

augmented train set is used to train the classiﬁcation

network, which is able to generate image-speciﬁc fea-

tures for the given image. We evaluate the segmen-

tation results on all 1464 images from the train set.

Mean-intersection-over-union is used to evaluate seg-

mentation results.

Implementation Details. In our experiments, we

use the classiﬁcation network designed in the weakly

supervised semantic segmentation framework called

Self-supervised Image-speciﬁc Prototype Exploration

(SIPE) (Chen et al., 2022) to obtain features for the

given image. Following the set in SIPE, we use

ResNet-50 (He et al., 2016) as the backbone classi-

ﬁcation network. The architecture of ResNet-50 is di-

vided into 5 stage, each of which consists in the con-

volutional and pooling layers. We named the output

features from each stage in the backbone classiﬁca-

tion network as F

to F

, whose spatial size is 1/2,

1/4, 1/8, 1/16 and 1/16 of the original image’s size,

respectively. These features are z-score normalized to

prevent certain parts of features from dominating the

training process due to their larger magnitude. For

each category, including background, in the image,

we set n = 2500 to sample, which corresponds to a

small fraction of image pixels.

4.2 Experiment Results

Figure 7 shows the SVM predictions when using var-

ied features from different stages. We observe that,

when training SVM using the original image as fea-

ture, high-resolution results with relatively detailed

object contours are generated. However, due to the

limited semantics there was a higher occurrence of

Study of an Expansion Method Based on an Image-Speciﬁc Classiﬁer and Multi-Features for Weakly Supervised Semantic Segmentation

407

misclassiﬁcations in background regions, leading to

a relatively imprecise prediction. Using F

feature,

which is in low-resolution with richer semantics, the

predicted segmentation boundaries is blurred and less

accurate. Both of these two features have advantages

in terms of resolution or semantics for SVM results,

but the imbalance between the two factors leads to

imperfect results. In comparison, as shown from col-

umn 4 to column 6 in Figure 7, when using shallow

features with sufﬁcient semantics, like F

, F

and F

most of the wrong predictions in the background are

clearly avoided and the objects boundaries are quite

clear.

Segmentation results using different features for

each class category and the mean performance in

the PASCAL VOC 2012 train set are shown in Ta-

ble 1. Results given by using features which strike

the balance between resolution and semantics, i.e.,

, F

and F

, shows best segmentation performance

in all categories. Using F

feature makes the best

performance in most categories, which suggests that

when ground truth is available, the pixel-wise clas-

siﬁer trained using high-resolution with sufﬁcient se-

mantics features is able to generate clear segmenta-

tion boundaries and reasonably accurate segmentation

results. Besides, we also did experiments by sim-

ply concatenating selected features, i.e., concatenat-

ing more than one features along the channel axis.

However, the output results from the classiﬁer trained

by the concatenated features are close to the results

from the classiﬁer trained by the feature which has the

most number of channels among the selected features.

It reveals that, when using more than one feature, it is

important to ﬁnd an appropriate method to combine

them, especially taking into account the differences

in feature depths.

The experiments results shows the great poten-

tial for our expansion method for a segmentation

task. Since we implement the expansion method with

ground truth in this paper, the results can be regarded

as the upper performance limit of the proposed ap-

proach. It reveals that segmentation results in our

WSSS task, which corresponds to the one initially tar-

geted, should be improved when the labels of the se-

lected samples are sufﬁciently precise and the feature

used represents a good compromise between resolu-

tion and semantics.

5 CONCLUSIONS

When we investigated methods to improve WSSS per-

formances, we observed that the main critical part

is to get better pseudo masks. Compared with the

ground truth, the pseudo masks are not perfect yet.

We start from a proposed expansion method made to

improve them by training a pixel-wise SVM classiﬁer

for each image in the training set. In this study, we

use ground truth to label samples, which is regarded

as the best sampling selection process, to evaluate the

potential of this expansion method. We found that

promising prediction is generated when the used fea-

ture keeps balance between semantics and resolution.

It shows that our expansion method has some poten-

tial and that high-resolution shallow features with suf-

ﬁcient semantics brings effective gain in generating

high quality pseudo masks.

Beside these considerations, it also appears that

the performance of WSSS segmentation networks

are close to the performance obtained with pseudo

masks, thus improving pseudo masks is valuable

if the segmentation network is able to take advan-

tage of their quality as shown in ﬁgure 8. Indeed,

FSSS DeeplabV2, used as backbone in all considered

WSSS methods, reaches almost 79%. The gain we

can expect with this backbone from perfect pseudo

masks is then signiﬁcant but not that large. Our ex-

periments show that by improving the sample selec-

tion process, we could reach more than 95% mIoU

for pseudo masks which is worth the effort if using

the best FSSS network that reaches 90.6% mIoU. So

ﬁnally, we can conclude that the efforts put to im-

prove pseudo masks should be adapted according to

the expected performance of the backbone network in

FSSS.

We must also mention that another line of re-

search is to consider segmentation models that self

correct the pseudo masks during training so that

post-processing could be avoided leading to less de-

manding methods from computational point of view.

Works in that direction are under investigation.

Figure 8: mIoU (%) of the pseudo mask on train set (red

points), mIoU of segmentation (model trained from the

pseudo masks, blue points) on test set of PASCAL VOC

2012 for several WSSS methods, and mIoU of the segmen-

tation for fully supervised models: DeepLabV2 and Seg-

NeXt (Guo et al., 2022) on train set (red crosses).

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

408

REFERENCES

Ahn, J., Cho, S., and Kwak, S. (2019). Weakly supervised

learning of instance segmentation with inter-pixel re-

lations. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition (CVPR).

IEEE.

Ahn, J. and Kwak, S. (2018). Learning pixel-level semantic

afﬁnity with image-level supervision for weakly su-

pervised semantic segmentation. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition (CVPR).

Amy, B., Olga, R., Vittorio, F., and Li, F. (2016). What’s

the point: Semantic segmentation with point supervi-

sion. In Proceedings of the European Conference on

Computer Vision (ECCV). Springer.

Chen, L., Zhu, Y., Papandreou, G., F. Schroff, F., and Adam,

H. (2018). Encoder-decoder with atrous separable

convolution for semantic image segmentation. In Pro-

ceedings of the European conference on computer vi-

sion (ECCV). Springer.

Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., and

Yuille, A. L. (2017). Deeplab: Semantic image seg-

mentation with deep convolutional nets, atrous convo-

lution, and fully connected crfs. IEEE Transactions

on Pattern Analysis and Machine Intelligence.

Chen, Q., Yang, L., Lai, J., and Xie, X. (2022). Self-

supervised image-speciﬁc prototype exploration for

weakly supervised semantic segmentation. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition (CVPR). IEEE.

Dai, J., He, K., and Sun, J. (2015). Boxsup: Exploiting

bounding boxes to supervise convolutional networks

for semantic segmentation. In Proceedings of the In-

ternational Conference on Computer Vision (ICCV).

IEEE.

Guo, M., Lu, C., Hou, Q., Liu, Z., Cheng, M., and Hu,

S. (2022). Segnext: Rethinking convolutional atten-

tion design for semantic segmentation. Proceedings

of the Conference on Neural Information Processing

Systems (NeurIPS), 35.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition (CVPR). IEEE.

Jo, S., Yu, I., and Kim, K. (2023). Mars: Model-agnostic

biased object removal without additional supervision

for weakly-supervised semantic segmentation. arXiv

preprint arXiv:2304.09913.

ahenb

uhl, P. and Koltun, V. (2011). Efﬁcient inference

in fully connected crfs with gaussian edge potentials.

In Proceedings of the Conference on Neural Infor-

mation Processing Systems (NeurIPS). Curran Asso-

ciates, Inc.

Lee, J., Choi, J., Mok, J., and Yoon, S. (2021a). Reducing

information bottleneck for weakly supervised seman-

tic segmentation. In Proceedings of the Conference

on Neural Information Processing Systems (NeurIPS).

Curran Associates, Inc.

Lee, J., Kim, E., Mok, J., and Yoon, S. (2022a). Anti-

adversarially manipulated attributions for weakly su-

pervised semantic segmentation and object localiza-

tion. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence.

Lee, M., Kim, D., and Shim, H. (2022b). Threshold mat-

ters in wsss: Manipulating the activation for the robust

and accurate segmentation model against thresholds.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition (CVPR). IEEE.

Lee, S., Lee, M., Lee, J., and Shim, H. (2021b). Railroad

is not a train: Saliency as pseudo-pixel supervision

for weakly supervised semantic segmentation. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition (CVPR). IEEE.

Li, J., Jie, Z., Wang, X., Wei, X., and Ma, L. (2022a).

Expansion and shrinkage of localization for weakly-

supervised semantic segmentation. In Proceedings

of the Conference on Neural Information Processing

Systems (NeurIPS). Curran Associates, Inc.

Li, J., Jie, Z., Wang, X., Zhou, Y., Wei, X., and Ma, L.

(2022b). Weakly supervised semantic segmentation

via progressive patch learning. IEEE Transactions on

Multimedia.

Li, Y., Kuang, Z., Liu, L., Chen, Y., and Zhang, W. (2021).

Pseudo-mask matters in weakly-supervised semantic

segmentation. In Proceedings of the International

Conference on Computer Vision (ICCV).

Qin, J., Wu, J., Xiao, X., Li, L., and Wang, X. (2022).

Activation modulation and recalibration scheme for

weakly supervised semantic segmentation. In Pro-

ceedings of the AAAI Conference on Artiﬁcial Intel-

ligence (AAAI). MIT Press.

Strudel, R., R.Garcia, , Laptev, I., and Schmid, C. (2021).

Segmenter: Transformer for semantic segmentation.

In Proceedings of the International Conference on

Computer Vision (ICCV). Springer.

Vernaza, P. and Chandraker, M. (2017). Learning random-

walk label propagation for weakly-supervised seman-

tic segmentation. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR). IEEE.

Wang, Y., Zhang, J., Kan, M., Shan, S., and Chen, X.

(2020). Self-supervised equivariant attention mech-

anism for weakly supervised semantic segmentation.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition (CVPR).

Yuan, K., SChaefer, G., Lai, Y., Wang, Y., Liu, X., Guan,

L., and Fang, H. (2023). A multi-strategy contrastive

learning framework for weakly supervised semantic

segmentation. Pattern Recognition.

Zhang, D., Zhang, H., Tang, J., Hua, X., and Sun, Q. (2020).

Causal intervention for weakly-supervised semantic

segmentation. In Proceedings of the Conference on

Neural Information Processing Systems (NeurIPS).

Curran Associates, Inc.

Zhou, B., Khosla, A., Lapedriza, A., et al. (2016). Learn-

ing deep features for discriminative localization. In

Proceedings of the International Conference on Com-

puter Vision (ICCV). Springer.

Study of an Expansion Method Based on an Image-Speciﬁc Classiﬁer and Multi-Features for Weakly Supervised Semantic Segmentation

409