Authors:
Felix Stillger
1
;
2
;
Frederik Hasecke
2
;
Lukas Hahn
2
and
Tobias Meisen
1
Affiliations:
1
University of Wuppertal, Gaußstraße 20, Wuppertal, Germany
;
2
APTIV, Am Technologiepark 1, Wuppertal, Germany
Keyword(s):
Diffusion Model, Self-Attention, Segmentation.
Abstract:
High-quality annotated datasets are crucial for training semantic segmentation models, yet their manual creation and annotation are labor-intensive and costly. In this paper, we introduce a novel method for generating class-agnostic semantic segmentation masks by leveraging the self-attention maps of latent diffusion models, such as Stable Diffusion. Our approach is entirely learning-free and explores the potential of self-attention maps to produce semantically meaningful segmentation masks. Central to our method is the reduction of individual self-attention information to condense the essential features required for semantic distinction. We employ multiple instances of unsupervised k-means clustering to generate clusters, with increasing cluster counts leading to more specialized semantic abstraction. We evaluate our approach using state-of-the-art models such as Segment Anything (SAM) and Mask2Former, which are trained on extensive datasets of manually annotated masks. Our results,
demonstrated on both synthetic and real-world images, show that our method generates high-resolution masks with adjustable granularity, relying solely on the intrinsic scene understanding of the latent diffusion model - without requiring any training or fine-tuning.
(More)