Unsupervised Image Decomposition with Phase-Correlation Networks

Angel Villar-Corrales

and Sven Behnke

Autonomous Intelligent Systems, University of Bonn, Germany

Keywords:

Object-Centric Representation Learning, Unsupervised Image Decomposition, Frequency-Domain Neural

Networks, Phase Correlation.

Abstract:

The ability to decompose scenes into their object components is a desired property for autonomous agents,

allowing them to reason and act in their surroundings. Recently, different methods have been proposed to

learn object-centric representations from data in an unsupervised manner. These methods often rely on latent

representations learned by deep neural networks, hence requiring high computational costs and large amounts

of curated data. Such models are also difﬁcult to interpret. To address these challenges, we propose the Phase-

Correlation Decomposition Network (PCDNet), a novel model that decomposes a scene into its object compo-

nents, which are represented as transformed versions of a set of learned object prototypes. The core building

block in PCDNet is the Phase-Correlation Cell (PC Cell), which exploits the frequency-domain representation

of the images in order to estimate the transformation between an object prototype and its transformed version

in the image. In our experiments, we show how PCDNet outperforms state-of-the-art methods for unsuper-

vised object discovery and segmentation on simple benchmark datasets and on more challenging data, while

using a small number of learnable parameters and being fully interpretable. Code and models to reproduce

our experiments can be found in https://github.com/AIS-Bonn/Unsupervised-Decomposition-PCDNet.

1 INTRODUCTION

Humans understand the world by decomposing

scenes into objects that can interact with each other.

Analogously, autonomous systems’ reasoning and

scene understanding capabilities could beneﬁt from

decomposing scenes into objects and modeling each

of these independently. This approach has been

proven beneﬁcial to perform a wide variety of com-

puter vision tasks without explicit supervision, in-

cluding unsupervised object detection (Eslami et al.,

2016), future frame prediction (Weis et al., 2021; Gr-

eff et al., 2019), and object tracking (He et al., 2019;

Veerapaneni et al., 2020).

Recent works propose extracting object-centric

representations without the need for explicit super-

vision through the use of deep variational auto-

encoders (Kingma and Welling, 2014) (VAEs) with

spatial attention mechanisms (Burgess et al., 2019;

Crawford and Pineau, 2019). However, training these

models often presents several difﬁculties, such as long

training times, requiring a large number of trainable

parameters, or the need for large curated datasets.

https://orcid.org/0000-0001-5805-2098

https://orcid.org/0000-0002-5040-7525

Furthermore, these methods suffer from the inherent

lack of interpretability which is characteristic of deep

neural networks (DNNs).

To address the aforementioned issues, we propose

a novel image decomposition framework: the Phase-

Correlation Decomposition Network (PCDNet). Our

method assumes that an image is formed as an ar-

rangement of multiple objects, each belonging to one

of a ﬁnite number of different classes. Following this

assumption, PCDNet decomposes an image into its

object components, which are represented as trans-

formed versions of a set of learned object prototypes.

The core building block of the PCDNet frame-

work is the Phase Correlation Cell (PC Cell). This

is a differentiable module that exploits the frequency-

domain representations of an image and a prototype to

estimate the transformation parameters that best align

a prototype to its corresponding object in the image.

The PC Cell localizes the object prototype in the im-

age by applying the phase-correlation method (Alba

et al., 2012), i.e., ﬁnding the peaks in the cross-

correlation matrix between the input image and the

prototype. Then, the PC Cell aligns the prototype to

its corresponding object in the image by performing

the estimated phase shift in the frequency domain.

224

Villar-Corrales, A. and Behnke, S.

Unsupervised Image Decomposition with Phase-Correlation Networks.

DOI: 10.5220/0010919100003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 4: VISAPP, pages

224-235

ISBN: 978-989-758-555-5; ISSN: 2184-4321

Figure 1: PCDNet decomposition framework. First, the Phase Correlation (PC) Cell estimates the N translation parameters

that best align each learned prototype to the objects in the image, and uses them to obtain (P × N) object and mask candi-

dates. Second, the color module assigns a color to each of the transformed prototypes. Finally, a greedy selection algorithm

reconstructs the input image by iteratively combining the colorized object candidates that minimize the reconstruction error.

PCDNet is trained by ﬁrst decomposing an image

into its object components, and then reconstructing

the input by recombining the estimated object com-

ponents following the “dead leaves” model approach,

i.e., as a superposition of different objects. The strong

inductive biases introduced by the network structure

allow our method to learn fully interpretable proto-

typical object-centric representations without any ex-

ternal supervision while keeping the number of learn-

able parameters small. Furthermore, our method also

disentangles the position and color of each object in a

human-interpretable manner.

In summary, the contributions of our work are as

follows:

• We propose the PCDNet model, which decom-

poses an image into its object components, which

are represented as transformed versions of a set of

learned object prototypes.

• Our proposed model exploits the frequency-

domain representation of images so as to disentan-

gle object appearance, position, and color without

the need for any external supervision.

• Our experimental results show that our proposed

framework outperforms recent methods for joined

unsupervised object discovery, image decompo-

sition, and segmentation on benchmark datasets,

while signiﬁcantly reducing the number of learn-

able parameters, allowing for high throughput,

and maintaining interpretability.

2 RELATED WORK

2.1 Object-Centric Representation

Learning

The ﬁeld of representation learning (Bengio et al.,

2013) has seen much attention in the last decade, giv-

ing rise to great advances in learning hierarchical rep-

resentations (Paschalidou et al., 2020; Stanic et al.,

2021) or in disentangling the underlying factors of

variation in the data (Locatello et al., 2019; Burgess

et al., 2018). Despite these successes, these methods

often rely on learning representations at a scene level,

rather than learning in an object-centric manner, i.e.,

simultaneously learning representations that address

multiple, possibly repeating, objects.

In the last few years, several methods have been

proposed to perform object-centric image decompo-

sition in an unsupervised manner.

A ﬁrst approach to object-centric decomposition

combines VAEs with attention mechanisms to decom-

pose a scene into object-centric representations. The

object representations are then decoded to reconstruct

the input image. These methods can be further di-

Unsupervised Image Decomposition with Phase-Correlation Networks

225

vided into two different groups depending on the class

of latent representations used. On the one hand, some

methods (Eslami et al., 2016; Kosiorek et al., 2018;

Stanic et al., 2021; He et al., 2019) explicitly en-

code the input into factored latent variables, which

represent speciﬁc properties such as appearance, po-

sition, and presence. On the other hand, other mod-

els (Burgess et al., 2019; Weis et al., 2021; Locatello

et al., 2020; Goyal et al., 2020) decompose the image

into unconstrained per-object latent representations.

Recently, several proposed methods (Greff et al.,

2019; Engelcke et al., 2020; Engelcke et al., 2021;

Veerapaneni et al., 2020; Lin et al., 2020) use param-

eterized spatial mixture models with variational infer-

ence to decode object-centric latent variables.

Despite these recent advances in unsupervised

object-centric learning, most existing methods rely on

DNNs and attention mechanisms to encode the input

images into latent representations, hence requiring a

large number of learnable parameters and high com-

putational costs. Furthermore, these approaches suf-

fer from the inherent lack of interpretability charac-

teristic of DNNs. Our proposed method exploits the

strong inductive biases introduced by our scene com-

position model in order to decompose an image into

object-centric components without the need for deep

encoders, using only a small number of learnable pa-

rameters, and being fully interpretable.

2.2 Layered Models

The idea of representing an image as a superposition

of different layers has been studied since the introduc-

tion of the “dead leaves” model by (Matheron, 1968).

This model has been extended to handle natural im-

ages and scale-invariant representations (Lee et al.,

2001), as well as video sequences (Jojic and Frey,

2001). More recently, several works (Yang et al.,

2017; Lin et al., 2018; Zhang et al., 2020; Aksoy

et al., 2017; Arandjelovi

c and Zisserman, 2019; Sbai

et al., 2020) combine deep neural networks and ideas

from layered image formation for different generative

tasks, such as editing or image composition. How-

ever, the aforementioned approaches are limited to

foreground/background layered decomposition, or to

represent the images with a small number of layers.

The work most similar to ours was recently pre-

sented by (Monnier et al., 2021). The authors propose

a model to decompose an image into overlapping lay-

ers, each containing an object from a predeﬁned set

of categories. The object layers are obtained with a

cascade of spatial transformer networks, which learn

transformations that align object sprites to the input

image.

While we also follow a layered image formation,

our PCDNet model is not limited to a small number

of layers, hence being able to represent scenes with

multiple objects. PCDNet represents each object in

its own layer, and uses learned alpha masks to model

occlusions and superposition between layers.

2.3 Frequency-domain Neural

Networks

Signal analysis and manipulation in the frequency do-

main is one of the most widely used tools in the ﬁeld

of signal processing (Proakis and Manolakis, 2004).

However, frequency-domain methods are not so de-

veloped for solving computer vision tasks with neu-

ral networks. They mostly focus on speciﬁc applica-

tions such as compression (Xu et al., 2020; Gueguen

et al., 2018), image super-resolution and denois-

ing (Fritsche et al., 2019; Villar-Corrales et al., 2021;

Kumar et al., 2017), or accelerating the calculation of

convolutions (Mathieu et al., 2014; Ko et al., 2017).

In recent years, a particular family of frequency-

domain neural networks—the phase-correlation net-

works—has received interest from the research com-

munity and has shown promise for tasks such as fu-

ture frame prediction (Farazi et al., 2021; Wolter

et al., 2020) and motion segmentation (Farazi and

Behnke, 2020). Phase-correlation networks compute

normalized cross-correlations in the frequency do-

main and operate with the phase of complex signals in

order to estimate motion and transformation parame-

ters between consecutive frames, which can be used

to obtain accurate future frame predictions requir-

ing few learnable parameters. Despite these recent

successes, phase-correlation networks remain unex-

plored beyond the tasks of video prediction and mo-

tion estimation. Our proposed method presents a ﬁrst

attempt at applying phase correlation networks for

the tasks of scene decomposition and unsupervised

object-centric representation learning.

3 PHASE-CORRELATION

DECOMPOSITION NETWORK

In this section, we present our image decomposition

model: PCDNet. Given an input image I, PCDNet

aims at its decomposition into N independent objects

O = {O

,...,O

}. In this work, we assume that

these objects belong to one out of a ﬁnite number P of

classes, and that there is a known upper bound to the

total number of objects present in an image (N

max

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

226

(a) PC Cell

(b) Color Module

Figure 2: (a): Inner structure of the PC Cell. First, the translation parameters are estimated by ﬁnding the correlation peaks

between the object prototype and the input image. Second, the prototype is shifted by phase shifting in the frequency domain.

(b): The Color Module estimates color parameters from the input and aligns the color channels of a translated object prototype.

Inspired by recent works in prototypical learning

and clustering (Monnier et al., 2020), we design our

model such that the objects in the image can be repre-

sented as transformed versions of a ﬁnite set of object

prototypes P = {P

,...,P

}. Each object proto-

type P

∈ R

H,W

is learned along with a correspond-

ing alpha mask M

∈ R

H,W

, which is used to model

occlusions and superposition of objects. Throughout

this work, we consider object prototypes to be in gray-

scale and of smaller size than the input image. PCD-

Net simultaneously learns suitable object prototypes,

alpha masks and transformation parameters in order

to accurately decompose an image into object-centric

components.

An overview of the PCDNet framework is dis-

played in Figure 1. First, the PC Cell (Section 3.1) es-

timates the candidate transformation parameters that

best align the object prototypes to the objects in the

image, and generates object candidates based on the

estimated parameters. Second, a Color Module (Sec-

tion 3.2) transforms the object candidates by applying

a learned color transformation. Finally, a greedy se-

lection algorithm (Section 3.3) reconstructs the input

image by iteratively selecting the object candidates

that minimize the reconstruction error.

3.1 Phase-correlation Cell

The ﬁrst module of our image decomposition frame-

work is the PC Cell, as depicted in Figure 1. This

module ﬁrst estimates the regions of an image where a

particular object might be located, and then shifts the

prototype to the estimated object location. Inspired

by traditional image registration methods (Reddy and

Chatterji, 1996; Alba et al., 2012), we adopt an ap-

proach based on phase correlation. This method esti-

mates the relative displacement between two images

by computing the normalized cross-correlation in the

frequency domain.

Given an image I and an object prototype P, the

PC Cell ﬁrst transforms both inputs into the frequency

domain using the Fast Fourier Transform (FFT, F ).

Second, it computes the phase differences between

the frequency representations of image and prototype,

which can be efﬁciently computed as an element-wise

division in the frequency domain. Then, a localization

matrix L is found by applying the inverse FFT (F

−1

)

on the normalized phase differences:

L = F

−1



F (I) F (P)

||F (I) F (P)|| + ε



, (1)

where F (P) denotes the complex conjugate of

F (P),  is the Hadamard product, || · || is the modu-

lus operator, and ε is a small constant to avoid division

by zero. Finally, the estimated relative pixel displace-

ment (δ

x,y

= (δ

,δ

)) can then be found by locating

the correlation peak in L:

x,y

= arg max(L) . (2)

In practical scenarios, we do not know in advance

which objects are present in the image or whether

there are more than one objects from the same class.

To account for this uncertainty, we pick the largest

max

correlation values from L and consider them as

candidate locations for an object.

Finally, given the estimated translation parame-

ters, the PC Cell relies on the Fourier shift theorem

to align the object prototypes and the corresponding

alpha masks to the objects in the image. Given the

translation parameters δ

and δ

, an object prototype

is shifted using

T = F

−1

(F (P)exp(−i2π(δ

+ δ

)), (3)

Unsupervised Image Decomposition with Phase-Correlation Networks

227

where f

and f

denote the frequencies along the

horizontal and vertical directions, respectively.

Figure 2a depicts the inner structure of the PC

Cell, illustrating each of the phase correlation steps

and displaying some intermediate representations, in-

cluding the magnitude and phase components of each

input, the normalized cross-correlation matrix, and

the localization matrix L.

3.2 Color Module

The PC Cell module outputs translated versions of

the object prototypes and their corresponding alpha

masks. However, these translated templates need not

match the color of the object represented in the image.

This issue is solved by the Color Module, which is il-

lustrated in Figure 2b. It learns color parameters from

the input image, and transforms the translated proto-

types according to the estimated color parameters.

Given the input image and the translated object

prototype and mask, the color module ﬁrst obtains a

masked version of the image containing only the rele-

vant object. This is achieved through an element-wise

product of the image with the translated alpha mask.

The masked object is fed to a neural network, which

learns the color parameters (one for gray-scale and

three for RGB images). Finally, these learned param-

eters are applied to the translated object prototypes

with a channel-wise afﬁne transform. Further details

about the color module are given in Appendix 5.

3.3 Greedy Selection

The PC Cell and color modules produce T = N

max

P translated and colorized candidate objects (T =

,...,T

}) and their corresponding translated alpha

masks ({M

,...,M

}). The ﬁnal module of the PCD-

Net framework selects, among all candidates, the ob-

jects that minimize the reconstruction error with re-

spect to the input image.

The number of possible object combinations

grows exponentially with the maximum number of

objects and the number of object candidates (T

max

which quickly makes it infeasible to evaluate all pos-

sible combinations. Therefore, similarly to (Monnier

et al., 2021), we propose a greedy algorithm that se-

lects in a sequential manner the objects that minimize

the reconstruction loss. The greedy nature of the al-

gorithm reduces the number of possible object com-

binations to T × N

max

, hence scaling to images with a

large number of objects and prototypes.

The greedy object selection algorithm operates as

follows. At the ﬁrst iteration, we select the object that

minimizes the reconstruction loss with respect to the

input, and add it to the list of selected objects. Then,

for each subsequent iteration, we greedily select the

object that, combined with the previously selected

ones, minimizes the reconstruction error. This error

is computed using Equation (4), which corresponds

to the mean squared error between the input image (I)

and a combination of the selected candidates (G(T )).

The objects are combined recursively in an over-

lapping manner, as shown in Equation (5), so that the

ﬁrst selected object (T

) corresponds to the one clos-

est to the viewer, whereas the last selected object (T

)

is located the furthest from the viewer:

E(I,T ) = ||I − G(T )||

(4)

G(T ) = T

i+1

 (1 − M

) + T

 M

∀i ∈ {N − 1, ...,1}. (5)

An example of this image composition is dis-

played in Figure 1. This reconstruction approach in-

herently models relative depths, allowing for a simple,

yet effective, modeling of occlusions between objects.

3.4 Training and Implementation

Details

We train PCDNet in an end-to-end manner to recon-

struct an image as a combination of transformed ob-

ject prototypes. The training is performed by mini-

mizing the reconstruction error (L

MSE

), while regu-

larizing the prototypes to with respect to the `

norm

), and enforcing smooth alpha masks with a total

variation regularizer (Rudin and Osher, 1994) (L

Speciﬁcally, we minimize the following loss function:

L = L

MSE

+ λ

, (6)

MSE

= ||I − G(T

)||

, (7)

∑

P∈P

||P||

, (8)

∑

M∈M

∑

i, j

i+1, j

− M

i, j

| + |M

i, j+1

− M

i, j

(9)

where T

are the object candidates selected by

the greedy algorithm, P are the learned object pro-

totypes, and M the corresponding alpha masks. Min-

imizing Equation (6) decreases the reconstruction er-

ror between the combination of selected candidates

(G(T

)) and the input image while keeping the proto-

types compact, and the alpha masks smooth.

In our experiments, we noticed that the initializa-

tion and update strategy of the object prototypes is

of paramount importance for the correct performance

of the PCDNet model. The prototypes are initialized

with a small constant value (e.g., 0.2), whereas the

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

228

Table 1: Object discovery evaluation results on the Tetrominoes dataset. PCDNet outperforms SOTA methods, while using

a small number of learned parameters. Moreover, our PCDNet has the highest throughput out of all evaluated methods. For

each metric, the best result is highlighted in boldface, whereas the second best is underlined.

Model ARI (%) ↑ Params ↓ Imgs/s ↑

Slot MLP (Locatello et al., 2020) 35.1 – –

Slot Attention (Locatello et al., 2020) 99.5 229,188 18.36

ULID (Monnier et al., 2021) 99.6 659,755 52.31

IODINE (Greff et al., 2019) 99.2 408,036 16.64

PCDNet (ours) 99.7 28,130 58.96

Figure 3: Object prototypes (top) and alpha masks (bottom)

learned on the Tetrominoes dataset. Our model is able to

discover in an unsupervised manner all 19 pieces.

center pixel is assigned an initial value of one, enforc-

ing the prototypes to emerge centered in the frame.

During the ﬁrst training iterations, we notice that

the greedy algorithm selects some prototypes with a

higher frequency that others, hence learning much

faster. In practice, this prevents other prototypes from

learning relevant object representations, since they are

not updated often enough. To reduce the impact of un-

even prototype discovery, we add, with a certain prob-

ability, some uniform random noise to the prototypes

during the ﬁrst training iterations. This prevents the

greedy algorithm from always selecting, and hence

updating, the same object prototypes and masks.

In datasets with a background, we add a special

prototype to model a static background. In these

cases, the input images are reconstructed by overlap-

ping the objects selected by the greedy algorithm on

top of the background prototype. This background

prototype is initialized by averaging all training im-

ages, and its values are reﬁned during training.

4 EXPERIMENTAL RESULTS

In this section, we quantitatively and qualitatively

evaluate our PCDNet framework for the tasks of unsu-

pervised object discovery and segmentation. PCDNet

is implemented in Python using the PyTorch frame-

work (Paszke et al., 2017). A detailed report of the

hyper-parameters used is given in Appendix 5.

4.1 Tetrominoes Dataset

We evaluate PCDNet for image decomposition and

object discovery on the Tetrominoes dataset (Greff

et al., 2019). This dataset contains 60.000 training

images and 320 test images of size 35×35, each com-

posed of three non-overlapping Tetris-like sprites over

a black background. The sprites belong to one out of

19 conﬁgurations and have one of six random colors.

Figure 3 displays the 19 learned object prototypes

and their corresponding alpha masks from the Tetro-

minoes dataset. We clearly observe how PCDNet ac-

curately discovers the shape of the different pieces

and their tiled texture.

Figure 4 depicts qualitative results for unsuper-

vised object detection and segmentation. In the ﬁrst

three rows, PCDNet successfully decomposes the im-

ages into their object components and precisely seg-

ments the objects into semantic and instance masks.

The bottom row shows an example in which the

greedy selection algorithm leads to a failure case.

For a fair quantitative comparison with previous

works, we evaluate our PCDNet model for object seg-

mentation using the Adjusted Rand Index (Hubert and

Arabie, 1985) (ARI) on the ground truth foreground

pixels. ARI is a clustering metric that measures the

similarity between two set assignments, ignoring la-

bel permutations, and ranges from 0 (random assign-

ment) to 1 (perfect clustering). We compare the per-

formance of our approach with several existing meth-

ods: Slot MLP and Slot Attention (Locatello et al.,

2020), IODINE (Greff et al., 2019) and Unsupervised

Layered Image Decomposition (Monnier et al., 2021)

(ULID).

Table 1 summarizes the evaluation results for

object discovery on the Tetrominoes dataset. We

observe that PCDNet outperforms SOTA models,

achieving 99.7% ARI on the Tetrominoes dataset.

PCDNet uses only a small percentage of learnable pa-

rameters compared to other methods (e.g., only 6% of

the parameters from IODINE), and has the highest in-

ference throughput (images/s). Additionally, unlike

Unsupervised Image Decomposition with Phase-Correlation Networks

229

(a)

(b)

(c)

(d)

(e)

(f) (g)

Figure 4: Qualitative decomposition and segmentation results on the Tetrominoes dataset. Last row shows a failure case. (a):

Original image. (b): PCDNet Reconstruction. (c)-(e): Colorized transformed object prototypes. (f): Semantic segmentation

masks. Colors correspond to the prototype frames in Figure 3. (g): Instance segmentation masks.

Figure 5: Object prototypes learned on the Space Invaders

dataset. PCDNet discovers prototypes corresponding to the

different elements from the game, e.g., aliens and ships.

other approaches, PCDNet obtains disentangled rep-

resentations for the object appearance, position, and

color in a human-interpretable manner.

4.2 Space Invaders Dataset

In this experiment, we use replays from humans play-

ing the Atari game Space Invaders, extracted from the

Atari Grand Challenge dataset (Kurin et al., 2017).

PCDNet is trained to decompose the Space Invaders

images into 50 objects, belonging to one of 14 learned

prototypes of size 20 × 20.

Figure 6 depicts a qualitative comparison between

our PCDNet model with SPACE (Lin et al., 2020) and

Slot Attention (Locatello et al., 2020).

Slot Attention achieves an almost perfect recon-

struction of the input image. However, it fails to de-

compose the image into its object components, uni-

formly scattering the object representations across

different slots. In Figure 6 (d) one can observe that

one of the slots simultaneously encodes information

from several different objects. SPACE successfully

decomposes the image into different object compo-

nents, which are recognized as foreground objects.

Nevertheless, the reconstructions appear blurred and

several objects are not correct. PCDNet achieves

the best results among all compared methods. Our

model successfully decomposes the input image into

accurate object-centric representations. Additionally,

PCDNet learns semantic understanding of the objects.

Figure 6 depicts a segmentation of an image from the

Space Invaders dataset. Further qualitative results on

the Space Invaders dataset are reported in Appendix 5.

4.3 NGSIM Dataset

In this third experiment, we apply our PCDNet model

to discover vehicle prototypes from real trafﬁc cam-

era footage from the Next Generation Simulation

(NGSIM) dataset (Alexiadis, 2006). We decompose

each frame into up to 33 different objects, belonging

to one of 30 learned vehicle prototypes.

Figure 7 depicts qualitative results on the NGSIM

dataset. We see how PCDNet is applicable to real-

world data, accurately reconstructing the input image,

while learning prototypes for different types of vehi-

cles. Interestingly, we notice how PCDNet learns the

car shade as part of the prototype. This is a reasonable

observation, since the shades are projected towards

the bottom of the image throughout the whole video.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

230

Figure 6: Comparison of different object-centric models on the Space Invaders dataset. PCDNet is the only one among

the compared methods which successfully decomposes the image into accurate object components, and that has semantic

knowledge of the objects. The color of each object corresponds to the frame of the corresponding prototype in Figure 5.

Figure 7: Object discovery on the NGSIM dataset. PCDNet learns different vehicle prototypes in an unsupervised manner.

5 CONCLUSION

We proposed PCDNet, a novel image composition

model that decomposes an image, in a fully unsuper-

vised manner, into its object components, which are

represented as transformed versions of a set of learned

object prototypes. PCDNet exploits the frequency-

domain representation of images to estimate the trans-

lation parameters that best align the prototypes to the

objects in the image. The structured network used by

PCDNet allows for an interpretable image decompo-

sition, which disentangles object appearance, position

and color without any external supervision. In our ex-

periments, we show how our proposed model outper-

forms existing methods for unsupervised object dis-

Unsupervised Image Decomposition with Phase-Correlation Networks

231

covery and segmentation on a benchmark synthetic

dataset, while signiﬁcantly reducing the number of

learnable parameters, having a superior throughput,

and being fully interpretable. Furthermore, we also

show that the PCDNet model can also be applied for

unsupervised prototypical object discovery on more

challenging synthetic and real datasets. We hope that

our work paves the way towards further research on

phase correlation networks for unsupervised object-

centric representation learning.

ACKNOWLEDGMENTS

This work was funded by grant BE 2556/18-2 (Re-

search Unit FOR 2535 Anticipating Human Behavior)

of the German Research Foundation (DFG).

REFERENCES

Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M.

(2019). Optuna: A next-generation hyperparameter

optimization framework. In 25th ACM SIGKDD Inter-

national Conference on Knowledge Discovery & Data

Mining, pages 2623–2631.

Aksoy, Y., Aydin, T. O., Smoli

c, A., and Pollefeys, M.

(2017). Unmixing-based soft color segmentation for

image manipulation. ACM Transactions on Graphics

(TOG), 36(2):1–19.

Alba, A., Aguilar-Ponce, R. M., Vigueras-G

omez, J. F., and

Arce-Santana, E. (2012). Phase correlation based im-

age alignment with subpixel accuracy. In Mexican

International Conference on Artiﬁcial Intelligence,

pages 171–182. Springer.

Alexiadis, V. (2006). Video-based vehicle trajectory data

collection. In Transportation Research Board 86th

Annual Meeting.

Arandjelovi

c, R. and Zisserman, A. (2019). Object discov-

ery with a copy-pasting GAN. arXiv:1905.11369.

Bengio, Y., Courville, A., and Vincent, P. (2013). Represen-

tation learning: A review and new perspectives. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence (PAMI), 35(8):1798–1828.

Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N.,

Desjardins, G., and Lerchner, A. (2018). Understand-

ing disentangling in β-vae. CoRR, abs/1804.03599.

Burgess, C. P., Matthey, L., Watters, N., Kabra, R., Higgins,

I., Botvinick, M., and Lerchner, A. (2019). Monet:

Unsupervised scene decomposition and representa-

tion. arXiv:1901.11390.

Crawford, E. and Pineau, J. (2019). Spatially invariant un-

supervised object detection with convolutional neu-

ral networks. In Conference on Artiﬁcial Intelligence

(AAAI), volume 33, pages 3412–3420.

Engelcke, M., Jones, O. P., and Posner, I. (2021). Genesis-

v2: Inferring unordered object representations without

iterative reﬁnement. In International Conference on

Neural Information Processing Systems (NeurIPS).

Engelcke, M., Kosiorek, A. R., Jones, O. P., and Posner, I.

(2020). Genesis: Generative scene inference and sam-

pling with object-centric latent representations. In In-

ternational Conference on Learning Representations

(ICLR).

Eslami, S. A., Heess, N., Weber, T., Tassa, Y., Szepesvari,

D., Kavukcuoglu, K., and Hinton, G. E. (2016). At-

tend, infer, repeat: Fast scene understanding with gen-

erative models. In International Conference on Neural

Information Processing Systems (NeurIPS).

Farazi, H. and Behnke, S. (2020). Motion segmentation

using frequency domain transformer networks. In

European Symposium on Artiﬁcial Neural Networks,

Computational Intelligence and Machine Learning

(ESANN).

Farazi, H., Nogga, J., and Behnke, S. (2021). Local fre-

quency domain transformer networks for video pre-

diction. In European Symposium on Artiﬁcial Neural

Networks, Computational Intelligence and Machine

Learning (ESANN).

Fritsche, M., Gu, S., and Timofte, R. (2019). Fre-

quency separation for real-world super-resolution. In

IEEE/CVF International Conference on Computer Vi-

sion Workshop (ICCVW), pages 3599–3608.

Goyal, A., Lamb, A., Hoffmann, J., Sodhani, S., Levine, S.,

Bengio, Y., and Sch

olkopf, B. (2020). Recurrent inde-

pendent mechanisms. In International Conference on

Learning Representations (ICLR).

Greff, K., Kaufman, R. L., Kabra, R., Watters, N., Burgess,

C., Zoran, D., Matthey, L., Botvinick, M., and Ler-

chner, A. (2019). Multi-object representation learn-

ing with iterative variational inference. In Inter-

national Conference on Machine Learning (ICML),

pages 2424–2433.

Gueguen, L., Sergeev, A., Kadlec, B., Liu, R., and Yosin-

ski, J. (2018). Faster neural networks straight from

JPEG. In International Conference on Neural In-

formation Processing Systems (NeurIPS), volume 31,

pages 3933–3944.

He, Z., Li, J., Liu, D., He, H., and Barber, D. (2019). Track-

ing by animation: Unsupervised learning of multi-

object attentive trackers. In IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 1318–1327.

Hubert, L. and Arabie, P. (1985). Comparing partitions.

Journal of Classiﬁcation, 2(1):193–218.

Jaderberg, M., Simonyan, K., Zisserman, A., and

Kavukcuoglu, K. (2015). Spatial transformer net-

works. In International Conference on Neural Infor-

mation Processing Systems (NeurIPS).

Jojic, N. and Frey, B. J. (2001). Learning ﬂexible sprites

in video layers. In IEEE Computer Society Confer-

ence on Computer Vision and Pattern Recognition.

(CVPR).

Kingma, D. P. and Ba, J. (2015). A method for stochastic

optimization. In International Conference on Learn-

ing Representations (ICLR).

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

232

Kingma, D. P. and Welling, M. (2014). Auto-encoding vari-

ational Bayes. In International Conference on Learn-

ing Representations (ICLR).

Ko, J. H., Mudassar, B., Na, T., and Mukhopad-

hyay, S. (2017). Design of an energy-efﬁcient ac-

celerator for training of convolutional neural net-

works using frequency-domain computation. In

54th ACM/EDAC/IEEE Design Automation Confer-

ence (DAC).

Kosiorek, A. R., Kim, H., Posner, I., and Teh, Y. W. (2018).

Sequential attend, infer, repeat: Generative modelling

of moving objects. In International Conference on

Neural Information Processing Systems (NeurIPS).

Kosiorek, A. R., Sabour, S., Teh, Y. W., and Hinton, G. E.

(2019). Stacked capsule autoencoders. In Interna-

tional Conference on Neural Information Processing

Systems (NeurIPS).

Kumar, N., Verma, R., and Sethi, A. (2017). Convolutional

neural networks for wavelet domain super resolution.

Pattern Recognition Letters, 90:65–71.

Kurin, V., Nowozin, S., Hofmann, K., Beyer, L., and

Leibe, B. (2017). The Atari grand challenge dataset.

arXiv:1705.10998.

Lee, A. B., Mumford, D., and Huang, J. (2001). Occlu-

sion models for natural images: A statistical study of a

scale-invariant dead leaves model. International Jour-

nal of Computer Vision, 41(1):35–59.

Lin, C.-H., Yumer, E., Wang, O., Shechtman, E., and Lucey,

S. (2018). St-gan: Spatial transformer generative ad-

versarial networks for image compositing. In IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 9455–9464.

Lin, Z., Wu, Y.-F., Peri, S. V., Sun, W., Singh, G., Deng, F.,

Jiang, J., and Ahn, S. (2020). Space: Unsupervised

object-oriented scene representation via spatial atten-

tion and decomposition. In International Conference

on Learning Representations (ICLR).

Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S.,

Sch

olkopf, B., and Bachem, O. (2019). Challenging

common assumptions in the unsupervised learning of

disentangled representations. In International Confer-

ence on Machine Learning (ICML), pages 4114–4124.

Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran,

A., Heigold, G., Uszkoreit, J., Dosovitskiy, A., and

Kipf, T. (2020). Object-centric learning with slot at-

tention. In International Conference on Neural Infor-

mation Processing Systems (NeurIPS).

Matheron, G. (1968). Sch

ema bool

een s

equentiel de par-

tition al

eatoire. N-83 CMM, Paris School of Mines

publications.

Mathieu, M., Henaff, M., and LeCun, Y. (2014). Fast train-

ing of convolutional networks through FFTs. In In-

ternational Conference on Learning Representations

(ICLR).

Monnier, T., Groueix, T., and Aubry, M. (2020). Deep

transformation-invariant clustering. In International

Conference on Neural Information Processing Sys-

tems (NeurIPS).

Monnier, T., Vincent, E., Ponce, J., and Aubry, M. (2021).

Unsupervised Layered Image Decomposition into Ob-

ject Prototypes. In IEEE/CVF International Confer-

ence on Computer Vision (ICCV).

Paschalidou, D., Gool, L. V., and Geiger, A. (2020). Learn-

ing unsupervised hierarchical part decomposition of

3D objects from a single RGB image. In IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 1060–1070.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,

DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and

Lerer, A. (2017). Automatic differentiation in Py-

Torch. In International Conference on Neural Infor-

mation Processing Systems Workshops (NeurIPS-W).

Proakis, J. G. and Manolakis, D. G. (2004). Digital signal

processing. PHI Publication: New Delhi, India.

Reddy, B. S. and Chatterji, B. N. (1996). An FFT-based

technique for translation, rotation, and scale-invariant

image registration. IEEE Transactions on Image Pro-

cessing, 5(8):1266–1271.

Rudin, L. I. and Osher, S. (1994). Total variation based im-

age restoration with free local constraints. In Proceed-

ings of 1st International Conference on Image Pro-

cessing, volume 1, pages 31–35. IEEE.

Sbai, O., Couprie, C., and Aubry, M. (2020). Unsupervised

image decomposition in vector layers. In IEEE In-

ternational Conference on Image Processing (ICIP),

pages 1576–1580. IEEE.

Stanic, A., Van Steenkiste, S., and Schmidhuber, J. (2021).

Hierarchical relational inference. In Conference on

Artiﬁcial Intelligence (AAAI), pages 9730–9738.

Veerapaneni, R., Co-Reyes, J. D., Chang, M., Janner, M.,

Finn, C., Wu, J., Tenenbaum, J., and Levine, S.

(2020). Entity abstraction in visual model-based rein-

forcement learning. In Conference on Robot Learning

(CoRL), pages 1439–1456. PMLR.

Villar-Corrales, A., Schirrmacher, F., and Riess, C.

(2021). Deep learning architectural designs for super-

resolution of noisy images. In IEEE International

Conference on Acoustics, Speech and Signal Process-

ing (ICASSP), pages 1635–1639.

Weis, M. A., Chitta, K., Sharma, Y., Brendel, W., Bethge,

M., Geiger, A., and Ecker, A. S. (2021). Benchmark-

ing unsupervised object representations for video se-

quences. volume 22, pages 1–61.

Wolter, M., Yao, A., and Behnke, S. (2020). Object-

centered fourier motion estimation and segment-

transformation prediction. In European Symposium

on Artiﬁcial Neural Networks, Computational Intelli-

gence and Machine Learning (ESANN).

Xu, K., Qin, M., Sun, F., Wang, Y., Chen, Y.-K., and Ren,

F. (2020). Learning in the frequency domain. In

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 1740–1749.

Yang, J., Kannan, A., Batra, D., and Parikh, D. (2017). LR-

GAN: Layered recursive generative adversarial net-

works for image generation. arXiv:1703.01560.

Zhang, Y., Tsang, I. W., Luo, Y., Hu, C.-H., Lu, X., and Yu,

X. (2020). Copy and paste GAN: Face hallucination

from shaded thumbnails. In IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 7355–7364.

Unsupervised Image Decomposition with Phase-Correlation Networks

233

APPENDIX

Model and Training Details

Table 2: Hyper-parameter values used for each dataset.

Param. Tetrominoes Space

Invaders

NGSIM

LR 0.003 0.001 0.013

Scheduler 0.1 / 5 0.1 / 5 0.6 / 2

−3

0 10

−5

−3

0 10

−2

Batch size 64 3 3

Training Details

We train our experiments with an NVIDIA RTX 3090

GPU with 24 GB RAM using the Adam (Kingma and

Ba, 2015) update rule. Additionally, we use a learning

rate scheduler that linearly decreases the learning rate.

We determine the values of our hyper-parameters us-

ing Optuna (Akiba et al., 2019)

. The selected hyper-

parameter values for each dataset are listed on Ta-

ble 2. We report the learning rate (LR), learning rate

scheduler parameters (LR factor / epochs), batch size,

and regularizer weights (λ

and λ

The object prototypes are initialized with a con-

stant value of 0.2 and with the center pixel set to one.

This enforces the object prototypes to emerge cen-

tered. To prevent the greedy algorithm from always

selecting the same prototypes during the ﬁrst itera-

tions, we add uniform random noise U[−0.5, 0.5) to

the prototypes with a probability of 80%.

Color Module

The color module, depicted in Figure 2b, is imple-

mented in a similar fashion to a Spatial Transformer

Network (Jaderberg et al., 2015) (STN). The masked

image is fed to a neural network, which extracts cer-

tain color parameters corresponding to the masked

object. The architecture of this network is summa-

rized in Table 3. The extracted color parameters

are applied to the translated object prototypes with

a channel-wise afﬁne transform. Our color module

shares similarities with other color transformation ap-

proaches (Kosiorek et al., 2019). Despite applying the

same afﬁne channel transform, our method differs in

the way the color parameters are computed.

Hyper-parameter ranges and further details

can be found in https://github.com/AIS-Bonn/

Unsupervised-Decomposition-PCDNet

Table 3: Implementation details of the Color Module CNN.

Layer Dimension

Input 3 × H ×W

Conv. (3 × 3) + ReLU 12 × H ×W

Batch Norm. 12 × H ×W

Conv. (3 × 3) + ReLU 12 × H ×W

Batch Norm. 12 × H ×W

Global Avg. Pooling 12 × 1 × 1

Flatten 12

Fully Connected 3

Greedy Selection Algorithm

Algorithm 1 illustrates the greedy selection algorithm

used to select the colorized object candidates that best

reconstruct the input image.

Qualitative Results

Figure 8 displays PCDNet segmentation results on the

Space Invaders dataset. Figure 9 depicts further qual-

itative comparisons on the Space Invaders dataset be-

tween PCDNet, SPACE (Lin et al., 2020) and Slot At-

tention (Locatello et al., 2020). Figure 10 depicts sev-

eral object prototypes and their corresponding alpha

masks learned by PCDNet on the NGSIM dataset.

Algorithm 1: Greedy Selection Algorithm.

procedure GREEDY SELECTION ALGORITHM

Inputs:

I ← Input image

T = [T

,...,T

] ← Object Candidates

max

← Max. number of objects

Returns:

O = [O

,...,O

max

] ← Selected objects

Algorithm:

O ← [ ]

for i in range [1, N

max

] do

E ← [ ]

for t in range [1, T ] do

= ||I − G(O, T

)||

q = argmin(E)

O ← [O, T

]

return O

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

234

Figure 8: Additional PCDNet unsupervised segmentation qualitative results on the Space Invaders dataset.

Figure 9: Additional qualitative comparison on the Space Invaders dataset.

Figure 10: Several vehicle prototypes (top) and their corresponding alpha masks (bottom) learned on the NGSIM dataset.

Unsupervised Image Decomposition with Phase-Correlation Networks

235