Transformation-Equivariant Representation Learning with

Barber-Agakov and InfoNCE Mutual Information Estimation

Marshal Arijona Sinaga, T. Basarrudin and Adila Alfa Krisnadhi

Faculty of Computer Science, University of Indonesia, Depok, Indonesia

Keywords:

Representation Learning, Transformation-Equivariant, Mutual Information Estimation, Barber-Agakov,

InfoNCE.

Abstract:

The success of deep learning on computer vision tasks is due to the convolution layer being equivariant to

the translation. Several works attempt to extend the notion of equivariance into more general transformations.

Autoencoding variational transformation (AVT) achieves state of art by approaching the problem from the

information theory perspective. The model involves the computation of mutual information, which leads to a

more general transformation-equivariant representation model. In this research, we investigate the alternatives

of AVT called variational transformation-equivariant (VTE). We utilize the Barber-Agakov and information

noise contrastive mutual information estimation to optimize VTE. Furthermore, we also propose a sequential

mechanism that involves a self-supervised learning model called predictive-transformation to train our VTE.

Results of experiments demonstrate that VTE outperforms AVT on image classiﬁcation tasks.

1 INTRODUCTION

The success of convolutional neural networks (CNN)

in computer vision tasks is due to the equiv-

ariant property (Hinton et al., 2011; Cohen and

Welling, 2016). Speciﬁcally, the CNN extracts fea-

tures/representations that are equivariant to the trans-

lation. In general, transformation-equivariant guaran-

tees the obtained representation changes in the same

way as we transform the image. Figure 1 shows the

illustration of transformation-equivariant. This prop-

erty enables CNN to extract a better representation

structure from the given image. Some efforts have

been made such that CNN can handle various types

of transformations. However, current methods are

restricted to discrete transformations. Such circum-

stance limits the capacity of CNN to capture visual

structure under more complex transformation, includ-

ing continuous and non-linear transformations.

An unsupervised approach solves the limitation of

the transformations. State of the art utilizes an au-

toencoder called autoencoding transformation (AET)

(Zhang et al., 2019). This autoencoder reconstructs

the transformation t given the original image x and the

transformed image tx. The transformation t is drawn

from the afﬁne and projective family of transforma-

tions. Another work extends the AET to an infor-

mation theory perspective, resulting in autoencoding

Figure 1: The illustration of transformation-equivariant.

The representation z

can be obtained in two ways. The ﬁrst

approach is to feed a transformed image x

= t(x

) through

a function f

, with t is the transformation in image space.

Another approach is to transform representation z

through

function r.

variational transformation (AVT) (Qi et al., 2019).

AVT adopts the notion of steerability and extends

it to an information theory perspective. The steerabil-

ity guarantees that we can transform the representa-

tion

z the same way we transform the image x with-

Sinaga, M., Basarrudin, T. and Krisnadhi, A.

Transformation-Equivariant Representation Learning with Barber-Agakov and InfoNCE Mutual Information Estimation.

DOI: 10.5220/0010880400003122

In Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2022), pages 99-109

ISBN: 978-989-758-549-4; ISSN: 2184-4313

out requiring x (Cohen and Welling, 2017). From an

information theory point of view, we can view the

steerability property as the MI between the represen-

tation of the original image

z, the representation of

the transformed image z, and the transformation t (Qi,

2019). The goal of AVT is to maximize MI I(z;

z, t).

However, computing the closed-form of MI is often

intractable, especially for high-dimensional data. In-

stead, AVT estimates the MI by decomposing the MI

into two terms and maximizing the lower bound of

one of them.

I(z;

z, t) = I(

z;z)+ I(z;t|

AVT maximizes I(z; t|

z) by using Barber-Agakov es-

timation (Agakov, 2004). Results show that AVT

achieves promising results on image classiﬁcation

tasks. However, It remains unclear whether term

z;z) gives the same performance as AVT.

This research investigates I(

z;z) as an alterna-

tive objective to train an unsupervised transformation-

equivariant representation. Later on, we call the

alternative models as variational transformation-

equivariant (VTE). Our ﬁnding shows that maximiz-

ing I(

z;z) without prior information is not feasible.

Instead, we train VTE into two stages. In the ﬁrst

stage, we build a self-supervised learning model that

maximizes MI between transformation t and the rep-

resentation of the transformed image z. In the sec-

ond stage, we build VTE model and incorporate the

previous self-supervised learning model to maximize

z;z). Moreover, we apply Barber-Agakov (Agakov,

2004) and InfoNCE (van den Oord et al., 2018) MI es-

timation to maximize the MI, resulting in three differ-

ent models. Finally, we evaluate the proposed models

on image classiﬁcation tasks. We conduct the classiﬁ-

cation on two datasets: CIFAR-10 (Krizhevsky et al.,

2009) and STL-10 (Coates et al., 2011) datasets. The

main contributions of this paper are as follow:

• We design a mechanism to train VTE, an alterna-

tive version of AVT.

• We build a self-supervised model to help training

the VTE.

• We utilize Barber-Agakov and InfoNCE MI esti-

mation to train our VTE.

• We apply the proposed models as feature extractor

on image classiﬁcation tasks.

We make our code available on Github.

The rest of this paper is organized as follows. In

Section 2, we cover works related to self-supervised

learning, transformation-equivariant representation,

and MI estimation methods. Section 3 gives a detailed

https://github.com/MarshalArijona/VTE

explanation of how to train VTE. In Section 4, we dis-

cuss the settings and results of experiments. Finally,

Section 5 gives the conclusion of this research.

2 RELATED WORKS

2.1 Transformation-Equivariant

Representation

The capsule net initiated the idea of general

transformation-equivariant representation (Hinton

et al., 2011; Wang and Liu, 2018). Capsule net takes

groups of neurons, which are responsible to capture

speciﬁc information of the image. Each capsule is

designed to be equivariant to speciﬁc transformations.

However, there was no rigorous algorithm to control

and guarantee the equivariance property for each

capsule.

Several works attempted to build a special con-

volution network that captures more types of trans-

formation operations. Group equivariant convolution

network (Cohen and Welling, 2016) introduces p4

and p4m groups to handle the equivariance for ro-

tation, translation, and reﬂection. This network pro-

duces a more complex visual structure which is bene-

ﬁcial for the classiﬁcation layer. Steerable CNN (Co-

hen and Welling, 2017) utilizes a ﬁlter bank that is

responsible to capture the equivariant property. An-

other work proposed dynamic routing for capsule net

(Lenssen et al., 2018) with concepts of equivariant

pose vectors and invariant agreements.

2.2 Self-supervised Learning

In general, self-supervised learning attempts to gen-

erate a surrogate label to enable supervised learning.

The surrogate label is synthesized from the part of

the data. (Noroozi and Favaro, 2016) divided the im-

age into several patches and permute the order of the

patches. Subsequently, they build a context-free net-

work to predict the index of the permutation. (Doer-

sch et al., 2015) also treat the images as grids/patches.

The idea is to predict the relative position of a ran-

dom grid given another grid that acts as the context.

(Noroozi et al., 2017) attempt to count the number of

features of the transformed images. In this research,

they restrict the transformation operation into scal-

ing and tiling. (Gidaris et al., 2018) apply discrete

rotation on the image and ask the model to predict

the type of rotation. (Dosovitskiy et al., 2016) apply

several transformations to the image. Subsequently,

they assign the same surrogate label to the original

ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods

100

image and the transformed images. Finally, a classi-

ﬁer is asked to predict the surrogate class. Most self-

supervised methods require some transformation op-

erations to train the model. However, the transforma-

tions are restricted to the pseudo label on which the

model is trying to solve. Furthermore, there is no ex-

plicit algorithm that guarantees the model preserves

the equivariant property.

2.3 Mutual Information Estimation

Given two random variables x and y, the mutual in-

formation I quantiﬁes the amount of information (in

nat or bit) obtained about x after observing y or vice

versa. Mathematically, MI is deﬁned by:

I(x; y) = E

p(x,y)



p(x, y)

p(x)p(y)



= E

p(x,y)



p(x|y)

p(x)



= E

p(x,y)



p(y|x)

p(y)



(1)

Computing MI is often intractable. Speciﬁcally, the

source of intractability is due to the unavailability to

the conditional distribution p(x|y). Furthermore, it

is often that we only have samples from the joint

distribution. Therefore, sample-based methods are

developed to estimate MI. In general, there are two

approaches to estimating MI: taking the variational

lower bound of MI and taking the variational upper

bound of MI.

Barber-Agakov MI estimation (Agakov, 2004) ap-

proximates the p(x|y) with a variational distribution

q(x|y) to obtain the lower bound of MI. This ap-

proach is relatively easy to compute but gives a high

bias. Another approaches transformed q(x|y) into an

unnormalized form by introducing a partition func-

tion Z (Nguyen et al., 2010; Donsker and Varadhan,

1975). However, those approaches require the sam-

ples from the marginal distribution, which we want to

avoid. InfoNCE (van den Oord et al., 2018) obtains

the lower bound by incorporating a contrastive loss

approach. The advantage of this method is the low

variance of the estimation. This method requires the

positive pair and the negative pairs of samples. How-

ever, the method is heavily dependent on the number

of samples.

We can adopt the Barber-Agakov method to ob-

tain the upper bound of MI. (Alemi et al., 2017) ap-

proximate the marginal distribution p(x) with a vari-

ational distribution q(x). However, approximating

the marginal distribution without prior information is

challenging, especially on high dimensional data. The

Leave-one-out (Poole et al., 2019) method attempts to

approximate p(x) by taking the sum of p(x

) over

the samples (x

, y

), except y

. y

is the correspond-

ing pair of x

. Another approach incorporates a con-

trastive method to derive the variational upper bound

of MI (Cheng et al., 2020).

In this research, we perform MI maximization

with the help of Barber-Agakov and InfoNCE esti-

mation. Both methods only require samples from the

joint distribution, which ﬁt the problem we aim to

solve.

3 VARIATIONAL

TRANSFORMATION-

EQUIVARIANT

3.1 The Generalization of

Transformation-Equivariant

Representation

Let

z ∈ Z be the representation of image x ∈ X and

t : X × T → X be a transformation that involves im-

age x and a transformation operation t. We have

z ∈ Z be the representation of a transformed image

t(x, t) = tx. We can view the transformation t as a

matrix. Representations

z, z, and transformation t sat-

isfy the transformation-equivariant property if there

exists a function r : Z × T → Z, such that

z = r(z, t) = τ(t)(

z) (2)

where τ(t) denotes a function that enable t to be ap-

plied in Z. Note that the representation z is com-

pletely determined by t and

z (no need access to x).

This notion is called steerability (Cohen and Welling,

2017; Qi, 2019), which enables computing z by ap-

plying an independent transformation τ(t) to

From the information theory perspective, we can

model the notion of steerability as the MI between z,

and (

z, t) (Qi, 2019). Here the MI is parameterized

by θ. Therefore, the goal is to ﬁnd θ that maximizes

(z;

z, t).

∗

= max

(z;

z, t) (3)

The form I

(z;

z, t) is not feasible to compute. To en-

able the training, we decompose I

(z;

z, t) into:

(z;

z, t) = E

(z,

z,t)



(z,

z, t)

(z) p

(

z, t)



= E

(z,

z,t)



(

z)p

(z|

z)p

(t|z,

(z) p

(t|

z)p

(



= E

(z,

z,t)



(z|

(z)



+ E

(z,

z,t)



(t|z,

(t|



= I

(z;

z) + I

(z;t|

z) (4)

Transformation-Equivariant Representation Learning with Barber-Agakov and InfoNCE Mutual Information Estimation

101

AVT aims to maximize I

(z;t|

z) as the objective func-

tion. In this paper, we investigate the performance of

an transformation-equivariant repersentation model

by maximizing I

(

z;z). We name this model as vari-

ational transformation-equivariant (VTE). Both AVT

and VTE encounter the intractable computation of

MI. AVT needs to compute the intractable posterior

(t|

z, z), while VTE needs to compute intractable

posterior p

(z|

z). Therefore we need to estimate the

MI. AVT utilizes Barber-Agakov estimation to max-

imize the MI. In this research, we maximize MI by

using Barber-Agakov (Agakov, 2004) and InfoNCE

MI estimation (van den Oord et al., 2018).

3.2 Transformation as Inductive Bias

Our preliminary experiments showed that maximiz-

ing MI I

(

z;z) without prior results in the model fails

to learn. Speciﬁcally, the model will assign a trivial

posterior probability given any pair of (

z, z). There-

fore, we need to explicitly involve the transformation

t to train VTE.

We propose sequential mechanism to train the

VTE. The training comprises two phases of training.

In the ﬁrst phase, we model the distribution of z. Re-

call that z is the representation of the transformed im-

age. We involve the transformation t the train the

model. Speciﬁcally, we build a self-supervised learn-

ing model that maximizes the MI I

(z;t), parame-

terized by

θ. We call this model as the predictive-

transformation. Note that this objective function is

similar to AVT (I

(z;t|

z)), except without

z. Due to

the absence of

z, the obtained representation is not

guaranteed to be equivariant anymore. We maximize

the MI by using Barber-Agakov lower bound MI esti-

mation. Barber-Agakov estimation introduces a vari-

ational distribution q

(t|z), parameterized by

φ to ap-

proximate p

(t|z).

(t;z) = H(t) − H(t|z)

= H(t) + E

(t,z)

log p

(t|z)

= H(t) + E

(t,z)

logq

(t|z)

+ E

p(z)

KL(p

(t|z)k q

(t|z))

≥ H(t) + E

(t,z)

logq

(t|z) (5)

H(.) and KL(.k.) denote the entropy and the

Kullback-Leibler divergence between two distribu-

tions, respectively. Since H(t) does not depend on

θ and

φ, we simply maximize

max

θ,

θ(t,z)

logq

(t|z) (6)

We implement the predictive-transformation in the

framework of autoencoder. Figure 2 shows the archi-

tecture of predictive-transformation.

Figure 2: The architecture of predictive-transformation

model. Transformed image is fed through the encoder p

The output of encoder is the mean µ

and the standard-

deviation σ

. The representation z is sampled and fed

through the decoder q

. The output of the decoder is the

mean µ

, which corresponds to the transformation t.

The encoder E

represents p

(z|tx). We assume that

(z|tx) is following a factored multivariate Gaussian

distribution N (z;µ

, σ

). Therefore, the output of the

encoder is the mean µ

and the standard-deviation σ

The decoder D

represents the variational distribution

(t|z). We also assume that q

(t|z) is a factored mul-

tivariate Gaussian distribution, with a constraint that

the standard-deviation sets to one: N (t;µ

, I). I de-

notes the identity matrix. Thus, the output of the de-

coder is the mean µ

For the second phase of training, we capture the

distribution of

z by training the VTE network. The

goal is to maximize the MI I

(

z;z). We estimate

(

z;z) by using Barber-Agakov and InfoNCE esti-

mation. The optimization can be done through a

gradient-based method such as stochastic gradient de-

scent.

3.3 Barber-Agakov Lower Bound

Estimation

Following Equation 5, we derive the lower bound

of I

(

z;z) by introducing a variational distribu-

tion q

(z|

z), parameterized by φ to approximate

(z|

z). Furthermore, we incorporate the predictive-

transformation from the ﬁrst phase of training to ob-

tain the representation z. We call this model as

VTEBArber-Agakov (VTEBA).

θ,

(

z;z) = H(z) −H(z|

= H(z) + E

θ,

(t,

z,z)

log p

θ,

(z|

= H(z) + E

θ,

(t,

z,z)

logq

(z|

+ E

KL(p

θ,

(z|

z)|| q

(z|

z))

≥ H(z) + E

θ,

(t,

z,z)

logq

(z|

z) (7)

ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods

102

In this phase, we do not optimize

θ anymore. There-

fore, the objective function of VTEBA is maximizing

max

θ,φ

θ,θ(t,

z,z)

logq

(z|

z) (8)

Just like predictive-transformation, we treat

VTEBA as an autoencoder. Figure 3 shows the ar-

chitecture of VTEBA.

Figure 3: The architecture of VTEBA. The transformed im-

age is fed through the predictive-transformation’s encoder

, while the original images is fed through the encoder

. The output of the encoder is the mean µ

and the

standard-deviation σ

. The representation

z is sampled and

fed through the decoder q

. The output of the decoder is the

mean µ

, which corresponds to the representation of trans-

formed image z, obtained from p

The encoder E

represents p

(

z|z). We assume that

(

z|z) is following a factored multivariate Gaussian

distribution N (

z;µ

, σ

). Thus, the output of E

is the

mean µ

and the standard-deviation σ

. The encoder

of the predictive-transformation is responsible to

infer z. Note that we freeze

θ while training VTE. We

use the samples z to compute q

(z|

z). The decoder

represents q

(z|

z) and takes

z as the input. We

assume that q

(z|

z) is following N (z; µ

, I) to reduce

the complexity of the model. Therefore, the output of

is the mean µ

3.4 InfoNCE Lower Bound Estimation

The second estimation method is the InfoNCE. Note

that we still utilize the predictive-transformation to in-

fer z. This method adopts the notion of contrastive

learning to estimate MI. Recall that from Equation 1,

we have the MI as the expectation of a density ratio

between conditional distribution and the marginal dis-

tribution. Given I

θ,

(

z;z), then:

θ,

(

z;z) = E

θ,

(t,

z,z)

log

θ,

(

z|z)

(9)

Furthermore, suppose that we have a batch of sam-

ples {(

, z

)}

i=1

. For a particular pair (

, z

), we can

model the density ratio

θ,

(

|z=z

)

as:

exp( f (

, z

)) ∝

θ,

(

z =

|z = z

)

z =

)

(10)

with f is an estimator function that takes (

, z

) as the

input. Note that this approximation form can be un-

normalized, which means that the integral of the den-

sity ratio does not have to be 1. Therefore, for each

pair of (

, z

), we normalize it by a partition func-

tion that takes the sum of (

, z

), 1 ≤ j ≤ N ∧ i 6= j.

We call (

, z

) a positive pair, while (

, z

) a negative

pairs. Mathematically, we have

θ,

(

z, z) ≥ E

∏

θ,

(t,

z,z)

log

exp( f (

, z

))

∑

exp( f (

, z

))

(11)

≥ E

∏

θ,

(t,

z,z)

log

exp(g(

)

h(z

))

∑

exp(g(

)

h(z

))

(12)

It is known that neural network is an universal approx-

imation for any function (Heaton, 2018). Therefore,

we implement the estimator function f as neural net-

work. We call this model as VTEInfoNCE concate-

nated version. Furthermore, we can decompose the

function f into two different functions g and h, each

takes

z and z as input separately. We combine the

output by performing vector multiplication. We call

this model as VTEInfoNCE separated version. Let

be the parameter of estimator function f . The objec-

tive function of VTEInfoNCE concatenated version is

maximizing

max

θ,

∏

θ,

(t,

z,z)

log

exp( f (

, z

))

∑

exp( f (

, z

))

(13)

Subsequently, let

φ, φ

be the parameter of the estima-

tor function g and h, respectively. The VTEInfoNCE

separated version aims to maximize

max

θ,

φ,φ

∏

θ,

(t,

z,z)

log

exp(g(

)

h(z

))

∑

exp(g(

)

h(z

))

(14)

Note that we do not optimize the parameter

θ. In the

next subsection, we provide the algorithm to train the

predictive-transformation and the VTE models. Fig-

ure 4 shows the architecture of VTEInfoNCE sepa-

rated.

3.5 Algorithm

3.5.1 Training Predictive-transformation

Suppose that we have a batch consists of N sam-

ples X =





i=1

. For each sample, we draw a

Transformation-Equivariant Representation Learning with Barber-Agakov and InfoNCE Mutual Information Estimation

103

Figure 4: The architecture of VTEInfoNCE. The decoder is

replaced with estimator functions g and h, parameterized by

φ and φ

, respectively.

transformation t

from p(t). We infer p

) by

feeding t

through encoder E

. Note that the out-

put of E

is the parameters of a probability distri-

bution. Therefore, we need to sample from the dis-

tribution to get z

instance. However, performing

ordinary sampling leads the model to fail to com-

pute the gradient of the objective function w.r.t. the

parameter

θ (encoder parameter). This condition is

undesirable if we perform the optimization through

gradient-based method. Therefore, we apply the

reparameterization trick to solve the problem. This

trick is renowned used by the variational autoencoder

(Kingma and Welling, 2013). Given the mean µ

)

and standard-deviation σ

), we can write repa-

rameterization trick as:

= µ

) + σ

)  ε

(15)

with  denotes the pointwise multiplication. Recall

that µ and σ are obtained from the encoder E

. Fur-

thermore, ε

refers to a noise sampled from N (ε; 0, I).

Decoder D

takes z to output q

(t|z). Since we only

have the samples z, we translate the expectation into

an unbiased Monte Carlo estimation:

min

θ,

∑

i=1

−log N (t

;µ

), I) (16)

By the property of the monotonic function (log), we

minimize the negative of the objective function. Fur-

thermore, the decoder D

only outputs µ

(by the as-

sumption of q

(t|z)).

3.5.2 Training Variational

Transformation-Equivariant

We follow the same settings as the predictive-

transformation. Given samples X =

{

}

i=1

, we draw

transformation t

from p(t) for each x

. Subsequently,

we use encoder E

to infer p

(

). and perform the

reparameterization trick to generate

= µ

) + σ

) 

(17)

The noise

ε is drawn from a standard Gaussian distri-

bution N (

ε;0, I).

We use the encoder E

to infer p

). Note that

is the encoder of predictive-transformation. We

freeze the parameter

θ since we do not optimize

θ.

For the Barber-Agakov MI-estimation, we minimize

min

θ,φ

∑

i=1

−log N (z

;µ

(

), σ(

)) (18)

Here we translate the expectation into an unbiased

Monte Carlo estimation method since we only have

the samples of

z and z. We take the beneﬁt of the

monotonic function’s property by translating the ob-

jective function into a minimization problem. De-

coder D

represents q

(z|

z). By the assumption of

(z|

z), D

only outputs the mean µ

For the InfoNCE MI-estimation, we have two ver-

sions: the concated version and the separated version.

The objective function of the concatenated version is

as follows:

min

θ,

∑

i=1

−log

exp( f (

, z

))

∑

exp( f (

, z

))

(19)

φ denotes the parameter of estimator function f . Sub-

sequently, the objective function of the separated In-

foNCE is as follows:

min

θ,

φ,φ

∑

i=1

−log

exp(g(

)

h(z

))

∑

exp(g(

)

h(z

))

(20)

φ, φ

denote the parameters of g and h, respec-

tively. Both VTEBA and VTEInfoNCE can utilize a

gradient-based method to optimize the objective func-

tion.

The difference between AVT and VTE lies in their

output. The decoder of AVT estimates the proba-

bility distribution p

(t|

z, z) through Barber-Agakov

MI estimation. On the other hand, the decoder of

VTEBA and the estimator function of VTEInfoNCE

aim to estimate p

(

z|z) through Barber-Agakov esti-

mation and noise contrastive loss, respectively. As a

result, AVT takes z and

z as the inputs, while VTEBA

only takes one of either

z or z. Furthermore, AVT re-

quires one stage of training while VTE requires two

stages of training. The latter is because VTE needs the

predictive-transformation model as the inductive bias,

ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods

104

Table 1: The comparison of average error rate by different of each model with the various number of training data on CIFAR-

10 image classiﬁcation using MLP.

Model 50K 5K 0.5K

AVT 0.147 ± 0.0001 0.355 ± 0.0018 0.797 ± 0.0021

predictive-transformation 0.142 ± 0.0006 0.383 ± 0.0028 0.655 ± 0.0036

VTEBA 0.140 ± 0.0000 0.354 ± 0.0001 0.508 ± 0.0006

VTEInfoNCE (1) 0.148 ± 0.0005 0.310 ± 0.0015 0.550 ± 0.0015

VTEInfoNCE (2) 0.152 ± 0.0000 0.362 ± 0.0016 0.476 ± 0.0006

which requires separate training. From the optimiza-

tion side, VTE and AVT depend on the gradient-based

method and reparameterization trick to optimize their

parameters.

4 EXPERIMENTS

For the experiments, we train predictive-

transformation, VTEBA, VTEInfoNCE separated

version, and VTEInfoNCE concatenated version. We

also reproduce AVT to give a fair comparison. We

then evaluate each of model on image classiﬁcation

tasks. We utilize multi-layer perceptron (MLP),

K-nearest neighbor (K-NN), and multinomial logistic

regression as the classiﬁers. In this experiment, we

use CIFAR-10 (Krizhevsky et al., 2009) and STL-10

(Coates et al., 2011) datasets.

4.1 CIFAR-10 Experiment

Architecture. We follow the original architecture

of AVT for each model. Speciﬁcally, we use

Network-In-Network architecture (Lin et al., 2013)

for the convolution blocks (encoder). We represent

the distribution of z and

z as a MLP of size 1024.

The ﬁrst 512 neurons represent the mean and the

rest represent the log variance. The idea of replac-

ing the standard-deviation with the log variance is

to preserve numerical stability. We can derive the

standard-deviation by performing an exponentiation

trick. We implement the Decoder D

, D

, and the es-

timator function f , g and h as MLP with three layers.

Implementation Details. All models are optimized

by adaptive moment (Adam) (Kingma and Ba, 2015)

with a learning rate of 1e − 4. We train the models

for 200 epochs. For each iteration, we chunk the data

into several mini-batches, each with size 256. We uti-

lize 1 GPU Tesla V-100 as the source of computation.

Furthermore, our models follow the same settings as

AVT (Qi et al., 2019) for the type of transformation.

For each image, we apply the projective transforma-

tion consists of random translation along horizontal

line and vertical line by [−0.125, 0.125] of the width

and the height of the image, random scaling with the

ratio [0.8, 1.2], and the random rotation with an angle

from {0

◦

, 90

◦

, 180

◦

, 270

◦

Evaluation. We perform image classiﬁcations to

evaluate the models. First, we feed the feature ex-

tracted by the encoder of each model through an

MLP-based classiﬁer. The MLP consists of three

fully connected layers. The ﬁrst two layers share the

exact size of 2048 neurons, while the last layer has a

size of 10. We train the classiﬁer on the various num-

ber of training data. Speciﬁcally, we train the MLP

on 50000, 5000, and 500 training data, respectively.

We then test the MLP on 10000 images. Since the

encoder is a probabilistic model, we perform the clas-

siﬁcation 5 times for each image and take the average

of the error rate. This approach is a bit different with

AVT since they only compute the error once.

Table 1 shows the classiﬁcation results using MLP

on CIFAR-10 dataset. VTEInfoNCE (1) and VTE-

InfoNCE (2) refer to VTEInfoNCE separated ver-

sion and VTEInfoNCE concatenated version, respec-

tively. The results show that VTEBA outperforms the

other models on 50000 data with a 0.14 average er-

ror rate while VTEInfoNCE (2) gives the worst re-

sult. On 5000 data, VTEInfoNCE (1) outperforms

the others with a 0.31 ± 0.0015 average error rate,

while VTEInfoNCE (2) gives the worst result with a

0.355 ± 0.00018 average error rate. Finally, VTEIn-

foNCE (2) yields the best result with a 0.476±0.0006

average error rate on 500 data, while AVT yields the

worst result with a 0.797 ± 0.0021 average error rate.

In this experiment, the best model is different for each

number of data involved during the training. In gen-

eral, the proposed models give more satisfying results

compared to the baseline model.

Subsequently, we perform image classiﬁcation

task by using K-NN. In this experiment, we choose

K = 5. All neighbors have an equal impact on the

classiﬁcation result. Table 2 shows the classiﬁcation

results using K-NN on CIFAR-10 dataset. From the

table, VTEBA consistently outperforms other mod-

els for every number of dataset. Furthermore, AVT

Transformation-Equivariant Representation Learning with Barber-Agakov and InfoNCE Mutual Information Estimation

105

Table 2: The comparison of average error rate by different of each model with the various number of training data on CIFAR-

10 image classiﬁcation using K-NN.

Model 50K 5K 0.5K

AVT 0.693 ± 0.050 0.743 ± 0.006 0.803 ±0.009

predictive-transformation 0.527 ± 0.006 0.596 ± 0.002 0.69 ± 0.011

VTEBA 0.396 ± 0.003 0.488 ± 0.003 0.578 ± 0.008

VTEInfoNCE (1) 0.429 ± 0.030 0.508 ± 0.004 0.583 ± 0.004

VTEInfoNCE (2) 0.448 ± 0.003 0.519 ± 0.001 0.596 ± 0.006

Table 3: The comparison of average error rate by different of each model with the various number of training data on CIFAR-

10 image classiﬁcation using multinomial logistic regression.

Model 50K 5K 0.5K

AVT 0.622 ± 0.0023 0.759 ± 0.0023 0.840 ± 0.0029

predictive-transformation 0.539 ± 0.0025 0.698 ± 0.0018 0.793 ± 0.0046

VTEBA 0.391 ± 0.0005 0.457 ± 0.0003 0.603 ± 0.0003

VTEInfoNCE (1) 0.439 ± 0.0010 0.550 ± 0.0014 0.681 ± 0.0012

VTEInfoNCE (2) 0.423 ± 0.0015 0.523 ± 0.0011 0.656 ± 0.0028

Table 4: Average error rate on STL-10 dataset using different classiﬁers.

Model PLB K-NN Logistic regression

AVT 0.522 ± 0.0000 0.707 ± 0.004 0.632 ± 0.0000

predictive-transformation 0.492 ± 0.0009 0.711 ± 0.002 0.544 ± 0.0012

VTEBA 0.460 ± 0.0007 0.607 ± 0.004 0.54 ± 0.0002

VTEInfoNCE (1) 0.365 ± 0.0000 0.475 ± 0.005 0.477 ± 0.0000

VTEInfoNCE (2) 0.363 ± 0.0005 0.473 ± 0.003 0.462 ± 0.0003

yields the highest average error rate for every number

of dataset.

Finally, we evaluate the representation models us-

ing multinomial logistic regression. The classiﬁer op-

timize the cross-entropy loss for 100 iterations using

stochastic average gradient descent (SAG) (Schmidt

et al., 2017). Furthermore, we involve l

−norm to

regularize the weights of classiﬁer. Table 3 shows

the classiﬁcation results using multinomial logistic re-

gression on CIFAR-10 dataset. We have VTEBA con-

sistently outperforms other models for every number

of dataset. Moreover, AVT also becomes the model

with the highest average error rate for every number

of dataset.

We argue that two main factors cause the incon-

sistency results in Table 5. The ﬁrst factor is the ten-

dency of MLP to suffer from over-ﬁtting, especially if

there is only a small amount of data. The second fac-

tor is related to the characteristic of the representation

generated by VTEBA and VTEInfoNCE. We argue

that the contrastive loss leads VTEInfoNCE to gen-

erate representations that lie sparsely one each other.

On the other hand, the objective function of VTEBA

does not consider the relation of the representation

with its negative samples. Thus, the generated rep-

resentations are naturally more dense one each other.

VTEInfoNCE has an advantage on the image classi-

ﬁcation task, which involves a small amount of data

since the sparsity might reduce the overﬁtting to some

degree. However, this model performs poorly if we

have a large dataset since the MLP has to ﬁnd a more

complex hypothesis (large and sparse). VTEBA per-

forms worse than VTEInfoNCE on a small dataset

since the MLP can ﬁt the data too well. On the

contrary, VTEBA can give a better result on a large

dataset since the representations are more concen-

trated in some regions of the representation space.

4.2 STL-10 Experiment

Architecture. For STL-10 experiment, we adopt

the Alexnet architecture (Krizhevsky et al., 2017) for

the convolution blocks (encoder). Each block com-

prises a convolution and ReLU layer, followed by

a pooling layer. For the mean and the standard-

deviation, we adopt the same settings as the previous

experiment.

Implementation Details. In this experiment, we

train the proposed models and baseline model on

100000 unlabeled images, each with size 96× 96. We

ﬁrst resize each image to 32 × 32. Subsequently, we

ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods

106

apply the same transformation as the previous exper-

iment. All models are trained for 200 epochs with a

512 batch size. Same as previous research, the models

use 1 GPU Tesla V-100. We use the adaptive moment

to optimize the model with a learning rate 1e-4.

Figure 5: The ﬁrst row shows sample images of the STL-10

dataset. The images in the second row are obtained by ap-

plying projective transformation on images in the ﬁrst row.

Evaluation. We use the encoder of each model to

extract the feature for the image classiﬁcation task.

We train an MLP, K-NN, and multinomial logistic re-

gression on 5000 labeled images. All classiﬁers fol-

low the same settings as the previous experiment. Fi-

nally, we ask each classiﬁer to predict the class of

8000 unseen images. We also compute the average

error rate instead of just the error rate. Table 4 shows

the image classiﬁcation results on the STL-10 dataset.

The results show that VTEInfoNCE(2) yields the low-

est average error-rate for each classiﬁer. Further-

more, AVT achieve the highest average error-rate

for each classiﬁer. The results are quite surprising

since we expect that AVT outperforms the predictive-

transformation. Recall that the goal of AVT is to max-

imize E

(t,

z,z)

(t|

z, z). We argue that combining z

and

z directly to reconstruct t restricts the expressive

power of z and

z mutually. Thus, it reduces the gener-

alization of the model to extract the representation. In

contrast, VTE models z and

z independently, allow-

ing the representations to fully exploit the structure of

data tx and x, without losing the equivariant property.

5 CONCLUSIONS

In this research, we investigate the alternatives of

autoencoding variational transformation (AVT). We

call the models variational transformation-equivariant

(VTE). We ﬁnd that training VTE directly fails in the

model to learn the representation of the data. The rea-

son is due to the absence of the prior/inductive bias

that gives the context of training. Instead, we pro-

pose training the model into two phases. In the ﬁrst

phase, we build a probabilistic self-supervised learn-

ing model to learn the representation of the trans-

formed image. Theoretically, this model maximizes

the mutual information (MI) between the transforma-

tion and the representation of the transformed image.

We call the model predictive-transformation. In the

second phase of training, we build a representation

model that learns the representation of the original

image. In theory, we maximize MI between the rep-

resentation of the original image and the represen-

tation of the transformed image. We leverage the

previous model to obtain the representation of the

transformed image. However, computing MI directly

is intractable. Therefore, we utilize Barber-Agakov

and InfoNCE MI estimation method to maximize MI.

Barber-Agakov estimation approximates the true pos-

terior distribution with a variational distribution that

is easy to compute. We call the model VTEBArber-

Agakov. InfoNCE estimation method uses the deep

learning network as an estimator function of the den-

sity ratio. In this research, we propose two versions of

VTE with InfoNCE estimation. We call them VTEIn-

foNCE concatenated version and VTEInfoNCE sepa-

rated version. Furthermore, we evaluate the proposed

models and baseline on image classiﬁcation tasks.

Results on CIFAR-10 and STL-10 datasets show that

our proposed models outperform the baseline.

ACKNOWLEDGEMENTS

We gratefully acknowledge the support of the

Tokopedia-UI AI Center of Excellence, Faculty of

Computer Science, University of Indonesia, for al-

lowing us to use its NVIDIA DGX-1 for running our

experiments.

REFERENCES

Agakov, D. B. F. (2004). The im algorithm: a variational

approach to information maximization. Advances in

neural information processing systems, 16(320):201.

Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy,

K. (2017). Deep variational information bottleneck.

In 5th International Conference on Learning Rep-

resentations, ICLR 2017, Toulon, France, April 24-

26, 2017, Conference Track Proceedings. OpenRe-

view.net.

Cheng, P., Hao, W., Dai, S., Liu, J., Gan, Z., and Carin, L.

(2020). CLUB: A contrastive log-ratio upper bound

of mutual information. In Proceedings of the 37th In-

ternational Conference on Machine Learning, ICML

2020, 13-18 July 2020, Virtual Event, volume 119

of Proceedings of Machine Learning Research, pages

1779–1788. PMLR.

Coates, A., Ng, A. Y., and Lee, H. (2011). An analysis of

single-layer networks in unsupervised feature learn-

ing. In Gordon, G. J., Dunson, D. B., and Dud

ık, M.,

Transformation-Equivariant Representation Learning with Barber-Agakov and InfoNCE Mutual Information Estimation

107

editors, Proceedings of the Fourteenth International

Conference on Artiﬁcial Intelligence and Statistics,

AISTATS 2011, Fort Lauderdale, USA, April 11-13,

2011, volume 15 of JMLR Proceedings, pages 215–

223. JMLR.org.

Cohen, T. and Welling, M. (2016). Group equivariant con-

volutional networks. In Balcan, M. and Weinberger,

K. Q., editors, Proceedings of the 33nd International

Conference on Machine Learning, ICML 2016, New

York City, NY, USA, June 19-24, 2016, volume 48 of

JMLR Workshop and Conference Proceedings, pages

2990–2999. JMLR.org.

Cohen, T. S. and Welling, M. (2017). Steerable cnns.

In 5th International Conference on Learning Rep-

resentations, ICLR 2017, Toulon, France, April 24-

26, 2017, Conference Track Proceedings. OpenRe-

view.net.

Doersch, C., Gupta, A., and Efros, A. A. (2015). Unsuper-

vised visual representation learning by context predic-

tion. In 2015 IEEE International Conference on Com-

puter Vision, ICCV 2015, Santiago, Chile, December

7-13, 2015, pages 1422–1430. IEEE Computer Soci-

ety.

Donsker, M. D. and Varadhan, S. S. (1975). Asymptotic

evaluation of certain markov process expectations for

large time, i. Communications on Pure and Applied

Mathematics, 28(1):1–47.

Dosovitskiy, A., Fischer, P., Springenberg, J. T., Riedmiller,

M. A., and Brox, T. (2016). Discriminative unsu-

pervised feature learning with exemplar convolutional

neural networks. IEEE Trans. Pattern Anal. Mach. In-

tell., 38(9):1734–1747.

Gidaris, S., Singh, P., and Komodakis, N. (2018). Unsuper-

vised representation learning by predicting image ro-

tations. In 6th International Conference on Learning

Representations, ICLR 2018, Vancouver, BC, Canada,

April 30 - May 3, 2018, Conference Track Proceed-

ings. OpenReview.net.

Heaton, J. (2018). Ian goodfellow, yoshua bengio, and

aaron courville: Deep learning - the MIT press, 2016,

800 pp, ISBN: 0262035618. Genet. Program. Evolv-

able Mach., 19(1-2):305–307.

Hinton, G. E., Krizhevsky, A., and Wang, S. D. (2011).

Transforming auto-encoders. In Honkela, T., Duch,

W., Girolami, M. A., and Kaski, S., editors, Artiﬁ-

cial Neural Networks and Machine Learning - ICANN

2011 - 21st International Conference on Artiﬁcial

Neural Networks, Espoo, Finland, June 14-17, 2011,

Proceedings, Part I, volume 6791 of Lecture Notes in

Computer Science, pages 44–51. Springer.

Kingma, D. P. and Ba, J. (2015). Adam: A method for

stochastic optimization. In Bengio, Y. and LeCun,

Y., editors, 3rd International Conference on Learn-

ing Representations, ICLR 2015, San Diego, CA, USA,

May 7-9, 2015, Conference Track Proceedings.

Kingma, D. P. and Welling, M. (2013). Auto-encoding vari-

ational bayes. arXiv preprint arXiv:1312.6114.

Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple

layers of features from tiny images.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). Im-

agenet classiﬁcation with deep convolutional neural

networks. Commun. ACM, 60(6):84–90.

Lenssen, J. E., Fey, M., and Libuschewski, P. (2018). Group

equivariant capsule networks. In Bengio, S., Wallach,

H. M., Larochelle, H., Grauman, K., Cesa-Bianchi,

N., and Garnett, R., editors, Advances in Neural

Information Processing Systems 31: Annual Con-

ference on Neural Information Processing Systems

2018, NeurIPS 2018, December 3-8, 2018, Montr

eal,

Canada, pages 8858–8867.

Lin, M., Chen, Q., and Yan, S. (2013). Network in network.

arXiv preprint arXiv:1312.4400.

Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2010).

Estimating divergence functionals and the likelihood

ratio by convex risk minimization. IEEE Trans. Inf.

Theory, 56(11):5847–5861.

Noroozi, M. and Favaro, P. (2016). Unsupervised learning

of visual representations by solving jigsaw puzzles.

In Leibe, B., Matas, J., Sebe, N., and Welling, M.,

editors, Computer Vision - ECCV 2016 - 14th Euro-

pean Conference, Amsterdam, The Netherlands, Octo-

ber 11-14, 2016, Proceedings, Part VI, volume 9910

of Lecture Notes in Computer Science, pages 69–84.

Springer.

Noroozi, M., Pirsiavash, H., and Favaro, P. (2017). Repre-

sentation learning by learning to count. In IEEE Inter-

national Conference on Computer Vision, ICCV 2017,

Venice, Italy, October 22-29, 2017, pages 5899–5907.

IEEE Computer Society.

Poole, B., Ozair, S., van den Oord, A., Alemi, A., and

Tucker, G. (2019). On variational bounds of mu-

tual information. In Chaudhuri, K. and Salakhutdi-

nov, R., editors, Proceedings of the 36th International

Conference on Machine Learning, ICML 2019, 9-15

June 2019, Long Beach, California, USA, volume 97

of Proceedings of Machine Learning Research, pages

5171–5180. PMLR.

Qi, G. (2019). Learning generalized transformation equiv-

ariant representations via autoencoding transforma-

tions. CoRR, abs/1906.08628.

Qi, G., Zhang, L., Chen, C. W., and Tian, Q. (2019). AVT:

unsupervised learning of transformation equivariant

representations by autoencoding variational transfor-

mations. In 2019 IEEE/CVF International Confer-

ence on Computer Vision, ICCV 2019, Seoul, Korea

(South), October 27 - November 2, 2019, pages 8129–

8138. IEEE.

Schmidt, M., Roux, N. L., and Bach, F. R. (2017). Minimiz-

ing ﬁnite sums with the stochastic average gradient.

Math. Program., 162(1-2):83–112.

van den Oord, A., Li, Y., and Vinyals, O. (2018). Repre-

sentation learning with contrastive predictive coding.

CoRR, abs/1807.03748.

Wang, D. and Liu, Q. (2018). An optimization view on

dynamic routing between capsules. In 6th Interna-

tional Conference on Learning Representations, ICLR

2018, Vancouver, BC, Canada, April 30 - May 3, 2018,

Workshop Track Proceedings. OpenReview.net.

Zhang, L., Qi, G., Wang, L., and Luo, J. (2019). AET vs.

AED: unsupervised representation learning by auto-

encoding transformations rather than data. In IEEE

ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods

108

Conference on Computer Vision and Pattern Recogni-

tion, CVPR 2019, Long Beach, CA, USA, June 16-20,

2019. Computer Vision Foundation / IEEE.

APPENDIX

5.1 Models Architecture on CIFAR-10

Dataset

Table 5 shows the architecture being used to build

predictive-transformation, AVT, VTEBA, VTEIn-

foNCE separated, and VTEInfoNCE concatenated on

CIFAR-10 dataset. The architecture of the encoder

follows the implementation of Network-In-Network.

Table 5: Architecture being used on CIFAR-10 experiment.

Encoder Decoder, f , g, dan h

Block(3, 192, 5) Linear(512, 2048)

Block(192, 160, 1) ReLU

Block(160, 96, 1) Linear(2048, 512)

Max-Pool(3, 2, 1)

Block(96, 192, 5)

Block(192, 192)

Block(192, 8)

Avg-Pool(3, 2, 1)

→ Linear(512, 512)

logσ

→ Linear(512, 512)

Block(in, out, kernel) is a module con-

sists of Conv2D(in, out, kernel, stride=1,

padding=(kernel - 1) // 2) → Batch Norm

2D → ReLU. Parameter in denotes the size of input

channel, out denotes the size of output channel, and

kernel denotes the size of kernel which owns the

same height and width.

5.2 Models Architecture on STL-10

Dataset

Tabel 6 shows architecture being used to build

predictive-transformation, AVT, VTEBA, and VTE-

InfoNCE on STL-10 dataset. In this experiment,

the encoder adopts architecture of Alexnet. Alex

Block(in, out, kernel, stride, padding) is

a module consists of Conv2D(in, out, kernel,

stride, padding) → ReLU.

Table 6: Architecture being used on STL-10 experiment.

Encoder Decoder, f , g, dan h

Alex-Block(3, 64, 11, 1, 2) Linear(512, 2048)

Max-Pool(3, 2, 0) BatchNorm

ReLU

Alex-Block(64, 192, 5, 1, 2) Linear(2048, 1024)

Max-Pool(3, 2, 0) BatchNorm

ReLU

Alex-Block(192, 384, 3, 1, 1) Linear(1024, 512)

Alex-Block(96, 192, 5)

Alex-Block(192, 192)

Max-Pool(3, 2, 0)

→ Linear(1024, 512)

logσ

→ Linear(1024, 512)

Transformation-Equivariant Representation Learning with Barber-Agakov and InfoNCE Mutual Information Estimation

109