Dynamically Modular and Sparse General Continual Learning

Arnav Varma

, Elahe Arani

1,2,†

and Bahram Zonooz

1,2,†

Advanced Research Lab, NavInfo Europe, Eindhoven, The Netherlands

Department of Mathematics and Computer Science, Eindhoven University of Technology, The Netherlands

Keywords:

Dynamic Neural Networks, Policy Gradients, Lifelong Learning.

Abstract:

Real-world applications often require learning continuously from a stream of data under ever-changing con-

ditions. When trying to learn from such non-stationary data, deep neural networks (DNNs) undergo catas-

trophic forgetting of previously learned information. Among the common approaches to avoid catastrophic

forgetting, rehearsal-based methods have proven effective. However, they are still prone to forgetting due to

task-interference as all parameters respond to all tasks. To counter this, we take inspiration from sparse coding

in the brain and introduce dynamic modularity and sparsity (Dynamos) for rehearsal-based general continual

learning. In this setup, the DNN learns to respond to stimuli by activating relevant subsets of neurons. We

demonstrate the effectiveness of Dynamos on multiple datasets under challenging continual learning evalu-

ation protocols. Finally, we show that our method learns representations that are modular and specialized,

while maintaining reusability by activating subsets of neurons with overlaps corresponding to the similarity of

stimuli. The code is available at https://github.com/NeurAI-Lab/DynamicContinualLearning.

1 INTRODUCTION

Deep neural networks (DNNs) have achieved human-

level performance in several applications (Greenwald

et al., 2021; Taigman et al., 2014). These networks

are trained on the multiple tasks within an appli-

cation with the data being received under an inde-

pendent and identically distributed (i.i.d.) assump-

tion. This assumption is satisﬁed by shufﬂing the

data from all tasks and balancing and normalizing

the samples from each task in the application (Had-

sell et al., 2020). Consequently, DNNs can achieve

human-level performance on all tasks in these appli-

cations by modeling the joint distribution of the data

as a stationary process. Humans, on the other hand,

can model the world from inherently non-stationary

and sequential observations (French, 1999). Learn-

ing continually from the more realistic sequential and

non-stationary data is crucial for many applications

such as lifelong learning robots (Thrun and Mitchell,

1995) and self-driving cars (Nose et al., 2019). How-

ever, vanilla gradient-based training for such contin-

ual learning setups with a continuous stream of tasks

and data leads to task interference in the DNN’s pa-

rameters, and consequently, catastrophic forgetting on

old tasks (McCloskey and Cohen, 1989; Kirkpatrick

et al., 2017). Therefore, there is a need for methods to

†

Contributed equally

alleviate catastrophic forgetting in continual learning.

Previous works have aimed to address these chal-

lenges in continual learning. These can be broadly

classiﬁed into three categories. First, regularization-

based methods (Kirkpatrick et al., 2017; Schwarz

et al., 2018; Zenke et al., 2017) that penalize changes

to the parameters of DNNs to reduce task interfer-

ence. Second, parameter isolation methods (Adel

et al., 2020) that assign distinct subsets of parame-

ters to different tasks. Finally, rehearsal-based meth-

ods (Chaudhry et al., 2019) that co-train on cur-

rent and stored previous samples. Among these,

regularization-based and parameter isolation-based

methods often require additional information (such as

task-identity at test time and task-boundaries during

training), or unconstrained growth of networks. These

requirements fail to meet general continual learning

(GCL) desiderata (Delange et al., 2021; Farquhar

and Gal, 2018), making these methods unsuitable for

GCL.

Although rehearsal-based methods improve over

other categories and meet GCL desiderata, they still

suffer from catastrophic forgetting through task in-

terference in the DNN parameters, as all parameters

respond to all examples and tasks. This could be re-

solved by inculcating task or example speciﬁc param-

eter isolation in the rehearsal-based methods. How-

ever, it is worth noting that unlike parameter isolation

262

Varma, A., Arani, E. and Zonooz, B.

Dynamically Modular and Sparse General Continual Learning.

DOI: 10.5220/0011790200003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

262-273

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

methods, modularity and sparsity in the brain is not

static. There is evidence that the brain responds to

stimuli in a dynamic and sparse manner, with differ-

ent modules or subsets of neurons responding ”dy-

namically” to different stimuli (Graham and Field,

2006). The advantages of a dynamic and sparse re-

sponse to stimuli have been explored in deep learn-

ing in stationary settings through mechanisms such as

gating of modules (Veit and Belongie, 2018), early-

exiting (Li et al., 2017; Hu et al., 2020), and dy-

namic routing (Wang et al., 2018), along with train-

ing losses that incentivize sparsity of neural activa-

tions (Wu et al., 2018). These studies observed that

DNNs trained to predict dynamically also learn to

respond differently to different inputs. Furthermore,

the learned DNNs demonstrate clustering of parame-

ters in terms of tasks such as similarity, difﬁculty, and

resolution of inputs (Wang et al., 2018; Veit and Be-

longie, 2018), indicating dynamic modularity. Hence,

we hypothesize that combining rehearsal-based meth-

ods with dynamic sparsity and modularity could help

further mitigate catastrophic forgetting in a more bi-

ologically plausible fashion while adhering to GCL

desiderata.

To this end, we propose Dynamic Modularity and

Sparsity (Dynamos), a general continual learning al-

gorithm that combines rehearsal-based methods with

dynamic modularity and sparsity. Concretely, we seek

to achieve three objectives: dynamic and sparse re-

sponse to inputs with specialized modules, compe-

tent performance, and reducing catastrophic forget-

ting. To achieve dynamic and sparse responses to

inputs, we deﬁne multiple agents in our DNN, each

responsible for dynamically zeroing out ﬁlter acti-

vations of a convolutional layer based on the input

to that layer. The agents are rewarded for choosing

actions that remove activations (sparse responses) if

the network predictions are accurate, but are penal-

ized heavily for choosing actions that lead to inaccu-

rate predictions. Agents also rely on prototype losses

to learn specialized features. To reduce forgetting

and achieve competent performance, we maintain a

constant-size memory buffer in which we store pre-

viously seen examples. The network is retrained on

previous examples alongside current examples to both

maintain performance on current and previous tasks,

as well as to enforce consistency between current

and previous responses to stimuli. Dynamos demon-

strates competent performance on multiple continual

learning datasets under multiple evaluation protocols,

including general continual learning. Additionally,

our method demonstrates similar and overlapping re-

sponses for similar inputs and disparate responses for

dissimilar inputs. Finally, we demonstrate that our

method can simulate the trial-to-trial variability ob-

served in humans (Faisal et al., 2008; Werner and

Mountcastle, 1963).

2 RELATED WORK

Research in deep learning has approached the dy-

namic compositionality and sparsity observed in

the human brain through dynamic neural networks,

where different subsets of neurons or different sub-

networks are activated for different stimuli (Bengio

et al., 2015; Bolukbasi et al., 2017). This can be

achieved through early exiting (Hu et al., 2020), dy-

namic routing through mixtures of experts or multi-

ple branches (Collier et al., 2020; Wang et al., 2022),

and through gating of modules (Wang et al., 2018).

Early-exiting might force the DNN to learn speciﬁc

features in its earlier layers and consequently hurt

performance (Wu et al., 2018) as the earlier layers

of DNNs are known to learn general purpose fea-

tures (Yosinski et al., 2014). Dynamic routing, on

the other hand, would require the growth of new ex-

perts in response to new tasks that risk unconstrained

growth, or the initialization of a larger DNN with

branches corresponding to the expected number of

tasks (Chen et al., 2020). Dynamic networks with

gating mechanisms, meanwhile, have been shown

to achieve competent performance in i.i.d. training

with standard DNNs embedded with small gating net-

works (Veit and Belongie, 2018; Wu et al., 2018;

Wang et al., 2018). These gating networks emit a dis-

crete keep/drop decision for each module, depending

on the input to the module or the DNN. As this opera-

tion is non-differentiable, a Gumbel Softmax approx-

imation (Veit and Belongie, 2018; Wang et al., 2018),

or an agent trained with policy gradients (Wu et al.,

2018; Sutton and Barto, 2018) is commonly used in

each module to enable backpropagation. However,

unlike the latter, the Gumbel-Softmax approximation

induces an asymmetry between the forward pass acti-

vations at inference and training (Wang et al., 2018).

Furthermore, these methods are not applicable to con-

tinual learning.

Recent works have attempted to build dynamic

networks for continual learning setups (Chen et al.,

2020; Abati et al., 2020), where data arrive in a more

realistic sequential manner. InstAParam (Chen et al.,

2020), Random Path Selection (RPS) (Rajasegaran

et al., 2019), and MoE (Collier et al., 2020) start

with multiple parallel blocks at each layer, ﬁnding

input-speciﬁc or task-speciﬁc paths within this large

network. Nevertheless, this requires knowledge of

the number of tasks to be learned ahead of training.

Dynamically Modular and Sparse General Continual Learning

263

More importantly, initializing a large network might

be unnecessary as indicated by the competent perfor-

mance of dynamic networks with gating mechanisms

in i.i.d training. In contrast to this, MNTDP (Ve-

niat et al., 2021), LMC (Ostapenko et al., 2021), and

CCGN (Abati et al., 2020) start with a standard archi-

tecture and grow units to respond to new data or tasks.

Of these, MNTDP and LMC develop task-speciﬁc

networks where all inputs from the same task elicit

the same response and therefore do not show a truly

dynamic response to stimuli. CCGN, however, com-

poses convolutional ﬁlters dynamically to respond to

stimuli, using a task-speciﬁc vector for every convo-

lutional ﬁlter, and task boundaries to freeze frequently

active ﬁlters. However, this leads to unrestrained

growth and fails in the absence of task-boundaries,

which makes it unsuitable for general continual learn-

ing.

Therefore, we propose a general continual learn-

ing method with dynamic modularity and sparsity

(Dynamos) induced through reinforcement learning

agents trained with policy gradients.

3 METHODOLOGY

Humans learn continually from inherently non-

stationary and sequential observations of the world

without catastrophic forgetting, even without super-

vision about tasks to be performed or the arrival of

new tasks, maintaining a bounded memory through-

out. This involves, among other things, making multi-

scale associations between current and previous ob-

servations (Goyal and Bengio, 2020) and respond-

ing sparsely and dynamically to stimuli (Graham and

Field, 2006). The former concerns consolidation of

previous experiences and ensuring that learned ex-

periences evoke a similar response. The latter con-

cern dynamically composing a subset of the special-

ized neural modules available to respond to stimuli,

reusing only the relevant previously learned informa-

tion. This also avoids erasure of information irrele-

vant to current stimuli but relevant to previous experi-

ences. We now formulate an approach for dynamic

sparse and modular general continual learning that

mimics these procedures with DNNs.

3.1 Dynamic, Modular, and Sparse

Response to Stimuli

To achieve a dynamic, modular, and sparse response

to inputs, we use a DNN F with a policy to compose

a subset of the available modules in each layer to re-

spond to the input to that layer. More speciﬁcally, we

use a CNN which is incentivized to drop some chan-

nels in its activations adaptively using policy gradi-

ents (Sutton and Barto, 2018; Williams, 1992).

Let us consider the l

convolutional layer with

output channels ∀l ∈ {1,2, ...L}, where L is the

total number of convolutional layers in the network.

The input to the convolutional layer is processed us-

ing an agent module with actions a

∈ {0, 1}

as out-

put, where each action represents the decision to drop

(action = 0) or keep (action = 1) the corresponding

channel of the output of the convolutional layer. The

agent module uses a self-attention network to obtain

a channel-wise attention vector v

of dimension c

which is converted into ”action probabilities” using

a probability layer. The policy for choosing actions is

then sampled from a c

-dimensional Bernoulli distri-

bution;

= σ(v

)

) =

∏

i=1

l,i

(1 − p

l,i

)

(1−a

l,i

)

(1)

where p

∈ (0,1)

is the output of the probability

layer σ, and π

is the policy function. The ﬁnal output

of the convolutional layer is the channel-wise product

of the actions with the output of the convolution. This

policy formulation is used at each convolutional layer

in the CNN, leading to L agents in total. The over-

all structure of an agent for a convolutional layer is

shown in Figure 1.

These agents are rewarded for dropping channels

while making accurate predictions through a reward

function. For an input to the DNN X applied to clas-

siﬁcation with label Y :

Z,V = F(X), V = [v

,...|v

]

Y = argmax Z,

(2)

where Z refers to the logits. Now, the ratio of activa-

tions or channels that were retained in the layer l is

determined by

∑

i=1

l,i

. So, for a target activation

retention rate per layer or ”keep ratio” kr, the reward

function is as follows:

(X,Y ) =

(

−(kr −

∑

i=1

l,i

)

, if

Y = Y

−λ(kr −

∑

i=1

l,i

)

, otherwise.

(3)

Therefore, when the DNN’s predictions are correct,

each agent is rewarded for dropping enough activa-

tions to match the ”keep ratio” from its correspond-

ing convolutional layer. However, when the predic-

tion is incorrect, each agent is penalized for the same,

scaled by a constant penalty factor λ. The global na-

ture of the reward function, achieved through depen-

dence on the correctness of the prediction, also en-

forces coordination between agents. Following RE-

INFORCE (Williams, 1992), the loss from all agents

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

264

Bernoulli

Sampling

Probability

Layer

Convolutional Layer with c

filters

Action

Probabilities

Actions:

Channel-wise

Attention Vector

Input

Batch

Normalization

Global

Average

Pooling

Point-wise

Convolution

FC FC

Self-Attention Network

01 1101

Output with channels removed

Figure 1: An overview of Dynamos’ dynamic and sparse response mechanism at the l

convolutional layer. Blacked activa-

tions are removed. The agent (bottom path) self-attention network uses a pointwise convolution to match output channels and

global average pooling to get a channel-length ﬂattened vector. This is sent through an MLP with one hidden layer and Sig-

moid activation, and multiplied with the original channel-length representation to get the channel-wise self-attention vector.

t = 1,2,...L is:

(X,Y ) = E

[−R

(X,Y )log π

)]

= E

[−R

(X,Y )log

∏

i=1

l,i

+ (1 − p

l,i

)(1 − a

l,i

)]

= E

[−R

(X,Y )

∑

i=1

log[p

l,i

+ (1 − p

l,i

)(1 − a

l,i

)]].

(4)

Although the agents along with this loss ensure

sparse and dynamic responses from the DNN, they

do not explicitly impose any specialization of com-

positional neural modules seen in humans. As the

channel-wise ”modules” activated in the DNN are di-

rectly dependent on the channel-wise attention vec-

tors, we ﬁnally apply a specialization loss that we call

prototype loss to them. Concretely, for classiﬁcation,

in any batch of inputs, we pull the vectors belonging

to the same class together while pushing those from

different classes away. This would cause different

subsets of channel-wise modules to be used for in-

puts of different classes. When combined with a suf-

ﬁciently high ”keep ratio”, this will encourage overlap

and therefore, reuse of relevant previously learned in-

formation (for example, reusing channels correspond-

ing to a learned class for a newly observed class) and,

consequently, learning of general-purpose features by

the modules. For an input batch X with correspond-

ing labels Y , and the corresponding batch of concate-

nated channel-wise attention vectors V (Equation 2),

the prototype loss is given by:

(X,Y ) =

1 + Σ

)∈V

MSE(V

)

1 + Σ

)∈V

̸=Y

MSE(V

)

, (5)

where MSE refers to the Mean Squared Error estima-

tor. Note that we only apply this loss to samples for

which the predictions were correct.

3.2 Multi-Scale Associations

As discussed earlier, one of the mechanisms em-

ployed by humans to mitigate forgetting is multi-

scale associations between current and previous ex-

periences.

With this goal in mind, we follow recent rehearsal-

based approaches (Buzzega et al., 2020; Riemer et al.,

2019) that comply with GCL and use a memory buffer

during training to store previously seen examples and

responses. The buffer is updated using reservoir sam-

pling (Vitter, 1985), which helps to approximate the

distribution of the samples seen so far (Isele and Cos-

gun, 2018). However, we only consider the subset

of batch samples on which the prediction was made

correctly for addition to the memory buffer. These

buffer samples are replayed through the DNN along-

side new samples with losses that associate the current

response with the stored previous response, resulting

in consistent responses over time.

Let M denote the memory buffer and D

de-

note the current task stream, from which we sample

batches (X

) and (X

), respectively.

Here, Z

and V

are the saved logits and channel-

wise attention vectors corresponding to X

when it

was initially observed. The consistency losses associ-

ated with current and previous responses are obtained

Dynamically Modular and Sparse General Continual Learning

265

during the task T as follows:

′

= F(X

)

′

) = E

[∥Z

− Z

′

∥

]

′

) = E

[∥V

−V

′

∥

(6)

In addition to consistency losses, we also enforce

accuracy, and dynamic sparsity and modularity on the

memory samples. Therefore, we have four sets of

losses:

• Task performance loss on current and memory

samples to ensure correctness on current and pre-

vious tasks. For classiﬁcation, we use cross-

entropy loss (L

• Reward losses (Equation 4) on current and mem-

ory samples to ensure dynamic modularity and

sparsity on current and previous tasks.

• Prototype losses (Equation 5) on current and

memory samples to ensure the specialization of

modules on current and previous tasks.

• Consistency losses (Equation 6) for multi-scale

associations between current and previous sam-

ples.

Putting everything together, the total loss becomes:

total

= L

) + γL

)

+ β[L

) + γL

)]

+ αL

′

) + α

′

)

+ w

) + L

)].

(7)

The weights given to the losses - α, α

, β, w

and γ, and the penalty for misclassiﬁcation (λ) and

keep ratio (kr) in Equation 3, are hyperparameters.

Note that we employ a warm-up stage at the begin-

ning of training, where neither the memory buffer nor

the agents are employed. This is equivalent to train-

ing using only the cross-entropy loss for this period,

while the agents are kept frozen. This gives agents

a better search space when they start searching for a

solution. We call our method as described above Dy-

namic modularity and sparsity - Dynamos.

4 EXPERIMENT DETAILS

Datasets. We show results on sequential variants of

MNIST (LeCun et al., 1998) and SVHN: Seq-MNIST

and Seq-SVHN (Netzer et al., 2011), respectively.

Seq-MNIST and Seq-SVHN divide their respective

datasets into 5 tasks, with 2 classes per task. Fur-

thermore, to test the applicability of Dynamos under

general continual learning, we also use the MNIST-

360 dataset (Buzzega et al., 2020).

Architecture. We use a network based on the

ResNet-18 (He et al., 2016) structure by removing the

later two of its four blocks and reducing the number

of ﬁlters per convolutional layer from 64 to 32. The

initial convolution is reduced to 3 × 3 to work with

smaller image sizes. For the baseline experiments,

we did not use any agents. For our method, while

agents can be used for all convolutional layers, we

only use agents in the second block. We make this

choice based on recent studies that observe that ear-

lier layers undergo minimal forgetting (Davari et al.,

2022), are highly transferrable (Yosinski et al., 2014),

and are used for most examples even when learned

with dynamic modularity (Abati et al., 2020). We use

a sigmoid with a temperature layer as the probabil-

ity layer in the agents and a probability of 0.5 as a

threshold for picking actions, i.e., channels during in-

ference. The temperature serves the purpose of tuning

the range of outputs of the self-attention layers, en-

suring that the probabilities being sampled to choose

the actions are not too small and that enough activa-

tions are chosen to enable learning. The exact net-

work structure used for each experiment, including

the self-attention networks of the agents, can be found

in Appendix, in Table 3 and Table 4.

Settings. All methods are implemented in the Mam-

moth repository

in PyTorch 1.6 and were trained

on Nvidia V100 GPUs. The hyperparameters cor-

responding to each experiment can be found in Ap-

pendix, Table 5. We always maintain a keep ra-

tio higher than 1/Num tasks to allow the learning

of overlapping, reusable, and general-purpose mod-

ules. The temperature of the Sigmoid activation of

the probability layers is kept at 0.15 unless mentioned

otherwise.

5 RESULTS

We will evaluate Dynamos under two standard eval-

uation protocols that adhere to the core desiderata of

GCL.

5.1 Class-Incremental Learning (CIL)

Class-incremental learning (CIL) refers to the eval-

uation protocol in which mutually exclusive sets of

classes are presented sequentially to the network, and

the identity of the task is not provided at the test

time, which meets the core desiderata of GCL (Far-

quhar and Gal, 2018). We compare against Condi-

https://github.com/aimagelab/mammoth/

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

266

Figure 2: Quantitative results under Class-Incremental

Learning protocol. Results are averaged across three seeds.

CCGN values taken from the original paper. The precise

accuracies can be found in Table 2.

tional Convolutional Gated Network (CCGN) (Abati

et al., 2020), which also dynamically composes con-

volutional ﬁlters for continual learning. We observe

in Figure 2 that Dynamos shows higher accuracies

on both the Seq-MNIST and Seq-SVHN datasets un-

der all buffer sizes. However, CCGN requires a

separate task vector for every task per convolutional

layer, resulting in unrestricted growth during train-

ing, whereas we maintain a bounded memory through

training. Furthermore, unlike CCGN, we do not lever-

age the task boundaries or the validation set during

training. Therefore, Dynamos outperforms the previ-

ous state-of-the-art for dynamic compositional con-

tinual learning in class-incremental learning, while

showing bounded memory consumption during train-

ing.

5.2 General Continual Learning (GCL)

So far, we have observed Dynamos under the CIL

protocol. Unlike CIL, real-world data streams with-

out clear task boundaries, where the same data may

reappear under different distributions (e.g. different

poses). Following (Buzzega et al., 2020), we approxi-

mate this setting using MNIST-360, where tasks over-

lap in digits (i.e. classes), reappear under different ro-

tations (i.e. distributions), and each example is seen

exactly once during training. This serves as a veriﬁca-

tion of the adherence to the GCL desiderata (Farquhar

and Gal, 2018; Delange et al., 2021). We study the

impact of both dynamic modularity as well as multi-

scale associations by removing them incrementally

from Dynamos. When neither is used, the learning

is done using vanilla gradient-based training, with no

strategy to counter forgetting. When dynamic mod-

ularity is removed, the learning strategy forms our

baseline, where no agents are used, simplifying the

total training loss from Equation 7 to:

base

= L

) + βL

αL

′

(8)

Table 1 shows that Dynamos outperforms the baseline

in all buffer sizes, proving that dynamic modularity is

advantageous in GCL. Furthermore, when multi-scale

associations are also removed, no buffer is used, and

the DNN undergoes catastrophic forgetting. Thus,

Dynamos is applicable to general continual learning,

with dynamic modularity improving over the base-

line. We hypothesize that dynamic modularity makes

dealing with the blurred task boundaries of GCL eas-

ier by adaptively reusing relevant previously learned

information, which in this case corresponds to learned

ﬁlters.

6 MODEL CHARACTERISTICS

We now analyze some of the characteristics and ad-

vantages of Dynamos. For all experiments in this sec-

tion, we use our model trained on Sequential-MNIST

with buffer size 500.

6.1 Dynamic Modularity and

Compositionality

Humans show modular and specialized responses to

stimuli(Meunier et al., 2010) with dynamic and sparse

response to inputs (Graham and Field, 2006) - a capa-

bility that we instilled in our DNN while learning a

sequence of tasks by dynamically removing channel

activations of convolutional layers. Therefore, we ex-

amine the task- and class-wise tendencies of the ﬁring

rates of each neuron (ﬁlter) in Figure 3.

It can be seen that Dynamos learns a soft separa-

tion of both tasks and classes, as evidenced by the per-

task and per-class ﬁring rates, respectively, of each

ﬁlter. This is in contrast to static methods, where

all ﬁlters react to all examples. Figure 3a further

shows that this allows learning of similar activation

patterns for similar examples. For example, MNIST

Dynamically Modular and Sparse General Continual Learning

267

Table 1: General continual learning results for multiple buffer sizes. All results are averaged across ﬁve seeds.

Multi-Scale

Associations

Dynamic

Modularity

Buffer Size

100 200 500

✓ ✓ 64.418 ± 4.095 79.638 ± 2.853 90.519 ± 0.737

✓ ✗ 61.192 ± 3.072 75.364 ± 1.259 88.150 ± 0.888

✗ ✗ 18.712 ± 0.690

(a) Classwise. (b) Taskwise.

Figure 3: Filter activation rates on the test set for each ﬁlter with respect to tasks and classes. For ease of visualization, we

only look at the last 40 ﬁlters. Full visualizations can be found in Appendix (Figure 7).

Figure 4: Jensen-Shanon Divergences (×100) of the activa-

tion rates of class pairs on the test set.

digit pairs 1 and 7, and 6 and 8, which share some

shape similarities, also share similarities in their acti-

vation patterns/rates. This could be attributed to be-

ing able to reuse and drop learned ﬁlters dynamically,

which causes the DNN to react similarly to similar

inputs, partitioning its responses based on example

similarities. Additionally, the ability to dynamically

reuse ﬁlters allows DNNs to learn overlapping acti-

vation patterns for dissimilar examples and classes,

instead of using completely disparate activation pat-

terns. This also facilitates the learning of sequences

of tasks without having to grow the DNN capacity or

having a larger capacity at initialization, as opposed

to the static parameter isolation methods for contin-

ual learning.

Following (Abbasi et al., 2022), we quantify the

overlap between the activation rates for each class pair

in the ﬁnal layer using the Jensen-Shanon divergence

(JSD) between them in Figure 4. Lower JSDs sig-

nify higher overlap. The JSD is lowest for the class

pair (1, 7) (both digits look like vertical lines), and is

∼

the average JSD across class pairs, and ∼

that of the least overlapping class pair (1,8) (1 is a

line, 8 is formed of loops). Now, as per Equation 1,

ﬁlters in the layer are activated based on the channel-

wise attention vector v

(see Equation 2), which are

pushed together for examples of the same classes, and

pushed away from each other for examples of differ-

ent classes using prototype loss (Equation 5). We vi-

sualize the t-SNEs of these v

s on the test set in Fig-

ure 5 and observe that the samples belonging to the

same classes are clustered, conﬁrming the effective-

ness of our prototype loss. Moreover, the clusters of

visually similar classes are close together, which is

concomitant with the JSDs and class-wise activation

rates seen earlier. Class similarities are also reﬂected

through multiple clusters for the digit 9, indicating its

similarity with the digits 6 (loop) and 1 (line) in one

cluster, but also with 7 (line) and 4 (line + loop) in

another cluster. Finally, we observe that there are ex-

amples that are scattered away from their class clus-

ters and overlap with other clusters, probably indicat-

ing that these particular examples are visually closer

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

268

Figure 5: t-SNEs on the test set of class prototypes learned

from channel-wise self-attention vectors for all classes.

to other digits. Note, however, that these similar ex-

amples and classes are distributed across tasks, which

explains the lower similarities in activation patterns

between task pairs in Figure 3b compared to the class

pairs in Figure 3a.

Therefore, Dynamos is capable of learning modu-

lar and specialized units that result in input-adaptive

dynamic separation and overlap of activations, based

on the extent of similarities with previously learned

examples. We also contend that the overlapping acti-

vations for digits of similar shape suggest the learning

of general-purpose features.

6.2 Trial-to-Trial Variability

The brain is known to show variability in response

across trials (Faisal et al., 2008; Werner and Mount-

castle, 1963). For the same stimulus, the precise

neuronal response could differ between trials, a be-

havior absent in most conventional DNNs. In our

method, this aspect of brains can be mimicked by us-

ing Bernoulli sampling instead of thresholding to pick

keep/drop decisions at each convolutional layer. In

Figure 6, we plot the response variability in the last

convolutional layer of our DNN with the same exam-

ple in four trials. We only pick responses for which

the predictions were correct. It can be seen that each

trial evoked a different response from the DNN. Fur-

thermore, despite the differences, there are also some

similarities in the response. There are some ﬁlters that

are repeatedly left unused, as well as some ﬁlters that

are used in every trial. This demonstrates that Dy-

namos can additionally simulate the trial-to-trial vari-

ability observed in brains.

Figure 6: Trial-to-trial variability of responses to same input

in Dynamos.

7 CONCLUSION AND FUTURE

WORK

We propose Dynamos, a method for general contin-

ual learning, that simulates the dynamically modu-

lar and sparse response to stimuli observed in the

brain. Dynamos rewards the input-adaptive removal

of channel activations of convolutional layers using

policy gradients for dynamic and sparse responses. To

further induce modularity, channel-wise self-attention

vectors corresponding to each convolutional layer

are pulled together for examples from same classes,

and are pushed apart for examples from different

classes; these vectors are then used to sample the

keep/drop decision for the corresponding channel.

Using a memory buffer, we enforce multi-scale con-

sistency between previous and current responses to

prevent forgetting. Dynamos outperforms previous

baselines on multiple datasets when evaluated us-

ing class-incremental learning (CIL) and general con-

tinual learning (GCL) protocols. Dynamos exhibits

similar and overlapping responses for similar inputs,

yet distinct responses to dissimilar inputs by utiliz-

ing subsets of learned ﬁlters in an adaptive manner.

We quantiﬁed the extent of class-wise overlaps and

showed that the semantic similarity of classes (dig-

its in MNIST, e.g. 1 and 7) are reﬂected in higher

representation overlaps. We additionally visualized

the channel-wise attention vectors and observed that

they are clustered by the classes and the clusters of se-

mantically similar classes lie together or overlap. Fi-

nally, we also demonstrated the ability of our method

to mimic the trial-to-trial variability seen in the brain,

where same inputs achieve same outputs through dif-

ferent “responses”, i.e. activations. Thus, we con-

sider our work as a step toward achieving dynamically

modular and general-purpose continual learning.

Dynamically Modular and Sparse General Continual Learning

269

REFERENCES

Abati, D., Tomczak, J., Blankevoort, T., Calderara, S.,

Cucchiara, R., and Bejnordi, B. E. (2020). Condi-

tional channel gated networks for task-aware contin-

ual learning. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 3931–3940.

Abbasi, A., Nooralinejad, P., Braverman, V., Pirsiavash, H.,

and Kolouri, S. (2022). Sparsity and heterogeneous

dropout for continual learning in the null space of neu-

ral activations. arXiv preprint arXiv:2203.06514.

Adel, T., Zhao, H., and Turner, R. E. (2020). Con-

tinual learning with adaptive weights (CLAW). In

8th International Conference on Learning Represen-

tations, ICLR 2020, Addis Ababa, Ethiopia, April 26-

30, 2020. OpenReview.net.

Bengio, E., Bacon, P.-L., Pineau, J., and Precup, D. (2015).

Conditional computation in neural networks for faster

models. arXiv preprint arXiv:1511.06297.

Bolukbasi, T., Wang, J., Dekel, O., and Saligrama, V.

(2017). Adaptive neural networks for efﬁcient infer-

ence. In International Conference on Machine Learn-

ing, pages 527–536. PMLR.

Buzzega, P., Boschini, M., Porrello, A., Abati, D., and

Calderara, S. (2020). Dark experience for general con-

tinual learning: a strong, simple baseline. Advances

in neural information processing systems, 33:15920–

15930.

Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny,

M. (2019). Efﬁcient lifelong learning with A-GEM.

In 7th International Conference on Learning Repre-

sentations, ICLR 2019, New Orleans, LA, USA, May

6-9, 2019. OpenReview.net.

Chen, H.-J., Cheng, A.-C., Juan, D.-C., Wei, W., and

Sun, M. (2020). Mitigating forgetting in online con-

tinual learning via instance-aware parameterization.

Advances in Neural Information Processing Systems,

33:17466–17477.

Collier, M., Kokiopoulou, E., Gesmundo, A., and Berent, J.

(2020). Routing networks with co-training for contin-

ual learning. arXiv preprint arXiv:2009.04381.

Davari, M., Asadi, N., Mudur, S., Aljundi, R., and

Belilovsky, E. (2022). Probing representation forget-

ting in supervised and unsupervised continual learn-

ing. arXiv preprint arXiv:2203.13381.

Delange, M., Aljundi, R., Masana, M., Parisot, S., Jia,

X., Leonardis, A., Slabaugh, G., and Tuytelaars, T.

(2021). A continual learning survey: Defying forget-

ting in classiﬁcation tasks. IEEE Transactions on Pat-

tern Analysis and Machine Intelligence.

Faisal, A. A., Selen, L. P., and Wolpert, D. M. (2008). Noise

in the nervous system. Nature reviews neuroscience,

9(4):292–303.

Farquhar, S. and Gal, Y. (2018). Towards Robust Evalua-

tions of Continual Learning. Lifelong Learning: A Re-

inforcement Learning Approach Workshop at ICML.

French, R. M. (1999). Catastrophic forgetting in con-

nectionist networks. Trends in cognitive sciences,

3(4):128–135.

Goyal, A. and Bengio, Y. (2020). Inductive biases for

deep learning of higher-level cognition. arXiv preprint

arXiv:2011.15091.

Graham, D. J. and Field, D. J. (2006). Sparse coding in the

neocortex. Evolution of nervous systems, 3:181–187.

Greenwald, N. F., Miller, G., Moen, E., Kong, A., Kagel,

A., Dougherty, T., Fullaway, C. C., McIntosh, B. J.,

Leow, K. X., Schwartz, M. S., et al. (2021). Whole-

cell segmentation of tissue images with human-level

performance using large-scale data annotation and

deep learning. Nature biotechnology, pages 1–11.

Hadsell, R., Rao, D., Rusu, A. A., and Pascanu, R. (2020).

Embracing change: Continual learning in deep neural

networks. Trends in cognitive sciences, 24(12):1028–

1040.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Hu, T., Chen, T., Wang, H., and Wang, Z. (2020). Triple

wins: Boosting accuracy, robustness and efﬁciency to-

gether by enabling input-adaptive inference. In ICLR.

Isele, D. and Cosgun, A. (2018). Selective experience re-

play for lifelong learning. In Proceedings of the AAAI

Conference on Artiﬁcial Intelligence, volume 32.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J.,

Desjardins, G., Rusu, A. A., Milan, K., Quan, J.,

Ramalho, T., Grabska-Barwinska, A., et al. (2017).

Overcoming catastrophic forgetting in neural net-

works. Proceedings of the national academy of sci-

ences, 114(13):3521–3526.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).

Gradient-based learning applied to document recogni-

tion. Proceedings of the IEEE, 86(11):2278–2324.

Li, X., Liu, Z., Luo, P., Change Loy, C., and Tang, X.

(2017). Not all pixels are equal: Difﬁculty-aware se-

mantic segmentation via deep layer cascade. In Pro-

ceedings of the IEEE conference on computer vision

and pattern recognition, pages 3193–3202.

McCloskey, M. and Cohen, N. J. (1989). Catastrophic in-

terference in connectionist networks: The sequential

learning problem. In Psychology of learning and mo-

tivation, volume 24, pages 109–165. Elsevier.

Meunier, D., Lambiotte, R., and Bullmore, E. T. (2010).

Modular and hierarchically modular organization of

brain networks. Frontiers in neuroscience, 4:200.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and

Ng, A. Y. (2011). Reading digits in natural images

with unsupervised feature learning.

Nose, Y., Kojima, A., Kawabata, H., and Hironaka, T.

(2019). A study on a lane keeping system using cnn

for online learning of steering control from real time

images. In 2019 34th International Technical Confer-

ence on Circuits/Systems, Computers and Communi-

cations (ITC-CSCC), pages 1–4. IEEE.

Ostapenko, O., Rodriguez, P., Caccia, M., and Charlin, L.

(2021). Continual learning via local module composi-

tion. Advances in Neural Information Processing Sys-

tems, 34:30298–30312.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

270

Rajasegaran, J., Hayat, M., Khan, S. H., Khan, F. S., and

Shao, L. (2019). Random path selection for continual

learning. Advances in Neural Information Processing

Systems, 32.

Riemer, M., Cases, I., Ajemian, R., Liu, M., Rish, I., Tu, Y.,

and Tesauro, G. (2019). Learning to learn without for-

getting by maximizing transfer and minimizing inter-

ference. In 7th International Conference on Learning

Representations, ICLR 2019, New Orleans, LA, USA,

May 6-9, 2019. OpenReview.net.

Schwarz, J., Czarnecki, W., Luketina, J., Grabska-

Barwinska, A., Teh, Y. W., Pascanu, R., and Hadsell,

R. (2018). Progress & compress: A scalable frame-

work for continual learning. In International Confer-

ence on Machine Learning, pages 4528–4537. PMLR.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-

ing: An introduction. MIT press.

Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014).

Deepface: Closing the gap to human-level perfor-

mance in face veriﬁcation. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 1701–1708.

Thrun, S. and Mitchell, T. M. (1995). Lifelong robot learn-

ing. Robotics and autonomous systems, 15(1-2):25–

46.

Veit, A. and Belongie, S. (2018). Convolutional networks

with adaptive inference graphs. In Proceedings of the

European Conference on Computer Vision (ECCV),

pages 3–18.

Veniat, T., Denoyer, L., and Ranzato, M. (2021). Efﬁcient

continual learning with modular networks and task-

driven priors. In International Conference on Learn-

ing Representations.

Vitter, J. S. (1985). Random sampling with a reser-

voir. ACM Transactions on Mathematical Software

(TOMS), 11(1):37–57.

Wang, X., Yu, F., Dou, Z.-Y., Darrell, T., and Gonzalez,

J. E. (2018). Skipnet: Learning dynamic routing in

convolutional networks. In Proceedings of the Euro-

pean Conference on Computer Vision (ECCV), pages

409–424.

Wang, Z., Zhang, Z., Lee, C.-Y., Zhang, H., Sun, R., Ren,

X., Su, G., Perot, V., Dy, J., and Pﬁster, T. (2022).

Learning to prompt for continual learning. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 139–149.

Werner, G. and Mountcastle, V. B. (1963). The variability of

central neural activity in a sensory system, and its im-

plications for the central reﬂection of sensory events.

Journal of Neurophysiology, 26(6):958–977.

Williams, R. J. (1992). Simple statistical gradient-following

algorithms for connectionist reinforcement learning.

Machine learning, 8(3):229–256.

Wu, Z., Nagarajan, T., Kumar, A., Rennie, S., Davis, L. S.,

Grauman, K., and Feris, R. (2018). Blockdrop: Dy-

namic inference paths in residual networks. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 8817–8826.

Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014).

How transferable are features in deep neural net-

works? Advances in neural information processing

systems, 27.

Zenke, F., Poole, B., and Ganguli, S. (2017). Continual

learning through synaptic intelligence. In Interna-

tional Conference on Machine Learning, pages 3987–

3995. PMLR.

APPENDIX

Table 2: Class-Incremental learning accuracies for CCGN

and Dynamos corresponding to Figure 2.

Buffer

Size

Model Seq-SVHN Seq-MNIST

500

CCGN 67.45 94.01

Dynamos 84.815 97.19

1000

CCGN 73.99 95.94

Dynamos 87.38 97.51

2000

CCGN 81.02 95.94

Dynamos 88.54 97.57

Table 3: Agent architecture with input features of shape

B ×C

× H ×W, output features of shape B ×C

out

× 1 × 1,

where B is the batch size, C

is the number of channels in

the input, and C

out

is the number of channels expected in

the output, which is same as the number of keep/drop ac-

tions required for the corresponding convolutional layer.

Operations Input size Output size

Pointwise Conv B ×C

× h × w B ×C

out

× h × w

Average Pooling B ×C

out

× h × w B ×C

out

× 1 × 1

Reshape B ×C

out

× 1 × 1 B ×C

out

Linear B ×C

out

B ×C

out

/16

ReLU B ×C

out

/16 B ×C

out

/16

Linear B ×C

out

/16 B ×C

out

Reshape B ×C

out

B ×C

out

× 1 × 1

Sigmoid

B ×C

out

× 1 × 1 B ×C

out

× 1 × 1

Dynamically Modular and Sparse General Continual Learning

271

Table 4: Architectures used in our experiments. For baseline experiments without dynamic compositionality, we do not use

the ”Agent” branch. Conv(k, n, s, p) refers to convolutional layer with kernel size k, number of ﬁlters n, stride s, and padding

p. BN refers to Batch Normalization. Linear(M, N) refers to a linear layer with M-dimensional input and N-dimensional

output. Agent(C

, C

out

, τ) refers to the Agent subnetwork with C

input channels, C

out

output channels, and τ temperature of

the sigmoid in the probability layer (See Figure 1, Section 4). The elementwise multiplication of the actions from the agents

with the output of a convolutional layer is done after the application of batch normalization, if present, but before the ReLU

activation function, if present. For complete description of Agent architecture, refer to Table 3. num classes refers to the

number of classes to be predicted.

Component Main Branch Residual branch Agent branch

Conv1 Conv(3,32, 1,1),BN,ReLU − −

Block1



Conv(3,32, 1,1),BN,ReLU

 

Identity

 

Agent(64,64,0.15)



Block2



Conv(3,32, 1,1),BN,ReLU

 

Conv(1,64, 2,0),BN

 

Agent(64,64,0.15)



Classiﬁer Linear(64,num classes) − −

Table 5: Hyperparameters for all the datasets for Dynamos.

Dataset

Buffer

Size

lr #Epochs

#Warmup

Ep./Itr.

Batch

Size

Memory

Batch

Size

α β α

λ γ w

Seq-MNIST

500 0.07 1 10 it 10 10 0.2 2.0 0.2 500 0.5 0.3 0.7

1000 0.07 1 10 it 10 10 0.1 2.5 0.2 200 0.7 0.5 0.7

2000 0.07 1 10 it 10 10 0.5 3.0 0.2 200 0.5 0.5 0.7

SVHN

500 0.07 70 10 ep 16 16 2.0 3.0 1.0 500 1.0 0.5 0.7

1000 0.07 70 10 ep 16 16 2.5 2.0 0.2 500 1.0 0.5 0.7

2000 0.07 70 10 ep 16 16 2.5 2.0 0.2 500 1.0 0.5 0.7

MNIST-360

100 0.07 1 10 it 16 16 0.2 1.0 0.1 200 0.5 0.5 0.7

200 0.07 1 10 it 16 16 0.2 1.5 0.1 200 1.0 0.5 0.7

500 0.07 1 10 it 16 16 0.1 1.5 0.1 200 0.3 0.3 0.7

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

272

(a) Classwise activations for Layer 1 (b) Taskwise activations for Layer 1

(e) Classwise activations for Layer 3 (f) Taskwise activations for Layer 3

(g) Classwise activations for Layer 4 (h) Taskwise activations for Layer 4

Figure 7: Filter activation rates for each ﬁlter in each convolutional layer of Block 2 with respect to MNIST tasks and classes.

Overlapping activations of tasks and classes indicative of similarities between them can still be observed. For e.g. 1 and 7

still show very similar responses.

Dynamically Modular and Sparse General Continual Learning

273