Lifelong Machine Learning with Adaptive Multi-Agent Systems

Nicolas Verstaevel, J

emy Boes, Julien Nigon,

Dorian d’Amico and Marie-Pierre Gleizes

IRIT, Universit

e de Toulouse, Toulouse, France

Keywords:

Ambient Systems, Multi-Agent Systems, Lifelong Learning, Self-adaptive Systems, Self-organization.

Abstract:

Sensors and actuators are progressively invading our everyday life as well as industrial processes. They form

complex and pervasive systems usually called ”ambient systems” or ”cyber-physical systems”. These systems

are supposed to efﬁciently perform various and dynamic tasks in an ever-changing environment. They need

to be able to learn and to self-adapt throughout their life, because designers cannot specify a priori all the

interactions and situations they will face. These are strong requirements that push the need for lifelong machine

learning, where devices can learn models and behaviours during their whole lifetime and are able to transfer

them to perform other tasks. This paper presents a multi-agent approach for lifelong machine learning.

1 INTRODUCTION

The rapid increasing of the computational capabili-

ties of electronic components and the drastic reduc-

tion of their costs have allowed to populate our en-

vironment with thousands of smart devices, inducing

a revolution in the way we interact with a more and

more numerical world. The emergence of a socio-

technical world where technology is a key component

of our environment impacts many aspects of our ex-

istence, changing the way we live, work and interact.

Applications of these Ambient Systems are wide and

various, covering a spectrum from industrial appli-

cations to end-user satisfaction in a domotic context

(Nigon et al., 2016a). The will to improve our man-

ufacturing processes through the Industry 4.0 topic

(Jazdi, 2014), or the emergence of a Green Economy

(Rifkin, 2016) are good illustrations of the usage of

these cyber-physical systems.

A key feature of these systems is their pervasive-

ness: they are composed of a huge variety of elec-

tronic components, with sensing and actuating capac-

ities, that interact with each other, generating more

and more complex data, all to control dynamic and

inter-disciplinary processes. They achieve their task

throughout their life without any interruption. The

complexity of these systems, increased by the hetero-

geneity of their components and the difﬁculty to spec-

ify a priori all the interactions that may occur, induces

the need for lifelong learning features.

In this paper, we address the lifelong learning

problem in ambient systems as a problem of self-

organization of the different components. Giving

each component the ability to self-adapt to both the

task and the environment is the key toward a lifelong

learning process. First, we argue that work on lifelong

machine learning and self-organization shares similar

objectives. Then, we propose a generic framework,

based on our previous and current works on Adap-

tive Multi-Agent Systems, that enable lifelong learn-

ing through interactions in a set of distributed agents.

We then study the relevance of our approach in a sim-

ulation of a robotic toy. Finally, we conclude with the

challenges and perspectives opened by this work.

2 AMBIENT SYSTEMS,

LIFELONG MACHINE

LEARNING AND

SELF-ORGANIZATION

In this section, we introduce the basics of lifelong ma-

chine learning and self-organization in order to high-

light the similar problematic addressed by these two

domains.

2.1 Machine Learning and Ambient

Systems

Machine Learning is an increasingly trending topic in

the industry, becoming an indispensable tool to design

artiﬁcial systems. Designers use Machine Learning

Verstaevel N., Boes J., Nigon J., d’Amico D. and Gleizes M.

Lifelong Machine Learning with Adaptive Multi-Agent Systems.

DOI: 10.5220/0006247302750286

In Proceedings of the 9th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2017), pages 275-286

ISBN: 978-989-758-220-2

275

to face the inability to specify a priori all the interac-

tions that could occur or to handle the complexity of

a task. The idea of designing learning machines lies

at the very heart of Artiﬁcial Intelligence (AI) and is

studied since the very beginning of computer science.

(Mitchell, 2006) deﬁnes the ﬁeld of machine learn-

ing as the science seeking to answer the questions

”how can we build computer systems that automat-

ically improve with experience?” and ”what are the

fundamental laws that govern all learning process?”.

Per (Mitchell, 2006), a machine is said to learn with

respect to a task T , performance metric P, and type

of experience E, if the system reliability improves its

performance P at task T , following experience E. The

ﬁeld of Machine Learning is traditionally classiﬁed in

three categories (Russell et al., 2003), depending on

how the experience E is performed:

• Supervised Learning which infers a function

from labelled data provided by an oracle.

• Unsupervised Learning which looks for hidden

structures in unlabelled data provided by an exter-

nal entity.

• Reinforcement Learning which learns through

the interaction between a learner and its environ-

ment to maximize a utility function.

While the difference is on the way the experience

is gathered, all these learning algorithms intend to

learn a task. The learning system is then specialised

through its experience to increase its performance.

The notion of task in the AI community is highly dis-

cussed, and in the absence of a common deﬁnition or

test framework, it is difﬁcult to truly compare those

different approaches (Thorisson et al., 2016).

Learning in an ambient system differs in the way that

the task to learn is a priori unknown and dynamic.

These systems must handle unanticipated tasks whose

speciﬁcs cannot be known beforehand. Indeed, these

systems are characterized by: their non-linearity (the

behaviour of such systems is not deﬁned by simple

mathematical rules), the high number of entities that

compose it, the high number of interactions between

these entities, and their unpredictability. In addition,

ambient systems are generally used in conjunction

with the presence of human users. Those human users

have varied and multiple needs, which imply to deal

at runtime with various tasks. Moreover, the devices

of ambient systems are made to last several years,

during which their usage may change. We have no

way to know to what end they will be used ﬁve or

ten years from now. Thus, ambient systems require

that machine learning algorithms handle the dynamics

of their environments and never stop learning. In the

next sections, we introduce lifelong machine learning

and draw a parallel with self-organization.

2.2 Lifelong Learning Machine:

Deﬁnitions and Objectives

Deﬁnition: Lifelong Machine Learning, or LML,

considers systems that can learn many tasks over a

lifetime from one or more domains. They efﬁciently

and effectively retain the knowledge they have learned

and use that knowledge to more efﬁciently and effec-

tively learn new tasks. (Silver et al., 2013)

The idea of design artiﬁcial systems with Lifelong

Learning capacities is not new (Thrun and Mitchell,

1995), but there is a upsurge work probably helped

by the recent advances in neural networks (Pentina

and Lampert, 2015).

Artiﬁcial systems must face more and more dynamic

and various environments. The impact of those quick

evolutions imposes severe limitations on designers ca-

pacities to a priori determine all the events that can

occur in the environment. While traditional machine

learning focuses on optimising a system performance

to achieve a task, LML proposes to deal with various

tasks in a variety of environments.

The researches on LML are then multidisciplinary, as-

sociating a wide range of researchers from computer

sciences to cognitive sciences. Learning is considered

as an ontogenesis process from which more and more

complex structures are built through experience and

LML addresses the problematic of reusing previously

learned knowledge (Zhang, 2014).

(Thrun and Mitchell, 1995) categorise LML ap-

proaches in two categories, depending on the hypoth-

esis put on the environment. The ﬁrst one considers

learners that must learn various tasks in similar en-

vironments. Those algorithms build an action model

which learns the impact of the learner’s action on the

environment. As the environment is the same, this

model is independent of the task to perform and can

be used in various tasks. The second one considers

that the same learner as to face multiple environments.

This approach intends to model invariant features that

are observed about the learner on its environment. In

both categories, the objective is to build through ex-

perience a model from the different experiences and

to transfer or reuse previously learned knowledge to

speed up the learning process when the learner has to

learn a new task.

2.3 Machine Learning and

Self-Organization

The behaviour and the interactions of the components

of a system deﬁne the function of the system. When

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

276

the components change their behaviour or their in-

teractions, the function of the system is necessarily

modiﬁed. If these changes are only controlled by the

components themselves, the system is said to be self-

organizing. The concept of self-organization is stud-

ied in various domains, from biology to computer sci-

ence. The following deﬁnition is given by (Serugendo

et al., 2011).

Deﬁnition: Self-organization is the process with

which a system changes its structure without any ex-

ternal control to respond to changes in its operating

conditions and its environment.

In other words, self-organization enables a system to

change its internal behaviour to reach and maintain ef-

ﬁcient interactions with its environment. This is very

close to Machine Learning, where a software system

tries to improve its function by adjusting itself regard-

ing past experiences. Self-organization is a ﬁne way

to achieve machine learning. It is particularly efﬁ-

cient when a system has many heterogeneous compo-

nents with various degrees of autonomy, such as an

ambient system and its many connected devices. In

this case, self-organization implies the distribution of

control and helps to tackle complexity by focusing on

the local problems of the components rather than the

global task of the system. Therefore, designing self-

organizing systems is a key challenge to tackle com-

plexity in ambient systems.

Multi-agent systems are particularly suitable

paradigm to design and implement self-organizing

systems. The next section presents our self-

organizing multi-agent pattern for Machine Learning.

3 LIFELONG MACHINE

LEARNING WITH ADAPTIVE

MULTI-AGENT SYSTEMS

3.1 Adaptive Multi-Agent Systems

The Adaptive Multi-Agent Systems (AMAS) ap-

proach aims at solving problems in dynamic non-

linear environments by a bottom-up design of co-

operative and autonomous agents, where coopera-

tion is the engine of the self-organisation process

(Georg

e et al., 2011). Over the years, the approach

has been applied to designed and developed various

self-adaptive multi-agent systems to tackle real-world

complex problems, such as robot control (Verstaevel

et al., 2016) or heat engine optimization (Boes et al.,

2014). A recurrent key feature of these systems is

their ability to continuously learn how to handle the

context they are plunged in, in other words to map

the current state of their perceptions to actions and ef-

fects. In these applications, a set of distributed agents

locally learns and exploits a model of an agent be-

haviour from the interactions between the agent and

its environment. In the next section, we present the

Self-Adaptive Context Pattern (Boes et al., 2015), our

proposal to design agents with lifelong learning ca-

pacities.

3.2 The Self-Adaptive Context Learning

Pattern

The Self-Adaptive Context Learning Pattern is a re-

current pattern, designed with the AMAS approach,

that we have been using to solve complex problems

based on the autonomous observation by an agent of

the impacts of its own activity on the environment

(Nigon et al., 2016b). The pattern is composed of two

coupled entities:

• An Exploitation Mechanism, which function is to

perform actions over the environment. It is the

acting entity which must decide and apply the ac-

tion to be performed by a controlled system in the

current context. Its decision is based on its own

knowledge, including constraints from the appli-

cation domain, and additional information pro-

vided by the Adaptation Mechanism.

• An Adaptation Mechanism, which builds and

maintains a model describing the current context

of the environment and its possible evolutions.

This model is dynamically built by correlating the

activity of the Exploitation Mechanism to the ob-

servation of the environment.

The Adaptation Mechanism and the Exploitation

Mechanism are coupled entities and form a con-

trol system (Figure 1). The Adaptation Mechanism

perceives information from the environment (Arrow

3) and uses this information to provide information

about the current system context to the Exploitation

Mechanism. Thanks to those information and its own

internal knowledge, the Exploitation Mechanism de-

cides of the action to perform and applies it (Arrow

2). The Exploitation Mechanism provides a feed-

back to the Adaptation Mechanism about the ade-

quacy of the information performed (Arrow 4). The

adequacy is evaluated by the observation of the envi-

ronment by the Exploitation Mechanism. This pattern

is highly generic and task-independent. The Adap-

tation Mechanism models information that are useful

for the Exploitation Mechanism, which exploits this

information to decide of the action to performed. By

its actions, the Exploitation Mechanism modiﬁes the

environment, enabling the Adaptation Mechanism to

Lifelong Machine Learning with Adaptive Multi-Agent Systems

277

Figure 1: The Self-Adaptive Context Learning Pattern: view of one agent and its interaction with the environment (Boes et al.,

2015).

update its knowledge about the environment and to

provide new information about the current context.

This loop enables the system to continuously acquire

knowledge about its environment through its own ex-

perience. Through this loop, the Adaptation Mech-

anism models information about the interaction be-

tween the Exploitation Mechanism and its environ-

ment. This information is autonomously provided to

the Exploitation Mechanism which uses it to improve

its behaviour.

The pattern results of an abstraction of various appli-

cation of the AMAS approach to different problems.

Then, it is independent of the implementation of the

Adaptation Mechanism and the Exploitation Mecha-

nism. It only focuses on how those two entities must

interact to learn over time, throughout their activity.

In this pattern, the learning is made by the Adapta-

tion Mechanism, whereas the acting is made by the

Exploitation Mechanism. However, to be truly func-

tional, the SACL pattern imposes that the Adaptation

Mechanism continuously learns from the interactions

in a Lifelong way.

3.3 Solving Problems with SACL

The pattern has been applied to various problems

where learning is a way toward self-organisation

(Boes et al., 2015). In each of those problems, the

recurrent key feature is to learn a model of the in-

teractions between the agent and its environment by

extracting the context in which those actions are per-

formed and associating information about this context

to lately use this information to behave well. To il-

lustrate this recurrent key feature, we show how the

Figure 2: The Self-Adaptive Context Learning Pattern ap-

plied to Supervised Learning problems.

pattern is used to design Supervised Learning agents

and Reinforcement Learning agents, and we point out

similarities of design and differences.

3.3.1 Supervised Learning Agent

A supervised learning algorithm infers a function

from labelled data provided by an oracle. Supervised

learning algorithms perform either a classiﬁcation or

a regression, depending on either the output is dis-

crete or continuous. Traditionally, Supervised Learn-

ing algorithms differentiate the learning phase, where

labelled situation are provided by the oracle, with the

exploitation phase, where new situations have to be

labelled by the agent itself.

With the SACL pattern, those two phases are not sep-

arated. The supervised learning process is performed

on-line. To do so, the oracle is considered as an au-

tonomous entity which takes part of the environment

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

278

Figure 3: The Self-Adaptive Context Learning Pattern ap-

plied to Reinforcement Learning problems.

which may provide information to the Exploitation

Mechanism. The function of the SACL pattern is then

to mimic the behaviour of this external entity. Thus,

the Exploitation Mechanism observes the behaviour

of the oracle, and when the oracle provides a label

describing a situation, it compares this label given

by the oracle to the label proposed by the Adapta-

tion Mechanism. Then, it generates a feedback to-

ward the Adaptation Mechanism to inform if either

or not the current situation is correctly mapped by

the Adaptation Mechanism. The Adaptation Mech-

anism uses this feedback and the observation of the

data that describes the current states of the environ-

ment to adjust its model. Whenever an oracle pro-

vides information about the current situation, the out-

put of the SACL agent is the label of the oracle. On

the other hand, when the oracle does not give infor-

mation about the current situation, the Exploitation

Mechanism exploits the information provided by the

Adaptation Mechanism to label the current situation.

The ﬁgure 2 summarizes the usage of the SACL pat-

tern for the design of Supervised Learning agents.

Thus, when performing supervised learning with

SACL, the Adaptation Mechanism must associate the

adequate label to the current state of the environment,

in other words, to extract the context of each label

to be able to correctly propose the adequate label to

its Exploitation Mechanism when the oracle does not

provide this label.

This form of learning with the SACL pattern has

been applied to the problematic of learning user pref-

erences in ambient systems (Guivarch et al., 2012)

and learning from a set of demonstrations performed

by a human operator in ambient robotic applications

(Verstaevel, 2016). In those applications, examples

are provided by the oracle throughout the activity of

the system, without restarting or clearing the learning

process.

3.3.2 Reinforcement Learning Agent

Reinforcement Learning algorithms learn through

their own experience to maximize a utility function.

Basically, each action is associated with a numerical

reward provided by the environment which evaluates

the utility of the performance of this action in the cur-

rent context. Learning a model of the consequences

of the performance of an action to the utility value

allows the agent to select actions to maximize its re-

ward.

The ﬁgure 3 illustrates how we used the SACL pat-

tern to design a Reinforcement Learning agent. Here,

the Exploitation Mechanism performs actions on the

environment and receives a feedback about the utility

of this action. The action and utility value are pro-

vided to the Adaptation Mechanism which associates

the current state of the environment and the perfor-

mance of the action the utility value. The Adapta-

tion Mechanism is used to build a model of the con-

sequences of performing an action under a certain

context. This model is then used to provide to the

Exploitation Mechanism information about the con-

sequences of performing a set of actions on the cur-

rent context. Using this knowledge, the Exploitation

Mechanism can select the action which maximizes its

reward.

Similarly, to the Supervised Learning agent, the

Adaptation Mechanism observes both the current

state of the environment and the activity of the Ex-

ploitation Mechanism. It dynamically builds a model

of the interaction between the Exploitation Mecha-

nism and its environment. This information is then

used to improve the Exploitation Mechanism be-

haviour.

Reinforcement Learning agents with SACL have been

used for the optimization of heat-engines control

(Boes et al., 2014).

3.4 AMOEBA: Agnostic MOdEl

Builder by self-Adaptation

Building a model of the interactions that occurs be-

tween the Exploitation Mechanism and the environ-

ment is the core of the SACL pattern and its learning

process. The Adaptation Mechanism must dynami-

cally model these interactions. To do so, we designed

a generic Multi-Agent System based on the AMAS

approach, AMOEBA (for Agnostic MOdEl Builder

by self-Adaptation), for building model of complex

Systems. In AMOEBA, a set of Context Agents is

dynamically created to discretize the environmental

state into events. To design this system, we made

some hypotheses about the environment:

• The observed world is deterministic: the same

causes produce the same effects.

• The environment perceived by AMOEBA is not

Lifelong Machine Learning with Adaptive Multi-Agent Systems

279

the world as a whole. AMOEBA therefore has

only an incomplete view of the world.

• Perceived information may be inaccurate.

• The collected data are orderable values (real, inte-

ger ...).

• The dynamics of evolution of the observed world

variables are potentially very different from one

to another.

In this section, we present the structure and behaviour

of Context Agents and the different mechanisms that

enable Context Agents to collectively build a model

of the interactions. In a didactic way, we consider

that the environmental state is observed through a set

of sensors described by a vector E ∈ R

. However,

the model is functional with any orderable value. Ba-

sically, each component e

∈ E describes the value of

a data.

3.4.1 Context Agents: Deﬁnitions

Context Agents are the core of learning in AMOEBA

Context Agent may be assimilated to a unit of knowl-

edge which locally describes the evolution of a partic-

ular variable of the environment. At start, the system

is empty of Context Agents as it does not possess an

a priori knowledge about the environment. Context

Agents are created at runtime to model new interac-

tions. They are composed by a set of validity ranges

and a local model.

Deﬁnition: A Context Agent c ∈ C is composed

of a set of validity range V and a local model L.

Each validity range v is an interval associated to a par-

ticular value e

∈ E of the environmental state E. Each

is modelled by a validity range within each Context

Agent. The intervals are deﬁned by a minimal and a

maximal ﬂoat value.

Deﬁnition: A validity range v

∈ V is associated

to any value e

∈ E. A particular v is an interval

min

, v

max

] ⊂ [e

min

, e

max

A validity range enables to discretize the environmen-

tal state into two types of events: valid or invalid.

Those events are determined by comparing the cur-

rent value of the variable e

and the range of v.

Deﬁnition: A validity range v

∈ V is said valid if

and only if ei ∈ v

and invalid otherwise.

In the same way, a Context Agent is a discretization

of the environmental state E into the two same type

of events.

Deﬁnition: A Context Agent c ∈ C is said valid if

and only if ∀v

∈ V ,e

∈ [v

min

, v

max

] and invalid other-

wise.

The local model L is a linear regression of the envi-

ronmental state L : E → R. Each Context Agent dis-

poses of its own local model and thus, of its own lin-

ear regression.

The function of a Context Agent is dual:

• To send the value computed by its local model to

the Exploitation Mechanism whenever the Con-

text Agent is valid and,

• Thanks to the feedbacks from the Exploitation

Mechanism, to adjust both its local model and its

validity ranges so that the information sent to the

Exploitation Mechanism is useful for its activity.

3.4.2 Context Agents: Creation and

Self-Organisation

At start, AMOEBA does not contain any Context

Agent. Indeed, all the Context Agents are going

to be created and will build themselves using self-

organization mechanisms. These changes / creations

of agents take place when the agents are trouble to

perform their function efﬁciently. The main adapta-

tions performed by a Context Agent are:

• Changing its validity range. Indeed, when the re-

sults obtained by the local model are not satisfac-

tory for the Exploitation Mechanism, it may be

preferable for the agent to no longer be active in

the current situation. To do this, it reduces its va-

lidity ranges. Conversely, if it ﬁnds that its pro-

posal would have been relevant in a close situa-

tion, the agent may broaden its ranges of validity

to include the situation in question (ﬁgure 4).

• Changing its local model. When the results ob-

tained by the local model are not satisfactory for

the Exploitation Mechanism, another way to im-

prove the results of the Context Agent is to change

the local model. This consists in adding to the

linear regression of the local model the point rep-

resented by the current situation. This change is

smoother than the previous one, and will therefore

be preferred by the Context Agent when the error

is small.

• Destroy itself. When the validity ranges of the

Context Agent are too small, the agent could con-

sider that it has no chances to be useful again. So,

it destroys itself to save resources.

Another mechanism is implemented to complement

those mentioned earlier. When no proposal is made

by the Context Agents, or when all their propositions

are false, a new Context Agent is created to carry

the information when a similar situation occurs again.

This is how new agents are added to the system.

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

280

Figure 4: Representation of the different adaptation mech-

anisms of the Context Agents in two dimensions. Each di-

mension represents the values that can be taken by a toy

sensor, and the size of the validity range determines the size

of the rectangle along this dimension.

3.4.3 Synthesis

AMOEBA is a generic model builder to model the

evolution of a variable. It based on the Adaptive

Multi-Agent approach. Its key feature is to be dy-

namically populated by a set of Context Agents which

evolves dynamically to maintain an up-to-date model

of the variables it observes.

The reader may see that Context Agents are simi-

lar to neurons in Artiﬁcial Neural Networks. Artiﬁ-

cial Neural Networks are composed of interconnected

neurons, where each neuron is a small computational

unit with inputs, outputs, an internal state and param-

eters. One could see Context Agents as ”advanced”

neurons, and the architecture of AMOEBA, where

messages navigate from the environment to the differ-

ent Context Agents, as a form of multilayer percep-

tron. However, the information within AMOEBA is

not feed-forward (contrary to neural networks). Thus,

the main difference between AMOEBA and Artiﬁcial

Networks is the organization of the interaction be-

tween the entities. With artiﬁcial Neural Network,

the topology of the network must be ﬁxed a pri-

ori in regards with the task to learn whereas within

AMOEBA, agents self-organize and the topology (the

number of agents and the way they interact) evolves

dynamically through experience.

4 COMBINING SUPERVISED

LEARNING AND

REINFORCEMENT

LEARNING: AN EXPERIMENT

WITH A CHILDREN’S

ROBOTIC TOY

We propose to study the usage of SACL in a Lifelong

reinforcement learning problem. We put ourselves in

the context of child entertainment with a robotic toy.

Firstly, we discuss of the motivations for this experi-

ment. Then, we present the experimental process and

the result we obtained. At last, we conclude on some

perspectives highlighted by this experiment.

4.1 Motivations

Robotic toys tend to be more democratized and the

scientiﬁc community is working on their usage as

an educational and therapeutic tool (Billard, 2003).

Robins and Dautenhahn (Robins and Dautenhahn,

2007) have especially shown that their use encourages

the interactions among autistic children. But as each

human is different, the social skills needed to main-

tain the attention in the human/toy interaction requires

the toys to be able to self-adapt to anybody (Huang,

2010).

To this extent, an interesting approach is the Imita-

tion Learning (Billard, 2003) (Robins and Dauten-

hahn, 2007). Imitation Learning is a form of Learn-

ing from Demonstration, a paradigm mainly studied

in robotics allowing a system to acquire new behav-

iors from the observation of human activity (Argall

et al., 2009). It is inspired by the natural tendency

in some animal species and human to learn from the

observation of their congeners. Imitation has a ma-

jor advantage: it is a natural step for the child devel-

opment and a medium for social interactions (Piaget,

1962). To imitate the child then seems to be a good

way to initiate and maintain the interactions. How-

ever, the imitation poses a correspondence problem

between what is observed (the object to imitate) and

what can be done by the imitating entity (Nehaniv and

Dautenhahn, 2002). Imitation also imposes the exis-

tence of a function allowing to associate the observed

state of the world to the corresponding state of the toy.

Such a function is complex and limits the interaction

modalities.

Another form of Learning from Demonstration pro-

poses to use Reinforcement Learning algorithms com-

bined with Learning from Demonstration (Knox and

Stone, 2009). Reinforcement Learning algorithms

bring an interesting solution to the correspondence

Lifelong Machine Learning with Adaptive Multi-Agent Systems

281

Figure 5: A bird’s eye view of the simulation. A robotic ball is immersed in an arena composed of 3 colored blocks. The

contact between the toy and a block induces a laugh which is interpreted as a reward function. The long-term objective is to

maximize laugh production.

problem by letting the learning system discovering

through its own experience what are the interesting in-

teractions. Then, the toy can explore its environment

and learn from its interactions with the child. How-

ever, in its original form, Reinforcement Learning al-

gorithms require the existence of a feedback function

which associates at each state of the world a reward

value. This function is often complex to specify be-

cause it requires to evaluate a priori the distance to

an objective. However, in our context, such objective

is a priori unknown and has to be discovered. On

the other hand, the Inverse Reinforcement Learning

approach (Argall et al., 2009) proposes to infer this

reward function from a set of reward examples. The

system is fed with a set of situations. The associated

rewards and the system infer a function that models

this reward. This function can then be used to opti-

mize a policy. But this solution implies that an exter-

nal entity can provide the required demonstrations to

learn the reward function which is, in the case being

considered, non-trivial. To face this challenge, some

approaches proposes to use child’s laugh as a rein-

forcement metric (Niewiadomski et al., 2013).

Robotic toys that learn to interact with child are good

illustrations of the need for Lifelong Learning agents

as maintaining children attention requires a constant

adaptation to the child’s needs.

4.2 Description of the Experiment

We study a simulation with the game physic engine

Unity3D (ﬁgure 5) in which a spherical robotic toy is

immersed in an arena. A child, also in the arena, ob-

serves the behavior of the robotic toy. The arena is

composed of 3 blocks of different colors. The contact

between the toy and a block produces a sound, which

depends on the texture of the block (determined by its

color). We postulate that the sound produced by the

toy collision with an obstacle causes the child’s laugh.

The noise caused by the collision and the laugh inten-

sity depends on the nature of the collided obstacle.

Each object causes a different combination of high or

low noise and strong or weak laugh. The laugh, ob-

served through the volume intensity of the sound it

produces, is used as a utility function and the objec-

tive is to maximize laugh over time.

The aim of the experiment is more to study how the

AMOEBA model is built and maintained up to date

during the experiment, than proposing a new rein-

forcement learning method.

4.3 Description of the SACL

Architecture

The SACL pattern is instantiated to perform this ex-

perimentation (ﬁgure 6). The Adaptation Mechanism

is implemented with an instance of AMOEBA (as de-

scribed in section 3.4). The Exploitation Mechanism

is a simple controller, whose objective is to maximize

the volume intensity of the environment. In order to

do so, it has to decide of the (x, y) coordinates it wants

to reach. A simple control loop determines, in func-

tion of the coordinate to reach, and the current state

of the toy, which is the action to apply. The toy is

controlled in velocity, which is then determined by

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

282

the position to reach. To behave well, the Exploita-

tion Mechanism must receive from the AMOEBA in-

stance a set of reachable positions and their associated

volume. The AMOEBA instance will use as input the

current x and y positions of the sphere and the current

volume of the simulation, to build a model of the ac-

tions to be taken to increase the volume intensity per

the context. At each time step, the Context Agents in-

side AMOEBA will propose a set of coordinates (x, y)

and their associated volume intensity to the Exploita-

tion Mechanism. In this experiment, all the Context

Agents inside AMOEBA are sent to the Exploitation

Mechanism.

The selection of the coordinates to reach by the Ex-

ploitation Mechanism is based on a combination of

three factors:

• The number of control loops elapsed since the last

selection of the Context Agent δ. Each time a

Context Agent is valid, the value δ is reset to 0.

The Exploitation Mechanism selects the Context

Agents which δ value is higher than 95% of the

maximum δ value.

• The volume intensity computed by the Context

Agent ω. The second selection process then se-

lects among the previous Context Agent set, the

Context Agents with a value ω higher than 95%

of the maximum D value.

• The reachability of the coordinates proposed by

the Context Agent D, which is here computed

with the Euclidean distance. At last, the Exploita-

tion Mechanism selects the Context Agent which

proposes to reach the closest position.

Once the coordinates are chosen, the control loop is

applied to reach those coordinates. The selection is

performed at each time step of the control loop, al-

lowing the Exploitation Mechanism to continuously

update its coordinate objectives. At the same time the

AMOEBA updates its model by observing the evolu-

tion of the environment.

4.4 Experimentation and Results

The experiment is carried out in two phases of three

minutes. At ﬁrst, an exploration phase allows the

robotic toy to discover its environment. This phase

is equivalent to a supervised learning approach where

examples are provided to the learning algorithm by an

oracle (which is the environment itself).

Two different exploration strategies are studied: a ran-

dom exploration, in which there is no guaranty to visit

all the blocks, or a scripted path, in which all the

blocks are knocked at least one time. In both cases,

Figure 6: The Self-Adaptive Context Learning Pattern im-

plemented in order to control a robotic toy.

the Adaptation Mechanism uses the data observed

during this step to construct and update its model.

In a second step, an Exploitation phase allows the Ex-

ploitation mechanism to use the built model to maxi-

mize the intensity of the volume. This does not pre-

vent the Adaptation Mechanism from continuing to

enrich itself by observing the environment. This sec-

ond phase is then similar to a reinforcement learning

approach.

Thus, the experiment is a combination of supervised

and reinforcement learning.

On the rest of this section, we present and discuss

some results we obtained. More precisely, we put

the focus on the evolution of the model built by

AMOEBA, and the evolution of the global

4.4.1 Evolution of the Model Built by AMOEBA

The ﬁgures 7 and 8 show two-dimensional visual-

izations of the Context Agents paving on the dimen-

sions space at the end of the exploration and exploita-

tion phases for each experiment. This visualization

is a graphical representation of the different Context

Agents and their associated utility level. Each rect-

angle represents a Context Agent and the sub-space in

which this Context Agent is valid. Their color depends

on the utility level which is associated to them. The

white color means that the utility level is zero. The

green shows a utility level superior to zero. The more

the green is intense, the more the utility level is high.

The grey areas without contours are areas that the toy

has not yet explored, so the model has not any infor-

mation about those areas, which explains the absence

of Context Agent.

The upper ﬁgure 7 illustrates the model built by

AMOEBA after a random exploration of the environ-

ment. By comparing this projection with the location

of the obstacles in the arena, we observe that only the

top right corner is populated with green squares. This

Lifelong Machine Learning with Adaptive Multi-Agent Systems

283

Figure 7: 2D visualisation of the Context Agents paving

after a random exploration phase and after an exploitation

phase.

implies that only the yellow block has been discov-

ered during the exploration phase.

On contrary, the upper ﬁgure 8, which shows the

model built at the end of a scripted exploration phase,

shows three distinct areas populated by green squares,

which corresponds to the blue and yellow blocks, and

the upper wall of the arena. Those two ﬁgures il-

lustrate how AMOEBA has populated its model with

Context Agents. This shows that the model built by

AMOEBA is dependent of the situation that it has ex-

perimented.

If we observe the evolution of the same model at the

end of the exploitation phases (lower parts of ﬁgure 8

and 7), we can see a signiﬁcant evolution of the dis-

tribution of context agents in the areas of contact with

obstacles. In both cases, better details of the edges of

the obstacles are obtained. In the case of random ex-

ploration, it is found that only one obstacle was used,

but all its edges were observed. Conversely, only a

few edges of several obstacles were exploited follow-

ing the random exploration.

4.4.2 Evolution of Global Utility Value

To evaluate this SACL implementation, we propose to

compare the global utility level generated during the

phase of exploration with the utility level generated

during the phase of exploitation. Each exploration

phase and exploitation phase lasts three minutes for

the two experiments, and the experiment is computed

ten time each.

The ﬁgure 9 shows the mean utility value obtained

Figure 8: 2D visualisation of the Context Agents paving

after a scripted exploration phase and after an exploitation

phase.

Figure 9: Comparison of the utility value between explo-

ration and exploitation phases.

in each phase for the two different exploration strate-

gies. The mean utility value during the random explo-

ration is 1.74 and grows to 3.16 during the exploita-

tion phase. During the scripted exploration phase, the

mean utility value goes from 3.67 to 3.61 during the

exploitation phase.

In the case of random exploration, there is signiﬁ-

cant progress in average utility during the exploitation

phase. Conversely, utility decreases (very slightly)

during the exploitation phase linked to scripted ex-

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

284

ploration. This can be explained by the fact that the

scenario is already relatively efﬁcient. Nevertheless,

this remains an interesting result because the system

has learned to produce behavior that is almost as ef-

fective as an ad hoc scenario.

5 CONCLUSIONS AND

PERSPECTIVES

The evolution of technologies now enables to con-

sider that artiﬁcial systems will face more and more

complex and dynamic environments where they must

perform more and more various tasks. As the variety

of environments and tasks is increasing, these systems

need to constantly adapt their behavior to maintain the

adequacy of their interactions with their environment.

This requirement to constantly learn from the inter-

action with the environment will be a key component

of Ambient Systems, notably because of their socio-

technological aspect. Indeed, the notion of task in

Ambient Systems is ambiguous and depends of its

users, which makes them interact with human users.

The incapacity to specify a priori all the interactions

that can occur in these systems, combined with the

high dynamics of those types of environment impedes

an ad hoc design. On the contrary, these artiﬁcial sys-

tems have to constantly learn, through their own ex-

periences, to interact with their environment.

In this paper, we present our use of the SACL pat-

tern to design artiﬁcial systems with Lifelong Learn-

ing capacities. It proposes to design artiﬁcial sys-

tems in which a model is dynamically built by ex-

perience. This model is both exploited and enriched

by the mechanism that uses the model to decide of its

behavior.

This experiment illustrates how a model can be both

built and exploited in real-time. The simulation we

performed also shows that our approach is suitable

for both supervised and reinforcement learning ap-

proaches.

The work introduced in this paper is currently being

deployed in the neOCampus initiative which intends

to transform the University of Toulouse into a smart

living lab. This deployment will allow real use-cases

applications and comparative analysis to evaluate the

beneﬁts of our approach.

ACKNOWLEDGEMENTS

This work is partially funded by the Midi-

Pyrenees region, within the neOCampus initiative

(www.irit.fr/neocampus/) and supported by the Uni-

versity of Toulouse.

REFERENCES

Argall, B. D., Chernova, S., Veloso, M., and Browning, B.

(2009). A survey of robot learning from demonstra-

tion. Robotics and autonomous systems, 57(5):469–

483.

Billard, A. (2003). Robota: Clever toy and educational tool.

Robotics and Autonomous Systems, 42(3):259–269.

Boes, J., Migeon, F., Glize, P., and Salvy, E. (2014).

Model-free Optimization of an Engine Control Unit

thanks to Self-Adaptive Multi-Agent Systems (reg-

ular paper). In International Conference on Em-

bedded Real Time Software and Systems (ERTS2),

Toulouse, 05/02/2014-07/02/2014, pages 350–359.

SIA/3AF/SEE.

Boes, J., Nigon, J., Verstaevel, N., Gleizes, M.-P., and Mi-

geon, F. (2015). The self-adaptive context learning

pattern: Overview and proposal. In Modeling and Us-

ing Context, pages 91–104. Springer.

Georg

e, J.-P., Gleizes, M.-P., and Camps, V. (2011). Co-

operation. In Di Marzo Serugendo, G., Gleizes,

M.-P., and Karageogos, A., editors, Self-organising

Software, Natural Computing Series, pages 7–32.

Springer Berlin Heidelberg.

Guivarch, V., Camps, V., and Pninou, A. (2012).

AMADEUS: an adaptive multi-agent system to learn a

user’s recurring actions in ambient systems. Advances

in Distributed Computing and Artiﬁcial Intelligence

Journal, Special Issue n3, Special Issue n3(ISSN:

2255-2863):(electronic medium).

Huang, C.-M. (2010). Joint attention in human-robot inter-

action. Association for the Advancement of Artiﬁcial

Intelligence.

Jazdi, N. (2014). Cyber physical systems in the context

of industry 4.0. In Automation, Quality and Test-

ing, Robotics, 2014 IEEE International Conference

on, pages 1–4. IEEE.

Knox, W. B. and Stone, P. (2009). Interactively shaping

agents via human reinforcement: The tamer frame-

work. In Proceedings of the Fifth International Con-

ference on Knowledge Capture, pages 9–16. ACM.

Mitchell, T. M. (2006). The discipline of machine learn-

ing, volume 9. Carnegie Mellon University, School of

Computer Science, Machine Learning Department.

Nehaniv, C. L. and Dautenhahn, K. (2002). The correspon-

dence problem. Imitation in animals and artifacts, 41.

Niewiadomski, R., Hofmann, J., Urbain, J., Platt, T., Wag-

ner, J., Piot, B., Cakmak, H., Pammi, S., Baur, T.,

Dupont, S., et al. (2013). Laugh-aware virtual agent

and its impact on user amusement. In Proceedings

of the 2013 international conference on Autonomous

agents and multi-agent systems, pages 619–626. In-

ternational Foundation for Autonomous Agents and

Multiagent Systems.

Lifelong Machine Learning with Adaptive Multi-Agent Systems

285

Nigon, J., Gleizes, M.-P., and Migeon, F. (2016a). Self-

adaptive model generation for ambient systems. Pro-

cedia Computer Science, 83:675–679.

Nigon, J., Glize, E., Dupas, D., Crasnier, F., and Boes,

J. (2016b). Use cases of pervasive artiﬁcial intelli-

gence for smart cities challenges. In IEEE Workshop

on Smart and Sustainable City, Toulouse, juillet.

Pentina, A. and Lampert, C. H. (2015). Lifelong learning

with non-iid tasks. In Advances in Neural Information

Processing Systems, pages 1540–1548.

Piaget, J. (1962). Play, dreams and imitation in childhood.

New York : Norton.

Rifkin, J. (2016). How the third industrial revolution will

create a green economy. New Perspectives Quarterly,

33(1):6–10.

Robins, B. and Dautenhahn, K. (2007). Encouraging social

interaction skills in children with autism playing with

robots. Enfance, 59(1):72–81.

Russell, S. J., Norvig, P., Canny, J. F., Malik, J. M., and

Edwards, D. D. (2003). Artiﬁcial intelligence: a mod-

ern approach, volume 2. Prentice hall Upper Saddle

River.

Serugendo, G. D. M., Gleizes, M.-P., and Karageorgos, A.

(2011). Self-organising systems. In Self-organising

Software, pages 7–32. Springer.

Silver, D. L., Yang, Q., and Li, L. (2013). Lifelong ma-

chine learning systems: Beyond learning algorithms.

In AAAI Spring Symposium: Lifelong Machine Learn-

ing, pages 49–55. Citeseer.

Thorisson, K. R., Bieger, J., and Thorarensen, T. (2016).

Why artiﬁcial intelligence needs a task theory. In Ar-

tiﬁcial General Intelligence: 9th International Con-

ference, AGI 2016, New York, NY, USA, July 16-19,

2016, Proceedings, volume 9782, page 118. Springer.

Thrun, S. and Mitchell, T. M. (1995). Lifelong robot learn-

ing. In The biology and technology of intelligent au-

tonomous agents, pages 165–196. Springer.

Verstaevel, N. (2016). Self-Organization of Robotic Devices

Through Demonstrations. Doctoral thesis, Universit

de Toulouse, Toulouse, France.

Verstaevel, N., R

egis, C., Gleizes, M.-P., and Robert,

F. (2016). Principles and experimentations of self-

organizing embedded agents allowing learning from

demonstration in ambient robotics. Future Genera-

tion Computer Systems.

Zhang, B.-T. (2014). Ontogenesis of agency in machines:

A multidisciplinary review. In AAAI 2014 Fall Sym-

posium on The Nature of Humans and Machines: A

Multidisciplinary Discourse.

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

286