SNAPE: A Sequential Non-Stationary Decision Process Model for

Adaptive Explanation Generation

Amelie Sophie Robrecht

and Stefan Kopp

Social Cognitive Systems Group, Faculty of Technology, Bielefeld University, Bielefeld, Germany

Keywords:

Explainable AI, Explanation Planning, Non-Stationary Hierarchichal Decision Processes.

Abstract:

The automatic generation of explanations is an increasingly important problem in the ﬁeld of Explainable AI

(XAI). However, while most work looks at how complete and correct information can be extracted or how it

can be presented, the success of an explanation also depends on the person the explanation is targeted at. We

present an adaptive explainer model that constructs and employs a partner model to tailor explanations during

the course of the interaction. The model incorporates different linguistic levels of human-like explanations in

a hierarchical, sequential decision process within a non-stationary environment. The model is based on online

planning (using Monte Carlo Tree Search) to solve a continuously adapted MDP for explanation action and

explanation move selection. We present the model as well as ﬁrst results from explanation interactions with

different kinds of simulated users.

1 INTRODUCTION

Current AI systems are increasingly required to be

explainable, i.e. to be understandable by stakehold-

ers when provided with some kind of additional ex-

planation. Such explanations can range from visu-

alizations of internal parameters, to dedicated anal-

ysis techniques to identify a classiﬁer’s sensitivities,

to textual information on natural language requests.

Which kind of explanation is suited best depends on

different aspects including the to-be-explained object

as well as the recipient of the explanation (Gunning

et al., 2021). A recent trend in XAI is thus to con-

sider explanations as an adaptive and interactive pro-

cess (Miller, 2019). Ultimately, this requires a tech-

nical system and an end user to clarify the user’s in-

formation needs and to tailor the explanation in an

interactively co-constructed process (Rohlﬁng et al.,

2021). Yet, it is not clear how such a process can be

established and supported by an intelligent explana-

tion system.

We address this problem for the case of natural

language explanations that are constructed in a dia-

logical interaction between a human addressee (the

explainee, henceforth) and an A.I. system. That is,

we aim to support a natural language interaction that

https://orcid.org/0000-0001-5622-8248

https://orcid.org/0000-0002-4047-9277

is co-constructed by an intelligent agent and a user,

resulting in an effective and successfully personal-

ized explanation. We focus speciﬁcally on the prob-

lem of generating explanations in a way that is adap-

tive to the explainee’s understanding and information

needs, as they become apparent through the course

of a dialogical interaction. This requires an agent to

autonomously decide upon the explanation’s content

(what is to be explained next), the general strategy

(e.g. whether to introduce new information, to elab-

orate on previously given information, or to provide

an example), and the speciﬁc realization (the concrete

verbal utterance). Adaptivity must be enabled at each

of these levels. We will focus on the dialog level at

which two things are determined: the information that

needs to be explained next and how it is done best.

Both decisions depend on a model of the current situa-

tion as well as the partner. To that end, the explainee’s

understanding and other relevant characteristics are

extracted from the interaction by analysing the feed-

back and are stored in a dynamic partner model. As

this partner model becomes more and more speciﬁc

over time, the decision process and the partner model

inﬂuence each other. In result, the generation prob-

lem changes from time to time or, in other words, is

non-stationary.

Adopting a broad view on explanations, we de-

ﬁne an explanation goal not only as to give reasons

(i.e. answering why-questions) but also to enable the

Robrecht, A. and Kopp, S.

SNAPE: A Sequential Non-Stationary Decision Process Model for Adaptive Explanation Generation.

DOI: 10.5220/0011671300003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 1, pages 48-58

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

recipient to act or to help her understand (i.e. answer-

ing how-questions or what-questions). One argument

for this is that providing a why-explanation requires

a certain degree of common ground with regard to

what-/how-understanding. More generally, we build

on models of rational communication and treat the

generation of explanations as a problem of rational

decision-making. In this, an agent makes decisions

by considering possible actions and choosing the one

that maximizes a desired outcome weighed against its

potential costs (Russell and Norvig, 2021). The stan-

dard approach is to model this as a (partially observ-

able) markov decision process ((PO)MDP), whose so-

lution is an action policy that can be determined via

reinforcement learning (RL) (Sutton and Barto, 2018)

or through some form of (usually approximate) plan-

ning such as value iteration (Lapan, 2018) or Monte

Carlo tree search (Silver and Veness, 2010). A partic-

ular challenge arises when the environment dynam-

ics, i.e. the way speciﬁc actions bring about speciﬁc

effects, is changing over time. This is the case for the

co-constructed explanation process as outlined above

due to (1) the growing (non-)understanding of the ex-

plainee and (2) the correspondingly changing effects

of explanation moves the agent may take. Indeed, dif-

ferent explanation strategies lead to various results de-

pending on expertise, interest or communication skills

of the explainee (Buhl, 2001; Grice, 1975): While

presenting the information in small pieces and elab-

orating it with reformulations or additional informa-

tion might suit a novice explainee, it will be more and

more inappropriate as the explainee gets increasingly

informed.

In this paper we propose SNAPE, a model that for-

malizes explanation generation as sequential, hierar-

chical decision-making in a non-stationary environ-

ment. In this model, a global explanation plan deter-

mines blocks of information to be explained. Then,

local decision problems (MDPs) concerning the se-

lection of explanation actions and moves are contin-

uously revised and reformulated due to the changing

partner model, and then solved efﬁciently through on-

line planning. We demonstrate the approach in the

domain of explaining the board game Quarto!, which

is of a manageable size, comprises both rules as well

as strategic elements, and allows for testing the ex-

plainee’s understanding by letting her play the board

game subsequent to the explanation. We start by dis-

cussing related work and underlying concepts in Sec-

tion 2. Section 3 will describe the concept and struc-

ture of the model focusing on how it provides expla-

nations sequentially and adaptively, based on contin-

uous updates of a partner model. We will then present

ﬁrst results (Section 4) obtained with the model in

simulations of explanation processes with different

explainees. We will also show dialogue examples dis-

playing different strategies SNAPE uses in result of

varying partner models. Finally, a conclusion and dis-

cussion of future plans is given in Section 5.

2 RELATED WORK

2.1 Explanation Generation

In an explanation, an explainer and an explainee

are interacting to establish sufﬁcient understanding

(Rohlﬁng et al., 2021). The object to be explained

is referred to as the explanandum, while explanans

refers to the particular way how the explanandum is

explained. Generally, explanations can vary largely

with regard to the explainer’s stance or goal, or the

causal patterns laid out by the explanans (Keil, 2006).

Most current research in XAI looks at introspective,

post-hoc explanations of a single decision of an AI

system using, e.g., contrastive or counterfactual ex-

planation patterns. This is fueled by the intrinsic in-

terest of understanding deep neural networks (DNNs)

and related “blackbox” algorithms, but also by ex-

ternal requirements such as the General Data Protec-

tion Regulation (GDPR). Consequently, current ap-

proaches in XAI aim at creating transparency, in-

terpretability and explainability amongst others (An-

jomshoae et al., 2019; Stepin et al., 2021). We fo-

cus on achieving explainability, which is mainly con-

cerned with how to make a description understand-

able to the explainee (Rohlﬁng et al., 2021) through

an appropriate, adaptive explanation process.

An explanation process can be deﬁned as a se-

quence of phases that consist of explanation and veri-

ﬁcation blocks (El-Assady et al., 2019). The optimal

pathway though an explanation and the best explana-

tion strategy can differ depending on the targeted level

of detail, the targeted explainee, or the desired level of

interactivity. Current work in XAI has started to de-

velop context-aware methods, making up 43% of ex-

planation studies, while user-aware methods are still

relatively underrepresented (only 10%) (Anjomshoae

et al., 2019). One major challenge for the latter is

that (1) there is not one global explanation that ﬁts

all (Sokol and Flach, 2020), and (2) prior assump-

tions about an explainee (user models) need to be

adapted continuously as the explanation process un-

folds. Therefore an explanation cannot be fully pre-

planned. Recent research also has been directed to ex-

planations in human-robot interaction (Stange et al.,

2022) or to eliciting explanations towards an artiﬁ-

cial explainee. In Zare et al. (2019) the agent learns

SNAPE: A Sequential Non-Stationary Decision Process Model for Adaptive Explanation Generation

a game by playing and asking questions to a human

explainer. Other approaches aim at developing an ar-

tiﬁcial explainer (Allameh and Zaman, 2021), e.g. in

the form of a chatbot that is giving in-game support

rather than an explanation for a given domain.

2.2 Markov Decision Processes

MDPs are a common approach to model sequen-

tial decision processes and agent-environment inter-

actions (Lapan, 2018), with only few explanation gen-

eration approaches using MDPs so far (Anjomshoae

et al., 2019). Classically, an MDP (Puterman and

Feinberg, 1994) is deﬁned by a tuple < S, A, T, R >,

where S is a ﬁnite set of possible states and A is a

ﬁnite set of actions. The transition model T (s|a, s

)

describes the probability to reach a new state s

af-

ter performing action a in state s. The reward model

R(a, s) deﬁnes the immediate reward that is obtained

after performing action a in state s.

Various exten-

sions or sup-types of MDPs have been deﬁned for

speciﬁc domains, two of which are relevant for the

present model are brieﬂy discussed in the following.

2.2.1 Non-Stationary MDPs

MDPs typically assume the environment to be station-

ary, i.e the transition probability T (s|a, s

) to be con-

stant throughout the whole decision process. Like-

wise, A and S are assumed to be ﬁxed. However, in

many real-life settings such as self-driving cars, stock

market forecasting, or dialogue management, the en-

vironment is non-stationary. Different approaches to

deal with non-stationary MDPs (NSMDPs) can be

used (Lecarpentier and Rachelson, 2019). One option

is to model several experts or policies and to compare

their predictions in order to minimize regret (Cesa-

Bianchi and Lugosi, 2006; Fruit et al., 2017). Usually

regret minimization is used in competitive domains

with at least one adversarial. A second approach

– called Hidden-Mode MDP (HM-MDP) (Hadoux,

2015) – is to break down the NSMDP into different

modes. Depending on the current mode the MDP is

in, the transition and reward model differ. HM-MDPs

have been used in several domains (Chades et al.,

2012) and are employed in our model to adapt the

transition probabilities according to the current part-

ner model. Note that this is different from using so-

called “non-stationary policies” (Scherrer and Lesner,

2012), which can change for certain parts of a deci-

sion sequence and were shown to be sometimes more

efﬁcient than classical policies.

Alternate reward models may be R(s|a, s

) or R(s) de-

ﬁned in regards to the domain.

No matter which kind of MDP is used, a crucial

question is how to solve them for a policy. Classi-

cal ofﬂine planning aims to ﬁnd the optimal policy

beforehand. This is computationally costly in larger

state-action spaces and infeasible in non-stationary

environments, because each time the environment

changes a new policy needs to be determined. An al-

ternative is to use online planning with approximate

methods such as Monte Carlo Tree Search (MCTS)

(Pettet et al., 2022). This approach is described

along with suitable application domains in Section

2.2.3. Here, we employ MCTS within a hierarchical

model, in which local MDP models are formulated

with smaller action-state spaces so that MCTS is ap-

plicable.

2.2.2 Hierarchical MDPs

A key challenge with MDPs is that the state-action

space tends to increase rapidly. One approach to deal

with this, besides using Partially Observable Monte

Carlo Planning (POMCP) (Silver and Veness, 2010)

or combining RL and MCTS (Pettet et al., 2022), is

to use hierarchical MDPs. This has been proposed,

e.g., for a robot navigation task (Bakker et al., 2005):

The key idea is to group states into clusters, which are

states of a higher-level MDP but full state spaces of

lower-level MDPs. At the lower level, the state space

is decreased due to only solving the sub-problem of

the associated cluster. The idea of using a hierarchy

in an MDP is well established, for example in plan-

ning algorithms and decision-making. The challenge

is to cluster the correct states together. To that end,

Hierarchically Determinized MDPs (HDMDP) were

introduced (Barry et al., 2010). Further, higher level

clusters or macro-actions can be seen as local policies

for solving sub-problems of the MDP (Hauskrecht

et al., 2013). This allows for reusing the lower level

policy several times. Nevertheless, in some cases us-

ing macro-actions can be costly too and lead to a less

ﬂexible and more abstract model (Hauskrecht et al.,

2013). Notwithstanding these problems, many ap-

proaches on modelling hierarchical MDPs have been

put forward (Kessler et al., 2017), all pointing out two

main advantages: (i) the reduction of state and ac-

tion spaces by breaking the problem down into sub-

problems and (ii) the reusability of higher level ac-

tions.

2.2.3 Monte Carlo Tree Search (MCTS)

MCTS provides a popular way of solving MDPs on-

line. A MCTS policy aims to ﬁnd a balance between

searching for new (exploration) and trying promising

options (exploitation). Coulom (2007) ﬁrst combined

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

Monte-Carlo evaluation with tree search, and Koc-

sis et al. (2006) introduced a ﬁrst Upper Conﬁdence

Bound Algorithm (UCT1) that treats the choice of a

child node as a multi-arm bandit problem. Generally,

the MCTS algorithm consists of four steps that are re-

peated in each iteration: selection, expansion, simula-

tion and backpropagation (Browne et al., 2012). The

ﬁrst step of the algorithm is the selection. Starting at

the root node the algorithm moves down the tree until

it reaches a node that is expandable, which means it is

nonterminal and has unseen children. After selecting

the node it is expanded, which means that at least one

child node is added. How many children can be added

to the tree depends on the available actions. Starting

at the new node a simulation is run using the default

policy and the result of the simulation is propagated

back to the root node to update the probabilities.

3 THE SNAPE MODEL

In this section we introduce the SNAPE model which

formalizes adaptive explanation generation as a se-

quential non-stationary decision process. It builds on

the view that producing an explanation can be seen as

a hierarchical process (Figure 1).

Figure 1: Assumed hierarchical structure of an explanation.

At the highest level of an explanation, the ex-

plainer follows an explanation plan that determines

the next block. A block can either consist of ex-

plaining or verifying a given content (El-Assady et al.,

2019). While providing an explanation block is pos-

sible as long as the block is not assumed to be under-

stood by the explainee, providing a veriﬁcation block

is possible only if a content has been explained before.

As soon as a block is selected, it is passed down to the

next level which determines an appropriate explana-

tion action. Here, the explainer either performs a val-

idate action or, for an explanation block, chooses be-

tween providing new content or deepening the current

one further. The action is transmitted to the explana-

tion move level, at which the explainer decides for the

best move to perform the given action. While, at the

explanation action level, the focus is on which infor-

mation to discuss, the move level evaluates how the

information can be presented best. Finally, the move

is passed down to the explanation utterance level,

where a verbal utterance is generated. In sum, the ex-

planation plan and action concern the content struc-

ture of the explanation, while the explanation move

and utterance levels mainly determine the linguistic

representation.

Figure 2: Structure of the SNAPE model.

The SNAPE model follows this hierarchical struc-

ture and formalizes the corresponding decision prob-

lems (Figure 2). At the highest level, a partially-

ordered plan (POP) speciﬁes the logical order of the

explanation blocks. In a partially-ordered plan some

of the elements have a strict order, while others can

be swapped to a certain extent. Our approach to

form these explanation blocks differs from to the ones

described in Section 2.2.2 as we receive the clus-

ters from a data-set of human-human explanations of

Quarto! that we translated into an ontology using Web

Ontology Language (OWL 2) (Motik et al., 2009).

The knowledge graph built from this ontology can be

seen in Figure 3. While the ontology contains all the

information connected to the domain, the knowledge

graph includes only a subset of the obligatory triples

SNAPE: A Sequential Non-Stationary Decision Process Model for Adaptive Explanation Generation

(e.g., information like the board is made of wood or

you can play Quarto online is included in the ontol-

ogy but not in the knowledge graph for the game ex-

planation). All the triples in the knowledge graph are

assigned to the corresponding explanation block. In

Figure 3 the blocks are indicated as gray areas. Corre-

spondingly, the state space of the MDP is a subgraph

of the knowledge graph and consists of a set of triples.

The global explanation plan determines which block

to perform next. The block includes a list of all triples

that have to be understood by the explainee in order

to complete the explanation successfully. The ﬁgure

illustrates that a block corresponds to a state at the ex-

planation plan level, while determining a state space

at the action and move selection level.

Figure 3: Knowledge graph of triples and associated expla-

nation blocks for the game domain.

Each triple describes a relation (predicate) be-

tween two elements (a subject and an object). For

illustration, the triple underlying the semantics of the

sentence ”Quarto is a board game.” can be seen in

Figure 4.

Figure 4: The semantics of the sentence ”Quarto is a board

game.” in the form of a (entity-relation) triple.

Given the block, an MDP is built for solving ex-

planation action and explanation move selection to-

gether. While the explanation action selection decides

which triple to deal with next, the explanation move is

more restricted to the linguistic representation of the

triple. After choosing which explanation move to per-

form on which triple the information is passed down

to the lowest level, the utterance generation. At the

current state the utterance generation is a dictionary of

move-template pairs, but will be extended to a proba-

bilistic grammar (Collins, 2011) with probability dis-

tributions being inﬂuenced by the partner model.

One key innovation of the SNAPE model is the

integration of an online partner model in the decision

process in order to allow for creating an explanation

adapted to the current state of the explainee’s under-

standing and preferences. Different features of the

partner model are directly connected to certain expla-

nation levels of the model, while others serve as input

for multiple levels (see below). In general, the part-

ner model consists of two components that represent

central assumptions an explainer holds about an ex-

plainee: (1) the partner’s domain model covering the

parts of the knowledge domain that are assumed to

be understood, and (2) more global features that de-

scribes the partner’s general ability and willingness to

follow the explanation. In its current state, the partner

model’s features are the explainee’s expertise, atten-

tiveness and cooperation. The expertise refers to the

general knowledge the explainee has of the domain.

The attentiveness assesses how easily the explainee

gets distracted, while cooperation estimates the extent

to which the explainee is willing to give information

about her current level of understanding.

Initially, the partner model starts as a very generic

model (based on prior assumptions) but is then up-

dated through the feedback the explainee provides.

How powerful the integration of the partner model is,

will become clear in the next two subsections that will

provide a closer look at the MDP, the partner model

and the adaptivity resulting from their combination.

3.1 Adaptive Explanation Action and

Move Selection

As the explanation block is given by the global expla-

nation plan, the set of triples and a list of templates

for each action is passed down to the explanation ac-

tion and move selection. Then, a local MDP is com-

posed whose state consists of all triples of the respec-

tive explanation block, along with a representation of

the level of understanding (lou) for each triple. In the

initial state all lou values are set to None. A list of

all the actions and their moves that can potentially be

performed at the current state is composed. Possible

actions always consist of a type (action type) and the

triple that is currently under discussion (action cud).

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

There are two actions available: provide and deepen,

which are mutually exclusive. Provide is available for

all triples with the value None, while deepen is avail-

able only if the triple has been introduced already.

Moves are more complex than actions. An explana-

tion move describes the way of how to perform the

action at the dialog level (see examples in Table 1)

and can be seen as sub-actions. This means that not

only the full explanation process but also the MDP has

a hierarchical structure with actions at the higher and

moves at the lower level. If the action deepen is cho-

sen all the moves (repeat, add, example) are possible

but with different transition probabilities and follow-

ing states.

Apart from the actions that are available, the MDP

needs to know about the likely results of perform-

ing an action-move pair. To that end, all actions and

moves available at the current state are simulated tak-

ing into account that an action can either be success-

ful and lead to a new state, or can be unsuccessful

so that the MDP remains in its current state. Cru-

cially, the transition probabilities depend on the part-

ner model’s value of attentiveness p

as shown in Ta-

ble 1. The action provide always leads to a new state,

where the level of understanding (lou) is the initial

value (lou

init

). The move repeat has a high probabil-

ity to result in a state change, as we assume a rep-

etition or reformulation does not need a high level

of attentiveness or focus. The move example has a

higher probability to fail than the repeat move, as it is

more complex and engages the user to actively trans-

fer knowledge.

The third move is to give additional

information about the triple. This is not necessarily

needed for the explanation but might help to under-

stand or connect the information to prior knowledge.

The transition probability of this move is in between

those of the other two moves. It needs more attentive-

ness than a repetition but less than an example.

Upon conveying information to the explainee, the

MDP changes to a new state that is also dynamically

determined based on the complexity of the triple (c

the expertise of the user (p

) and the kind of move

(see Table 1). The complexity of the triple is stored

in the dictionary and taken from human-human expla-

nations, triples that caused confusion or frequently led

to a misunderstanding are attached a high complexity

level. The expertise capture’s the explainee’s domain-

speciﬁc ability to understand. The kind of move is im-

portant as we hypothesise that different moves require

Also, examples tend to be longer than repetitions of

information. At the current state of the model this is im-

plicitly included in the complexity level of the triple, later it

will be encoded by the utterance complexity and inﬂuence

the probability distribution of the probabilistic grammar.

different levels of communicative activity and there-

fore lead to different improvements of the level of un-

derstanding (lou). If the current move is to repeat a

triple the level of improvement is comparatively low,

as a reformulation is not a very activating move for the

explainee. Giving an example does need a high level

of activity and therefore leads to a higher improve-

ment. Again to add further information is the move

that is in between. It is more active than a repetition,

but less active than the example.

Further, the MDP needs to deﬁne what a terminal

state looks like. In our case a state is terminal as soon

as all the values have exceeded a certain level of un-

derstanding lou (in our case 0.75). The last element of

the MDP that needs to be deﬁned is the reward func-

tion, which assigns a reward to each possible state.

For the MCTS only the ﬁnal reward is needed, which

in our case is the sum of all the rewards during the it-

eration. The reward is negative until the terminal state

is reached. A non-terminal reward is set to the differ-

ence between all triples that exceed the needed level

of understanding (0.75) and the full amount of triples.

The reward at a ﬁnal state is set to 100.

The decision process runs as follows: The best

move (in terms of maximal expected reward) is cal-

culated using MCTS and then performed by process-

ing the template and offering the explainee the option

to provide feedback. Currently the user can provide

positive, negative or no feedback. If the feedback is

positive, the information is assumed to be grounded

(lou value of 0.9), while it is reset to its initial state

if the feedback is negative. If no feedback is given,

the MDP calculates a state parallel to the predicted

state as described above. Then the current state is

updated and the MCTS, again, ﬁnds the best move

to perform. As the current explanation block is fully

grounded, the MDP reaches its ﬁnal state. Now the

partner model is updated and the global explanation

plan picks the next block to explain. How the partner

model update is works in detail will be content of the

next subsection.

3.2 Partner Model Update

As described above, different variables in the partner

model inﬂuence the model’s probability of changing

into a new state as well as the new state itself. In

result, changes to the partner model cause the gen-

eration process to be non-stationary by yielding new

MDPs that steer the explanation actions/moves for a

limited period of time. Initially, the partner model

holds predeﬁned values and an empty domain model.

Then, the partner’s domain model is updated con-

stantly during the explanation process. Each triple

SNAPE: A Sequential Non-Stationary Decision Process Model for Adaptive Explanation Generation

Table 1: Overview of action-move combinations with examples, along with the (predeﬁned) formulae for calculating new

states and transition probabilities of the adapted MDP.

Action-

Move

Example New State Transitions

Provide-

Provide

”To win you need to build a row.” lou

t(s

|s) = 1

t(s|s) = 0

Deepen-

Repeat

”You can win by creating a row.” lou

= lou

1−lou

∗

0.4+p

t(s

|s) = 0.9 ∗ p

t(s|s) = 1 − 0.9 ∗ p

Deepen-

Add

”A row has a length of 4 ﬁgures.” lou

= lou

1−lou

∗

0.6+p

t(s

|s) = 0.6 ∗ p

t(s|s) = 1 − 0.6 ∗ p

Deepen-

Example

”For example: If there are three

small ﬁgures in a row and you have

a small one, you can build a line.”

lou

= lou

1−lou

∗

0.9+p

t(s

|s) = 0.4 ∗ p

t(s|s) = 1 − 0.4 ∗ p

that is introduced (provide) is added to the knowl-

edge graph, each triple that is acknowledged with pos-

itive feedback or exceeds the needed lou is addition-

ally labelled as grounded. Every time an explanation

block is passed back to the global explanation plan

the global variables in the partner model are updated

as well. If the frequency of feedback given by the user

is high, the level of attentiveness rises, which results

in higher transition probabilities in the next MDP. In

contrast, the attributed expertise is mainly inﬂuenced

by the kind of feedback that is generated. Positive

feedback on a complex triple hints to a high level

of expertise, while the opposite, negative feedback

on an easy information, is an indication for a lower

level of expertise. Also it can be measured by asking

open questions about the explainee’s previous domain

knowledge (in our game domain, e.g., ”You probably

know Best of Four, right?”). This was observed a lot

in human-human game explanations.

Note also that

the selection of the next explanation block depends

on the current partner model and is hence subject to

changes due to partner model updates. Speciﬁcally,

whether to validate or to move on and explain the next

block is signiﬁcantly inﬂuenced by the partner’s at-

tributed cooperation. The cooperation value of the

partner is derived from the quality of the produced

feedback (see below).

The detailed update rules for expertise E

and at-

tentiveness A

with respect to a block b are as follows

(|FB

|: number of positive feedback signals, |FB

number of negative feedback signals, |FB| number of

all feedback signals including None):

This kind of questions is also a building block for ex-

plaining through analogy, which we leave to future work.

Expertise:

b+1

= E

∗

|FB

p+n

act

|FB

p+n

∗ (1 −

|FB

p+n

)

with

act = 0.5 +

tanh(2 ∗ pos − neg)

pos =

|FB

∑

i=1

, neg =

|FB

∑

i=1

4 − c

(1)

Attentiveness:

b+1

= (A

|FB

p+n

|FB|

) ∗ 0.5 (2)

4 RESULTS

In order to evaluate the SNAPE model we have imple-

mented the described approach and tested it in expla-

nation interactions with simulated explainees. Gen-

erally, our hypothesis is that the adaptive model will

be able to generate user-speciﬁc explanations that can

yield a better understanding of the domain, higher

user satisfaction, and trust. While this is left to an

evaluation study with human users, in this paper we

present ﬁrst results to demonstrate the model’s ability

to adapt to different explainees and to produce accord-

ing explanations.

We created three partner models Harry, Ron and

Hermione (Table 2). Hermione is highly interested,

always stays focused and is an expert in nearly every-

thing. Harry often gets distracted and is not much of

an expert in the game domain. Finally, Ron does not

have a high level of attentiveness as he easily looses

track or gets bored when listening to an explanation,

but he has a high level of game expertise. Note that,

as we focus on the explanation blocks only, the co-

operation value is not considered here. Also, due to

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

the simpliﬁed feedback, it is currently not possible to

analyse the quality of feedback.

Table 2: The three partner models used for simulation.

Partner Attentiveness Expertise

Hermione 0.9 0.9

Ron 0.3 0.8

Harry 0.3 0.3

These models were used as different initial part-

ner model of the explainer. Further, during the expla-

nation process, feedback was generated according to

three different feedback distribution types (Table 3).

Table 3: The three kinds of feedback distribution.

FB Negative Positive None

A 0.2 0.6 0.2

B 0.01 0.39 0.6

C 0.39 0.01 0.6

In feedback type A there is a high frequency of

feedback overall, which would be correlated with a

high level of attentiveness. The feedback is mainly

positive, which should indicate a high level of exper-

tise. Feedback type B rarely produces feedback and

therefore displays a low level of attentiveness. If feed-

back is produced, it is mainly positive, this correlates

with a high degree of expertise. Finally, feedback type

C should correspond to a low level of both attentive-

ness and expertise.

Figure 5: Expertise development during explanation.

In our evaluation, we have SNAPE explaining

Quarto starting out with one of the three partner mod-

els and then receiving feedback according to one of

the three types. If the feedback matches the initially

assumed partner model (i.e. Hermione - feedback

A; Ron - feedback B; Harry - feedback C), SNAPE

should explain suitably to whom it thinks it is talking

to. Adaptive explanation generation, however, would

be observed if the partner does not behave as expected

and the system changes its explanation behavior in re-

sult, i.e. produces feedback that does not match the

initial attributions.

We combined each initial partner model with each

of the possible feedback variants and let SNAPE ex-

plain for ten times each. As long as the partner model

matches the feedback type, we expect the develop-

ment of the partner model during the explanation to

be stable. In the other cases we expect the partner

model to update to the feedback and cause changes in

the explanation behavior, which will be shown in this

section as well.

Figure 5 shows how SNAPE adjusts the expertise

value to the unexpected feedback distribution. The

main changes appear in the updates after the ﬁrst three

explanation blocks. It can also be observed that –

no matter with which initial partner model SNAPE

started – the same feedback leads to a more or less

equal expertise value in the end of the explanation.

Figure 6: Attentiveness development during explanation.

The second partner model value we were inter-

ested in was attentiveness, which differed in the feed-

back distributions as well. While B and C generated

comparably little feedback, we had a lot of feedback

in A. Figure 6 conﬁrms that, no matter what the ini-

tial partner model was, SNAPE ends up at a higher

attentiveness value after getting feedback A. If the

initial partner model does not match the actual feed-

back, the partner model is adjusted after the ﬁrst four

sequences.

To illustrate how SNAPE adapts to different part-

ners, we now compare two example explanations of

the explanation block board. SNAPE explains the

block to Hermione and Harry, respectively, who give

feedback in accordance to what SNAPE’s partner

model expects. First lets consider the conversation

between SNAPE and Hermione. We can see two in-

teresting behaviors in the dialog. The ﬁrst is a proof

to what we previously expected: The explanation is

comparatively short and mainly consists of the move

provide (P), as Hermione tends to give positive feed-

back, which generates a high level of understanding

SNAPE: A Sequential Non-Stationary Decision Process Model for Adaptive Explanation Generation

(lou) and does not leave any need to deepen the con-

tent currently under discussion (cud).

After SNAPE provided the triple t1 (P-t1),

Hermione does not give any feedback, so he repeats

the triple. Then, even though there is still no feed-

back given by Hermione, SNAPE moves on to the

next triple. As he does not come back to the triple

during the explanation, it is taken as grounded.

Snape: "Quarto has a game board." (P-t0)

Hermione: +

Snape: "The unused figures are put next

to the board, the stable." (P-t1)

Hermione: None

Snape: "Quarto has a stable." (R-t1)

Hermione: None

Snape: "The board consists of individual

fields." (P-t2)

Hermione: +

This differs from the explanation behavior toward

Harry. As SNAPE provides the second triple (P-t2)

Harry does not give any feedback, so SNAPE gives

some additional information (A-t2). Harry again does

not provide any feedback and, in contrast to the pre-

vious example, SNAPE sticks to the triple currently

under discussion (cud) and generates an example (E-

t2) for the same triple.

Snape: "Quarto has a game board." (P-t0)

Harry: -

Snape: "The board consists of individual

fields." (P-t2)

Harry: None

Snape: "It consists of 16 fields." (A-t2)

Harry: None

Snape: "The board is structured like a

net for example." (E-t2)

Harry: +

Snape: "The board is made of wood."

(A-t0)

Harry: None

Snape: "The board is placed on the

table." (A-t0)

Harry: +

Snape: "The unused figures are put next

to the board, the stable." (P-t1)

Harry: None

Snape: "The stable is not part of the

actual board." (A-t1)

Harry: +

Now Harry gives positive feedback and SNAPE

moves to the next triple (A-t0). Also the explanation

is longer than the one for Hermione, which is due to

the lower expertise and attentiveness of Harry as the

explainee. The example also shows that SNAPE is

capable of changing to a new triple, even though the

item currently under discussion (cud) is not grounded,

and to come back to it at a later state.

5 CONCLUSIONS

In this paper we introduced SNAPE, an explana-

tion generation model using hierarchical decision pro-

cesses that integrate ﬂexible, online partner models.

This online update results in a non-stationary deci-

sion process, which is realized sequentially by com-

posing and solving local, updated models of rational

decision-making in explanation generation and com-

munication. By embedding those in a hierarchical

decision model we illustrate everyday life explana-

tions, which are roughly preplanned, but provide a

detailed plan from sequence to sequence. Such hi-

erarchical decision models minimize state and action

space signiﬁcantly so that local models can be solved

online using MCTS. We have shown how the sys-

tem updates its partner model and the corresponding

decision-making during the conversation, based on

initial assumptions as well as the kind and frequency

of the feedback provided by the explainee through-

out the running interaction. We also reported two

dialog examples produced for the same explanation

block with different underlying partner models. In

direct comparison it becomes clear how SNAPE al-

ready does adapt its explanation strategy to the part-

ner model of the user.

In future work, we will improve the feedback in-

terpretation to enable evaluation studies with human

explainees. By now we only allowed simpliﬁed feed-

back, such as ”+”, ”-” and ”None”. We want the

agent to be able to process backchannels as well as

more complex feedback, such as open questions, sen-

tence completion or examples generated by the ex-

plainee, which may indicate and inﬂuence the level

of cooperation. A high level of cooperation leads

to a smaller need for veriﬁcation blocks and hence

is a partner model feature that inﬂuences the highest

level of global explanation planning. Further, the ut-

terance generation, which is currently based on pre-

deﬁned templates, will be improved towards a prob-

abilistic grammar. This will allow for adapting to

the partner also at the level of lexico-grammatical

choices in accordance with the various levels of the

partner model already incorporated in SNAPE. Ongo-

ing work focuses on deﬁning and including precondi-

tions for each triple into the model. The level of un-

derstanding (lou) of the preconditions directly inﬂu-

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

ences the reward that is given for a state. This results

in lower costs for generating triples for which more

preconditions have been introduced.

ACKNOWLEDGEMENTS

We are grateful to Fabian Heinrich and Markus

Rothg

anger for help with developing the model, and

to Katharina Rohlﬁng, Heike Buhl, Josephine Fisher

and Erick Ronoh for valuable conceptual discus-

sions. This research was funded by the Deutsche

Forschungsgemeinschaft (DFG, German Research

Foundation): TRR 318/1 2021 – 438445824.

REFERENCES

Allameh, M. and Zaman, L. (2021). Jessy: A Conversa-

tional Assistant for Tutoring Digital Board Games. In

Extended Abstracts of the 2021 Annual Symposium

on Computer-Human Interaction in Play, pages 168–

173, Virtual Event Austria. Association fpr Comput-

ing Machinery.

Anjomshoae, S., Najjar, A., Calvaresi, D., and Fr

amling,

K. (2019). Explainable Agents and Robots: Re-

sults from a Systematic Literature Review. In Pro-

ceedings of the 18th International Conference on Au-

tonomous Agents and MultiAgent Systems, AAMAS

’19, pages 1078–1088. International Foundation for

Autonomous Agents and Multiagent Systems.

Bakker, B., Zivkovic, Z., and Krose, B. (2005). Hierarchi-

cal dynamic programming for robot path planning. In

2005 IEEE/RSJ International Conference on Intelli-

gent Robots and Systems, pages 2756–2761.

Barry, J., Kaelbling, L. P., and Lozano-Perez, T. (2010).

Hierarchical Solution of Large Markov Decision

Processes. ICAPS-10 Workshop on Planning and

Scheduling Under Uncertainty, 2010, page 8.

Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M.,

Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D.,

Samothrakis, S., and Colton, S. (2012). A Survey of

Monte Carlo Tree Search Methods. IEEE Transac-

tions on Computational Intelligence and AI in Games,

4(1):1–43.

Buhl, H. M. (2001). Partner Orientation and Speaker’s

Knowledge as Conﬂicting Parameters in Language

Production. Journal of Psycholinguistic Research,

pages 549–567.

Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learn-

ing, and games. Cambridge University Press, Cam-

bridge. OCLC: 70056026.

Chades, I., Carwardine, J., Martin, T., Nicol, S., Sabbadin,

R., and Buffet, O. (2012). MOMDPs: A Solution for

Modelling Adaptive Management Problems. Proceed-

ings of the AAAI Conference on Artiﬁcial Intelligence,

26(1):267–273.

Collins, M. (2011). Probabilistic Context-Free Grammars

(PCFGs). Columbia University.

Coulom, R. (2007). Efﬁcient Selectivity and Backup Oper-

ators in Monte-Carlo Tree Search. In van den Herik,

H. J., Ciancarini, P., and Donkers, H. H. L. M. J.,

editors, Computers and Games, pages 72–83, Berlin,

Heidelberg. Springer Berlin Heidelberg.

El-Assady, M., Jentner, W., Kehlbeck, R., Schlegel, U.,

Sevastjanova, R., Sperrle, F., Spinner, T., and Keim,

D. (2019). Towards XAI: Structuring the Processes of

Explanations. Proceedings of the ACM Workshop on

Human-Centered Machine Learning, Glasgow, UK,

4:13.

Fruit, R., Pirotta, M., Lazaric, A., and Brundskill, E. (2017).

Regret Minimization in MDPs with Options without

Prior Knowledge. Advances in Neural Information

Processing Systems, 30.

Grice, H. P. (1975). Logic and Conversation. Speech Acts,

3:41 – 58.

Gunning, D., Vorm, E., Wang, J. Y., and Turek, M. (2021).

DARPA’s explainable AI (XAI) program: A retro-

spective. Applied AI Letters, 2(4):e61.

Hadoux, E. (2015). Markovian sequential decision-making

in non-stationary environments: application to argu-

mentative debates. PhD thesis, Universit

e Pierre et

Marie Curie - Paris.

Hauskrecht, M., Meuleau, N., Kaelbling, L. P., Dean,

T., and Boutilier, C. (2013). Hierarchical Solution

of Markov Decision Processes using Macro-actions.

arXiv preprint arXiv:1301.7381, page 10.

Keil, F. C. (2006). Explanation and Understanding. Annual

Review of Psychology, 57(1):227–254.

Kessler, C., Capocchi, L., Santucci, J.-F., and Zeigler, B.

(2017). Hierarchical Markov decision process based

on DEVS formalism. In 2017 Winter Simulation Con-

ference (WSC), pages 1001–1012, Las Vegas, NV.

IEEE.

Kocsis, L., Szepesvari, C., and Willemson, J. (2006). Im-

proved Monte-Carlo Search. Univ. Tartu, Tech. Rep,

1:22.

Lapan, M. (2018). Deep reinforcement learning hands-

on: apply modern RL methods, with deep Q-networks,

value iteration, policy gradients, TRPO, AlphaGo

Zero and more. Packt Publishing Ltd.

Lecarpentier, E. and Rachelson, E. (2019). Non-Stationary

Markov Decision Processes, a Worst-Case Approach

using Model-Based Reinforcement Learning. Ad-

vances in neural information processing systems,

32:10.

Miller, T. (2019). Explanation in Artiﬁcial Intelligence: In-

sights from the Social Sciences. Artiﬁcial Intelligence,

267:1 – 38. arXiv:1706.07269 [cs].

Motik, B., Patel-Schneider, P. F., Parsia, B., Bock, C., Fok-

oue, A., Haase, P., Hoekstra, R., Horrocks, I., Rutten-

berg, A., Sattler, U., et al. (2009). Owl 2 web ontology

language: Structural speciﬁcation and functional-style

syntax. W3C recommendation, 27(65):159.

Pettet, G., Mukhopadhyay, A., and Dubey, A. (2022).

Decision Making in Non-Stationary Environments

SNAPE: A Sequential Non-Stationary Decision Process Model for Adaptive Explanation Generation

with Policy-Augmented Monte Carlo Tree Search.

arXiv:2202.13003 [cs].

Puterman, M. L. and Feinberg, E. A. (1994). Markov De-

cision Processes: Discrete Stochastic Dynamic Pro-

gramming. SIAM Review, 38(4):689.

Rohlﬁng, K. J., Cimiano, P., Scharlau, I., Matzner, T.,

Buhl, H. M., Buschmeier, H., Esposito, E., Grim-

minger, A., Hammer, B., Hab-Umbach, R., Horwath,

I., Hullermeier, E., Kern, F., Kopp, S., Thommes, K.,

Ngonga Ngomo, A.-C., Schulte, C., Wachsmuth, H.,

Wagner, P., and Wrede, B. (2021). Explanation as a

Social Practice: Toward a Conceptual Framework for

the Social Design of AI Systems. IEEE Transactions

on Cognitive and Developmental Systems, 13(3):717–

728.

Russell, S. and Norvig, P. (2021). Artiﬁcial Intelligence,

Global Edition : A Modern Approach. Pearson

Deutschland.

Scherrer, B. and Lesner, B. (2012). On the Use of Non-

Stationary Policies for Stationary Inﬁnite-Horizon

Markov Decision Processes. Advances in Neural In-

formation Processing Systems, 25.

Silver, D. and Veness, J. (2010). Monte-Carlo Planning

in Large POMDPs. Advances in Neural Information

Processing Systems, 23:9.

Sokol, K. and Flach, P. (2020). One Explanation Does Not

Fit All: The Promise of Interactive Explanations for

Machine Learning Transparency. KI - K

unstliche In-

telligenz, 34(2):235–250.

Stange, S., Hassan, T., Schr

oder, F., Konkol, J., and

Kopp, S. (2022). Self-Explaining Social Robots:

An Explainable Behavior Generation Architecture for

Human-Robot Interaction. Frontiers in Artiﬁcial In-

telligence, page 87.

Stepin, I., Alonso, J. M., Catala, A., and Pereira-Farina, M.

(2021). A Survey of Contrastive and Counterfactual

Explanation Generation Methods for Explainable Ar-

tiﬁcial Intelligence. IEEE Access, 9:11974–12001.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learn-

ing. The MIT Press, 2 edition.

Zare, M., Ayub, A., Wagner, A. R., and Passonneau, R. J.

(2019). Show me how to win: a robot that uses dialog

management to learn from demonstrations. In Pro-

ceedings of the 14th International Conference on the

Foundations of Digital Games, pages 1–7.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence