Computing Maxmin Strategies in Extensive-form Zero-sum Games with
Imperfect Recall
Branislav Bo
ˇ
sansk
´
y, Ji
ˇ
r
´
ı
ˇ
Cerm
´
ak, Karel Hor
´
ak and Michal P
ˇ
echou
ˇ
cek
Department of Computer Science, Czech Technical University in Prague, Prague, Czech Republic
Keywords:
Game Theory, Imperfect Recall, Maxmin Strategies.
Abstract:
Extensive-form games with imperfect recall are an important game-theoretic model that allows a compact
representation of strategies in dynamic strategic interactions. Practical use of imperfect recall games is limited
due to negative theoretical results: a Nash equilibrium does not have to exist, computing maxmin strategies is
NP-hard, and they may require irrational numbers. We present the first algorithm for approximating maxmin
strategies in two-player zero-sum imperfect recall games without absentmindedness. We modify the well-
known sequence-form linear program to model strategies in imperfect recall games resulting in a bilinear
program and use a recent technique to approximate the bilinear terms. Our main algorithm is a branch-and-
bound search that provably reaches the desired approximation after an exponential number of steps in the size
of the game. Experimental evaluation shows that the proposed algorithm can approximate maxmin strategies
of randomly generated imperfect recall games of sizes beyond toy-problems within few minutes.
1 INTRODUCTION
The extensive form is a well-known representation
of dynamic strategic interactions that evolve in time.
Games in the extensive form (extensive-form games;
EFGs) are visualized as game trees, where nodes
correspond to states of the game and edges to ac-
tions executed by players. This representation is gen-
eral enough to model stochastic events and imperfect
information when players are unable to distinguish
among several states. Recent years have seen ad-
vancements in algorithms for computing solution con-
cepts in large zero-sum extensive-form games (e.g.,
solving heads-up limit texas hold’em poker (Bowling
et al., 2015)).
Most of the algorithms for finding optimal strate-
gies in EFGs assume that players remember all infor-
mation gained during the course of the game (Zinke-
vich et al., 2008; Hoda et al., 2010; Bo
ˇ
sansk
´
y et al.,
2014). This assumption is known as perfect recall
and has a significant impact on theoretical proper-
ties of finding optimal strategies in EFGs. Namely,
there is an equivalence between two types of strate-
gies in perfect recall games mixed strategies (prob-
ability distributions over pure strategies
1
) and behav-
1
A pure strategy in an EFG is an assignment of an action to
play in each decision point.
ioral strategies (probability distributions over actions
in each decision point) (Kuhn, 1953). This equiva-
lence guarantees that a Nash equilibrium (NE) exists
in behavioral strategies in perfect recall games (the
proof of the existence of NE deals with mixed strate-
gies only (Nash, 1950)) and it is exploited by algo-
rithms for computing a NE in zero-sum EFGs with
perfect recall the well-known sequence-form linear
program (Koller et al., 1996; von Stengel, 1996).
The caveat of perfect recall is that remember-
ing all information increases the number of decision
points (and consequently the size of a behavioral strat-
egy) exponentially with the number of moves in the
game. One possibility for tackling the size of per-
fect recall EFGs is to create an abstracted game where
certain decision points are merged together, solve this
abstracted game, and then translate the strategy from
the abstracted game into the original game (e.g., see
(Gilpin and Sandholm, 2007; Kroer and Sandholm,
2014; Kroer and Sandholm, 2016)). However, devis-
ing abstracted games that have imperfect recall is de-
sirable due to the reduced size. One then must com-
pute behavioral strategies in order to exploit the re-
duced size (mixed strategies already operate over an
exponentially large set of pure strategies).
Solving imperfect recall games has several fun-
damental problems. The best known game-theoretic
solution concept, a Nash equilibrium (NE), does not
Bosansky B., Cermak J., Horak K. and Pechoucek M.
Computing Maxmin Strategies in Extensive-form Zero-sum Games with Imperfect Recall.
DOI: 10.5220/0006121200630074
In Proceedings of the 9th International Conference on Agents and Artificial Intelligence (ICAART 2017), pages 63-74
ISBN: 978-989-758-220-2
Copyright
c
2017 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
63
have to exist even in zero-sum games (see (Wichardt,
2008) for a simple example) and standard algo-
rithms (e.g., a Counterfactual Regret Minimization
(CFR) (Zinkevich et al., 2008)) can converge to in-
correct strategies (see Example 1). Therefore, we
focus on finding a strategy that guarantees the best
possible expected outcome for a player a maxmin
strategy. However, computing a maxmin strategy is
NP-hard and such strategies may require irrational
numbers even when the input uses only rational num-
bers (Koller and Megiddo, 1992).
Existing works avoid these negative results by cre-
ating very specific abstracted games so that perfect
recall algorithms are still applicable. One example
is a subset of imperfect recall games called (skewed)
well-formed games, motivated by the poker domain,
in which the standard perfect-recall algorithms (e.g.,
CFR) are still guaranteed to find an approximate
Nash behavioral strategy (Lanctot et al., 2012; Kroer
and Sandholm, 2016). The restrictions on games
to form (skewed) well-formed games are, however,
rather strict and can prevent us from creating suffi-
ciently small abstracted games. To fully explore the
possibilities of exploiting the concept of abstractions
and/or other compactly represented dynamic games
(e.g., Multi-Agent Influence Diagrams (Koller and
Milch, 2003)), a new algorithm for solving imperfect
recall games is required.
1.1 Our Contribution
We advance the state of the art and provide the
first approximate algorithm for computing maxmin
strategies in imperfect recall games (since maxmin
strategies might require irrational numbers (Koller
and Megiddo, 1992), finding exact maxmin has fun-
damental difficulties). We assume imperfect recall
games with no absentmindedness, which means that
each decision point in the game can be visited at
most once during the course of the game and it is
arguably a natural assumption in finite games (see,
e.g., (Piccione and Rubinstein, 1997) for a detailed
discussion). The main goal of our approach is to
find behavioral strategies that maximize the expected
outcome of player 1 against an opponent that min-
imizes the outcome. We base our formulation on
the sequence-form linear program for perfect recall
games (Koller et al., 1996; von Stengel, 1996) and
we extend it with bilinear constraints necessary for
the correct representation of strategies of player 1 in
imperfect recall games. We approximate the bilinear
terms using recent Multiparametric Disaggregation
Technique (MDT) (Kolodziej et al., 2013) and pro-
vide a mixed-integer linear program (MILP) for ap-
proximating maxmin strategies. Finally, we consider
a linear relaxation of the MILP and propose a branch-
and-bound algorithm that (1) repeatedly solves this
linear relaxation and (2) tightens the constraints that
approximate bilinear terms as well as relaxed binary
variables from the MILP. We show that the branch-
and-bound algorithm ends after exponentially many
steps while guaranteeing the desired precision.
Our algorithm approximates maxmin strategies
for player 1 having generic imperfect recall without
absentmindedness and we give two variants of the al-
gorithm depending on the type of imperfect recall of
the opponent. If the opponent, player 2, has either a
perfect recall or so-called A-loss recall (Kaneko and
Kline, 1995; Kline, 2002), the linear program solved
by the branch-and-bound algorithm has a polynomial
size in the size of the game. If player 2 has a generic
imperfect recall without absentmindedness, the linear
program solved by the branch-and-bound algorithm
can be exponentially large.
We provide a short experimental evaluation to
demonstrate that our algorithm can solve games far
beyond the size of toy problems. Randomly gener-
ated imperfect recall games with up to 5 · 10
3
states
can be typically solved within few minutes.
All the technical proofs can be found in the ap-
pendix or in the full version of this paper.
2 TECHNICAL PRELIMINARIES
Before describing our algorithm we define extensive-
form games, different types of recall, and describe the
approximation technique for the bilinear terms.
A two-player extensive-form game (EFG) is a tu-
ple G = (N ,H ,Z,A, u, C , I). N = {1, 2} is a set of
players, by i we refer to one of the players, and by
i to his opponent. H denotes a finite set of histo-
ries of actions taken by all players and chance from
the root of the game. Each history corresponds to a
node in the game tree; hence, we use terms history
and node interchangeably. We say that h is a prefix
of h
0
(h v h
0
) if h lies on a path from the root of the
game tree to h
0
. Z H is the set of terminal states of
the game. A denotes the set of all actions. An ordered
list of all actions of player i from root to h is referred
to as a sequence, σ
i
= seq
i
(h), Σ
i
is a set of all se-
quences of i. For each z Z we define a utility func-
tion u
i
: Z R for each player i (u
i
(z) = u
i
(z) in
zero-sum games). The chance player selects actions
based on a fixed probability distribution known to all
players. Function C : H [0,1] is the probability of
reaching h due to chance.
Imperfect observation of player i is modeled via
ICAART 2017 - 9th International Conference on Agents and Artificial Intelligence
64
information sets I
i
that form a partition over h H
where i takes action. Player i cannot distinguish be-
tween nodes in any I
i
I
i
. A(I
i
) denotes actions
available in each h I
i
. The action a uniquely identi-
fies the information set where it is available. We use
seq
i
(I
i
) as a set of all sequences of player i leading to
I
i
. Finally, we use inf
i
(σ
i
) to be a set of all information
sets to which sequence σ
i
leads.
A behavioral strategy β
i
B
i
is a probability dis-
tribution over actions in each information set I I
i
.
We use u
i
(β) = u
i
(β
i
,β
i
) for the expected outcome
of the game for i when players follow β. A best
response of player i against β
i
is a strategy β
BR
i
BR
i
(β
i
), where u
i
(β
BR
i
,β
i
) u
i
(β
0
i
,β
i
) for all β
0
i
B
i
. β
i
(I,a) is the probability of playing a in I, β(h)
denotes the probability that h is reached when both
players play according to β and due to chance.
We say that β
i
and β
0
i
are realization equivalent if
for any β
i
, z Z β(z) = β
0
(z), where β = (β
i
,β
i
)
and β
0
= (β
0
i
,β
i
).
A maxmin strategy β
i
is defined as β
i
=
argmax
β
i
B
i
min
β
i
B
i
u
i
(β
i
,β
i
). Note that when a
Nash equilibrium in behavioral strategies exists in a
two-player zero-sum imperfect recall game then β
i
is
a Nash equilibrium strategy for i.
2.1 Types of Recall
We now briefly define types of recall in EFGs and
state several lemmas and observations about charac-
teristics of strategies in imperfect recall EFGs that are
later exploited by our algorithm.
In perfect recall, all players remember the history
of their own actions and all information gained during
the course of the game. As a consequence, all nodes
in any information set I
i
have the same sequence for
player i. If the assumption of perfect recall does not
hold, we talk about games with imperfect recall. In
imperfect recall games, mixed and behavioral strate-
gies are not comparable (Kuhn, 1953). However, in
games without absentmindedness (AM) where each
information set is encountered at most once during
the course of the game, the following observation al-
low us to consider only pure best responses of the op-
ponent when computing maxmin strategies:
Lemma 1. Let G be an imperfect recall game with-
out AM and β
1
strategy of player 1. There exists
an ex ante (i.e., when evaluating only the expected
value of the strategy) pure behavioral best response
of player 2.
The proof is in the full version of the paper.
This lemma is applied when a mathematical pro-
gram for computing maxmin strategies is formulated
strategies of player 2 can be considered as con-
straints using pure best responses. Note that this is not
true in general imperfect recall games – in games with
AM, an ex ante best response may need to be random-
ized (e.g., in the game with absentminded driver (Pic-
cione and Rubinstein, 1997)).
A disadvantage of using pure best responses as
constraints for the minimizing player is that there
are exponentially many pure best responses in the
size of the game. In perfect recall games, this can
be avoided by formulating best-response constraints
such that the opponent is playing the best action in
each information set. However, this type of response,
termed time consistent strategy (Kline, 2002), does
not have to be an ex ante best response in general im-
perfect recall games (see (Kline, 2002) for an exam-
ple). A class of imperfect recall games where it is
sufficient to consider only time consistent strategies
when computing best responses was termed as A-loss
recall games (Kaneko and Kline, 1995; Kline, 2002).
Definition 1. Player i has A-loss recall if and only if
for every I I
i
and nodes h,h
0
I it holds either (1)
seq
i
(h) = seq
i
(h
0
), or (2) I
0
I
i
and two distinct
actions a, a
0
A
i
(I
0
),a 6= a
0
such that a seq
i
(h)
a
0
seq
i
(h
0
).
Condition (1) in the definition says that if player i
has perfect recall then she also has A-loss recall. Con-
dition (2) requires that each loss of memory of A-loss
recall player can be traced back to some loss of mem-
ory of the player’s own previous actions.
The equivalence between time consistent strate-
gies and ex ante best responses allows us to simplify
the best responses of player 2 in case she has A-
loss recall. Formally, it is sufficient to consider best
responses that correspond to the best response in a
coarsest perfect-recall refinement of the imperfect re-
call game when computing best response for a player
with A-loss recall. By a coarsest perfect recall re-
finement of an imperfect recall game G we define a
perfect recall game G
0
where we split the imperfect
recall information sets to biggest subsets still fulfill-
ing the perfect recall.
Definition 2. The coarsest perfect re-
call refinement G
0
of the imperfect recall
game G = {N ,H ,Z,A,u,C , I} is a tuple
{N ,H , Z,A
0
,u, C ,I
0
}, where i N I
i
I
i
H(I
i
) partitions information set I
i
such that
H(I
i
) = {H
1
,..., H
n
} is a disjoint parti-
tion of all h I
i
, where
S
n
j=1
H
j
= I
i
and
H
j
H(I
i
) h
k
,h
l
H
j
: seq
i
(h
k
) = seq
i
(h
l
) and
h
k
H
k
,h
l
H
l
: H
k
H
l
=
/
0 seq
i
(h
k
) 6= seq
i
(h
l
).
Each set from H(I
i
) corresponds to an information
set I
0
i
I
0
i
. Moreover, A
0
is a modification of A
Computing Maxmin Strategies in Extensive-form Zero-sum Games with Imperfect Recall
65
guaranteeing I I
0
h
k
,h
l
I A
0
(h
k
) = A
0
(h
l
),
while for all distinct I
k
,I
l
I
0
A(I
k
) 6= A(I
l
).
Note that we can restrict the coarsest perfect re-
call refinement only for i by splitting only information
sets of i (information sets of i remain unchanged).
Finally, we assume that there is a mapping between
actions from the coarsest perfect recall refinement A
0
and actions in the original game A so that we can
identify to which actions from A
0
an original action
a A maps. We assume this mapping to be implicit
since it is clear from the context.
Lemma 2. Let G be an imperfect recall game where
player 2 has A-loss recall and β
1
is a strategy of
player 1, and let G
0
be the coarsest perfect recall re-
finement of G for player 2. Let β
0
2
be a pure best re-
sponse in G
0
and let β
2
be a realization equivalent be-
havioral strategy in G, then β
2
is a pure best response
to β
1
in G.
The proof is in the full version of the paper.
Note that the NP-hardness proof of computing
maxmin strategies due to Koller (Koller and Megiddo,
1992) still applies, since we assume the maximizing
player to have generic imperfect recall and the reduc-
tion provided by Koller results in a game where the
maximizing player has generic imperfect recall while
the minimizing player has perfect recall, which is a
special case of both settings assumed in our paper.
Finally, let us show that CFR cannot be applied
in these settings. This is caused by the fact that CFR
iteratively minimizes per information set regret terms
(counterfactual regrets). Since in perfect recall games
the sum of counterfactual regrets provides an upper
bound on the external regret, this minimization is
guaranteed to converge to a strategy profile with 0 ex-
ternal regret. In imperfect recall games, however, the
sum of counterfactual regrets no longer forms an up-
per bound on the external regret (Lanctot et al., 2012),
and the minimization of these regret terms can, there-
fore, lead to a strategy profile with a non-zero external
regret.
Example 1: Consider the A-loss recall game in Figure
1. When setting the x > 2, one of the strategy profiles
with zero counterfactual regret (and therefore a pro-
file to which CFR can converge) is mixing uniformly
between both a, b and g, h, while player 2 plays d,
e deterministically. By setting the utility x as some
large number, this strategy profile can have expected
utility arbitrarily worse than the maxmin value 1.
The reason is the presence of the conflicting outcomes
for some action in an imperfect recall information set
that cannot be generally avoided, or easily detected in
imperfect recall games.
Figure 1: An A-loss recall game where CFR finds a strategy
with the expected utility arbitrarily distant from the maxmin
value.
2.2 Approximating Bilinear Terms
The final technical tool that we use in our algorithm
is the approximation of bilinear terms by Multipara-
metric Disaggregation Technique (MDT) (Kolodziej
et al., 2013) for approximating bilinear constraints.
The main idea of the approximation is to use a digit-
wise discretization of one of the variables from a bi-
linear term. The main advantage of this approxima-
tion is a low number of newly introduced integer vari-
ables and an experimentally confirmed speed-up over
the standard technique of piecewise McCormick en-
velopes (Kolodziej et al., 2013).
9
k=0
w
k,`
= 1 ` Z (1a)
w
k,`
∈{0,1} (1b)
`Z
9
k=0
10
`
· k · w
k,`
= b (1c)
c
L
· w
k,`
ˆc
k,`
c
U
· w
k,`
` Z,k 0..9 (1d)
9
k=0
ˆc
k,`
= c ` Z (1e)
`Z
9
k=0
10
`
· k · ˆc
k,`
= a (1f)
Let a = bc be a bilinear term. MDT discretizes
variable b and introduces new binary variables w
k,l
that indicate whether the digit on `-th position is k.
Constraint (1a) ensures that for each position ` there
is exactly one digit chosen. All digits must sum to
b (Constraint (1c)). Next, we introduce variables ˆc
k,`
that are equal to c for such k and ` where w
k,l
= 1,
and ˆc
k,`
= 0 otherwise. c
L
and c
U
are bounds on the
value of variable c. The value of a is given by Con-
straint (1f).
This is an exact formulation that requires infinite
sums and an infinite number of constraints. However,
by restricting the set of all possible positions ` to a
finite set {P
L
,. . . ,P
U
} we get a lower bound approx-
imation. Following the approach in (Kolodziej et al.,
2013) we can extend the lower bound formulation to
compute an upper bound:
ICAART 2017 - 9th International Conference on Agents and Artificial Intelligence
66
Constraints (1a), (1d),(1e)
`∈{P
L
,...,P
U
}
9
k=0
10
`
· k · w
k,`
+ b = b (2a)
0 b 10
P
L
(2b)
`∈{P
L
,...,P
U
}
9
k=0
10
`
· k · ˆc
k,`
+ a = a (2c)
c
L
· b a c
U
· b (2d)
c c
U
· 10
P
L
+ c
U
· b a (2e)
c c
L
· 10
P
L
+ c
L
· b a (2f)
Here, b is assigned to every discretized variable
b allowing it to take up the value between two dis-
cretization points created due to the minimal value
of ` (Constraints (2a)–(2b)). Similarly, we allow
the product variable a to be increased with variable
a = b · c. To approximate the product of the delta
variables, we use the McCormick envelope defined by
Constraints (2d)–(2f).
3 MATHEMATICAL PROGRAMS
FOR APPROXIMATING
MAXMIN STRATEGIES
We now state the mathematical programs for approx-
imating maxmin strategies. The main idea is to add
bilinear constraints into the sequence form LP to re-
strict to imperfect recall strategies. We formulate an
exact bilinear program, followed by the approxima-
tion of bilinear terms using MDT.
3.1 Exact Bilinear Sequence Form
Against A-loss Recall Opponent
max
x,r,v
v(root,
/
0) (3a)
s.t. r(
/
0) = 1 (3b)
0 r(σ) 1 σ Σ
1
(3c)
aA(I)
r(σa) = r(σ) σ Σ
1
,I inf
1
(σ
1
) (3d)
aA(I)
x(a) = 1 I I
IR
1
(3e)
0 x(a) 1 I I
IR
1
,a A(I) (3f)
r(σ) · x(a) = r(σa) I I
IR
1
,a A(I),
σ seq
1
(I) (3g)
σ
1
Σ
1
g(σ
1
,σ
2
a)r
1
(σ
1
) +
I
0
inf
2
(σ
2
a)
v(I
0
,σ
2
a) v(I, σ
2
)
I I
2
,σ
2
seq
2
(I),a A(I) (3h)
Constraints (3a)–(3h) represent a bilinear refor-
mulation of the sequence-form LP due to (von Sten-
gel, 1996) applied to the information set structure of
an imperfect recall game G. The objective of player 1
is to find a strategy that maximizes the expected util-
ity of the game. The strategy is represented by vari-
ables r that assign probability to a sequence: r(σ
1
) is
the probability that σ
1
Σ
1
will be played assuming
that information sets, in which actions of sequence
σ
1
are applicable, are reached due to player 2. Prob-
abilities r must satisfy so-called network flow Con-
straints (3c)–(3d). Finally, a strategy of player 1 is
constrained by the best-responding opponent that se-
lects an action minimizing the expected value in each
I I
2
and for each σ
2
seq
2
(I) that was used to
reach I (Constraint (3h)). These constraints ensure
that the opponent plays the best response in the coars-
est perfect recall refinement of G and thus also in G
due to Lemma 2. The expected utility for each ac-
tion is a sum of the expected utility values from im-
mediately reachable information sets I
0
and from im-
mediately reachable leafs. For the latter we use gen-
eralized utility function g : Σ
1
× Σ
2
R defined as
g(σ
1
,σ
2
) =
zZ|seq
1
(z)=σ
1
seq
2
(z)=σ
2
u(z)C (z).
In imperfect recall games multiple σ
i
can lead to
some imperfect recall information set I
i
I
IR
i
I
i
;
hence, realization plans over sequences do not have
to induce the same behavioral strategy for I
i
. There-
fore, for each I
i
I
IR
i
we define behavioral strategy
x(a) for each a A(I
i
) (Constraints (3e)–(3f)). To en-
sure that the realization probabilities induce the same
behavioral strategy in I
i
, we add bilinear constraint
r(σ
i
a) = x(a) · r(σ
i
) (Constraint (3g)).
3.1.1 Player 2 without A-Loss Recall
If player 2 does not have A-loss recall, the mathe-
matical program must use each pure best response of
player 2 π
2
Π
2
as a constraint as follows:
max
x,r,v
v(root) (4a)
Constraints (3b)–(3f)
zZ | π
2
(z)=1
u(z)C (z)r(seq
1
(z)) v(root) π
2
Π
2
(4b)
Since the modification does not change the parts
of the program related to the approximation of strate-
gies of player 1, all the following approximation
methods, theorems, and the branch-and-bound al-
gorithm are applicable for general imperfect recall
games without absentminded players.
Computing Maxmin Strategies in Extensive-form Zero-sum Games with Imperfect Recall
67
3.2 Upper Bound MILP Approximation
The upper bound formulation of the bilinear program
follows the MDT example and uses ideas similar to
Section 2.2. In accord with the MDT, we represent
every variable x(a) using a finite number of digits. Bi-
nary variables w
I
1
,a
k,`
correspond to w
k,`
variables from
the example shown in Section 2.2 and are used for the
digit-wise discretization of x(a). Finally, ˆr(σ
1
)
a
k,`
cor-
respond to ˆc
k,`
variables used to discretize the bilinear
term r(σ
1
a). In order to allow variable x(a) to attain
an arbitrary value from [0,1] interval using a finite
number of digits of precision, we add an additional
real variable 0 x(a) 10
P
that can span the
gap between two adjacent discretization points. Con-
straints (5d) and (5e) describe this loosening. Vari-
ables x(a) also have to be propagated to bilinear
terms r(σ
1
) · x(a) involving x(a). We cannot repre-
sent the product r(σ
1
a) = r(σ
1
) · x(a) exactly and
therefore we give bounds based on the McCormick
envelope (Constraints (5i)–(5j)).
max
x,r,v
v(root,
/
0) (5a)
s.t. Constraints (3b) - (3f) , (3h)
w
I,a
k,`
{0, 1} I I
IR
1
,a A(I),
k 0..9,` P..0 (5b)
9
k=0
w
I,a
k,`
= 1 I I
IR
1
,a A(I),
` P..0 (5c)
0
`=P
9
k=0
10
`
· k · w
I,a
k,`
+ x(a) = x(a)
I I
IR
1
,a A(I) (5d)
0 x(a) 10
P
I I
IR
1
,a A(I) (5e)
0 ˆr(σ)
a
k,`
w
I,a
k,`
I I
IR
1
,a A(I), (5f)
σ seq
1
(I),` P..0
9
k=0
ˆr(σ)
a
k,`
= r(σ) I I
IR
1
,σ seq
1
(I)
` P..0 (5g)
0
`=P
9
k=0
10
`
· k · ˆr(σ)
a
k,`
+ r(σa) = r(σa)
I I
IR
1
,a A(I),
σ seq
1
(I) (5h)
(r(σ) 1) · 10
P
+ x(a) r(σa) 10
P
· r(σ)
I I
IR
1
,a A(I),
σ seq
1
(I) (5i)
0 r(σa) x(a) I I
IR
1
,σ seq
1
(I),
a A(I) (5j)
Due to this loose representation of r(σ
1
a), the
reformulation of bilinear terms is no longer exact and
this MILP therefore yields an upper bound of the
bilinear sequence form program (3). Note that the
MILP has both the number of variables and the num-
ber of constraints bounded by O(|I| · |Σ| · P), where
|Σ| is the number of sequences of both players. The
number of binary variables is equal to 10·|I
IR
1
|·A
max
1
·
P, where A
max
1
= max
II
1
|A
1
(I)|.
3.3 Theoretical Analysis of the Upper
Bound MILP
The variables x(a) and r(σ) ensure that the opti-
mal value of the MILP is an upper bound on the value
of the bilinear program. The drawback is that the re-
alization probabilities do not have to induce a valid
strategy in the imperfect recall game G, i.e. if σ
1
,σ
2
are two sequences leading to an imperfect recall in-
formation set I
1
I
IR
1
where action a A(I
1
) can be
played, r(σ
1
a)/r(σ
1
) need not equal r(σ
2
a)/r(σ
2
).
We will show that it is possible to create a valid strat-
egy in G which decreases the value by at most ε, while
deriving bound on this ε.
Let β
1
(I
1
),. . . ,β
k
(I
1
) be behavioral strategies in
the imperfect recall information set I
1
I
IR
1
corre-
sponding to realization probabilities of continuations
of sequences σ
1
,. . . ,σ
k
leading to I
1
. These probabil-
ity distributions can be obtained from the realization
plan as β
j
(I
1
,a) = r(σ
j
a)/r(σ
j
) for σ
j
seq
1
(I
1
) and
a A(I
1
). We will omit the information set and use
β(a) whenever it is clear from the context. If the im-
perfect recall is violated in I
1
, β
j
(a) may not be equal
to β
l
(a) for some j, l and action a A(I
1
).
Proposition 1. It is always possible to construct a
strategy β(I
1
) such that kβ(I
1
) β
j
(I
1
)k
1
|A(I
1
)| ·
10
P
for every j.
2
We now connect the distance of a corrected
strategy β(I
1
) from a set of behavioral strategies
β
1
(I
1
),. . . ,β
k
(I
1
) in I
1
I
IR
1
to the expected value of
the strategy.
Theorem 1. The error of the Upper Bound MILP is
bounded by
ε = 10
P
· d · A
max
1
·
v
max
(
/
0) v
min
(
/
0)
2
,
where d is the maximum number of player 1’s
imperfect recall information sets encountered on a
path from the root to a terminal node, A
max
1
=
max
I
1
I
IR
1
|A(I
1
)| is the branching factor and v
min
(
/
0),
v
max
(
/
0) are the lowest and highest utilities for player
1 in the whole game, respectively.
The idea of the proof is to bound the error in ev-
ery I
1
I
IR
1
and propagate the error in a bottom-up
fashion.
2
The L1 norm is taken as kx
1
x
2
k
1
=
aA(I
1
)
|x
1
(a)
x
2
(a)|
ICAART 2017 - 9th International Conference on Agents and Artificial Intelligence
68
The error of Upper Bound MILP is bounded if the
precision of all approximations of bilinear terms is P.
However, we can increase the precision for each term
separately and thus design the following iterative al-
gorithm (termed simply as MILP in the experiments):
(1) start with the precision set to 0 for all bilinear
terms, (2) for each approximation of a bilinear term
calculate the current error contribution (the difference
between r(σ
1
a) and r(σ
1
)x(a) multiplied by the
expected utility) and increase the precision only for
the term that contributes to the overall error the most.
Once the term with maximal error has already reached
maximal precision P, Theorem 1 guarantees us that
we are ε close to the optimal solution. Our algorithm
in the following section simultaneously increases the
precision for approximating bilinear terms together
with searching for optimal values for binary variables.
4 BRANCH-AND-BOUND
ALGORITHM
We now introduce a branch-and-bound (BNB) search
for approximating maxmin strategies, that exploits the
observation below and thus improves the performance
compared to the previous MILP formulations. Addi-
tionally, we provide bounds on the overall runtime as
a function of the desired precision.
The BNB algorithm works on the linear relaxation
of the Upper Bound MILP and searches the BNB tree
in the best first search manner. In every node n, the al-
gorithm solves the relaxed LP corresponding to node
n, heuristically selects the information set I and ac-
tion a contributing to the current approximation er-
ror the most, and creates successors of n by restrict-
ing the probability β
1
(I,a) that a is played in I. The
algorithm adds new constraints to LP depending on
the value of β
1
(I,a) by constraining (and/or introduc-
ing new) variables w
I
1
,a
k,l
and creating successors of the
BNB node in the search tree. Note that w
I
1
,a
k,l
variables
correspond to binary variables in the MILP formula-
tion. This way, the algorithm simultaneously searches
for the optimal approximation of bilinear terms as
well as the assignment for binary variables. The al-
gorithm terminates when ε-optimal strategy is found
(using the difference of the global upper bound and
the lower bound computed as described in Observa-
tion 1 below).
Observation 1. Even if the current assignment to
variables w
I
1
,a
k,`
is not feasible (they are not set to bi-
nary values), the realization plan produced is valid
in the perfect recall refinement. We can fix it in the
Algorithm 1: BNB algorithm.
input : Initial LP relaxation LP
0
of Upper
Bound MILP using a P = 0
discretization
output : ε-optimal strategy for a player having
imperfect recall
parameters: Bound on maximum error ε, precision
bounds for x(a) variables P
max
(I
1
,a)
1 fringe {CreateNode(LP
0
)}
2 opt (nil,, )
3 while fringe 6= do
4 (LP,lb, ub) argmax
nfringe
n.ub
5 fringe fringe \ (LP,lb, ub)
6 if opt.lb n.ub then
7 return ReconstructStrategy(opt)
8 if opt.lb < n.lb then
9 opt n
10 if n.ub n.lb ε then
11 return ReconstructStrategy(opt)
12 else
13 (I
1
,a) SelectAction(n)
14 P number of digits of precision
representing x(a) in LP
15 fringe fringe {CreateNode(LP
{
b
a
ub
+a
lb
2
c
P
k=0
w
I
1
,a
k,P
= 1})}
16 fringe fringe {CreateNode(LP
{
9
k=b
a
ub
+a
lb
2
c
P
w
I
1
,a
k,P
= 1})}
17 if P < P
max
(I
1
,a) then
18 fringe fringe {CreateNode(LP
{w
I
1
,a
LP.x(a)
P
,P
= 1, introduce vars
w
I
1
,a
0,P+1
,.. ., w
I
1
,a
9,P+1
and corresponding
constraints from MDT })}
19 return ReconstructStrategy(opt)
20 function CreateNode(LP)
21 ub Solve(LP)
22 β
1
ReconstructStrategy(LP)
23 lb u
1
(β
1
,BestResponse(β1))
24 return (LP,lb, ub)
sense of Proposition 1 and use it to estimate the lower
bound for the BNB subtree rooted in the current node
without a complete assignment of all w
I
1
,a
k,`
variables
to either 0 or 1.
Algorithm 1 depicts the complete BNB algorithm.
It takes an LP relaxation of the Upper Bound MILP
as its input. Initially, the maxmin strategy is approx-
imated using 0 digits of precision after the decimal
point (i.e. precision P(I
1
,a) = 0 for every variable
x(a)). The algorithm maintains a set of active BNB
nodes (fringe) and a candidate with the best guaran-
teed value opt. The algorithm selects the node with
the highest upper bound from fringe at each itera-
tion (lines 4–5). If there is no potential for improve-
ment in the unexplored parts of the branch and bound
tree, the current best solution is returned (line 7) (up-
Computing Maxmin Strategies in Extensive-form Zero-sum Games with Imperfect Recall
69
per bounds of the nodes added to the fringe in the
future will never be higher than the current upper
bound). Next, we check, whether the current solu-
tion has better lower bound than the current best, if
yes we replace it (line 9). Since we always select
the most promising node with respect to the upper
bound, we are sure that if the lower bound and up-
per bound have distance at most ε, we have found
an ε-optimal solution and we can terminate (line 11)
(upper bounds of the nodes added to the fringe in
the future will never be higher than the current up-
per bound). Otherwise, we heuristically select an ac-
tion having the highest effect on the gap between the
current upper and lower bound (line 13). We obtain
the precision used to represent behavioral probabil-
ity of this action. By default we add two successors
of the current BNB node, each with one of the fol-
lowing constraints. x(a) b
a
ub
+a
lb
2
c
P
(line 15) and
x(a) b
a
ub
+a
lb
2
c
P
(line 16), where b·c
p
is flooring of
a number towards p digits of precision and a
ub
and
a
lb
are the lowest and highest allowed values of play-
ing x(a). This step performs binary halving restricting
allowed values of x(a) in current precision. Addition-
ally, if the current precision is lower than the maximal
precision P
max
(I
1
,a) the gap between bounds may be
caused by the lack of discretization points; hence, we
add one more successor bvc x(a) dve, where v is
the current probability of playing a, while increasing
the precision used for representing x(a) (line 18) (all
the restriction to x(a) in all 3 cases are done via w
I
1
,a
k,l
variables).
The function CreateNode computes the upper
bound by solving the given LP (line 21) and the lower
bound, by using the heuristical construction of a valid
strategy β
1
returning some convex combination of
strategies found for σ
k
1
seq
1
(I
1
) (line 22) and com-
puting the expected value of β
1
against a best re-
sponse to it.
Note that this algorithm allows us to plug-in cus-
tom heuristic for the reconstruction of strategies (line
22) and for the action selection (line 13).
4.1 Theoretical Properties of the BNB
Algorithm
The BNB algorithm takes the error bound ε as an in-
put. We provide a method for setting the P
max
(I
1
,a)
parameters appropriately to guarantee ε-optimality.
Finally, we provide a bound on the number of steps
the algorithm needs to terminate.
Theorem 2. Let P
max
(I
1
,a) be the maximum number
of digits of precision used for representing variable
x(a) set as
P
max
(I
1
,a) =
max
hI
1
log
10
|A(I
1
)| · d · v
di f f
(h)
2ε
,
where v
di f f
(h) = v
max
(h) v
min
(h). With this setting
Algorithm 1 terminates and it is guaranteed to return
an ε-optimal strategy for player 1.
The proof provides a bound on the error per node
in every I I
IR
i
and propagates this bound through
the game tree in a bottom up fashion.
Theorem 3. When using P
max
(I
1
,a) from The-
orem 2 for all I
1
I
1
and all a A(I
1
),
the number of iterations of the BNB algo-
rithm needed to find an ε-optimal solution is
in O(3
4S
1
(log
10
(S
1
·v
di f f
(
/
0))+1)
2
5S
1
ε
5S
1
), where S
1
=
|I
1
|A
max
1
.
The proof derives a bound on the num-
ber of nodes in the BNB tree dependent on
max
I
1
I
i
,aA
1
(I
1
)
P
max
(I
1
,a) and uses the formula from
Theorem 2 to transform the bound to a function of ε.
5 EXPERIMENTS
We now demonstrate the practical aspects of our main
branch-and-bound algorithm (BNB) described in Sec-
tion 4 and the iterative MILP variant described in Sec-
tion 3. We compare the algorithms on a set of ran-
dom games where player 2 has A-loss recall. Both
algorithms were implemented in Java, each algorithm
uses a single thread, 8 GB memory limit. We use IBM
ILOG CPLEX 12.6 to solve all LPs/MILPs.
5.1 Random Games
Since there is no standardized collection of bench-
mark EFGs, we use randomly generated games in
order to obtain statistically significant results. We
randomly generate a perfect recall game with vary-
ing branching factor and fixed depth of 6. To control
the information set structure, we use observations as-
signed to every action for player i, nodes h with
the same observations generated by all actions in his-
tory belong to the same information set. In order to
obtain imperfect recall games with a non-trivial in-
formation set structure, we run a random abstraction
algorithm which randomly merges information sets
with the same action count, which do not cause ab-
sentmindedness. We generate a set of experimen-
tal instances by varying the branching factor. Such
games are rather difficult to solve since (1) informa-
tion sets can span multiple levels of the game tree (i.e.,
the nodes in an information set often have histories
ICAART 2017 - 9th International Conference on Agents and Artificial Intelligence
70
Table 1: Average runtime and standard error in seconds
needed to solve at least 100 different random games in every
setting, while increasing the b.f.with fixed depth of 6.
Algs \ b. f . 3 4
MILP 157.87 ± 61.47 459.09 ± 75.95
BNB 9.52 ± 3.78 184.50 ± 48.22
with differing sizes) and (2) actions can easily lead to
leafs with very differing utility values. We always de-
vise an abstraction which results to A-loss recall for
the minimizing player.
5.2 Results
Table 1 reports the average runtime and its standard
error in seconds for both algorithms over at least 100
different random games with increasing branching
factor and fixed depth of 6. We limited the runtime
for 1 instance to 2 hours (in the Table we report results
only for instances where both algorithms finished un-
der 2 hours). The BNB algorithm was terminated 17
and 30 times after 2 hours for the reported settings
respectively, while MILP algorithm was terminated
19 and 34 times. Note that the random games form
an unfavorable scenario for both algorithms since the
construction of the abstraction is completely random,
which makes conflicting behavior in merged informa-
tion sets common. As we can see, however, even in
these scenarios we are typically able to solve games
with approximately 5 · 10
3
states in several minutes.
6 CONCLUSIONS
We provide the first algorithm for approximat-
ing maxmin strategies in imperfect recall zero-sum
extensive-form games without absentmindedness. We
give a novel mathematical formulation for computing
approximate strategies by means of a mixed-integer
linear program (MILP) that uses recent methods in the
approximation of bilinear terms. Next, we use a linear
relaxation of this MILP and introduce a branch-and-
bound search (BNB) that simultaneously looks for the
correct solution of binary variables and increases the
precision for the approximation of bilinear terms. We
provide guarantees that both MILP and BNB find an
approximate optimal solution. Finally, we show that
the algorithms are capable of solving games of sizes
far beyond toy problems (up to 5 ·10
3
states) typically
within few minutes in practice.
Results presented in this paper provide the first
baseline algorithms for the class of imperfect recall
games that are of a great importance in solving large
extensive-form games with perfect recall. As such,
our algorithms can be further extended to improve
the current scalability, e.g., by employing incremental
strategy generation methods.
ACKNOWLEDGEMENTS
This research was supported by the Czech Sci-
ence Foundation (grant no. 15-23235S) and by
the Grant Agency of the Czech Technical Univer-
sity in Prague, grant No. SGS16/235/OHK3/3T/13.
Computational resources were provided by the
CESNET LM2015042 and the CERIT Scientific
Cloud LM2015085, provided under the programme
”Projects of Large Research, Development, and Inno-
vations Infrastructures”.
REFERENCES
Bo
ˇ
sansk
´
y, B., Kiekintveld, C., Lis
´
y, V., and P
ˇ
echou
ˇ
cek,
M. (2014). An Exact Double-Oracle Algorithm for
Zero-Sum Extensive-Form Games with Imperfect In-
formation. Journal of Artificial Intelligence Research,
51:829–866.
Bowling, M., Burch, N., Johanson, M., and Tammelin, O.
(2015). Heads-up limit hold’em poker is solved. Sci-
ence, 347(6218):145–149.
Gilpin, A. and Sandholm, T. (2007). Lossless Abstraction
of Imperfect Information Games. Journal of the ACM,
54(5).
Hoda, S., Gilpin, A., Pe
˜
na, J., and Sandholm, T. (2010).
Smoothing Techniques for Computing Nash Equilib-
ria of Sequential Games. Mathematics of Operations
Research, 35(2):494–512.
Kaneko, M. and Kline, J. J. (1995). Behavior Strategies,
Mixed Strategies and Perfect Recall. International
Journal of Game Theory, 24:127–145.
Kline, J. J. (2002). Minimum Memory for Equivalence
between Ex Ante Optimality and Time-Consistency.
Games and Economic Behavior, 38:278–305.
Koller, D. and Megiddo, N. (1992). The Complexity
of Two-Person Zero-Sum Games in Extensive Form.
Games and Economic Behavior, 4:528–552.
Koller, D., Megiddo, N., and von Stengel, B. (1996). Effi-
cient Computation of Equilibria for Extensive Two-
Person Games. Games and Economic Behavior,
14(2):247–259.
Koller, D. and Milch, B. (2003). Multi-agent influence di-
agrams for representing and solving games. Games
and Economic Behavior, 45(1):181–221.
Kolodziej, S., Castro, P. M., and Grossmann, I. E. (2013).
Global optimization of bilinear programs with a mul-
tiparametric disaggregation technique. Journal of
Global Optimization, 57(4):1039–1063.
Computing Maxmin Strategies in Extensive-form Zero-sum Games with Imperfect Recall
71
Kroer, C. and Sandholm, T. (2014). Extensive-Form Game
Abstraction with Bounds. In ACM conference on Eco-
nomics and computation.
Kroer, C. and Sandholm, T. (2016). Imperfect-Recall Ab-
stractions with Bounds in Games. In EC.
Kuhn, H. W. (1953). Extensive Games and the Problem of
Information. Contributions to the Theory of Games,
II:193–216.
Lanctot, M., Gibson, R., Burch, N., Zinkevich, M.,
and Bowling, M. (2012). No-Regret Learning in
Extensive-Form Games with Imperfect Recall. In
ICML.
Nash, J. F. (1950). Equilibrium Points in n-person Games.
Proc. Nat. Acad. Sci. USA, 36(1):48–49.
Piccione, M. and Rubinstein, A. (1997). On the Interpre-
tation of Decision Problems with Imperfect Recall.
Games and Economic Behavior, 20:3–24.
von Stengel, B. (1996). Efficient Computation of Behavior
Strategies. Games and Economic Behavior, 14:220–
246.
Wichardt, P. C. (2008). Existence of nash equilibria in fi-
nite extensive form games with imperfect recall: A
counterexample. Games and Economic Behavior,
63(1):366–369.
Zinkevich, M., Johanson, M., Bowling, M., and Piccione,
C. (2008). Regret Minimization in Games with In-
complete Information. In NIPS.
APPENDIX
Proposition 1. It is always possible to construct a
strategy β(I
1
) such that kβ(I
1
) β
j
(I
1
)k
1
|A(I
1
)| ·
10
P
for every j.
Proof. Probabilities of playing action a in β
1
,. . . ,β
k
can differ by at most 10
P
, i.e. |β
j
(a)β
l
(a)| 10
P
for every j,l and action a A(I
1
). This is based on
the MDT we used to discretize the bilinear program.
Let us denote
r(σ
1
a) =
0
l=P
9
k=0
10
`
· k · ˆr(σ
1
)
a
k,`
(6)
x(I
1
,a) =
0
l=P
9
k=0
10
`
· k · w
I
1
,a
k,`
. (7)
Constraints (5f) and (5g) ensure that r(σ
1
a) =
r(σ
1
) · x(I
1
,a). The only way how the imper-
fect recall can be violated is thus in the usage of
r(σ
1
a). We know however that r(σ
1
a) 10
P
·
r(σ
1
) which ensures that the amount of imbalance in
β
1
,. . . ,β
k
is at most 10
P
. Taking any of the behav-
ioral strategies β
1
,. . . ,β
k
as the corrected behavioral
strategy β(I
1
) therefore satisfies kβ(I
1
) β
j
(I
1
)k
1
aA(I
1
)
10
P
= |A(I
1
)| · 10
P
.
We now provide the technical proof of Theorem
1. First, we connect the distance of a corrected
strategy β(I
1
) from a set of behavioral strategies
β
1
(I
1
),. . . ,β
k
(I
1
) in I
1
I
IR
1
to the expected value of
the strategy. We start with bounding this error in a
single node.
Lemma 3. Let h I
1
be a history and β
1
, β
2
be be-
havioral strategies (possibly prescribing different be-
havior in I
1
) prescribing the same distribution over
actions for all subsequent histories h
0
A h. Let
v
max
(h) and v
min
(h) be maximal and minimal utilities
of player 1 in the subtree of h, respectively. Then the
following holds:
|v
β
1
(h) v
β
2
(h)|
v
di f f
(h)
2
· kβ
1
(I
1
) β
2
(I
1
)k
1
,
where v
β
j
(h) is the maxmin value u(β
j
,β
BR
2
) of strat-
egy β
j
of player 1 given the play starts in h and
v
di f f
(h) = v
max
(h) v
min
(h).
Proof. Let us study strategies β
1
and β
2
in node h.
Let us take β
1
(I
1
) as a baseline and transform it to-
wards β
2
(I
1
). We can identify two subsets of A(I
1
)
a set of actions A
+
where the probability of play-
ing the action in β
2
was increased and A
where the
probability was decreased. Let us denote
C
=
aA
|β
1
(I
1
,a) β
2
(I
1
,a)| ∀◦ {+, −}.
We know that C
+
= C
(as strategies have to be
probability distributions). Moreover we know that
kβ
1
(I
1
) β
2
(I
1
)k
1
= C
+
+C
. In the worst case, de-
creasing the probability of playing action a A
risks
losing quantity proportional to the amount of this de-
crease multiplied by the highest utility in the subtree
v
max
(h). For all actions a A
this loss is equal to
v
max
(h) ·
aA
|β
1
(I
1
,a) β
2
(I
1
,a)| = v
max
(h) ·C
.
Similarly the increase of the probabilities of ac-
tions in A
+
can add in the worst case v
min
(h) · C
+
to the value of the strategy. This combined together
yields
v
β
2
(h)v
β
1
(h) v
max
(h) ·C
+ v
min
(h) ·C
+
= [v
max
(h) + v
min
(h)] ·C
+
=
v
max
(h) + v
min
(h)
2
· 2C
+
=
v
max
(h) + v
min
(h)
2
· kβ
1
(I
1
) β
2
(I
1
)k
1
.
The strategies β
1
, β
2
are interchangeable which
results in the final bound on the difference of v
β
2
(h),
v
β
1
(h).
ICAART 2017 - 9th International Conference on Agents and Artificial Intelligence
72
Now we are ready to bound the error in the whole
game tree.
Theorem 1. The error of the Upper Bound MILP is
bounded by
ε = 10
P
· d · A
max
1
·
v
max
(
/
0) v
min
(
/
0)
2
,
where d is the maximum number of player 1’s im-
perfect recall information sets encountered on a
path from the root to a terminal node, A
max
1
=
max
I
1
I
IR
1
|A(I
1
)| is the branching factor and v
min
(
/
0),
v
max
(
/
0) are the lowest and highest utilities for player
1 in the whole game, respectively.
Proof. We show an inductive way to compute the
bound on the error and we show that the bound from
Theorem 1 is its upper bound. Throughout the deriva-
tion we assume that the opponent plays to maximize
the error bound. We proceed in a bottom-up fashion
over the nodes in the game tree, computing the max-
imum loss L(h) player 1 could have accumulated by
correcting his behavioral strategy in the subtree of h,
i.e.
L(h) u
h
(β
0
) u
h
(β
IR
),
where β
0
is the (incorrect) behavioral strategy of
player 1 acting according to the realization probabili-
ties r(σ) from the solution of the Upper Bound MILP,
β
IR
is its corrected version and u
h
(β) is the expected
utility of a play starting in history h when player 1
plays according to β and his opponent best responds
(without knowing that the play starts in h). The proof
follows in case to case manner.
(1) No corrections are made in subtrees of leafs h,
thus the loss L(h) = 0.
(2) The chance player selects one of the succes-
sor nodes based on the fixed probability distribution.
The loss is then the expected loss over all child nodes
L(h) =
aA(h)
L(h · a) · C (h · a)/C (h). In the worst
case, the chance player selects the child with the high-
est associated loss, therefore
L(h) max
aA(h)
L(h · a).
(3) Player 2 wants to maximize player 1’s loss.
Therefore she selects such an action in her node h
that leads to a node with the highest loss, L(h)
max
aA(n)
L(h · a). This is a pessimistic estimate of
the loss as she may not be able to pick the maximiz-
ing action in every state because of the imperfection
of her information.
(4) If player 1’s node h is not a part of an imper-
fect recall information set, no corrective steps need
to be taken. The expected loss at node h is there-
fore L(h) =
aA(h)
β
0
(h,a)L(h · a). Once again in
the worst case player 1’s behavioral strategy β
0
(h) se-
lects deterministically the child node with the highest
associated loss, therefore L(h) max
aA(h)
L(h · a).
(5) So far we have considered cases that only ag-
gregate losses from child nodes. If player 1’s node
h is part of an imperfect recall information set, the
correction step may have to be taken. Let β
h
be a
behavioral strategy where corrective steps have been
taken for successors of h and let us construct a strat-
egy β
h
where the strategy was corrected in the whole
subtree of h (i.e. including h). Note that ultimately
we want to construct strategy β
/
0
= β
IR
.
We know that values of children have been de-
creased by at most max
aA(h)
L(h · a), hence v
β
0
(h)
v
β
h
(h) max
aA(h)
L(h · a). Then we have to take
the corrective step at the node h and construct strat-
egy β
h
. From Lemma 3 and the observation about the
maximum distance of behavioral strategies within a
single imperfect recall information set I
1
, we get:
v
β
h
(h) v
β
h
(h)
v
di f f
(h)
2
· 10
P
|A
1
(I
1
)|
v
di f f
(
/
0)
2
· 10
P
A
max
1
The loss in the subtree of h is equal to v
β
0
(h)
v
β
h
(h) which is bounded by player 1 acting accord-
ing to the realization
L(h) = v
β
0
(h) v
β
h
(h)
=
h
v
β
h
(h) v
β
h
(h)
i
+
h
v
β
0
(h) v
β
h
(h)
i
v
di f f
(
/
0)
2
· 10
P
A
max
1
+ max
aA(h)
L(h · a).
We will now provide an explicit bound on the
loss in the root node L(
/
0). We have shown that
in order to prove the worst case bound it suffices
to consider deterministic choice of action at every
node this means that a single path in the game
tree is pursued during propagation of loss. The loss
is increased exclusively in imperfect recall nodes
and we can encounter at most d such nodes on any
path from the root. The increase in such nodes
is constant ([v
max
(
/
0) v
min
(
/
0)] · 10
P
A
max
1
/2), there-
fore the bound is ε = L(
/
0) [v
max
(
/
0) v
min
(
/
0)] · d ·
10
P
A
max
1
/2.
We now know that the expected value of the strat-
egy we have found lies within the interval [v
ε, v
],
where v
is the optimal value of the Upper Bound
MILP. As v
is an upper bound on the solution of the
original bilinear program, no strategy can be better
than v
— which means that the strategy we found is
ε-optimal.
Computing Maxmin Strategies in Extensive-form Zero-sum Games with Imperfect Recall
73
Theorem 2. Let P
max
(I
1
,a) be the maximum number
of digits of precision used for representing variable
x(a) set as
P
max
(I
1
,a) =
max
hI
1
log
10
|A(I
1
)| · d · v
di f f
(h)
2ε
,
where v
di f f
(h) = v
max
(h) v
min
(h). With this setting,
Algorithm 1 terminates and it is guaranteed to return
an ε-optimal strategy for player 1.
Proof. We start by proving that Algorithm 1 with this
choice of P
max
(I
1
,a) terminates. We will show that
every branch of the branch-and-bound search tree is
finite. This together with the fact that every node is
visited at most once and the branching factor of the
search tree is finite (every node of the search tree has
at most 3 child nodes) ensures that the algorithm ter-
minates.
Every node of the search tree is tied to branch-
ing on some variable x(a). Let p be the current pre-
cision used to represent x(a) and let us consider the
first node on the branch where x(a) is represented
with such precision. At such point, p 1 digits are
fixed and thus x [c,c + 10
(p1)
] for some c [0,1].
On line 18 an interval of size 10
p
is handled, every
left/right operation (lines 15 and 16) may thus handle
an interval whose size is reduced at least by 10
p
. We
can conduct at most 9 left/right branching operations
(lines 15 and 16) before the size of the interval drops
below 10
p
, which forces us to increase p. At most
10 operations can be performed on every x(a) for ev-
ery precision p, the limit on p is finite for every such
variable and the number of variables is finite as well,
the branch has therefore to terminate.
Let us now show that these limits on the number
of refinements P
max
(I
1
,a) are enough to guarantee ε-
optimality. We will refer the reader to the proof of
Theorem 1 for details while we focus exclusively on
the behavior in nodes from imperfect recall informa-
tion sets.
Let I
1
I
IR
1
and h I
1
. We know that the L1 dis-
tance between behavioral strategies in I
1
is at most
10
P
max
(I
1
,a)
· |A(I
1
)| (for any a A(I
1
)). This means
that the bound on L(h) in h from the proof of Theo-
rem 1 is modified to:
L(h) = v
β
0
(h) v
β
h
(h)
=
h
v
β
h
(h) v
β
h
(h)
i
+
h
v
β
0
(h) v
β
h
(h)
i
v
di f f
(h)
2
· 10
P
max
(I
1
,a)
· |A(I
1
)| + max
aA(h)
L(h · a)
v
di f f
(h)
2
·
|A(I
1
)| · 2ε
|A(I
1
)| · d · [v
di f f
(h)]
+ max
aA(h)
L(h · a)
=
ε
d
+ max
aA(h)
L(h · a).
Similarly with the reasoning in the proof of Theo-
rem 1, it suffices to assume players choosing action at
every node in a deterministic way. The path induced
by these choices contains at most d imperfect recall
nodes, thus L(
/
0) = d · ε/d = ε.
Theorem 3. When using P
max
(I
1
,a) from The-
orem 2 for all I
1
I
1
and all a A(I
1
),
the number of iterations of the BNB algo-
rithm needed to find an ε-optimal solution is
in O(3
4S
1
(log
10
(S
1
·v
di f f
(
/
0))+1)
2
5S
1
ε
5S
1
), where S
1
=
|I
1
|A
max
1
.
Proof. We start by proving that there is
N O(3
4|I
1
|A
max
1
P
max
) nodes in the BnB
tree, where A
max
1
= max
II
1
|A(I)| and
P
max
= max
II
1
,aA(I)
P
max
(I,a). This holds since in
the worst case we branch for every action in every
information set (hence |I
1
|A
1
). We can bound the
number of branchings for a fixed action by 4P
max
,
since there are 10 digits (we branch at most 4 times
using binary halving) and we might require P
max
number of digits of precision. 4|I
1
|A
max
1
P
max
is
therefore the maximum depth of the branch-and-
bound tree. Finally the branching factor of the
branch-and-bound tree is at most 3.
By substituting
max
I
1
I
1
max
hI
1
log
10
|A(I
1
)| · d · v
di f f
(h)
2ε
for P
max
in the above bound (Theorem 2), we ob-
tain
N O(3
4S
1
max
I
1
I
1
l
max
hI
1
log
10
|A(I
1
)d·v
di f f
(h)
2ε
m
),
where S
1
= |I
1
|A
max
1
O(3
4S
1
max
I
1
I
1
l
log
10
|A(I
1
)d·v
di f f
(
/
0)
2ε
m
)
O(3
4S
1
max
I
1
I
1
l
log
10
S
1
v
di f f
(
/
0)
2ε
m
)
O(3
4S
1
l
log
10
S
1
·v
di f f
(
/
0)
2ε
m
)
O(3
4S
1
(log
10
S
1
·v
di f f
(
/
0)
2ε
+1)
)
O(3
4S
1
(log
10
(S
1
·v
di f f
(
/
0))log
10
(2ε)+1)
)
O(3
4S
1
(log
10
S
1
·v
di f f
(
/
0))+1)
3
10S
1
log
10
(2ε)
)
O(3
4S
1
(log
10
S
1
·v
di f f
(
/
0))+1)
3
10S
1
log
3
(2ε)
log
3
(10)
)
O(3
4S
1
(log
10
S
1
·v
di f f
(
/
0))+1)
(2ε)
10S
1
log
3
(10)
)
O(3
4S
1
(log
10
(S
1
·v
di f f
(
/
0))+1)
(2ε)
5S
1
)
ICAART 2017 - 9th International Conference on Agents and Artificial Intelligence
74