Computing Maxmin Strategies in Extensive-form Zero-sum Games with

Imperfect Recall

Branislav Bo

ˇ

sansk

´

y, Ji

ˇ

r

´

ı

ˇ

Cerm

´

ak, Karel Hor

´

ak and Michal P

ˇ

echou

ˇ

cek

Department of Computer Science, Czech Technical University in Prague, Prague, Czech Republic

Keywords:

Game Theory, Imperfect Recall, Maxmin Strategies.

Abstract:

Extensive-form games with imperfect recall are an important game-theoretic model that allows a compact

representation of strategies in dynamic strategic interactions. Practical use of imperfect recall games is limited

due to negative theoretical results: a Nash equilibrium does not have to exist, computing maxmin strategies is

NP-hard, and they may require irrational numbers. We present the ﬁrst algorithm for approximating maxmin

strategies in two-player zero-sum imperfect recall games without absentmindedness. We modify the well-

known sequence-form linear program to model strategies in imperfect recall games resulting in a bilinear

program and use a recent technique to approximate the bilinear terms. Our main algorithm is a branch-and-

bound search that provably reaches the desired approximation after an exponential number of steps in the size

of the game. Experimental evaluation shows that the proposed algorithm can approximate maxmin strategies

of randomly generated imperfect recall games of sizes beyond toy-problems within few minutes.

1 INTRODUCTION

The extensive form is a well-known representation

of dynamic strategic interactions that evolve in time.

Games in the extensive form (extensive-form games;

EFGs) are visualized as game trees, where nodes

correspond to states of the game and edges to ac-

tions executed by players. This representation is gen-

eral enough to model stochastic events and imperfect

information when players are unable to distinguish

among several states. Recent years have seen ad-

vancements in algorithms for computing solution con-

cepts in large zero-sum extensive-form games (e.g.,

solving heads-up limit texas hold’em poker (Bowling

et al., 2015)).

Most of the algorithms for ﬁnding optimal strate-

gies in EFGs assume that players remember all infor-

mation gained during the course of the game (Zinke-

vich et al., 2008; Hoda et al., 2010; Bo

ˇ

sansk

´

y et al.,

2014). This assumption is known as perfect recall

and has a signiﬁcant impact on theoretical proper-

ties of ﬁnding optimal strategies in EFGs. Namely,

there is an equivalence between two types of strate-

gies in perfect recall games – mixed strategies (prob-

ability distributions over pure strategies

1

) and behav-

1

A pure strategy in an EFG is an assignment of an action to

play in each decision point.

ioral strategies (probability distributions over actions

in each decision point) (Kuhn, 1953). This equiva-

lence guarantees that a Nash equilibrium (NE) exists

in behavioral strategies in perfect recall games (the

proof of the existence of NE deals with mixed strate-

gies only (Nash, 1950)) and it is exploited by algo-

rithms for computing a NE in zero-sum EFGs with

perfect recall – the well-known sequence-form linear

program (Koller et al., 1996; von Stengel, 1996).

The caveat of perfect recall is that remember-

ing all information increases the number of decision

points (and consequently the size of a behavioral strat-

egy) exponentially with the number of moves in the

game. One possibility for tackling the size of per-

fect recall EFGs is to create an abstracted game where

certain decision points are merged together, solve this

abstracted game, and then translate the strategy from

the abstracted game into the original game (e.g., see

(Gilpin and Sandholm, 2007; Kroer and Sandholm,

2014; Kroer and Sandholm, 2016)). However, devis-

ing abstracted games that have imperfect recall is de-

sirable due to the reduced size. One then must com-

pute behavioral strategies in order to exploit the re-

duced size (mixed strategies already operate over an

exponentially large set of pure strategies).

Solving imperfect recall games has several fun-

damental problems. The best known game-theoretic

solution concept, a Nash equilibrium (NE), does not

Bosansky B., Cermak J., Horak K. and Pechoucek M.

Computing Maxmin Strategies in Extensive-form Zero-sum Games with Imperfect Recall.

DOI: 10.5220/0006121200630074

In Proceedings of the 9th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2017), pages 63-74

ISBN: 978-989-758-220-2

Copyright

c

2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

63

have to exist even in zero-sum games (see (Wichardt,

2008) for a simple example) and standard algo-

rithms (e.g., a Counterfactual Regret Minimization

(CFR) (Zinkevich et al., 2008)) can converge to in-

correct strategies (see Example 1). Therefore, we

focus on ﬁnding a strategy that guarantees the best

possible expected outcome for a player – a maxmin

strategy. However, computing a maxmin strategy is

NP-hard and such strategies may require irrational

numbers even when the input uses only rational num-

bers (Koller and Megiddo, 1992).

Existing works avoid these negative results by cre-

ating very speciﬁc abstracted games so that perfect

recall algorithms are still applicable. One example

is a subset of imperfect recall games called (skewed)

well-formed games, motivated by the poker domain,

in which the standard perfect-recall algorithms (e.g.,

CFR) are still guaranteed to ﬁnd an approximate

Nash behavioral strategy (Lanctot et al., 2012; Kroer

and Sandholm, 2016). The restrictions on games

to form (skewed) well-formed games are, however,

rather strict and can prevent us from creating sufﬁ-

ciently small abstracted games. To fully explore the

possibilities of exploiting the concept of abstractions

and/or other compactly represented dynamic games

(e.g., Multi-Agent Inﬂuence Diagrams (Koller and

Milch, 2003)), a new algorithm for solving imperfect

recall games is required.

1.1 Our Contribution

We advance the state of the art and provide the

ﬁrst approximate algorithm for computing maxmin

strategies in imperfect recall games (since maxmin

strategies might require irrational numbers (Koller

and Megiddo, 1992), ﬁnding exact maxmin has fun-

damental difﬁculties). We assume imperfect recall

games with no absentmindedness, which means that

each decision point in the game can be visited at

most once during the course of the game and it is

arguably a natural assumption in ﬁnite games (see,

e.g., (Piccione and Rubinstein, 1997) for a detailed

discussion). The main goal of our approach is to

ﬁnd behavioral strategies that maximize the expected

outcome of player 1 against an opponent that min-

imizes the outcome. We base our formulation on

the sequence-form linear program for perfect recall

games (Koller et al., 1996; von Stengel, 1996) and

we extend it with bilinear constraints necessary for

the correct representation of strategies of player 1 in

imperfect recall games. We approximate the bilinear

terms using recent Multiparametric Disaggregation

Technique (MDT) (Kolodziej et al., 2013) and pro-

vide a mixed-integer linear program (MILP) for ap-

proximating maxmin strategies. Finally, we consider

a linear relaxation of the MILP and propose a branch-

and-bound algorithm that (1) repeatedly solves this

linear relaxation and (2) tightens the constraints that

approximate bilinear terms as well as relaxed binary

variables from the MILP. We show that the branch-

and-bound algorithm ends after exponentially many

steps while guaranteeing the desired precision.

Our algorithm approximates maxmin strategies

for player 1 having generic imperfect recall without

absentmindedness and we give two variants of the al-

gorithm depending on the type of imperfect recall of

the opponent. If the opponent, player 2, has either a

perfect recall or so-called A-loss recall (Kaneko and

Kline, 1995; Kline, 2002), the linear program solved

by the branch-and-bound algorithm has a polynomial

size in the size of the game. If player 2 has a generic

imperfect recall without absentmindedness, the linear

program solved by the branch-and-bound algorithm

can be exponentially large.

We provide a short experimental evaluation to

demonstrate that our algorithm can solve games far

beyond the size of toy problems. Randomly gener-

ated imperfect recall games with up to 5 · 10

3

states

can be typically solved within few minutes.

All the technical proofs can be found in the ap-

pendix or in the full version of this paper.

2 TECHNICAL PRELIMINARIES

Before describing our algorithm we deﬁne extensive-

form games, different types of recall, and describe the

approximation technique for the bilinear terms.

A two-player extensive-form game (EFG) is a tu-

ple G = (N ,H ,Z,A, u, C , I). N = {1, 2} is a set of

players, by i we refer to one of the players, and by

−i to his opponent. H denotes a ﬁnite set of histo-

ries of actions taken by all players and chance from

the root of the game. Each history corresponds to a

node in the game tree; hence, we use terms history

and node interchangeably. We say that h is a preﬁx

of h

0

(h v h

0

) if h lies on a path from the root of the

game tree to h

0

. Z ⊆ H is the set of terminal states of

the game. A denotes the set of all actions. An ordered

list of all actions of player i from root to h is referred

to as a sequence, σ

i

= seq

i

(h), Σ

i

is a set of all se-

quences of i. For each z ∈ Z we deﬁne a utility func-

tion u

i

: Z → R for each player i (u

i

(z) = −u

−i

(z) in

zero-sum games). The chance player selects actions

based on a ﬁxed probability distribution known to all

players. Function C : H → [0,1] is the probability of

reaching h due to chance.

Imperfect observation of player i is modeled via

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

64

information sets I

i

that form a partition over h ∈ H

where i takes action. Player i cannot distinguish be-

tween nodes in any I

i

∈ I

i

. A(I

i

) denotes actions

available in each h ∈ I

i

. The action a uniquely identi-

ﬁes the information set where it is available. We use

seq

i

(I

i

) as a set of all sequences of player i leading to

I

i

. Finally, we use inf

i

(σ

i

) to be a set of all information

sets to which sequence σ

i

leads.

A behavioral strategy β

i

∈ B

i

is a probability dis-

tribution over actions in each information set I ∈ I

i

.

We use u

i

(β) = u

i

(β

i

,β

−i

) for the expected outcome

of the game for i when players follow β. A best

response of player i against β

−i

is a strategy β

BR

i

∈

BR

i

(β

−i

), where u

i

(β

BR

i

,β

−i

) ≥ u

i

(β

0

i

,β

−i

) for all β

0

i

∈

B

i

. β

i

(I,a) is the probability of playing a in I, β(h)

denotes the probability that h is reached when both

players play according to β and due to chance.

We say that β

i

and β

0

i

are realization equivalent if

for any β

−i

, ∀z ∈ Z β(z) = β

0

(z), where β = (β

i

,β

−i

)

and β

0

= (β

0

i

,β

−i

).

A maxmin strategy β

∗

i

is deﬁned as β

∗

i

=

argmax

β

i

∈B

i

min

β

−i

∈B

−i

u

i

(β

i

,β

−i

). Note that when a

Nash equilibrium in behavioral strategies exists in a

two-player zero-sum imperfect recall game then β

∗

i

is

a Nash equilibrium strategy for i.

2.1 Types of Recall

We now brieﬂy deﬁne types of recall in EFGs and

state several lemmas and observations about charac-

teristics of strategies in imperfect recall EFGs that are

later exploited by our algorithm.

In perfect recall, all players remember the history

of their own actions and all information gained during

the course of the game. As a consequence, all nodes

in any information set I

i

have the same sequence for

player i. If the assumption of perfect recall does not

hold, we talk about games with imperfect recall. In

imperfect recall games, mixed and behavioral strate-

gies are not comparable (Kuhn, 1953). However, in

games without absentmindedness (AM) where each

information set is encountered at most once during

the course of the game, the following observation al-

low us to consider only pure best responses of the op-

ponent when computing maxmin strategies:

Lemma 1. Let G be an imperfect recall game with-

out AM and β

1

strategy of player 1. There exists

an ex ante (i.e., when evaluating only the expected

value of the strategy) pure behavioral best response

of player 2.

The proof is in the full version of the paper.

This lemma is applied when a mathematical pro-

gram for computing maxmin strategies is formulated

– strategies of player 2 can be considered as con-

straints using pure best responses. Note that this is not

true in general imperfect recall games – in games with

AM, an ex ante best response may need to be random-

ized (e.g., in the game with absentminded driver (Pic-

cione and Rubinstein, 1997)).

A disadvantage of using pure best responses as

constraints for the minimizing player is that there

are exponentially many pure best responses in the

size of the game. In perfect recall games, this can

be avoided by formulating best-response constraints

such that the opponent is playing the best action in

each information set. However, this type of response,

termed time consistent strategy (Kline, 2002), does

not have to be an ex ante best response in general im-

perfect recall games (see (Kline, 2002) for an exam-

ple). A class of imperfect recall games where it is

sufﬁcient to consider only time consistent strategies

when computing best responses was termed as A-loss

recall games (Kaneko and Kline, 1995; Kline, 2002).

Deﬁnition 1. Player i has A-loss recall if and only if

for every I ∈ I

i

and nodes h,h

0

∈ I it holds either (1)

seq

i

(h) = seq

i

(h

0

), or (2) ∃I

0

∈ I

i

and two distinct

actions a, a

0

∈ A

i

(I

0

),a 6= a

0

such that a ∈ seq

i

(h) ∧

a

0

∈ seq

i

(h

0

).

Condition (1) in the deﬁnition says that if player i

has perfect recall then she also has A-loss recall. Con-

dition (2) requires that each loss of memory of A-loss

recall player can be traced back to some loss of mem-

ory of the player’s own previous actions.

The equivalence between time consistent strate-

gies and ex ante best responses allows us to simplify

the best responses of player 2 in case she has A-

loss recall. Formally, it is sufﬁcient to consider best

responses that correspond to the best response in a

coarsest perfect-recall reﬁnement of the imperfect re-

call game when computing best response for a player

with A-loss recall. By a coarsest perfect recall re-

ﬁnement of an imperfect recall game G we deﬁne a

perfect recall game G

0

where we split the imperfect

recall information sets to biggest subsets still fulﬁll-

ing the perfect recall.

Deﬁnition 2. The coarsest perfect re-

call reﬁnement G

0

of the imperfect recall

game G = {N ,H ,Z,A,u,C , I} is a tuple

{N ,H , Z,A

0

,u, C ,I

0

}, where ∀i ∈ N ∀I

i

∈ I

i

H(I

i

) partitions information set I

i

such that

H(I

i

) = {H

1

,..., H

n

} is a disjoint parti-

tion of all h ∈ I

i

, where

S

n

j=1

H

j

= I

i

and

∀H

j

∈ H(I

i

) ∀h

k

,h

l

∈ H

j

: seq

i

(h

k

) = seq

i

(h

l

) and

∀h

k

∈ H

k

,h

l

∈ H

l

: H

k

∩H

l

=

/

0 ⇒ seq

i

(h

k

) 6= seq

i

(h

l

).

Each set from H(I

i

) corresponds to an information

set I

0

i

∈ I

0

i

. Moreover, A

0

is a modiﬁcation of A

Computing Maxmin Strategies in Extensive-form Zero-sum Games with Imperfect Recall

65

guaranteeing ∀I ∈ I

0

∀h

k

,h

l

∈ I A

0

(h

k

) = A

0

(h

l

),

while for all distinct I

k

,I

l

∈ I

0

A(I

k

) 6= A(I

l

).

Note that we can restrict the coarsest perfect re-

call reﬁnement only for i by splitting only information

sets of i (information sets of −i remain unchanged).

Finally, we assume that there is a mapping between

actions from the coarsest perfect recall reﬁnement A

0

and actions in the original game A so that we can

identify to which actions from A

0

an original action

a ∈ A maps. We assume this mapping to be implicit

since it is clear from the context.

Lemma 2. Let G be an imperfect recall game where

player 2 has A-loss recall and β

1

is a strategy of

player 1, and let G

0

be the coarsest perfect recall re-

ﬁnement of G for player 2. Let β

0

2

be a pure best re-

sponse in G

0

and let β

2

be a realization equivalent be-

havioral strategy in G, then β

2

is a pure best response

to β

1

in G.

The proof is in the full version of the paper.

Note that the NP-hardness proof of computing

maxmin strategies due to Koller (Koller and Megiddo,

1992) still applies, since we assume the maximizing

player to have generic imperfect recall and the reduc-

tion provided by Koller results in a game where the

maximizing player has generic imperfect recall while

the minimizing player has perfect recall, which is a

special case of both settings assumed in our paper.

Finally, let us show that CFR cannot be applied

in these settings. This is caused by the fact that CFR

iteratively minimizes per information set regret terms

(counterfactual regrets). Since in perfect recall games

the sum of counterfactual regrets provides an upper

bound on the external regret, this minimization is

guaranteed to converge to a strategy proﬁle with 0 ex-

ternal regret. In imperfect recall games, however, the

sum of counterfactual regrets no longer forms an up-

per bound on the external regret (Lanctot et al., 2012),

and the minimization of these regret terms can, there-

fore, lead to a strategy proﬁle with a non-zero external

regret.

Example 1: Consider the A-loss recall game in Figure

1. When setting the x > 2, one of the strategy proﬁles

with zero counterfactual regret (and therefore a pro-

ﬁle to which CFR can converge) is mixing uniformly

between both a, b and g, h, while player 2 plays d,

e deterministically. By setting the utility x as some

large number, this strategy proﬁle can have expected

utility arbitrarily worse than the maxmin value −1.

The reason is the presence of the conﬂicting outcomes

for some action in an imperfect recall information set

that cannot be generally avoided, or easily detected in

imperfect recall games.

Figure 1: An A-loss recall game where CFR ﬁnds a strategy

with the expected utility arbitrarily distant from the maxmin

value.

2.2 Approximating Bilinear Terms

The ﬁnal technical tool that we use in our algorithm

is the approximation of bilinear terms by Multipara-

metric Disaggregation Technique (MDT) (Kolodziej

et al., 2013) for approximating bilinear constraints.

The main idea of the approximation is to use a digit-

wise discretization of one of the variables from a bi-

linear term. The main advantage of this approxima-

tion is a low number of newly introduced integer vari-

ables and an experimentally conﬁrmed speed-up over

the standard technique of piecewise McCormick en-

velopes (Kolodziej et al., 2013).

9

∑

k=0

w

k,`

= 1 ` ∈ Z (1a)

w

k,`

∈{0,1} (1b)

∑

`∈Z

9

∑

k=0

10

`

· k · w

k,`

= b (1c)

c

L

· w

k,`

≤ ˆc

k,`

≤ c

U

· w

k,`

∀` ∈ Z,∀k ∈ 0..9 (1d)

9

∑

k=0

ˆc

k,`

= c ∀` ∈ Z (1e)

∑

`∈Z

9

∑

k=0

10

`

· k · ˆc

k,`

= a (1f)

Let a = bc be a bilinear term. MDT discretizes

variable b and introduces new binary variables w

k,l

that indicate whether the digit on `-th position is k.

Constraint (1a) ensures that for each position ` there

is exactly one digit chosen. All digits must sum to

b (Constraint (1c)). Next, we introduce variables ˆc

k,`

that are equal to c for such k and ` where w

k,l

= 1,

and ˆc

k,`

= 0 otherwise. c

L

and c

U

are bounds on the

value of variable c. The value of a is given by Con-

straint (1f).

This is an exact formulation that requires inﬁnite

sums and an inﬁnite number of constraints. However,

by restricting the set of all possible positions ` to a

ﬁnite set {P

L

,. . . ,P

U

} we get a lower bound approx-

imation. Following the approach in (Kolodziej et al.,

2013) we can extend the lower bound formulation to

compute an upper bound:

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

66

Constraints (1a), (1d),(1e)

∑

`∈{P

L

,...,P

U

}

9

∑

k=0

10

`

· k · w

k,`

+ ∆b = b (2a)

0 ≤ ∆b ≤ 10

P

L

(2b)

∑

`∈{P

L

,...,P

U

}

9

∑

k=0

10

`

· k · ˆc

k,`

+ ∆a = a (2c)

c

L

· ∆b ≤ ∆a ≤ c

U

· ∆b (2d)

c − c

U

· 10

P

L

+ c

U

· ∆b ≤ ∆a (2e)

c − c

L

· 10

P

L

+ c

L

· ∆b ≥ ∆a (2f)

Here, ∆b is assigned to every discretized variable

b allowing it to take up the value between two dis-

cretization points created due to the minimal value

of ` (Constraints (2a)–(2b)). Similarly, we allow

the product variable a to be increased with variable

∆a = ∆b · c. To approximate the product of the delta

variables, we use the McCormick envelope deﬁned by

Constraints (2d)–(2f).

3 MATHEMATICAL PROGRAMS

FOR APPROXIMATING

MAXMIN STRATEGIES

We now state the mathematical programs for approx-

imating maxmin strategies. The main idea is to add

bilinear constraints into the sequence form LP to re-

strict to imperfect recall strategies. We formulate an

exact bilinear program, followed by the approxima-

tion of bilinear terms using MDT.

3.1 Exact Bilinear Sequence Form

Against A-loss Recall Opponent

max

x,r,v

v(root,

/

0) (3a)

s.t. r(

/

0) = 1 (3b)

0 ≤ r(σ) ≤ 1 ∀σ ∈ Σ

1

(3c)

∑

a∈A(I)

r(σa) = r(σ) ∀σ ∈ Σ

1

,∀I ∈ inf

1

(σ

1

) (3d)

∑

a∈A(I)

x(a) = 1 ∀I ∈ I

IR

1

(3e)

0 ≤ x(a) ≤ 1 ∀I ∈ I

IR

1

,∀a ∈ A(I) (3f)

r(σ) · x(a) = r(σa) ∀I ∈ I

IR

1

,∀a ∈ A(I),

∀σ ∈ seq

1

(I) (3g)

∑

σ

1

∈Σ

1

g(σ

1

,σ

2

a)r

1

(σ

1

) +

∑

I

0

∈inf

2

(σ

2

a)

v(I

0

,σ

2

a) ≥ v(I, σ

2

)

∀I ∈ I

2

,∀σ

2

∈ seq

2

(I),∀a ∈ A(I) (3h)

Constraints (3a)–(3h) represent a bilinear refor-

mulation of the sequence-form LP due to (von Sten-

gel, 1996) applied to the information set structure of

an imperfect recall game G. The objective of player 1

is to ﬁnd a strategy that maximizes the expected util-

ity of the game. The strategy is represented by vari-

ables r that assign probability to a sequence: r(σ

1

) is

the probability that σ

1

∈ Σ

1

will be played assuming

that information sets, in which actions of sequence

σ

1

are applicable, are reached due to player 2. Prob-

abilities r must satisfy so-called network ﬂow Con-

straints (3c)–(3d). Finally, a strategy of player 1 is

constrained by the best-responding opponent that se-

lects an action minimizing the expected value in each

I ∈ I

2

and for each σ

2

∈ seq

2

(I) that was used to

reach I (Constraint (3h)). These constraints ensure

that the opponent plays the best response in the coars-

est perfect recall reﬁnement of G and thus also in G

due to Lemma 2. The expected utility for each ac-

tion is a sum of the expected utility values from im-

mediately reachable information sets I

0

and from im-

mediately reachable leafs. For the latter we use gen-

eralized utility function g : Σ

1

× Σ

2

→ R deﬁned as

g(σ

1

,σ

2

) =

∑

z∈Z|seq

1

(z)=σ

1

∧seq

2

(z)=σ

2

u(z)C (z).

In imperfect recall games multiple σ

i

can lead to

some imperfect recall information set I

i

∈ I

IR

i

⊆ I

i

;

hence, realization plans over sequences do not have

to induce the same behavioral strategy for I

i

. There-

fore, for each I

i

∈ I

IR

i

we deﬁne behavioral strategy

x(a) for each a ∈ A(I

i

) (Constraints (3e)–(3f)). To en-

sure that the realization probabilities induce the same

behavioral strategy in I

i

, we add bilinear constraint

r(σ

i

a) = x(a) · r(σ

i

) (Constraint (3g)).

3.1.1 Player 2 without A-Loss Recall

If player 2 does not have A-loss recall, the mathe-

matical program must use each pure best response of

player 2 π

2

∈ Π

2

as a constraint as follows:

max

x,r,v

v(root) (4a)

Constraints (3b)–(3f)

∑

z∈Z | π

2

(z)=1

u(z)C (z)r(seq

1

(z)) ≥ v(root) ∀π

2

∈ Π

2

(4b)

Since the modiﬁcation does not change the parts

of the program related to the approximation of strate-

gies of player 1, all the following approximation

methods, theorems, and the branch-and-bound al-

gorithm are applicable for general imperfect recall

games without absentminded players.

Computing Maxmin Strategies in Extensive-form Zero-sum Games with Imperfect Recall

67

3.2 Upper Bound MILP Approximation

The upper bound formulation of the bilinear program

follows the MDT example and uses ideas similar to

Section 2.2. In accord with the MDT, we represent

every variable x(a) using a ﬁnite number of digits. Bi-

nary variables w

I

1

,a

k,`

correspond to w

k,`

variables from

the example shown in Section 2.2 and are used for the

digit-wise discretization of x(a). Finally, ˆr(σ

1

)

a

k,`

cor-

respond to ˆc

k,`

variables used to discretize the bilinear

term r(σ

1

a). In order to allow variable x(a) to attain

an arbitrary value from [0,1] interval using a ﬁnite

number of digits of precision, we add an additional

real variable 0 ≤ ∆x(a) ≤ 10

−P

that can span the

gap between two adjacent discretization points. Con-

straints (5d) and (5e) describe this loosening. Vari-

ables ∆x(a) also have to be propagated to bilinear

terms r(σ

1

) · x(a) involving x(a). We cannot repre-

sent the product ∆r(σ

1

a) = r(σ

1

) · ∆x(a) exactly and

therefore we give bounds based on the McCormick

envelope (Constraints (5i)–(5j)).

max

x,r,v

v(root,

/

0) (5a)

s.t. Constraints (3b) - (3f) , (3h)

w

I,a

k,`

∈ {0, 1} ∀I ∈ I

IR

1

,∀a ∈ A(I),

∀k ∈ 0..9,∀` ∈ −P..0 (5b)

9

∑

k=0

w

I,a

k,`

= 1 ∀I ∈ I

IR

1

,∀a ∈ A(I),

∀` ∈ −P..0 (5c)

0

∑

`=−P

9

∑

k=0

10

`

· k · w

I,a

k,`

+ ∆x(a) = x(a)

∀I ∈ I

IR

1

,∀a ∈ A(I) (5d)

0 ≤ ∆x(a) ≤ 10

−P

∀I ∈ I

IR

1

,∀a ∈ A(I) (5e)

0 ≤ ˆr(σ)

a

k,`

≤ w

I,a

k,`

∀I ∈ I

IR

1

,∀a ∈ A(I), (5f)

∀σ ∈ seq

1

(I),∀` ∈ −P..0

9

∑

k=0

ˆr(σ)

a

k,`

= r(σ) ∀I ∈ I

IR

1

,∀σ ∈ seq

1

(I)

∀` ∈ −P..0 (5g)

0

∑

`=−P

9

∑

k=0

10

`

· k · ˆr(σ)

a

k,`

+ ∆r(σa) = r(σa)

∀I ∈ I

IR

1

,∀a ∈ A(I),

∀σ ∈ seq

1

(I) (5h)

(r(σ) − 1) · 10

−P

+ ∆x(a) ≤ ∆r(σa) ≤ 10

−P

· r(σ)

∀I ∈ I

IR

1

,∀a ∈ A(I),

∀σ ∈ seq

1

(I) (5i)

0 ≤ ∆r(σa) ≤ ∆x(a) ∀I ∈ I

IR

1

,∀σ ∈ seq

1

(I),

∀a ∈ A(I) (5j)

Due to this loose representation of ∆r(σ

1

a), the

reformulation of bilinear terms is no longer exact and

this MILP therefore yields an upper bound of the

bilinear sequence form program (3). Note that the

MILP has both the number of variables and the num-

ber of constraints bounded by O(|I| · |Σ| · P), where

|Σ| is the number of sequences of both players. The

number of binary variables is equal to 10·|I

IR

1

|·A

max

1

·

P, where A

max

1

= max

I∈I

1

|A

1

(I)|.

3.3 Theoretical Analysis of the Upper

Bound MILP

The variables ∆x(a) and ∆r(σ) ensure that the opti-

mal value of the MILP is an upper bound on the value

of the bilinear program. The drawback is that the re-

alization probabilities do not have to induce a valid

strategy in the imperfect recall game G, i.e. if σ

1

,σ

2

are two sequences leading to an imperfect recall in-

formation set I

1

∈ I

IR

1

where action a ∈ A(I

1

) can be

played, r(σ

1

a)/r(σ

1

) need not equal r(σ

2

a)/r(σ

2

).

We will show that it is possible to create a valid strat-

egy in G which decreases the value by at most ε, while

deriving bound on this ε.

Let β

1

(I

1

),. . . ,β

k

(I

1

) be behavioral strategies in

the imperfect recall information set I

1

∈ I

IR

1

corre-

sponding to realization probabilities of continuations

of sequences σ

1

,. . . ,σ

k

leading to I

1

. These probabil-

ity distributions can be obtained from the realization

plan as β

j

(I

1

,a) = r(σ

j

a)/r(σ

j

) for σ

j

∈ seq

1

(I

1

) and

a ∈ A(I

1

). We will omit the information set and use

β(a) whenever it is clear from the context. If the im-

perfect recall is violated in I

1

, β

j

(a) may not be equal

to β

l

(a) for some j, l and action a ∈ A(I

1

).

Proposition 1. It is always possible to construct a

strategy β(I

1

) such that kβ(I

1

) − β

j

(I

1

)k

1

≤ |A(I

1

)| ·

10

−P

for every j.

2

We now connect the distance of a corrected

strategy β(I

1

) from a set of behavioral strategies

β

1

(I

1

),. . . ,β

k

(I

1

) in I

1

∈ I

IR

1

to the expected value of

the strategy.

Theorem 1. The error of the Upper Bound MILP is

bounded by

ε = 10

−P

· d · A

max

1

·

v

max

(

/

0) − v

min

(

/

0)

2

,

where d is the maximum number of player 1’s

imperfect recall information sets encountered on a

path from the root to a terminal node, A

max

1

=

max

I

1

∈I

IR

1

|A(I

1

)| is the branching factor and v

min

(

/

0),

v

max

(

/

0) are the lowest and highest utilities for player

1 in the whole game, respectively.

The idea of the proof is to bound the error in ev-

ery I

1

∈ I

IR

1

and propagate the error in a bottom-up

fashion.

2

The L1 norm is taken as kx

1

− x

2

k

1

=

∑

a∈A(I

1

)

|x

1

(a) −

x

2

(a)|

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

68

The error of Upper Bound MILP is bounded if the

precision of all approximations of bilinear terms is P.

However, we can increase the precision for each term

separately and thus design the following iterative al-

gorithm (termed simply as MILP in the experiments):

(1) start with the precision set to 0 for all bilinear

terms, (2) for each approximation of a bilinear term

calculate the current error contribution (the difference

between ∆r(σ

1

a) and r(σ

1

)∆x(a) multiplied by the

expected utility) and increase the precision only for

the term that contributes to the overall error the most.

Once the term with maximal error has already reached

maximal precision P, Theorem 1 guarantees us that

we are ε close to the optimal solution. Our algorithm

in the following section simultaneously increases the

precision for approximating bilinear terms together

with searching for optimal values for binary variables.

4 BRANCH-AND-BOUND

ALGORITHM

We now introduce a branch-and-bound (BNB) search

for approximating maxmin strategies, that exploits the

observation below and thus improves the performance

compared to the previous MILP formulations. Addi-

tionally, we provide bounds on the overall runtime as

a function of the desired precision.

The BNB algorithm works on the linear relaxation

of the Upper Bound MILP and searches the BNB tree

in the best ﬁrst search manner. In every node n, the al-

gorithm solves the relaxed LP corresponding to node

n, heuristically selects the information set I and ac-

tion a contributing to the current approximation er-

ror the most, and creates successors of n by restrict-

ing the probability β

1

(I,a) that a is played in I. The

algorithm adds new constraints to LP depending on

the value of β

1

(I,a) by constraining (and/or introduc-

ing new) variables w

I

1

,a

k,l

and creating successors of the

BNB node in the search tree. Note that w

I

1

,a

k,l

variables

correspond to binary variables in the MILP formula-

tion. This way, the algorithm simultaneously searches

for the optimal approximation of bilinear terms as

well as the assignment for binary variables. The al-

gorithm terminates when ε-optimal strategy is found

(using the difference of the global upper bound and

the lower bound computed as described in Observa-

tion 1 below).

Observation 1. Even if the current assignment to

variables w

I

1

,a

k,`

is not feasible (they are not set to bi-

nary values), the realization plan produced is valid

in the perfect recall reﬁnement. We can ﬁx it in the

Algorithm 1: BNB algorithm.

input : Initial LP relaxation LP

0

of Upper

Bound MILP using a P = 0

discretization

output : ε-optimal strategy for a player having

imperfect recall

parameters: Bound on maximum error ε, precision

bounds for x(a) variables P

max

(I

1

,a)

1 fringe ← {CreateNode(LP

0

)}

2 opt ← (nil,−∞, ∞)

3 while fringe 6= ∅ do

4 (LP,lb, ub) ← argmax

n∈fringe

n.ub

5 fringe ← fringe \ (LP,lb, ub)

6 if opt.lb ≥ n.ub then

7 return ReconstructStrategy(opt)

8 if opt.lb < n.lb then

9 opt ← n

10 if n.ub − n.lb ≤ ε then

11 return ReconstructStrategy(opt)

12 else

13 (I

1

,a) ← SelectAction(n)

14 P ← number of digits of precision

representing x(a) in LP

15 fringe ← fringe ∪ {CreateNode(LP ∪

{

∑

b

a

ub

+a

lb

2

c

−P

k=0

w

I

1

,a

k,P

= 1})}

16 fringe ← fringe ∪ {CreateNode(LP ∪

{

∑

9

k=b

a

ub

+a

lb

2

c

−P

w

I

1

,a

k,P

= 1})}

17 if P < P

max

(I

1

,a) then

18 fringe ← fringe ∪ {CreateNode(LP ∪

{w

I

1

,a

LP.x(a)

−P

,P

= 1, introduce vars

w

I

1

,a

0,P+1

,.. ., w

I

1

,a

9,P+1

and corresponding

constraints from MDT })}

19 return ReconstructStrategy(opt)

20 function CreateNode(LP)

21 ub ← Solve(LP)

22 β

1

← ReconstructStrategy(LP)

23 lb ← u

1

(β

1

,BestResponse(β1))

24 return (LP,lb, ub)

sense of Proposition 1 and use it to estimate the lower

bound for the BNB subtree rooted in the current node

without a complete assignment of all w

I

1

,a

k,`

variables

to either 0 or 1.

Algorithm 1 depicts the complete BNB algorithm.

It takes an LP relaxation of the Upper Bound MILP

as its input. Initially, the maxmin strategy is approx-

imated using 0 digits of precision after the decimal

point (i.e. precision P(I

1

,a) = 0 for every variable

x(a)). The algorithm maintains a set of active BNB

nodes (fringe) and a candidate with the best guaran-

teed value opt. The algorithm selects the node with

the highest upper bound from fringe at each itera-

tion (lines 4–5). If there is no potential for improve-

ment in the unexplored parts of the branch and bound

tree, the current best solution is returned (line 7) (up-

Computing Maxmin Strategies in Extensive-form Zero-sum Games with Imperfect Recall

69

per bounds of the nodes added to the fringe in the

future will never be higher than the current upper

bound). Next, we check, whether the current solu-

tion has better lower bound than the current best, if

yes we replace it (line 9). Since we always select

the most promising node with respect to the upper

bound, we are sure that if the lower bound and up-

per bound have distance at most ε, we have found

an ε-optimal solution and we can terminate (line 11)

(upper bounds of the nodes added to the fringe in

the future will never be higher than the current up-

per bound). Otherwise, we heuristically select an ac-

tion having the highest effect on the gap between the

current upper and lower bound (line 13). We obtain

the precision used to represent behavioral probabil-

ity of this action. By default we add two successors

of the current BNB node, each with one of the fol-

lowing constraints. x(a) ≤ b

a

ub

+a

lb

2

c

−P

(line 15) and

x(a) ≥ b

a

ub

+a

lb

2

c

−P

(line 16), where b·c

p

is ﬂooring of

a number towards p digits of precision and a

ub

and

a

lb

are the lowest and highest allowed values of play-

ing x(a). This step performs binary halving restricting

allowed values of x(a) in current precision. Addition-

ally, if the current precision is lower than the maximal

precision P

max

(I

1

,a) the gap between bounds may be

caused by the lack of discretization points; hence, we

add one more successor bvc ≤ x(a) ≤ dve, where v is

the current probability of playing a, while increasing

the precision used for representing x(a) (line 18) (all

the restriction to x(a) in all 3 cases are done via w

I

1

,a

k,l

variables).

The function CreateNode computes the upper

bound by solving the given LP (line 21) and the lower

bound, by using the heuristical construction of a valid

strategy β

1

returning some convex combination of

strategies found for σ

k

1

∈ seq

1

(I

1

) (line 22) and com-

puting the expected value of β

1

against a best re-

sponse to it.

Note that this algorithm allows us to plug-in cus-

tom heuristic for the reconstruction of strategies (line

22) and for the action selection (line 13).

4.1 Theoretical Properties of the BNB

Algorithm

The BNB algorithm takes the error bound ε as an in-

put. We provide a method for setting the P

max

(I

1

,a)

parameters appropriately to guarantee ε-optimality.

Finally, we provide a bound on the number of steps

the algorithm needs to terminate.

Theorem 2. Let P

max

(I

1

,a) be the maximum number

of digits of precision used for representing variable

x(a) set as

P

max

(I

1

,a) =

max

h∈I

1

log

10

|A(I

1

)| · d · v

di f f

(h)

2ε

,

where v

di f f

(h) = v

max

(h) − v

min

(h). With this setting

Algorithm 1 terminates and it is guaranteed to return

an ε-optimal strategy for player 1.

The proof provides a bound on the error per node

in every I ∈ I

IR

i

and propagates this bound through

the game tree in a bottom up fashion.

Theorem 3. When using P

max

(I

1

,a) from The-

orem 2 for all I

1

∈ I

1

and all a ∈ A(I

1

),

the number of iterations of the BNB algo-

rithm needed to ﬁnd an ε-optimal solution is

in O(3

4S

1

(log

10

(S

1

·v

di f f

(

/

0))+1)

2

−5S

1

ε

−5S

1

), where S

1

=

|I

1

|A

max

1

.

The proof derives a bound on the num-

ber of nodes in the BNB tree dependent on

max

I

1

∈I

i

,a∈A

1

(I

1

)

P

max

(I

1

,a) and uses the formula from

Theorem 2 to transform the bound to a function of ε.

5 EXPERIMENTS

We now demonstrate the practical aspects of our main

branch-and-bound algorithm (BNB) described in Sec-

tion 4 and the iterative MILP variant described in Sec-

tion 3. We compare the algorithms on a set of ran-

dom games where player 2 has A-loss recall. Both

algorithms were implemented in Java, each algorithm

uses a single thread, 8 GB memory limit. We use IBM

ILOG CPLEX 12.6 to solve all LPs/MILPs.

5.1 Random Games

Since there is no standardized collection of bench-

mark EFGs, we use randomly generated games in

order to obtain statistically signiﬁcant results. We

randomly generate a perfect recall game with vary-

ing branching factor and ﬁxed depth of 6. To control

the information set structure, we use observations as-

signed to every action – for player i, nodes h with

the same observations generated by all actions in his-

tory belong to the same information set. In order to

obtain imperfect recall games with a non-trivial in-

formation set structure, we run a random abstraction

algorithm which randomly merges information sets

with the same action count, which do not cause ab-

sentmindedness. We generate a set of experimen-

tal instances by varying the branching factor. Such

games are rather difﬁcult to solve since (1) informa-

tion sets can span multiple levels of the game tree (i.e.,

the nodes in an information set often have histories

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

70

Table 1: Average runtime and standard error in seconds

needed to solve at least 100 different random games in every

setting, while increasing the b.f.with ﬁxed depth of 6.

Algs \ b. f . 3 4

MILP 157.87 ± 61.47 459.09 ± 75.95

BNB 9.52 ± 3.78 184.50 ± 48.22

with differing sizes) and (2) actions can easily lead to

leafs with very differing utility values. We always de-

vise an abstraction which results to A-loss recall for

the minimizing player.

5.2 Results

Table 1 reports the average runtime and its standard

error in seconds for both algorithms over at least 100

different random games with increasing branching

factor and ﬁxed depth of 6. We limited the runtime

for 1 instance to 2 hours (in the Table we report results

only for instances where both algorithms ﬁnished un-

der 2 hours). The BNB algorithm was terminated 17

and 30 times after 2 hours for the reported settings

respectively, while MILP algorithm was terminated

19 and 34 times. Note that the random games form

an unfavorable scenario for both algorithms since the

construction of the abstraction is completely random,

which makes conﬂicting behavior in merged informa-

tion sets common. As we can see, however, even in

these scenarios we are typically able to solve games

with approximately 5 · 10

3

states in several minutes.

6 CONCLUSIONS

We provide the ﬁrst algorithm for approximat-

ing maxmin strategies in imperfect recall zero-sum

extensive-form games without absentmindedness. We

give a novel mathematical formulation for computing

approximate strategies by means of a mixed-integer

linear program (MILP) that uses recent methods in the

approximation of bilinear terms. Next, we use a linear

relaxation of this MILP and introduce a branch-and-

bound search (BNB) that simultaneously looks for the

correct solution of binary variables and increases the

precision for the approximation of bilinear terms. We

provide guarantees that both MILP and BNB ﬁnd an

approximate optimal solution. Finally, we show that

the algorithms are capable of solving games of sizes

far beyond toy problems (up to 5 ·10

3

states) typically

within few minutes in practice.

Results presented in this paper provide the ﬁrst

baseline algorithms for the class of imperfect recall

games that are of a great importance in solving large

extensive-form games with perfect recall. As such,

our algorithms can be further extended to improve

the current scalability, e.g., by employing incremental

strategy generation methods.

ACKNOWLEDGEMENTS

This research was supported by the Czech Sci-

ence Foundation (grant no. 15-23235S) and by

the Grant Agency of the Czech Technical Univer-

sity in Prague, grant No. SGS16/235/OHK3/3T/13.

Computational resources were provided by the

CESNET LM2015042 and the CERIT Scientiﬁc

Cloud LM2015085, provided under the programme

”Projects of Large Research, Development, and Inno-

vations Infrastructures”.

REFERENCES

Bo

ˇ

sansk

´

y, B., Kiekintveld, C., Lis

´

y, V., and P

ˇ

echou

ˇ

cek,

M. (2014). An Exact Double-Oracle Algorithm for

Zero-Sum Extensive-Form Games with Imperfect In-

formation. Journal of Artiﬁcial Intelligence Research,

51:829–866.

Bowling, M., Burch, N., Johanson, M., and Tammelin, O.

(2015). Heads-up limit hold’em poker is solved. Sci-

ence, 347(6218):145–149.

Gilpin, A. and Sandholm, T. (2007). Lossless Abstraction

of Imperfect Information Games. Journal of the ACM,

54(5).

Hoda, S., Gilpin, A., Pe

˜

na, J., and Sandholm, T. (2010).

Smoothing Techniques for Computing Nash Equilib-

ria of Sequential Games. Mathematics of Operations

Research, 35(2):494–512.

Kaneko, M. and Kline, J. J. (1995). Behavior Strategies,

Mixed Strategies and Perfect Recall. International

Journal of Game Theory, 24:127–145.

Kline, J. J. (2002). Minimum Memory for Equivalence

between Ex Ante Optimality and Time-Consistency.

Games and Economic Behavior, 38:278–305.

Koller, D. and Megiddo, N. (1992). The Complexity

of Two-Person Zero-Sum Games in Extensive Form.

Games and Economic Behavior, 4:528–552.

Koller, D., Megiddo, N., and von Stengel, B. (1996). Efﬁ-

cient Computation of Equilibria for Extensive Two-

Person Games. Games and Economic Behavior,

14(2):247–259.

Koller, D. and Milch, B. (2003). Multi-agent inﬂuence di-

agrams for representing and solving games. Games

and Economic Behavior, 45(1):181–221.

Kolodziej, S., Castro, P. M., and Grossmann, I. E. (2013).

Global optimization of bilinear programs with a mul-

tiparametric disaggregation technique. Journal of

Global Optimization, 57(4):1039–1063.

Computing Maxmin Strategies in Extensive-form Zero-sum Games with Imperfect Recall

71

Kroer, C. and Sandholm, T. (2014). Extensive-Form Game

Abstraction with Bounds. In ACM conference on Eco-

nomics and computation.

Kroer, C. and Sandholm, T. (2016). Imperfect-Recall Ab-

stractions with Bounds in Games. In EC.

Kuhn, H. W. (1953). Extensive Games and the Problem of

Information. Contributions to the Theory of Games,

II:193–216.

Lanctot, M., Gibson, R., Burch, N., Zinkevich, M.,

and Bowling, M. (2012). No-Regret Learning in

Extensive-Form Games with Imperfect Recall. In

ICML.

Nash, J. F. (1950). Equilibrium Points in n-person Games.

Proc. Nat. Acad. Sci. USA, 36(1):48–49.

Piccione, M. and Rubinstein, A. (1997). On the Interpre-

tation of Decision Problems with Imperfect Recall.

Games and Economic Behavior, 20:3–24.

von Stengel, B. (1996). Efﬁcient Computation of Behavior

Strategies. Games and Economic Behavior, 14:220–

246.

Wichardt, P. C. (2008). Existence of nash equilibria in ﬁ-

nite extensive form games with imperfect recall: A

counterexample. Games and Economic Behavior,

63(1):366–369.

Zinkevich, M., Johanson, M., Bowling, M., and Piccione,

C. (2008). Regret Minimization in Games with In-

complete Information. In NIPS.

APPENDIX

Proposition 1. It is always possible to construct a

strategy β(I

1

) such that kβ(I

1

) − β

j

(I

1

)k

1

≤ |A(I

1

)| ·

10

−P

for every j.

Proof. Probabilities of playing action a in β

1

,. . . ,β

k

can differ by at most 10

−P

, i.e. |β

j

(a)−β

l

(a)| ≤ 10

−P

for every j,l and action a ∈ A(I

1

). This is based on

the MDT we used to discretize the bilinear program.

Let us denote

r(σ

1

a) =

0

∑

l=−P

9

∑

k=0

10

`

· k · ˆr(σ

1

)

a

k,`

(6)

x(I

1

,a) =

0

∑

l=−P

9

∑

k=0

10

`

· k · w

I

1

,a

k,`

. (7)

Constraints (5f) and (5g) ensure that r(σ

1

a) =

r(σ

1

) · x(I

1

,a). The only way how the imper-

fect recall can be violated is thus in the usage of

∆r(σ

1

a). We know however that ∆r(σ

1

a) ≤ 10

−P

·

r(σ

1

) which ensures that the amount of imbalance in

β

1

,. . . ,β

k

is at most 10

−P

. Taking any of the behav-

ioral strategies β

1

,. . . ,β

k

as the corrected behavioral

strategy β(I

1

) therefore satisﬁes kβ(I

1

) − β

j

(I

1

)k

1

≤

∑

a∈A(I

1

)

10

−P

= |A(I

1

)| · 10

−P

.

We now provide the technical proof of Theorem

1. First, we connect the distance of a corrected

strategy β(I

1

) from a set of behavioral strategies

β

1

(I

1

),. . . ,β

k

(I

1

) in I

1

∈ I

IR

1

to the expected value of

the strategy. We start with bounding this error in a

single node.

Lemma 3. Let h ∈ I

1

be a history and β

1

, β

2

be be-

havioral strategies (possibly prescribing different be-

havior in I

1

) prescribing the same distribution over

actions for all subsequent histories h

0

A h. Let

v

max

(h) and v

min

(h) be maximal and minimal utilities

of player 1 in the subtree of h, respectively. Then the

following holds:

|v

β

1

(h) − v

β

2

(h)| ≤

v

di f f

(h)

2

· kβ

1

(I

1

) − β

2

(I

1

)k

1

,

where v

β

j

(h) is the maxmin value u(β

j

,β

BR

2

) of strat-

egy β

j

of player 1 given the play starts in h and

v

di f f

(h) = v

max

(h) − v

min

(h).

Proof. Let us study strategies β

1

and β

2

in node h.

Let us take β

1

(I

1

) as a baseline and transform it to-

wards β

2

(I

1

). We can identify two subsets of A(I

1

)

— a set of actions A

+

where the probability of play-

ing the action in β

2

was increased and A

−

where the

probability was decreased. Let us denote

C

◦

=

∑

a∈A

◦

|β

1

(I

1

,a) − β

2

(I

1

,a)| ∀◦ ∈ {+, −}.

We know that C

+

= C

−

(as strategies have to be

probability distributions). Moreover we know that

kβ

1

(I

1

) − β

2

(I

1

)k

1

= C

+

+C

−

. In the worst case, de-

creasing the probability of playing action a ∈ A

−

risks

losing quantity proportional to the amount of this de-

crease multiplied by the highest utility in the subtree

v

max

(h). For all actions a ∈ A

−

this loss is equal to

v

max

(h) ·

∑

a∈A

−

|β

1

(I

1

,a) − β

2

(I

1

,a)| = v

max

(h) ·C

−

.

Similarly the increase of the probabilities of ac-

tions in A

+

can add in the worst case v

min

(h) · C

+

to the value of the strategy. This combined together

yields

v

β

2

(h)−v

β

1

(h) ≥ −v

max

(h) ·C

−

+ v

min

(h) ·C

+

= [−v

max

(h) + v

min

(h)] ·C

+

=

−v

max

(h) + v

min

(h)

2

· 2C

+

=

−v

max

(h) + v

min

(h)

2

· kβ

1

(I

1

) − β

2

(I

1

)k

1

.

The strategies β

1

, β

2

are interchangeable which

results in the ﬁnal bound on the difference of v

β

2

(h),

v

β

1

(h).

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

72

Now we are ready to bound the error in the whole

game tree.

Theorem 1. The error of the Upper Bound MILP is

bounded by

ε = 10

−P

· d · A

max

1

·

v

max

(

/

0) − v

min

(

/

0)

2

,

where d is the maximum number of player 1’s im-

perfect recall information sets encountered on a

path from the root to a terminal node, A

max

1

=

max

I

1

∈I

IR

1

|A(I

1

)| is the branching factor and v

min

(

/

0),

v

max

(

/

0) are the lowest and highest utilities for player

1 in the whole game, respectively.

Proof. We show an inductive way to compute the

bound on the error and we show that the bound from

Theorem 1 is its upper bound. Throughout the deriva-

tion we assume that the opponent plays to maximize

the error bound. We proceed in a bottom-up fashion

over the nodes in the game tree, computing the max-

imum loss L(h) player 1 could have accumulated by

correcting his behavioral strategy in the subtree of h,

i.e.

L(h) ≥ u

h

(β

0

) − u

h

(β

IR

),

where β

0

is the (incorrect) behavioral strategy of

player 1 acting according to the realization probabili-

ties r(σ) from the solution of the Upper Bound MILP,

β

IR

is its corrected version and u

h

(β) is the expected

utility of a play starting in history h when player 1

plays according to β and his opponent best responds

(without knowing that the play starts in h). The proof

follows in case to case manner.

(1) No corrections are made in subtrees of leafs h,

thus the loss L(h) = 0.

(2) The chance player selects one of the succes-

sor nodes based on the ﬁxed probability distribution.

The loss is then the expected loss over all child nodes

L(h) =

∑

a∈A(h)

L(h · a) · C (h · a)/C (h). In the worst

case, the chance player selects the child with the high-

est associated loss, therefore

L(h) ≤ max

a∈A(h)

L(h · a).

(3) Player 2 wants to maximize player 1’s loss.

Therefore she selects such an action in her node h

that leads to a node with the highest loss, L(h) ≤

max

a∈A(n)

L(h · a). This is a pessimistic estimate of

the loss as she may not be able to pick the maximiz-

ing action in every state because of the imperfection

of her information.

(4) If player 1’s node h is not a part of an imper-

fect recall information set, no corrective steps need

to be taken. The expected loss at node h is there-

fore L(h) =

∑

a∈A(h)

β

0

(h,a)L(h · a). Once again in

the worst case player 1’s behavioral strategy β

0

(h) se-

lects deterministically the child node with the highest

associated loss, therefore L(h) ≤ max

a∈A(h)

L(h · a).

(5) So far we have considered cases that only ag-

gregate losses from child nodes. If player 1’s node

h is part of an imperfect recall information set, the

correction step may have to be taken. Let β

−h

be a

behavioral strategy where corrective steps have been

taken for successors of h and let us construct a strat-

egy β

h

where the strategy was corrected in the whole

subtree of h (i.e. including h). Note that ultimately

we want to construct strategy β

/

0

= β

IR

.

We know that values of children have been de-

creased by at most max

a∈A(h)

L(h · a), hence v

β

0

(h) −

v

β

−h

(h) ≤ max

a∈A(h)

L(h · a). Then we have to take

the corrective step at the node h and construct strat-

egy β

h

. From Lemma 3 and the observation about the

maximum distance of behavioral strategies within a

single imperfect recall information set I

1

, we get:

v

β

−h

(h) − v

β

h

(h) ≤

v

di f f

(h)

2

· 10

−P

|A

1

(I

1

)|

≤

v

di f f

(

/

0)

2

· 10

−P

A

max

1

The loss in the subtree of h is equal to v

β

0

(h) −

v

β

−h

(h) which is bounded by player 1 acting accord-

ing to the realization

L(h) = v

β

0

(h) − v

β

h

(h)

=

h

v

β

−h

(h) − v

β

h

(h)

i

+

h

v

β

0

(h) − v

β

−h

(h)

i

≤

v

di f f

(

/

0)

2

· 10

−P

A

max

1

+ max

a∈A(h)

L(h · a).

We will now provide an explicit bound on the

loss in the root node L(

/

0). We have shown that

in order to prove the worst case bound it sufﬁces

to consider deterministic choice of action at every

node — this means that a single path in the game

tree is pursued during propagation of loss. The loss

is increased exclusively in imperfect recall nodes

and we can encounter at most d such nodes on any

path from the root. The increase in such nodes

is constant ([v

max

(

/

0) − v

min

(

/

0)] · 10

−P

A

max

1

/2), there-

fore the bound is ε = L(

/

0) ≤ [v

max

(

/

0) − v

min

(

/

0)] · d ·

10

−P

A

max

1

/2.

We now know that the expected value of the strat-

egy we have found lies within the interval [v

∗

− ε, v

∗

],

where v

∗

is the optimal value of the Upper Bound

MILP. As v

∗

is an upper bound on the solution of the

original bilinear program, no strategy can be better

than v

∗

— which means that the strategy we found is

ε-optimal.

Computing Maxmin Strategies in Extensive-form Zero-sum Games with Imperfect Recall

73

Theorem 2. Let P

max

(I

1

,a) be the maximum number

of digits of precision used for representing variable

x(a) set as

P

max

(I

1

,a) =

max

h∈I

1

log

10

|A(I

1

)| · d · v

di f f

(h)

2ε

,

where v

di f f

(h) = v

max

(h) − v

min

(h). With this setting,

Algorithm 1 terminates and it is guaranteed to return

an ε-optimal strategy for player 1.

Proof. We start by proving that Algorithm 1 with this

choice of P

max

(I

1

,a) terminates. We will show that

every branch of the branch-and-bound search tree is

ﬁnite. This together with the fact that every node is

visited at most once and the branching factor of the

search tree is ﬁnite (every node of the search tree has

at most 3 child nodes) ensures that the algorithm ter-

minates.

Every node of the search tree is tied to branch-

ing on some variable x(a). Let p be the current pre-

cision used to represent x(a) and let us consider the

ﬁrst node on the branch where x(a) is represented

with such precision. At such point, p − 1 digits are

ﬁxed and thus x ∈ [c,c + 10

−(p−1)

] for some c ∈ [0,1].

On line 18 an interval of size 10

−p

is handled, every

left/right operation (lines 15 and 16) may thus handle

an interval whose size is reduced at least by 10

−p

. We

can conduct at most 9 left/right branching operations

(lines 15 and 16) before the size of the interval drops

below 10

−p

, which forces us to increase p. At most

10 operations can be performed on every x(a) for ev-

ery precision p, the limit on p is ﬁnite for every such

variable and the number of variables is ﬁnite as well,

the branch has therefore to terminate.

Let us now show that these limits on the number

of reﬁnements P

max

(I

1

,a) are enough to guarantee ε-

optimality. We will refer the reader to the proof of

Theorem 1 for details while we focus exclusively on

the behavior in nodes from imperfect recall informa-

tion sets.

Let I

1

∈ I

IR

1

and h ∈ I

1

. We know that the L1 dis-

tance between behavioral strategies in I

1

is at most

10

−P

max

(I

1

,a)

· |A(I

1

)| (for any a ∈ A(I

1

)). This means

that the bound on L(h) in h from the proof of Theo-

rem 1 is modiﬁed to:

L(h) = v

β

0

(h) − v

β

h

(h)

=

h

v

β

−h

(h) − v

β

h

(h)

i

+

h

v

β

0

(h) − v

β

−h

(h)

i

≤

v

di f f

(h)

2

· 10

−P

max

(I

1

,a)

· |A(I

1

)| + max

a∈A(h)

L(h · a)

≤

v

di f f

(h)

2

·

|A(I

1

)| · 2ε

|A(I

1

)| · d · [v

di f f

(h)]

+ max

a∈A(h)

L(h · a)

=

ε

d

+ max

a∈A(h)

L(h · a).

Similarly with the reasoning in the proof of Theo-

rem 1, it sufﬁces to assume players choosing action at

every node in a deterministic way. The path induced

by these choices contains at most d imperfect recall

nodes, thus L(

/

0) = d · ε/d = ε.

Theorem 3. When using P

max

(I

1

,a) from The-

orem 2 for all I

1

∈ I

1

and all a ∈ A(I

1

),

the number of iterations of the BNB algo-

rithm needed to ﬁnd an ε-optimal solution is

in O(3

4S

1

(log

10

(S

1

·v

di f f

(

/

0))+1)

2

−5S

1

ε

−5S

1

), where S

1

=

|I

1

|A

max

1

.

Proof. We start by proving that there is

N ∈ O(3

4|I

1

|A

max

1

P

max

) nodes in the BnB

tree, where A

max

1

= max

I∈I

1

|A(I)| and

P

max

= max

I∈I

1

,a∈A(I)

P

max

(I,a). This holds since in

the worst case we branch for every action in every

information set (hence |I

1

|A

1

). We can bound the

number of branchings for a ﬁxed action by 4P

max

,

since there are 10 digits (we branch at most 4 times

using binary halving) and we might require P

max

number of digits of precision. 4|I

1

|A

max

1

P

max

is

therefore the maximum depth of the branch-and-

bound tree. Finally the branching factor of the

branch-and-bound tree is at most 3.

By substituting

max

I

1

∈I

1

max

h∈I

1

log

10

|A(I

1

)| · d · v

di f f

(h)

2ε

for P

max

in the above bound (Theorem 2), we ob-

tain

N ∈ O(3

4S

1

max

I

1

∈I

1

l

max

h∈I

1

log

10

|A(I

1

)|·d·v

di f f

(h)

2ε

m

),

where S

1

= |I

1

|A

max

1

∈ O(3

4S

1

max

I

1

∈I

1

l

log

10

|A(I

1

)|·d·v

di f f

(

/

0)

2ε

m

)

∈ O(3

4S

1

max

I

1

∈I

1

l

log

10

S

1

v

di f f

(

/

0)

2ε

m

)

∈ O(3

4S

1

l

log

10

S

1

·v

di f f

(

/

0)

2ε

m

)

∈ O(3

4S

1

(log

10

S

1

·v

di f f

(

/

0)

2ε

+1)

)

∈ O(3

4S

1

(log

10

(S

1

·v

di f f

(

/

0))−log

10

(2ε)+1)

)

∈ O(3

4S

1

(log

10

S

1

·v

di f f

(

/

0))+1)

3

−10S

1

log

10

(2ε)

)

∈ O(3

4S

1

(log

10

S

1

·v

di f f

(

/

0))+1)

3

−10S

1

log

3

(2ε)

log

3

(10)

)

∈ O(3

4S

1

(log

10

S

1

·v

di f f

(

/

0))+1)

(2ε)

−10S

1

log

3

(10)

)

∈ O(3

4S

1

(log

10

(S

1

·v

di f f

(

/

0))+1)

(2ε)

−5S

1

)

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

74