
 
both these technologies together. Reinforcement 
learning together with neural networks have 
successfully been applied for different control 
problems - control of gas turbine (Schafer, 2008) 
and motor-control task (Coulom, 2002). 
Given research describes poker agent 
development using RL with ANN. 
4 STATE OF THE ART 
Texas Hold’em is one of the most popular forms of 
poker. It is also a very complex game. There are 
several factors that make poker game as uncertain 
environment (concealed cards, bluffing). These 
characteristics make poker partially observable 
Markov Decision process that has no ready solution. 
Reinforcement learning together with neural 
network for value function approximation provides a 
solution for uncertain environment agent. This paper 
gives a brief description of information needed to 
develop agent poker game: 
  Poker game rules; 
  Definition of the partially observable Markov 
decision process; 
  Neural network theory; 
  Reinforcement learning theory. 
4.1 Poker Game 
Poker is a game of imperfect information in which 
players have only partial knowledge about the 
current state of the game (Johanson, 2007). Poker 
involves betting and individual play, and the winner 
is determined by the rank and combination of cards. 
Poker has many variations - in experiments, and data 
analyses author uses Texas hold’em poker game 
version. Texas hold’em consists of two cards dealt to 
player and five table cards. Texas hold’em is an 
extremely complicated form of poker. This is 
because the exact manner in which a hand should be 
played is often debatable. It is not uncommon to 
hear two expert players argue the pros and cons of a 
certain strategy (Sklansky, Malmuth, 1999). 
Poker game consists of 4 phases - pre-flop, flop, 
turn, river. On the first phase (pre-flop) two cards 
are dealt for every player. On the second phase 
(flop) three table cards are shown. On next phase 
(turn) fourth card is shown and finally on the last 
phase (river) table fifth card is shown and winner is 
determined. Game winner is a player with the 
strongest five card combination. Possible card 
combinations are (starting from the highest rank) 
Straight flush, Royal flush, Four of a kind, Full 
house, Flush, Straight, Three of a kind, Two pair, 
One pair, High card. 
4.2 Partially Observable Markov 
Decision Process 
Markov decision process can be described as a tuple 
(S, A, P, R), where 
  S, a set of states of the world; 
  A, a set of actions; 
  P:S×S ×A →[0,1], which specifies the 
dynamics. This is written P(s'|s,a), where 
∀s ∈S  ∀a ∈A   ∑s'∈S P(s'|s,a) = 1. 
 
In particular, P(s'|s,a) specifies the probability of 
transitioning to state s' given that the agent is in a 
state s and does action a. 
  R:S×A ×S →R, where R(s,a,s') gives the 
expected immediate reward from doing action 
a and transitioning to state s' from state s 
(Poole and Mackworth, 2010). 
 
 
Figure 1: Decision network representing a finite part of an 
MDP (Poole and Mackworth, 2010). 
Partially observable Markov decision process is a 
formalism for representing decision problems for 
agents that must act under uncertainty (Sandberg, 
Lo, Fancourt, Principe, Katagiri, Haykin, 2001).   
POMDP can be formally described as a tuple (S, 
A, T, R, O, Ω), where 
  S - finite set of states of the environment; 
  A - finite set of actions; 
  T: S × A →  ∆(S) - state-transition function, 
giving a distribution over states of the 
environment, given a starting state and an 
action performed by the agent; 
  R: S × A → R - the reward function, giving a 
real-values expected immediate reward, given 
a starting state and an action performed by the 
agent; 
  Ω - finite set of observations the agent can 
experience; 
  O: S × A → ∆(Ω) - the observation function, 
giving a distribution over possible 
observations, given a starting state and an 
action performed by the agent. 
BuildingPokerAgentUsingReinforcementLearningwithNeuralNetworks
23