Example 4.1 Consider the 4x4 gridworld shown below.
actions
1
2
3
4
5
6
7
8
9
10
11
12 13
14
$R_t = -1$
on all transitions
The nonterminal states are $S = \{1, 2, \dots, 14\}$. There are four actions possible in each
state, $A = \{\text{up, down, right, left}\}$, which deterministically cause the corresponding
state transitions, except that actions that would take the agent off the grid in fact leave
the state unchanged. Thus, for instance, $p(6, \neg 15, \text{right}) = 1$, $p(7, \neg 17, \text{right}) = 1$,
and $p(10, \neg 5, \text{right}) = 0$ for all $r \in \mathcal{R}$. This is an undiscounted, episodic task. The
reward is $-1$ on all transitions until the terminal state is reached. The terminal state is
shaded in the figure (although it is shown in two places, it is formally one state). The
expected reward function is thus $r(s, a, s') = -1$ for all states $s, s'$ and actions $a$. Suppose
the agent follows the equiprobable random policy (all actions equally likely). The left side
of Figure 4.1 shows the sequence of value functions $\{v_\pi^k\}$ computed by iterative policy
evaluation. The final estimate is in fact $v_\pi^*$, which in this case gives for each state the
negation of the expected number of steps from that state until termination.
Exercise 4.1 In Example 4.1, if $\pi$ is the equiprobable random policy, what is $q_\pi(11, \text{down})$?
What is $q_\pi(7, \text{down})$?
Exercise 4.2 In Example 4.1, suppose a new state 15 is added to the gridworld just below
state 13, and its actions, left, up, right, and down, take the agent to states 12, 13, 14,
and 15, respectively. Assume that the transitions from the original states are unchanged.
What, then, is $v_\pi(15)$ for the equiprobable random policy? Now suppose the dynamics of
state 13 are also changed, such that action down from state 13 takes the agent to the new
state 15. What is $v_\pi(15)$ for the equiprobable random policy in this case?
Exercise 4.3 What are the equations analogous to (4.3), (4.4), and (4.5) for the action-
value function $q_\pi$ and its successive approximation by a sequence of functions $q_0, q_1, q_2, \dots$?