(3) Markov Decision Processes (Value Iteration
and/or Policy Iteration)
0.3, 1
A
0.4,3
0.7,-1
C
0.2, 5
B
Action P
0.8,1
0.6,0
A
0.1,-1
0.9,-1
0.8, 4
C
0.6, 3
Action Q
0.4,-1
0.2, 2
B
To avoid cluttering the diagram, I have drawn the arrows of
Action P and Action Q on separate diagrams (in the assignments
we drew the green and black arrows on the same figure). The
probabilities and the rewards are given with each of the arrows.
(a) Take the discount rate, and let the initial values of all the
three states equal 0. Perform one iteration of value
iteration, and update the values of the three states.
(b) Take the initial policy as action P from state A, and action
Q from states B and C. Take the same discount rate, and
initial values as part (a). Perform one iteration of policy
iteration \textemdash that is, compute three equations in the three
unknowns under this policy. (don't solve the equations. You
don't even have to simplify the equations).
(c) Assume that you got the values:, and (I am just making
these values up, and didn't solve the above equations).
Assuming these values, compute the next policy.
(To compute the next policy, you need to compute the
best action from each of the states A, B and C, based
on the values computed in part (b). To save time, just
compute the best action from state A only).