Markov Decision Process - Value Iteration
Consider an infinite horizon Markov Decision Process. There are three states S1, S2, and S3, and two actions a1 and a2. Assume the reward functions are R(s1, a1) = R(s2, a1) = 1, R(s1, a2) = R(s2, a2) = -1, and R(s3, a1) = R(s3, a2) = 3. The state transitions by taking action a1 and a2 are described by the matrices below, respectively.
a1:
S1 0.8 0.9 0.2
S2 0.1 0.0 0.7
S3 0.1 0.1 0.1
a2:
S1 0.1 0.9 0.7
S2 0.8 0.1 0.2
S3 0.1 0.0 0.1
(1) Starting with a value function that is 0 for all states, perform value iteration for 10 iterations for ó=0.9. Do it again for ó=0.1. You can either work it out using a table or by computer.
(2) Compare the values of states after 10 iterations with those in Problem 1. Are the values of states always no lower? Why yes and why not?
(3) Can you see it converging faster in one case? If so, why?
(4) What is the optimal policy?