Consider the n-state MDP in the figure below. In state n, there is just one action that collects a reward of +10 and terminates the episode. In all the other states, there are two actions: float, which moves deterministically one step to the right, and reset, which deterministically goes back to state 1. There is a reward of +1 for a float and 0 for reset. The discount factor is y=1/2.
(i) What is the optimal policy?
(ii) What is the optimal value of state n, V*(n)?
(iii) Compute the optimal value function, V*(k), for all k = 1,....n-1.
(iv) Suppose you are doing value iteration to figure out these values. You start with all value estimates equal to 0. Show all the non-zero values after 1 and 2 iterations, respectively.