Consider the following deterministic MDP (deterministic meaning: if we are in state 1 and take action a12, then the probability of getting to state 2 is 1). Each state contains the reward of the agent for that particular state. If we use a discount factor (gamma, γ) of 0.1, what will the optimal policy (pi*, π*) be?
Write your answer in English. For each state, there is only one action in the optimal policy, so just give a list of actions. For instance: a01, a12, a21, a43, a31.
a03
a13
S1 -2
S3 1
a31
T
1
a01
s0 1
a21
a12
a43
a34
a02
a20
V
S2 2
a24
S4 1
a04