Question

a03 a13 $s_1$ -2 $s_3$ 1 a31 a01 s_0 1 a21 a12 a43 a34 a02 a20 a24 s_2 2 s_4 1 a04

          a03
a13
$s_1$
-2
$s_3$
1
a31
a01
s_0
1
a21
a12
a43
a34
a02
a20
a24
s_2
2
s_4
1
a04

Added by Evan H.

Computer Science and Information Technology

Trishna Knowledge Systems 2018 Edition

Instant Answer

Solved by Expert Jennifer Hudspeth

Step 1

1 * max(V(S0), V(S3)) V(S2) = 2 + 0.1 * max(V(S1), V(S4)) V(S3) = 1 + 0.1 * max(V(S1), V(T)) V(S4) = 1 + 0.1 * V(S2) V(T) = 0 Show more…

Show all steps

Thanks for your feedback!

Consider the following deterministic MDP (deterministic meaning: if we are in state 1 and take action a12, then the probability of getting to state 2 is 1). Each state contains the reward of the agent for that particular state. If we use a discount factor (gamma, Î³) of 0.1, what will the optimal policy (pi*, Ï€*) be? Write your answer in English. For each state, there is only one action in the optimal policy, so just give a list of actions. For instance: a01, a12, a21, a43, a31. a03 a13 S1 -2 S3 1 a31 T 1 a01 s0 1 a21 a12 a43 a34 a02 a20 V S2 2 a24 S4 1 a04