Please answer in the complete solution, thank you.
Let us consider an MDP with a fixed start state s_(0).
Let us consider the conservative policy update rule:
pi _(new )(s,a)=(1-alpha )pi (s,a)+alpha pi ^(')(s,a)
for some alpha in[0,1].
(a) [10 points] What is pi _(new )(s,a) when alpha =1 ?
(b) points
3.[60 points] Conservative Policy Iteration
Let us consider an MDP with a fixed start state so.
Let us consider the conservative policy update rule:
Tnew(s,a)=(1-a)T(s,a)+aT'(s,a)
for some e [0,1].
(a) [10 points] What is Tnew(s,a) when a = 1? (b) [10 points] Recall that A"(s,a)=Q"(s,a) -V"(s). Let P(st; T) be the distribution over states at time t while following T from the start state so. Recall that the discounted stationary state distribution of a policy T is d"(s) = (1 -)E=oP(st = s;). We now define the policy advantage of some policy T' with respect to a policy T as A"() =Es~d*[Ea~T'[A"(s,a)]]. Recall that for policies t' and , we have that V"' (so)-V"(so) = --Es~d"' [Ea~'[A"(s, a)] How does Vt'(so) - V"(so) differ from the policy advantage A"(T')? A high-level description in words will suffice.
(c) [10 points] Compute a simplified expression for A"(Tnew) in terms of the policy ad- vantage of T'. (d) [10 points] With Tnew, at any given timestep, the probability that we select an action according to T' is a. Let us define the random variable ct as the number of actions chosen from T' before time t. Let us denote Pt = Pr(ct > 1). Compute an expression for pt in terms of a and t
(e) [20 points] Now let e= maxs|Ea~T'(s)[A"(s,a)]| Prove that:
Es~P(st;Tnew)
Tnew(s,a)A"(s,a) > aEs~P(st;r)
T(s,a)A(s,a)
2apt