Q3. Model-based RL (10 points)
An agent lives in a grid world as shown in the figure below. The agent tries out a policy ̀π which is indicated by the arrows in the figure. After four trials, the agent observes four episodes.
Input Policy π
Observed Episodes (Training)
Episode 1
A, south, C, -1
C, south, E, -1
E, exit, x, +10
Episode 2
B, east, C, -1
C, south, D, -1
D, exit, x, -10
Episode 3
B, east, C, -1
C, south, E, -1
E, exit, x, +10
Episode 4
A, south, C, -1
C, south, E, -1
E, exit, x, +10
What model would be learned from the above observed episodes (transition/reward functions)?