Question

Q3. Model-based RL (10 points) An agent lives in a grid world as shown in the figure below. The agent tries out a policy ? which is indicated by the arrows in the figure. After four trials, the agent observes four episodes. Input Policy ? Observed Episodes (Training) Episode 1 A, south, C, -1 C, south, E, -1 E, exit, x, +10 Episode 2 B, east, C, -1 C, south, D, -1 D, exit, x, -10 Episode 3 B, east, C, -1 C, south, E, -1 E, exit, x, +10 Episode 4 A, south, C, -1 C, south, E, -1 E, exit, x, +10 What model would be learned from the above observed episodes (transition/reward functions)?

          Q3. Model-based RL (10 points)
An agent lives in a grid world as shown in the figure below. The agent tries out a policy ? which is indicated by the arrows in the figure. After four trials, the agent observes four episodes.
Input Policy ?
Observed Episodes (Training)
Episode 1
A, south, C, -1
C, south, E, -1
E, exit, x, +10
Episode 2
B, east, C, -1
C, south, D, -1
D, exit, x, -10
Episode 3
B, east, C, -1
C, south, E, -1
E, exit, x, +10
Episode 4
A, south, C, -1
C, south, E, -1
E, exit, x, +10
What model would be learned from the above observed episodes (transition/reward functions)?

Q3. Model-based RL (10 points)
An agent lives in a grid world as shown in the figure below. The agent tries out a policy ? which is indicated by the arrows in the figure. After four trials, the agent observes four episodes.
Input Policy ?
Observed Episodes (Training)
Episode 1
A, south, C, -1
C, south, E, -1
E, exit, x, +10
Episode 2
B, east, C, -1
C, south, D, -1
D, exit, x, -10
Episode 3
B, east, C, -1
C, south, E, -1
E, exit, x, +10
Episode 4
A, south, C, -1
C, south, E, -1
E, exit, x, +10
What model would be learned from the above observed episodes (transition/reward functions)?

Added by Eric D.

Question

Please give Ace some feedback