Consider the following gridworld:
10
s1
s3
s2
s4
Objective: Use the Value Iteration Algorithm to calculate the values for the states over 4 iterations and determine the optimal policy based on your calculations.
Scenario:
If the agent wants to move in a direction, it will move in the intended direction with a probability of 1/3. If it doesn't move in the intended direction, it will move in one of the two perpendicular directions with equal probability of 1/3 for each.
For example, if the action is to move left, then:
• P(moveleft)=31
• P(move down)=31 • P(moveup)=31
Reward Structure:
- The immediate reward for moving in any direction is -1.
10
Tasks:
1. Value Iteration: Perform value iteration for 4 iterations to calculate the value of each state. 2. Optimal Policy: Based on your value calculations, derive the optimal policy for each state.
Guidelines for Value Iteration:
- Initialization: Start with initial value function V (s) for all states s.
- Update Rule: Update the value of each state V (s) using the Bellman equation: V(s)←max(∑Pa [Ra +\gamma V(s')])
ss'
ss' ss' s'
a
- \gamma is the discount factor (assume \gamma =1 for this assignment). - Iteration Process: Repeat the update rule for 4 iterations.
Guidelines for Optimal Policy:
- Policy Derivation: After completing the value iteration, determine the optimal policy \pi (s) for each state s by choosing the action a that maximizes the expected value:
where:
- Pa is the transition probability.
- Ra is the immediate reward. ss'
\pi (s)←argmax(∑Pa [Ra +\gamma V(s')])
a s'
Submission:
- Calculation Details: Show your calculations for the value of each state for all 4 iterations.
- Optimal Policy: Clearly indicate the optimal policy for each state based on your final value iteration results.