Chapter 21, Reinforcement Learning Video Solutions, Artificial Intelligence. A Modern Approach [Global Edition]

Problem 1

Implement a passive learning agent in a simple environment, such as the 4 × 3 world. For the case of an initially unknown environment model, compare the learning performance of the direct utility estimation, TD, and ADP algorithms. Do the comparison for the optimal policy and for several random policies. For which do the utility estimates converge faster? What happens when the size of the environment is increased? (Try environments with and without obstacles.)

Check back soon!

Problem 2

Starting with the passive ADP agent, modify it to use an approximate ADP algorithm as discussed in the text. Do this in two steps:
a. Implement a priority queue for adjustments to the utility estimates. Whenever a state is adjusted, all of its predecessors also become candidates for adjustment and should be added to the queue. The queue is initialized with the state from which the most recent transition took place. Allow only a fixed number of adjustments.
b. Experiment with various heuristics for ordering the priority queue, examining their effect on learning rates and computation time.

Check back soon!

Problem 4

The direct utility estimation method in Section 21.2 uses distinguished terminal states to indicate the end of a trial. How could it be modified for environments with discounted rewards and no terminal states?

Check back soon!

00:37

Problem 4

Write out the parameter update equations for TD learning with
$$
\hat{U}(x, y)=\theta_0+\theta_1 x+\theta_2 y+\theta_3 \sqrt{\left(x-x_g\right)^2+\left(y-y_g\right)^2} .
$$

Babita Kumari

Numerade Educator

Problem 5

Adapt the vacuum world (Chapter 2) for reinforcement learning by including rewards for squares being clean. Make the world observable by providing suitable percepts. Now experiment with different reinforcement leaming agents. Is function approximation necessary for success? What sort of approximator works for this application?

Check back soon!

Problem 6

Implement an exploring reinforcement learning agent that uses direct utility estimation. Make two versions - one with a tabular representation and one using the function approximator in Equation (21.10). Compare their performance in three environments:
a. The $4 \times 3$ world described in the chapter.
b. A $10 \times 10$ world with no obstacles and $a+1$ reward at $(10,10)$.
c. A $10 \times 10$ world with no obstacles and $a+1$ reward at $(5,5)$.

Check back soon!

01:13

Problem 7

Extend the standard game-playing environment (Chapter 5) to incorporate a reward signal. Put two reinforcement learning agents into the environment (they may, of course, share the agent program) and have them play against each other. Apply the generalized TD update rule (Equation (21.12)) to update the evaluation function. You might wish to start with a simple linear weighted evaluation function and a simple game, such as tic-tac-toe.

Tyler Moulton

Numerade Educator

11:05

Problem 8

Compute the true utility function and the best linear approximation in $x$ and $y$ (as in Equation (21.10)) for the following environments:
a. A $10 \times 10$ world with a single +1 terminal state at $(10,10)$.
b. As in (a), but add a -1 terminal state at $(10,1)$.
c. As in (b), but add obstacles in 10 randomly selected squares.
d. As in (b), but place a wall stretching from $(5,2)$ to $(5,9)$.
e. As in (a), but with the terminal state at $(5,5)$.

The actions are deterministic moves in the four directions. In each case, compare the results using three-dimensional plots. For each environment, propose additional features (besides $x$ and $y$ ) that would improve the approximation and show the results.

Paul Choe

Numerade Educator

Problem 9

Implement the ReINFORCE and Pegasus algorithms and apply them to the $4 \times 3$ world, using a policy family of your own choosing. Comment on the results.

Check back soon!

01:48

Problem 10

Investigate the application of reinforcement learning ideas to the modeling of human and animal behavior.

Sarah Howell

Numerade Educator