Chapter 17, Making Complex Decisions Video Solutions, Artificial Intelligence. A Modern Approach [Global Edition]

Problem 1

For the $4 \times 3$ world shown in Figure 17.1, calculate which squares can be reached from $(1,1)$ by the action sequence $[$ Right, Right, Right, $U p, U p]$ and with what probabilities. Explain how this computation is related to the prediction task (see Section 15.2.1) for a hidden Markov model.

Mahendra Kumar

Numerade Educator

Problem 2

Suppose that we define the utility of a state sequence to be the maximum reward obtained in any state in the sequence. Show that this utility function does not result in stationary preferences between state sequences. Is it still possible to define a utility function on states such that MEU decision making gives optimal behavior?

Check back soon!

Problem 3

Can any finite search problem be translated exactly into a Markov decision problem such that an optimal solution of the latter is also an optimal solution of the former? If so, explain precisely how to translate the problem and how to translate the solution back; if not, explain precisely why not (i.e., give a counterexample).

Check back soon!

Problem 4

Sometimes MDPs are formulated with a reward function $R(s, a)$ that depends on the action taken or with a reward function $R\left(s, a, s^{\prime}\right)$ that also depends on the outcome state.
a. Write the Bellman equations for these formulations.
b. Show how an MDP with reward function $R\left(s, a, s^{\prime}\right)$ can be transformed into a different MDP with reward function $R(s, a)$, such that optimal policies in the new MDP correspond exactly to optimal policies in the original MDP.
c. Now do the same to convert MDPs with $R(s, a)$ into MDPs with $R(s)$.

Check back soon!

01:52

Problem 5

For the environment shown in Figure 17.1, find all the threshold values for $R(s)$ such that the optimal policy changes when the threshold is crossed. You will need a way to calculate the optimal policy and its value for fixed $R(s)$. (Hint: Prove that the value of any fixed policy varies linearly with $R(s)$.)

Thane Stiles

Numerade Educator

Problem 6

Equation (17.7) on page 654 states that the Bellman operator is a contraction.
a. Show that, for any functions $f$ and $g$,
$$
\left|\max _a f(a)-\max _a g(a)\right| \leq \max _a|f(a)-g(a)| .
$$
b. Write out an expression for $\left|\left(B U_i-B U_i^{\prime}\right)(s)\right|$ and then apply the result from (a) to complete the proof that the Bellman operator is a contraction.

Check back soon!

Problem 7

This exercise considers two-player MDPs that correspond to zero-sum, turn-taking games like those in Chapter 5. Let the players be $A$ and $B$, and let $R(s)$ be the reward for player $A$ in state $s$. (The reward for $B$ is always equal and opposite.)
a. Let $U_A(s)$ be the utility of state $s$ when it is $A$ 's turn to move in $s$, and let $U_B(s)$ be the utility of state $s$ when it is $B$ 's turn to move in $s$. All rewards and utilities are calculated from $A$ 's point of view (just as in a minimax game tree). Write down Bellman equations defining $U_A(s)$ and $U_B(s)$.
b. Explain how to do two-player value iteration with these equations, and define a suitable termination criterion.
c. Consider the game described in Figure 5.17 on page 197. Draw the state space (rather than the game tree), showing the moves by $A$ as solid lines and moves by $B$ as dashed lines. Mark each state with $R(s)$. You will find it helpful to arrange the states $\left(s_A, s_B\right)$ on a two-dimensional grid, using $s_A$ and $s_B$ as "coordinates."
d. Now apply two-player value iteration to solve this game, and derive the optimal policy.

Check back soon!

Problem 8

Consider the $3 \times 3$ world shown in Figure 17.14(a). The transition model is the same as in the $4 \times 3$ Figure 17.1: $80 \%$ of the time the agent goes in the direction it selects; the rest of the time it moves at right angles to the intended direction.

Implement value iteration for this world for each value of $r$ below. Use discounted rewards with a discount factor of 0.99 . Show the policy obtained in each case. Explain intuitively why the value of $r$ leads to each policy.
a. $r=100$
b. $r=-3$
c. $r=0$
d. $r=+3$

Check back soon!

Problem 9

Consider the $101 \times 3$ world shown in Figure 17.14(b). In the start state the agent has a choice of two deterministic actions, $U p$ or Down, but in the other states the agent has one deterministic action, Right. Assuming a discounted reward function, for what values of the discount $\gamma$ should the agent choose $U p$ and for which Down? Compute the utility of each action as a function of $\gamma$. (Note that this simple example actually reflects many real-world situations in which one must weigh the value of an immediate action versus the potential continual long-term consequences, such as choosing to dump pollutants into a lake.)

Check back soon!

05:27

Problem 10

Consider an undiscounted MDP having three states, $(1,2,3)$, with rewards $-1,-2$, 0 , respectively. State 3 is a terminal state. In states 1 and 2 there are two possible actions: $a$ and $b$. The transition model is as follows:
$\bullet$ In state 1 , action $a$ moves the agent to state 2 with probability 0.8 and makes the agent stay put with probability 0.2 .
$\bullet$ In state 2, action $a$ moves the agent to state 1 with probability 0.8 and makes the agent stay put with probability 0.2 .
$\bullet$ In either state 1 or state 2 , action $b$ moves the agent to state 3 with probability 0.1 and makes the agent stay put with probability 0.9 .

Answer the following questions:
a. What can be determined qualitatively about the optimal policy in states 1 and 2 ?
b. Apply policy iteration, showing each step in full, to determine the optimal policy and the values of states 1 and 2. Assume that the initial policy has action $b$ in both states.
c. What happens to policy iteration if the initial policy has action $a$ in both states? Does discounting help? Does the optimal policy depend on the discount factor?

Robin Corrigan

Numerade Educator

Problem 11

Consider the $4 \times 3$ world shown in Figure 17.1.
a. Implement an environment simulator for this environment, such that the specific geography of the environment is easily altered. Some code for doing this is already in the online code repository.
b. Create an agent that uses policy iteration, and measure its performance in the environment simulator from various starting states. Perform several experiments from each starting state, and compare the average total reward received per run with the utility of the state, as determined by your algorithm.
c. Experiment with increasing the size of the environment. How does the run time for policy iteration vary with the size of the environment?

Check back soon!

03:02

Problem 12

How can the value determination algorithm be used to calculate the expected loss experienced by an agent using a given set of utility estimates $U$ and an estimated model $P$, compared with an agent using correct values?

Ameer Said

Numerade Educator

01:58

Problem 13

Let the initial belief state $b_0$ for the $4 \times 3$ POMDP on page 658 be the uniform distribution over the nonterminal states, i.e., $\left\langle\frac{1}{9}, \frac{1}{9}, \frac{1}{9}, \frac{1}{9}, \frac{1}{9}, \frac{1}{9}, \frac{1}{9}, \frac{1}{9}, \frac{1}{9}, 0,0\right\rangle$. Calculate the exact belief state $b_1$ after the agent moves Left and its sensor reports 1 adjacent wall. Also calculate $b_2$ assuming that the same thing happens again.

Christopher Stanley

Numerade Educator

01:37

Problem 14

What is the time complexity of $d$ steps of POMDP value iteration for a sensorless environment?

James Kiss

Numerade Educator

Problem 15

Consider a version of the two-state POMDP on page 661 in which the sensor is $90 \%$ reliable in state 0 but provides no information in state 1 (that is, it reports 0 or 1 with equal probability). Analyze, either qualitatively or quantitatively, the utility function and the optimal policy for this problem.

Check back soon!

Problem 16

Show that a dominant strategy equilibrium is a Nash equilibrium, but not vice versa.

Check back soon!

Problem 17

Solve the game of three-finger Morra.

Check back soon!

Problem 18

In the Prisoner's Dilemma, consider the case where after each round, Alice and Bob have probability $X$ meeting again. Suppose both players choose the perpetual punishment strategy (where each will choose refuse unless the other player has ever played testify). Assume neither player has played testify thus far. What is the expected future total payoff for choosing to testify versus refuse when $X=.2$ ? How about when $X=.05$ ? For what value of $X$ is the expected future total payoff the same whether one chooses to testify or refuse in the current round?

Check back soon!

Problem 19

A Dutch auction is similar in an English auction, but rather than starting the bidding at a low price and increasing, in a Dutch auction the seller starts at a high price and gradually lowers the price until some buyer is willing to accept that price. (If multiple bidders accept the price, one is arbitrarily chosen as the winner.) More formally, the seller begins with a price $p$ and gradually lowers $p$ by increments of $d$ until at least one buyer accepts the price. Assuming all bidders act rationally, is it true that for arbitrarily small $d$, a Dutch auction will always result in the bidder with the highest value for the item obtaining the item? If so, show mathematically why. If not, explain how it may be possible for the bidder with highest value for the item not to obtain it.

Check back soon!

Problem 20

Imagine an auction mechanism that is just like an ascending-bid auction, except that at the end, the winning bidder, the one who bid $b_{\max }$, pays only $b_{\max } / 2$ rather than $b_{\max }$. Assuming all agents are rational, what is the expected revenue to the auctioneer for this mechanism, compared with a standard ascending-bid auction?

Check back soon!

View

Problem 21

Teams in the National Hockey League historically received 2 points for winning a game and 0 for losing. If the game is tied, an overtime period is played; if nobody wins in overtime, the game is a tie and each team gets 1 point. But league officials felt that teams were playing too conservatively in overtime (to avoid a loss), and it would be more exciting if overtime produced a winner. So in 1999 the officials experimented in mechanism design: the rules were changed, giving a team that loses in overtime 1 point, not 0 . It is still 2 points for a win and 1 for a tie.
a. Was hockey a zero-sum game before the rule change? After?
b. Suppose that at a certain time $t$ in a game, the home team has probability $p$ of winning in regulation time, probability $0.78-p$ of losing, and probability 0.22 of going into overtime, where they have probability $q$ of winning, $.9-q$ of losing, and .1 of tying. Give equations for the expected value for the home and visiting teams.
c. Imagine that it were legal and ethical for the two teams to enter into a pact where they agree that they will skate to a tie in regulation time, and then both try in earnest to win in overtime. Under what conditions, in terms of $p$ and $q$, would it be rational for both teams to agree to this pact?
d. Longley and Sankaran (2005) report that since the rule change, the percentage of games with a winner in overtime went up $18.2 \%$, as desired, but the percentage of overtime games also went up $3.6 \%$. What does that suggest about possible collusion or conservative play after the rule change?

Rashmi Sinha

Numerade Educator