Describe the exploration exploitation trade off in...

Name: describe the exploration exploitation trade off in reinforcement learning make sure you include what each term means why this trade off is important an example scenario where this trade off 72136
Uploaded: 2023-11-01T18:05:16-08:00
Duration: 4 min 25 s
Channel: Akash M
Description: describe the exploration exploitation trade off in reinforcement learning make sure you include what each term means why this trade off is important an example scenario where this trade off 72136

Akash M and 95 other subject AP CS educators are ready to help you.

Ask a new question

Labs

Want to see this concept in action?

NEW

Explore this concept interactively to see how it behaves as you change inputs.

View Labs

Key Concepts

Key Concept

Premium Feature

Explore the core concept behind this problem.

Key Concept

Premium Feature

Explore the core concept behind this problem.

Recommended Videos

For this task, we want to use decision theory to design the robots. We assume that the resources can appear in the locations at time t = 0 randomly - and if some resource appears in a location, it appears in a quantity of 50 units - but does not appear later on. We also assume that 1 unit of resource r1 (resp. r2, r3) is worth 10 pounds (resp. 35 pounds, 40 pounds). We remind that the resources disappear over time (if not collected) - see introduction. We assume for this task that there is no cliff on the grid. 1. For the two next questions, take the two last digits of your student's number. Say they are xy. We denote by p the probability 0.xy. Consider the following grid, where a robot is on the location at time 0 and where: - On the orange locations, r1 has a probability of 0.7 to appear, r2 has a probability of p, and r3 has a probability of 0.1 at time 0 and in a quantity of 50 units. - On the violet location, r1 has a probability of p to appear, r2 has a probability of 0.5, and r3 has a probability of 0.2 at time 0 and in a quantity of 50 units. - On the other locations, resources have a probability of 0 to appear. According to decision theory, which path should the robot follow? 2. Consider the following grid, where a robot is on the location at time 0 and where: - On the violet location, r1 has a probability of p to appear, r2 has a probability of 0.4, and r3 has a probability of x, at time 0 and in a quantity of 50. - On the orange location, r1 has a probability of 0.6 to appear, r2 has a probability of x, and r3 has a probability of p at time 0 and in a quantity of 50. According to decision theory, for which values of x would the robot start by collecting resources on the orange location before going to the violet one? 3. (a bit difficult) You are going to collect resources on the next location, but you are asked to choose which resource(s) you will collect (before knowing the content of the next location). On the next location, r1 has a probability of 0.5 to appear in a quantity of 50 units when you reach it. Resources r2 and r3 have a probability of 0.75 to appear in a total quantity of 50 units, but you do not know how many units are of r2 and how many are of r3 (but the total is 50 units). You have the two following scenarios: - Scenario 1: you are asked to choose between collecting resource r1 or collecting resource r3. - Scenario 2: you are asked to choose between collecting resources r1 and r3 or collecting resources r2 and r3. Explain why, if you prefer to collect resource r1 in Scenario 1, and prefer to collect r2 and r3 in Scenario 2, you would create a paradox.

Akash M.

Explain the trade-off between risk and return

Jennifer S.

How do $r$ and $K$ strategies relate to the predictability of the environment? In which kinds of environment is each strategist

Transcript

00:01 So, to determine the optimal path for the robot, we need to calculate the expected value of each path.

00:23 Let's denote the orange location as a and the violet location as b.

00:43 The expected value of path ab is eab equals 0 .7 multiplied with 10 multiplied with 50 plus p multiplied with 35 multiplied with 50 plus 0 .1 multiplied with 40 multiplied with 50.

01:29 That is actually 1, 2, 2, 5, p plus 3, 50.

01:39 The expected value of path ab we did and now the expected value of path ba according to the similar process is equal to eba equals 875p.

02:22 I am doing it after the calculation...