The Monte Carlo method! A computational technique.pptx

ssuser7b7f4e 60 views 16 slides Jun 30, 2024
Slide 1
Slide 1 of 16
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16

About This Presentation

The Monte Carlo method! A computational technique that uses random sampling to solve mathematical problems and estimate quantities. It's a powerful tool in many fields, including:

- Finance: risk analysis, option pricing
- Engineering: system design, optimization
- Computer Science: algorithms,...


Slide Content

Monte Carlo Method Presented by-Shajal Afaq M.Tech CSE( M.tech 2 nd year) 1

Monte Carlo Methods 2 Monte Carlo methods learn from complete sample returns  Only defined for episodic tasks Monte Carlo methods learn directly from experience  On-line: No model necessary and still attains optimality  Simulated: No need for a full model

Monte Carlo Policy Evaluation 3 Goal: learn V  (s) Given: some number of episodes under  which contain s Idea: Average returns observed after visits to s 1 2 3 4 5 Every-Visit MC: average returns for every time s is visited in an episode First-visit MC: average returns only for first time s is visited in an episode Both converge asymptotically

First-visit Monte Carlo policy evaluation 4

Blackjack example 5 Object: Have your card sum be greater than the dealers without exceeding 21. States (200 of them):  current sum (12-21)  dealer ’ s showing card (ace-10)  do I have a useable ace? Reward: +1 for winning, 0 for a draw, -1 for losing Actions: stick (stop receiving cards), hit (receive another card) Policy: Stick if my sum is 20 or 21, else hit

Blackjack value functions 6  1  1 A Dealer showing 10 12 Playe r sum 21 After 500,000 episodes After 10,000 episodes Usable ace No usable ace

Backup diagram for Monte Carlo 7 Entire episode included Only one choice at each state (unlike DP) MC does not bootstrap Time required to estimate one state does not depend on the total number of states terminal state

Monte Carlo Estimation of Action Values (Q ) 8 Monte Carlo is most useful when a model is not available  We want to learn Q * Q  (s,a) - average return starting from state s and action a following  Also converges asymptotically if every state-action pair is visited Exploring starts: Every state-action pair has a non-zero probability of being the starting pair

Monte Carlo Control 9  greedy ( Q ) improvement MC policy iteration: Policy evaluation using MC methods followed by policy improvement Policy improvement step: greedify with respect to value (or action-value) function  Q evaluation Q  Q 

Convergence of MC Control 10 ❐ Policy improvement theorem tells us: Q  k ( s ,  k  1 ( s ))  Q  k ( s , arg max Q  k ( s , a )) a  max Q  k ( s , a ) a  Q  k ( s ,  ( s )) k  V  k ( s ) This assumes exploring starts and infinite number of episodes for MC policy evaluation To solve the latter:  update only to a given level of performance  alternate between evaluation and improvement per episode

Usable ace No usable ace A 2 3 4 5 6 7 8 9 10 Dealer showing Player sum HIT STICK 21 20 19 18 17 16 15 14 13 12 11 Exploring starts Initial policy as described before  * A 2 3 4 5 6 7 8 9 10 HIT STICK 21 20 19 18 17 16 15 14 13 12 11 V * 21 10 12 A Dealer showing Playe r sum 10 A 12 21  1  1 Blackjack example continued 11

On-policy Monte Carlo Control 12 A ( s ) 1      non-max greedy Similar to GPI: move policy towards greedy policy (i.e.  - soft) Converges to best  -soft policy A ( s )  On-policy: learn about policy currently executing How do we get rid of exploring starts?  Need soft policies:  (s,a) > 0 for all s and a  e.g.  -soft policy:

Off-policy Monte Carlo control 13 Behavior policy generates behavior in environment Estimation policy is policy being learned about Average returns from behavior policy by probability their probabilities in the estimation policy

Incremental Implementation MC can be implemented incrementally  saves memory Compute the weighted average of each return n V  n  w k R k k k  1 n  w k  1 w 14 W n  1  W n n  1 V  W  incremental equivalent W n n  1 n  1 w n  1   R  V  V n  1  V n non-incremental

Summary 15 ❐ MC has several advantages over DP:  Can learn directly from interaction with environment  No need for full models  No need to learn about ALL states  Less harm by Markovian violations (later in book) MC methods provide an alternate policy evaluation process One issue to watch for: maintaining sufficient exploration  exploring starts, soft policies No bootstrapping (as opposed to DP)

Thank you 16
Tags