5th Module_Machine Learning_Reinforc.pdf

DrShivashankar1 937 views 26 slides Aug 20, 2024
Slide 1
Slide 1 of 26
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26

About This Presentation

5th Module_Machine Learning_Reinforcement Learning


Slide Content

MACHINE LEARNING (INTEGRATED)
(21ISE62)
Module 5
Dr. Shivashankar
Professor
Department of Information Science & Engineering
GLOBAL ACADEMY OF TECHNOLOGY-Bengaluru
8/20/2024 1Dr. Shivashankar, ISE, GAT
GLOBAL ACADEMY OF TECHNOLOGY
Ideal Homes Township, RajarajeshwariNagar, Bengaluru –560 098
Department of Information Science & Engineering

Course Outcomes
AfterCompletionofthecourse,studentwillbeableto:
IllustrateRegressionTechniquesandDecisionTreeLearning
Algorithm.
ApplySVM,ANNandKNNalgorithmtosolveappropriateproblems.
ApplyBayesianTechniquesandderiveeffectivelearningrules.
IllustrateperformanceofAIandMLalgorithmsusingevaluation
techniques.
Understandreinforcementlearninganditsapplicationinrealworld
problems.
TextBook:
1.TomM.Mitchell,MachineLearning,McGrawHillEducation,IndiaEdition2013.
2.EthemAlpaydın,Introductiontomachinelearning,MITpress,Secondedition.
3.Pang-NingTan,MichaelSteinbach,VipinKumar,IntroductiontoDataMining,
Pearson,FirstImpression,2014.
8/20/2024 2Dr. Shivashankar, ISE, GAT

Module 5: Reinforcement Learning
•Reinforcementlearning(RL)isaMachineLearning(ML)techniquethat
trainssoftwaretomakedecisionstoachievethemostoptimalresults.
•Itmimicsthetrial-and-errorlearningprocessthathumansuseto
achievetheirgoals.
•ItisafeedbackbasedMLapproach,hereanagentlearnstowhich
actiontoperformbylookingattheenvironmentandresultofaction.
•Foreachcorrectaction,theagentsgetpositivefeedback,andforeach
incorrectaction,theagentgetsnegativefeedbackorpenalty.
8/20/2024 3Dr. Shivashankar, ISE, GAT
Agent
Environment
Action
State
Rewards
Fig 5.1: Reinforcement Learning

Learning to Optimize Rewards
•Theagentinteractwithenvironmentandidentifythepossibleactionhecan
perform.
•Theprimarygoalofanagentinreinforcementlearningistoperformactionsby
lookingattheenvironmentandgetthemaximumpositiverewards.
•Inreinforcementlearning,theagentlearnsautomaticallyusingfeedbacks
withoutanylabelleddata,unlikesupervisedlearning.
•Sincethereisnolabelleddata,sotheagentisboundtolearnbyitsexperience
only.
•Therearetwotypesofreinforcementlearning:Positiveandnegative.
•Positivereinforcementlearningisarecurrencebehaviorduetopositive
rewards.
•Rewardsincreasestrengthandfrequencyofaspecificbehavior.
•Thisencouragestoexecutesimilaractionthatyieldmaximumrewards.
•Similarlyinnegativereinforcementlearning,negativerewardsareusedas
deterrenttoweakenthebehaviorandtoavoidit.
•Rewardsdecreasethestrengthandthefrequencyofaspecificbehavior.
8/20/2024 4Dr. Shivashankar, ISE, GAT

Cont…
•Theagentcantakeanypathtoreachtothefinalpoint,butheneedstomake
itinpossiblefewersteps.SupposetheagentconsidersthepathS9-S5-S1-S2-
S3,sohewillgetthe+1-rewardpoint.
•Actionperformedbytheagentisreferredtoas"a"
•Stateoccurredbyperformingtheactionis"s."
•Thereward/feedbackobtainedforeachgoodandbadactionis"R."
•AdiscountfactorisGamma"γ."
V(s)=max[R(s,a)+γV(s`)]
8/20/2024 5Dr. Shivashankar, ISE, GAT

Cont…
•V(s)=valuecalculatedataparticularpoint.
•R(s,a)=Rewardataparticularstatesbyperforminganaction.
•γ=Discountfactor
•V(s`)=Thevalueatthepreviousstate.
How to represent the agent state?
•WecanrepresenttheagentstateusingtheMarkovStatethatcontainsall
therequiredinformationfromthehistory.TheStateStisMarkovstateifit
followsthegivencondition:
P[S
t+1 | S
t ] = P[S
t +1 | S
1,......, S
t]
•tupleoffourelements(S,A,P
a,R
a):
•AsetoffiniteStatesS
•AsetoffiniteActionsA
•RewardsreceivedaftertransitioningfromstateStostateS',duetoaction
a.
•ProbabilityP
a.
8/20/2024 6Dr. Shivashankar, ISE, GAT

Reinforcement Learning Algorithms
•ReinforcementlearningalgorithmsaremainlyusedinAIapplicationsandgaming
applications.Themainusedalgorithmsare:
•Q-Learning:
•Q-learningisanOffpolicyRLalgorithm,whichisusedforthetemporaldifference
Learning.Thetemporaldifferencelearningmethodsarethewayofcomparing
temporallysuccessivepredictions.
•ItlearnsthevaluefunctionQ(S,a),whichmeanshowgoodtotakeaction"a"ata
particularstate"s."
•ThebelowflowchartexplainstheworkingofQ-learning:
8/20/2024 7Dr. Shivashankar, ISE, GAT

Credit Assignment Problem
•TheCreditAssignmentProblem(CAP)isafundamentalchallengein
reinforcementlearning.
•Itariseswhenanagentreceivesarewardforaparticularaction,buttheagent
mustdeterminewhichofitspreviousactionsledtothereward.
•Inreinforcementlearning,anagentappliesasetofactionsinanenvironment
tomaximizetheoverallreward.
•Theagentupdatesitspolicybasedonfeedbackreceivedfromthe
environment.
•TheCAPreferstotheproblemofmeasuringtheinfluenceandimpactofan
actiontakenbyanagentonfuturerewards.
•Thecoreaimistoguidetheagentstotakecorrectiveactionswhichcan
maximizethereward.
•Thiscanmakeitdifficultfortheagenttobuildaneffectivepolicy.
•Additionally,there’resituationswheretheagenttakesasequenceofactions,
andtherewardsignalisonlyreceivedattheendofthesequence.
•Inthesecases,theagentmustdeterminewhichofitspreviousactions
positivelycontributedtothefinalreward
8/20/2024 8Dr. Shivashankar, ISE, GAT

Cont…
•Example:Astheagentexploresthemaze,itreceivesarewardof+10for
reachingthegoalstate.Additionally,ifithitsastone,wepenalizetheactionby
providinga-10reward.
Path1:1-5-9
Path2:1-4-8
Path3:1-2-3-6-9
Path4:1-2-5-9
andsoon..
8/20/2024 9Dr. Shivashankar, ISE, GAT

Temporal Difference Learning
•TemporalDifferenceLearninginreinforcementlearningworksas
anunsupervisedlearningmethod.
•Ithelpspredictthetotalexpectedfuturereward.
•Atitscore,TemporalDifferenceLearning(TDLearning)aimsto
predictavariable'sfuturevalueinastatesequence.TDLearning
madeabigleapinsolvingrewardpredictionproblems.
8/20/2024 10Dr. Shivashankar, ISE, GAT
Figure 3 : The temporal difference reinforcement learning algorithm.

Cont…
•Moreformally,accordingtotheTDalgorithmthepredictionerrorδ(t)is
definedastheimmediaterewardR(t)plusthepredictedfuturevalueV(t+1)
minusthecurrentvaluepredictionV(t)(Eqn(1)):
δ (t) = R(t) + γ · V (t+1) − V (t) -----(1)
•ThePeδ(t)isusedtoupdatetheoldvalueprediction(Eqn(2)):
V (t) new = Vt(old) + α · δ (t) ----(2)
•Alpha (α): learning rate
It shows how much our estimates should be adjusted, based on the error. This
rate varies between 0 and 1.
•Gamma (γ): the discount rate
This indicates how much future rewards are valued.
•The(exponential)discountfactor0<γ<1inEqn(1)accountsforthefactthat
humans(andotheranimals)tendtodiscountthevalueoffuturereward.
•Thelearningrate0<α<1inEqn(2)determineshowmuchaspecificevent
affectsfuturevaluepredictions.Alearningratecloseto1wouldsuggestthat
themostrecentoutcomehasastrongeffectonthevalueprediction.
8/20/2024 11Dr. Shivashankar, ISE, GAT

Q-learning
•Q-learningisareinforcementlearningalgorithmthatfindsan
optimalaction-selectionpolicyforanyfiniteMarkovDecision
Process(MDP).
•Ithelpsanagentlearntomaximizethetotalrewardovertime
throughrepeatedinteractionswiththeenvironment,evenwhen
themodelofthatenvironmentisnotknown.
How Does Q-Learning Work
1.LearningandUpdatingQ-values:
•ThealgorithmmaintainsatableofQ-valuesforeachstate-action
pair.
•TheseQ-valuesrepresenttheexpectedutilityoftakingagiven
actioninagivenstateandfollowingtheoptimalpolicyafterthat.
•TheQ-valuesareinitializedarbitrarilyandareupdatediteratively
usingtheexperiencesgatheredbytheagent.
8/20/2024 12Dr. Shivashankar, ISE, GAT

Q-learning
2.Q-valueUpdateRule:
TheQ-valuesareupdatedusingtheformula:
??????(&#3627408480;,??????)←??????(&#3627408480;,??????)+??????[&#3627408479;+??????max (??????(&#3627408480;′,??????′)−??????(&#3627408480;,??????))]
Q(s,a) = r(s,a) + ??????*max (??????(??????(s,a), ƴ??????))
Where:
&#3627408480;isthecurrentstate,
??????istheactiontaken,
ristherewardreceivedaftertakingaction??????instate&#3627408480;,
&#3627408480;′isthenewstateafteraction,
??????′isanypossibleactionfromthenewstate&#3627408480;′,
??????isthelearningrate(0<α≤1),
??????isthediscountfactor(0≤γ<1).
8/20/2024 13Dr. Shivashankar, ISE, GAT

Cont…
3.PolicyDerivation:Thepolicydetermineswhatactiontotakein
eachstateandcanbederivedfromtheQ-values.
Typically,thepolicychoosestheactionwiththehighestQ-valuein
eachstate.
4.Explorationvs.Exploitation:Q-learningmanagesthetrade-off
betweenexploration(choosingrandomactionstodiscovernew
strategies)andexploitation(choosingactionsbasedon
accumulatedknowledge).
5.Convergence:Undercertainconditions,suchasensuringall
state-actionpairsarevisitedaninfinitenumberoftimes,Q-learning
convergestotheoptimalpolicyandQ-valuesthatgivethe
maximumexpectedrewardforanystateunderanyconditions.
8/20/2024 14Dr. Shivashankar, ISE, GAT

Q Learning Algorithm
Foreachs,ainitializethetableentry෡??????(s,a)tozero.
Observecurrentstates.
Doforever:
Selectanactionaandexecuteit.
Receiveimmediaterewardr
Observenewstateƴ&#3627408532;
Updatetheentryfor෡??????(s,a)asfollow:
Q(s,a)=r(s,a)+??????*max(??????(??????(s,a),ƴ??????))
෡??????(s,a) ←&#3627408531;+????????????????????????
ƴ??????
෡??????(ƴ&#3627408532;, ƴ??????)
S ←ƴ&#3627408532;
8/20/2024 15Dr. Shivashankar, ISE, GAT

Problem
Problem1:Supposethereare6roomsinthegiventable,room3isthegoal,havean
instantrewardof100,otherroomsnotdirectlyconnectedtothetargetroomhavezero
reward.Eachrowcontainsaninstantrewardvalue.Constructarewardmatrix,Q-learning
andcalculateimmediaterewardvalueforeachinstant.Learningrateis0.9.
Solution:RewardMatrix Q-learningMatrix
8/20/2024 16Dr. Shivashankar, ISE, GAT
123 456
1-10-10-1-1
20-1100-10-1
3-1-10 -1-1-1
40-1-1-10-1
5-10-10-10
6-1-1100-10-1
123 456
1000 000
2000 000
3000 000
4000 000
5000 000
6000 000
R=
State
Action
Q=

Cont…
Q(state,action)=R(state,action)+Alpha*Max[Q(nextstate,allaction)]
Q(s,a) = r(s,a) + ??????*max [??????(??????(s,a), ƴ??????)]
At state 3: Q(2,3) = R(2,3) + 0.9 * max [Q(3,3)]
= 100 + 0.9 * max [0]
= 100 + 0 = 100
Q(6,3) = R(6,3) + 0.9 * max [Q(3,3)]
= 100 + 0.9 * max [0]
= 100 + 0 = 100
At state 2: Q(1,2) = R(1,2) + 0.9 * max [Q(2,3), Q(2,5)]
= 0 + 0.9 * max [100,0]
= 0.9*100 = 90
At state 1: Q(2,1) = R(2,1) + 0.9 * max [Q(1,2), Q(1,4)]
= 0 + 0.9 * max [90,0]
= 0.9*90 = 81
Q(4,1) = R(4,1) + 0.9 * max [Q(1,2), Q(1,4)]
= 0 + 0.9 * max [90,0]
= 0.9*90=81
8/20/2024 17Dr. Shivashankar, ISE, GAT
Sate transition diagram

Cont…
At state 4: Q(1,4) = R(1,4) + 0.9 * max [Q(4,1), Q(4,5)]
= 0 + 0.9 * max [81,0]
= 0.9*81 = 72
At state 2: Q(5,2) = R(5,2) + 0.9 * max [Q(2,1), Q(2,3), Q(2,5)]
= 0 + 0.9 * max [81,100,0]
= 0.9*100 = 90
At state 4: Q(5,4) = R(5,4) + 0.9 * max [Q(4,1), Q(4,5)]
= 0 + 0.9 * max [81,0]
= 0.9*81 = 72
At state 5: Q(2,5) = R(2,5) + 0.9 * max [Q(5,2), Q(5,4), Q(5,6)]
= 0 + 0.9 * max [90, 72, 0]
= 0.9*90 = 81
Q(4,5) = R(4,5) + 0.9 * max [Q(5,2), Q(5,4), Q(5,6)]
= 0 + 0.9 * max [90, 72, 0]
= 0.9*90 = 81
8/20/2024 18Dr. Shivashankar, ISE, GAT

Cont…
At state 5: Q(6,5) = R(6,5) + 0.9 * max [Q(5,2), Q(5,4), Q(5,6)]
= 0 + 0.9 * max [90, 72, 0]
= 0.9*90 = 81
At state 6: Q(5,6) = R(5,6) + 0.9 * max [Q(6,3), Q(6,5)]
= 0 + 0.9 * max [100,81]
= 0.9*100 = 90
8/20/2024 19Dr. Shivashankar, ISE, GAT

Cont…
Q(s,a)values:UpdatedQ-matrix
Q(s,a)values
V*(s)value oneoptimalpolicy
8/20/2024 20Dr. Shivashankar, ISE, GAT
1 2 3 4 5 6
1 0 900 720 0
2 810 1000 810
3 0 0 0 0 0 0
4 810 0 0 810
5 0 900 720 90
6 0 0 1000 810

Cont…
Problem2:Supposewehave5roomsinabuildingconnectedbydoorsasshowninfigure
below.We’llnumbereachroom0through4.Theoutsideofthebuildingcanbethoughtof
usonebigroom(5).Notethatdoors1and4leadintothebuildingfromroom5(outside).
Thegoalroomisnumber5.Thedoorsthatleadimmediatelytothegoalhaveaninstant
rewardof100.otherroomsnotdirectlyconnectedthetargetroomhave0reward.
Constructarewardmatrix,Q-learningandcalculateimmediaterewardvalueforeach
instant.Learningrateis0.8.
8/20/2024 21Dr. Shivashankar, ISE, GAT

Cont…
Solution:
RewardMatrix Q-LearningMatrix
R= Q=
8/20/2024 22Dr. Shivashankar, ISE, GAT
0 1 2 3 4 5
0 -1-1-1-10 -1
1 -1-1-10 -1100
2 -1-1-10 -1-1
3 -10 0 -10 -1
4 0 -1-10 -1100
5 -10 -1-10 100
Action
State
0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0

Cont…
•Nowlet’simaginewhatwouldhappenifouragentwereinstate5(nextstate)
•Lookatthe6
th
rowoftherewardmatrixR(i.e.state5)
•Ithas3possibleactions:gotostate1,4,or5.
Q(s,a)=r(s,a)+??????*max[??????(??????(s,a),ƴ??????)]
Q(state,action)=R(state,action)+Alpha*max[Q(nextstate,allaction)]
Atstate5:
Q(1,5) = R(1,5) + 0.8 * max [Q(5,1), Q(5,4)]
= 100 + 0.8 * max [Q(0,0)]
= 100 + 0.8*0 = 100
Q(4,5) = R(4,5) + 0.8 * max [Q(5,1), Q(5,4)]
= 100 + 0.8 * max [0,0]= 100
At state 1:
Now we imagine that we are in state 1(next state).
It has 2 possible actions: go to state 3 or state 5.
Then, we complete the Q value:
Q(3,1) = R(3,1) +0.8*Max(Q[1,3), Q(1,5)]
= 0 + 0.8 * max [0,100]
= 0 + 0.8*max(0,100) = 80
8/20/2024 23Dr. Shivashankar, ISE, GAT

Cont…
Q(5,1) = R(5,1) + 0.8*max (Q(1,3), Q(1,5)]
= 0 + 0.8* max(64,100)]
0.8*100 = 80
Atstate3:
Q(1,3) = R(1,3) + 0.8 * max [Q(3,1), Q(3,2), Q(3,4)]
= 0 + 0.8 * max [80, 0, 0]
= 0+ 0.8*80 = 64
Q(4,3) = R(4,3) + 0.8 * max [Q(3,4), Q(3,2), Q(3,1)]
= 0 + 0.8 * max [0, 0, 80]
= 0+ 0.8*80 = 64
Q(2,3) = R(2,3) + 0.8 * max [Q(3,2), Q(3,1), Q(3,4)]
= 0 + 0.8 * max [0, 80, 0]
= 0+ 0.8*80 = 64
At state 4:
Q(5, 4) = R(5, 4) + 0.8 * max [Q(4,5), Q(4,3), Q(4, 0)]
= 0 + 0.8 * max [100, 64, 0]
= 80
8/20/2024 24Dr. Shivashankar, ISE, GAT

Cont…
Q(3, 4) = R(3, 4) + 0.8 * max [Q(4,3), Q(4,5), Q(4, 0)]
= 0 + 0.8 * max [64,100, 0]= 80
Q(0, 4) = R(0, 4) + 0.8 * max [Q(4,0), Q(4,3), Q(4,5)]
= 0 + 0.8 * max [0, 0, 100] = 80
At state 2:
Q(3,2) = R(3,2) + 0.8*max[ Q(2,3)]
= 0 + 0.8*max[64] = 51
At state 0:
Q(4, 0) = R(4,0) + 0.8*max[Q(0,4)]
= 0 + 0.8 * 80 = 64
8/20/2024 25Dr. Shivashankar, ISE, GAT
012345
00000800
1000640100
20006400
3080510800
46400640100
50800080100
Updated state diagram
Final Updated Q-Learning Matrix

Cont…
Case Study:
Artificial Intelligence Powering Google Products,
Recent AI Tools leveraged by Tesla, AI for
Facebook,
Robo-Banking: Artificial Intelligence at JPMorgan
Chase, Audio AI,
A Machine Learning Approach —Building a Hotel
Recommendation Engine
8/20/2024 26Dr. Shivashankar, ISE, GAT