Reinforcement Learning in machine learning

ssemwogerere_rajab 0 views 41 slides Oct 13, 2025
Slide 1
Slide 1 of 41
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41

About This Presentation

Reinforcement Learning in machine learning


Slide Content

Reinforcement Learning Muhumuza Ambroze & Seguya Ronald Chris

What is Reinforcement Learning RL is an area of machine learning that focuses on how you, or how some thing (also known as an agent), might act in an environment in order to maximize some given reward. Reinforcement learning algorithms study the behavior of subjects in such environments and learn to optimize that behavior. An agent learns by try and error

How is RL different from Supervised and unsupervised learning Unlike supervised learning where feedback provided to the agent is a correct set of actions for performing a task, reinforcement learning uses rewards and punishment as signals for positive and negative behavior. While the goal in unsupervised learning is to find similarities and differences between data points, in reinforcement learning the goal is to find a suitable action model that would maximize the total cumulative reward of the agent .

Markov Decision Process (MDP) Reinforcement learning models world problems using the MDP (Markov Decision Process) formalism . A Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling sequential decision making in situations where outcomes are partly random and partly under the control of a decision maker .

Components of MDP In an MDP, we have a decision maker, called an agent , that interacts with the environment it's placed in. These interactions occur sequentially over time. At each time step, the agent will get some representation of the environment’s state . Given this representation, the agent selects an action to take. The environment is then transitioned into a new state, and the agent is given a reward as a consequence of the previous action. Components of an MDP: Agent Environment State Action Reward

Throughout this process, the agent’s goal is to maximize the total amount of rewards that it receives from taking actions in given states. This means that the agent wants to maximize not just the immediate reward, but the cumulative rewards it receives over time.

MDP Notation

Expected Return (sum of rewards) T he goal of an agent in an MDP is to maximize its cumulative rewards. This is what makes it choose the decision it takes.

Discounted return Rather than the agent’s goal being to maximize the expected return of rewards, it will instead be to maximize the expected discounted return of rewards. The fact that the discount rate , γ is between 0 and 1 is a mathematical trick to make an infinite sum finite. This helps in proving the convergence of certain algorithms .

How good is a state or action? Policies and value functions This will give us a way to measure “ how good ” it is for an agent to be in a given state or to select a given action(the notion of policies). Secondly, how good is a given action or state for the agent. S electing one action over another in a given state may increase or decrease the agent's rewards. This helps our agent out with deciding which actions to take in which states(the notion of value functions ).

Policy A policy (denoted as π ) is a function that maps a given state to probabilities of selecting each possible action from that state . Plainly, it tells us which is the best action to take in each state If an agent follows policy π at time t, then π( a|s ) is the probability that At=a if St=s. This means that, at time t, under policy π, the probability of taking action a in state s is; π( a|s ).

Value functions Value functions are functions of states, or of state-action pairs, that estimate how good it is for an agent to be in a given state, or how good it is for the agent to perform a given action in a given state . This notion of how good a state or state-action pair is given in terms of expected return. Since the way an agent acts is influenced by the policy it's following, then we can see that value functions are defined with respect to policies . Value functions are functions of states or state-action pairs. These formulate the 2 types of value functions, namely; state-value functions and action-value functions respectively

Action-value function Similarly , the action-value function for policy π, denoted as qπ, tells us how good it is for the agent to take any given action from a given state while following policy π. In other words, it gives us the value of an action under π.

Conventionally, the action-value function q π is referred to as the Q-function , and the output from the function for any given state-action pair is called a Q-value . The letter “ Q ” is used to represent the quality of taking a given action in a given state.

Optimal Policy The goal of a reinforcement learning algorithm is to find an optimal policy that will yield more return to the agent than all other policies . In terms of return, a policy π is considered to be better than or the same as policy π′ if the expected return of π is greater than or equal to the expected return of π′ for all states . vπ(s) gives the expected return for starting in state s and following π thereafter. A policy that is better than or at least the same as all other policies is called the optimal policy .

Optimal state-value function The optimal policy has an associated optimal state-value function defined as;

Optimal action-value function Similarly , the optimal policy has an optimal action-value function, or optimal Q-function, which we denote as q ∗

Bellman optimality equation One fundamental property of q∗ is that it must satisfy the following equation. This is called the Bellman optimality equation . It states that, for any state-action pair ( s,a ) at time t, the expected return from starting in state s, selecting action a and following the optimal policy thereafter (AKA the Q-value of this pair) is going to be the expected reward we get from taking action a in state s, which is Rt+1, plus the maximum expected discounted return that can be achieved from any possible next state-action pairs ( s′,a ′) at time t + 1

Q - Learning Q – Learning is a model-free reinforcement learning algorithm used for learning the optimal policy in a Markov Decision Process . The objective of Q-learning is to find a policy that is optimal in the sense that the expected value of the total reward over all successive steps is the maximum achievable. I n other words, the goal of Q-learning is to find the optimal policy by learning the optimal Q-values for each state-action pair.

How Q – Learning Works Value iteration The Q-learning algorithm iteratively updates the Q-values for each state-action pair using the Bellman equation until the Q-function converges to the optimal Q-function, q∗. This approach is called value iteration .

Example; The Lizard Game . The agent (Lizard) needs to eat as many crickets as possible in the shortest time without going through the bird, which will, itself, eat the lizard. the lizard has no idea how good any given action is from any given state. It’s not aware of anything besides the current state of the environment . Q-values for each state-action pair will all be initialized to zero since the lizard knows nothing about the environment at the start. Throughout the game, though, the Q-values will be iteratively updated using value iteration.

Example; The Lizard Game . 1 10 -1 -10 Game Over Game Over

Q-Table The Q-Table stores q-values for each action-state pair The horizontal axis of the table represents the actions, and the vertical axis represents the states . As the Q-table becomes updated, in later moves and later episodes, the lizard can look in the Q-table and base its next action on the highest Q-value for the current state.

How does the lizard move? Episodes Now, we’ll set some standard number of episodes that we want the lizard to play. Let’s say we want the lizard to play five episodes. It is during these episodes that the learning process will take place. In each episode, the lizard starts out by choosing an action from the starting state based on the current Q-values in the table. The lizard chooses the action based on which action has the highest Q-value in the Q-table for the current state.

But… the q-table is initialized with zero q-values for each state, how does it choose which action to take first? ?

Exploration and Exploitation The agent uses the concept of Exploration and Exploitation to choose the starting action . Exploration is the act of exploring the environment to find out information about it. Exploitation is the act of exploiting the information that is already known about the environment in order to maximize the return . These concepts guide how the agent chooses actions not just at the starting point but also in general.

But Wait…. ….when and why does the agent choose between exploration and exploitation?

Exploitation alone If you use exploitation alone, the agent may get stuck in an infinite loop of gaining and loosing a reward . Besides, it may never get to the crickets

Exploitation alone If you use exploitation alone, the agent may get stuck in an infinite loop of gaining and loosing a reward. If you use exploitation alone, the agent may get stuck in an infinite loop of gaining and loosing a reward . Besides, it may never get to the crickets

E xploration alone If the agent uses exploration alone, then it would miss out on making use of known information that could help to maximize the return . So how do we find a balance between exploration and exploitation?

Epsilon greedy strategy Essentially, in each episode, the lizard starts out by choosing an action from the starting state based on the current Q-value estimates in the Q-table. But since all of the Q-values are first initialized to zero, there’s no way for the lizard to differentiate between them at the starting state of the first episode. Better yet, for subsequent states, is it really as straight-forward as just selecting the action with the highest Q-value for the given state ? To get the balance between exploitation and exploration, we use what is called an epsilon greedy strategy .

How epsilon greedy strategy works We define an exploration rate ϵ that we initially set to 1. This exploration rate is the probability that our agent will explore the environment rather than exploit it. With ϵ=1, it is 100%certain that the agent will start out by exploring the environment. As it explores, it updates the q-table with Q-Values. As the agent learns more about the environment, at the start of each new episode, ϵ will decay by some rate that we set so that the likelihood of exploration becomes less and less probable as the agent learns more and more about the environment. The agent will become “greedy” in terms of exploiting the environment once it has had the opportunity to explore and learn more about it. It will choose the action with the highest Q-value for its current state from the Q-table.

To update the Q-value for the action of moving right taken from the previous state, we use the Bellman equation that we highlighted previously;

The learning rate The learning rate is a number between 0 and 1, which can be thought of as how quickly the agent abandons the previous Q-value in the Q-table for a given state-action pair for the new Q-value. So, for example, suppose we have a Q-value in the Q-table for some arbitrary state-action pair that the agent has experienced in a previous time step. Well, if the agent experiences that same state-action pair at a later time step once it's learned more about the environment, the Q-value will need to be updated to reflect the change in expectations the agent now has for the future returns. We don't just overwrite the old Q-value, but rather, we use the learning rate as a tool to determine how much information we keep about the previously computed Q-value for the given state-action pair versus the new Q-value calculated for the same state-action pair at a later time step . We’ll denote the learning rate with the symbol α, and we’ll arbitrarily set α=0.7for our lizard game example. The higher the learning rate, the more quickly the agent will adopt the new Q-value. For example, if the learning rate is 1, the estimate for the Q-value for a given state-action pair would be the straight up newly calculated Q-value and would not consider previous Q-values that had been calculated for the given state-action pair at previous time steps.

Calculating the new Q-value So, our new Q-value is equal to the sum of our old value and the learned value.

Max steps We can also specify a max number of steps that our agent can take before the episode auto-terminates. With the way the game is set up right now, termination will only occur if the lizard reaches the state with five crickets or the state with the bird. We could define some condition that states if the lizard hasn’t reached termination by either one of these two states after 100 steps , then terminate the game after the 100th step.

Let’s see some code now!
Tags