Reinforcement Learning - Learning from Experience like a Human

PAWDeutschland 413 views 46 slides Nov 29, 2018
Slide 1
Slide 1 of 46
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46

About This Presentation

Reinforcement learning is in addition to (un)-supervised learning a major machine learning technology, which has a huge potential in a broad field of applications like robotics, autonomous driving, gaming and general control. This talk describes the major concepts, algorithms and software environmen...


Slide Content

Reinforcement Learning
Learning from experience like a human …
Nokia Bell Labs / Norbert Kraft

Introduction Nokia Bell LabsAccess
End to End Network & Service Automation
Application Platforms & Software Systems
Standardization
Smart Network Fabric
Algorithms, Analytics & Augmented Intelligence
Emerging Materials, Components and DevicesResearch Activities

Analyze
Taxonomy of Machine Learning
From Analysis to Full Autonomous Control
Descriptive
Analytics
What happened?
Predict
Predictive
Analytics
What will happen?
Control
Prescriptive
Analytics
Make it happen!
Difficulty
Value
Control
Predict
Analyze
Monitoring
Sensing
11/12/183 Reinforcement Learning -learning from experience like a human

Don’t analyze, why you failed –just do it right ….
Reinforcement Learning -learning from experience like a human11/12/184

Machine Learning
Unsupervised
learning
Find
anomalies,
similarities,
groups
Reduce
complexity
of high
dimensional
features
Supervised
learning
Learn from
labelled
observations
Reinforcement
learning
Learn
from
experience
Reinforcement Learning -learning from experience like a human
Basic Ideas …
Find groups with
similar attributes,
which are not
necessarily self-
explaining …
Generate a limited
set of new features
with virtual meaning

Train human
knowledge by taking
a labelled list
(… not always existing)
11/12/185
Learn system
behavior with
experiments
(learning by doing)

Machine Learning
Basic Ideas …
11/12/18 Reinforcement Learning -learning from experience like a human6
Find groups with
similar attributes,
which are not
necessarily self-
explaining …
Generate a limited
set of new features
with virtual meaning

Train human
knowledge by taking
a labelled list … not
always existing
Supervised LearningInUnsupervised
Learning
In OutSupervised
Learning
In OutReinforcement
Learning
Target
Error
In Out
Reward
Learning from
experience with
reward by
trial & error

Machine Learning
Deep Learning
Unsupervised
learning
Find
anomalies,
similarities,
groups
Reduce complexity
of high dimensional
features
Supervised
learning
Learn from labelled
observations
Reinforcement
learning
Learn
from
experience
Reinforcement Learning -learning from experience like a human
Machine Learning Concepts & Deep Learning
11/12/187

How does this map to a human brain?
11/12/18 Reinforcement Learning -learning from experience like a human8
Basal Ganglia
Reinforcement
Learning
Reward≈Dopamine
Cerebral Cortex
Unsupervised
Learning
Cerebellum
Supervised
Learning

Machine Learning
Reinforcement Learning -learning from experience like a human
One Way of Human Thinking/Learning
Observations Actions
Reward
Find action
to optimize reward …
11/12/189

The Human Way of Thinking/Learning
Reinforcement Learning -learning from experience like a human
Learning is Trial and Error … again and again???
11/12/1810

Differences between Human Brain and a Neural Networks
11/12/18 Reinforcement Learning -learning from experience like a human11
CharacteristicHuman BrainNeural Network
Feed forwardYes Yes
Feed backwardYes Yes (only RNNs …)
Complexity1011 neurons109 transistors
Switching speed10-3 secs. 10-9 secs.
StructureHierarchicalFlat & simplistic
OperationMassively parallelStill serial & parallel
How about ??
•Intuition
•Instinct
•Gut feeling
•Mind
•Intellect

Most People have an idea of a dangerous animals
without learning it …
11/12/18 Reinforcement Learning -learning from experience like a human12

Reinforcement Learning
Robotics
Control
physical
systems
GamesOptimizationGeneral
Compute
Problems
Router/Radio
Channel
Assignment
Power
optimization
Scheduling
algorithms
Admission
Control
Anomaly
detection
Reinforcement Learning -learning from experience like a human
Application Areas
11/12/1813

Reinforcement Learning
Reinforcement Learning -learning from experience like a human
Universal Self Learning with Autonomous Algorithms
Universal
Autonomous
Self-learning
Algorithms
11/12/1814

Reinforcement LearningNo pre-define knowledge
Starts with random action
Trial & error learning
Find solution with optimum reward
Agent/environment states are hidden
Controller receives observations, reward and
triggers action
System receives action, goes into next state,
generates observations and reward
Reinforcement Learning -learning from experience like a human
Components & Interaction
Controller
(agent)
System
(environment)
observations
actionsreward
11/12/1815

How does this translate to real environments
•Human, Neural Net, Decision Tree, Coded
AlgorithmAgent
•Robot, Machine, Chess / Go game,
Telecommunication network, ProblemEnvironment
•Go left/right, stop, move pawn to, set parameter
value toActions
•Car Position/Speed, (chess) piece
positions/value, temperature, performance KPIObservations
•Power consumption, no/value of (chess) pieces,
game score, call success rateReward
Reinforcement Learning -learning from experience like a human
Examples: Robots, Games, Telecom
Controller
(agent)
System
(environment)
observations
actionsreward
11/12/1816

System (environment)
Reinforcement Learning
Reinforcement Learning -learning from experience like a human
Examples: Telecom
Controller (agent)
KPIs
Change Parameter
HighLevelKPIsCEIPower
11/12/1817

Some theory …
11/12/18 Reinforcement Learning -learning from experience
like a human
18

Markov Properties
Fully observable process
•The current state completely characterizes the process
The future is independent of the past given
the present
The state captures all relevant information
from the history
Once the state is known, the history may be
thrown away
Reinforcement Learning -learning from experience like a human11/12/1819

Reinforcement Learning
Observability
•Full: Sa
t = Se
t
•Partial: Sa
t ≠ Se
t
Agent functions
•Policy (predicted action based on state)
•a = !(#)
•Value (pred. of future reward)
•vπ(s,a)=+![-.+1,- .+2 …]
•Model
•Build transition model of the system
Reinforcement Learning -learning from experience like a human
Formalisms
Controller
(agent)
System
(environment)
OtRtAt
Sa
t
Se
t
Value
Policy Model
11/12/1820

Agent Functions are Optional
Value
PolicyModel
•No Policy (Implicit)
•Value function
Value
Based
•Policy function
•No value function
Policy
Based
•Policy function
•Value function
Actor
Critic
•Policy and/or value function
•No model
Model
Free
•Policy and/or Value Function
•Model
Model
Based
Reinforcement Learning -learning from experience like a human
Agent Types
11/12/1821

Exploitation
Reinforcement Learning
•Find more information about the
environment …
•Try random action
•Use action not used before in this state
•…
Exploration
•Exploit already known information to
maximize reward …
•Use action promising most direct reward
•Use action promising most future reward
•…
Exploitation
Reinforcement Learning -learning from experience like a human
Exploration vs. Exploitation
Exploration
Optimal
solution
No convergence
Sub optimum
Exploration-Exploitation Dilemma
11/12/1822

Reinforcement Learning
Use certain amount of random actions
•1 −6780,1> ∈-> <∗#,7
•1 −6780,1≤ ∈-> 6787
Decrease ∈over time
•∈?@A= B ∗ ∈?
11/12/18 Reinforcement Learning -learning from experience like a human23
Exploration/Exploitation Strategies : (dynamic) ∈-Greedy

Reinforcement Learning & Neural Nets
•Ot1 = f(xt1, yt1,zt1 …) C,D,E ∈F
•At1 = Map(xt1, yt1,zt1 …)
Limited
observation
space
•Ot1 = f(xt1, yt1,zt1 …) C,D,E ∈-
•At1 = model.predict(xt1, yt1,zt1 …)
Unlimited
observation
space
Reinforcement Learning -learning from experience like a human
Limited vs. Unlimited spaces for Observations, Actions
Algorithms / tables
Policy / Mapping
Supervised Learning
Decision Trees
Lin/log Regression
Supervised Learning
Neural Networks
Deep Learning
11/12/1824

Reinforcement Learning
•Discounted Reward: GT
•Discount factor G∈ 0,1
-G≈1‘far-sighted’ evaluation
-G≈0‘myopic’ evaluation
Reinforcement Learning -learning from experience like a human
Discounted Future Reward
11/12/1825

Reinforcement Learning
Result of Q-Function represents actual & future reward
•Based on current state(s) and action(a) applied
•Corrected by maximal achievable award in state #?@A
Learning is done by continuous update of Q
•Ilearning rate (adoption rate for learned knowledge)
•Gdiscount factor for future reward
Reinforcement Learning -learning from experience like a human
Temporal-Difference Learning: Q Learning
<#?, 7?=J7C-?@K
<#?, 7?=1 − I <#?, 7?+ I(6?+GJ7C<(#?@A,7?@A))
11/12/1826

Reinforcement Learning
Neural networks for large observation & actions spaces
•Can work with pixel based observations
•Large amount of setup values (actions)
Different variants
•Onefeedforwardper s,acombination<(#,7)
•Onefeed forward per state <#
Reinforcement Learning -learning from experience like a human
Deep Q networks
11/12/1827
State
Action
Neural NetQ ValueStateNeural Net
Q
Value(a1)
Q
Value(an)
…or
1. 2.

Reinforcement Learning
Given Transition <#,7,6,#M>
1.Do feed forward for all actions in state s
2.Get max. Q value for all actions in state s’
3.Set target value for <#,7=6+ G J7C<(#M,7M)
4.Update weights using back propagation
11/12/18 Reinforcement Learning -learning from experience like a human28
Deep Q networks Update Rules

Reinforcement Learning
Small changes in Q-value could cause a totally different action
selection.
No convergence guarantee.
Tries to find deterministic value function, some problems require
a stochastic value function
Reinforcement Learning -learning from experience like a human
Q Learning Pre-requisites / Limitations
11/12/1829

Some examples …
11/12/18 Reinforcement Learning -learning from experience
like a human
30

OpenAI
Ready to use environments for agent &
algorithm development
•Computing problems
•Games
•Robots
•2D Problems
Find the optimal model/policy/value function
for a problem
•Model based (unlimitedaction, observation, reward space)
•Value/Policy based (limitedaction, observation, reward
space)
Functions:
•Ot0 = reset()
•Ot1,Rt1, Se
t = step(At0)
Reinforcement Learning -learning from experience like a human
An Agent Development Environment
Controller
(agent)
OpenAI
System
(environment)
OtRtAt
Sa
t
Se
t
Value
Policy Model
11/12/1831

OpenAI
Reinforcement Learning -learning from experience like a human
Environments for Advanced Algorithm Development
AcrobotCartPoleCar over mountainPendulum
Humanoid stand upTennisCar RaceLunar Lander
Humanoid
Robot
11/12/1832
Computational alg.

Reinforcement LearningBalance inverted pendulum
•Simplified for 1 dimension
State
•Cart position [-2.4, 2.4]
•Cart Velocity [-inf, inf]
•Pole Angle [-41°, 41°]
•Pole velocity at tip [-inf, inf]
Actions
•Impacts cart direction & velocity
•Push cart to left
•Push cart to right
Termination
•Cart position at boundary (fails)
•Angle outside [-12, 12] (fails)
•More than 200 steps (terminates successfully)
Reward
•+1 for every step not terminating
Reinforcement Learning -learning from experience like a human
Example: CartPole
By using random actions pole returns to stable state
11/12/1833

Reinforcement Learning
Example: CartPoleSolving with model based algorithm (RandomForest)
11/12/18 Reinforcement Learning -learning from experience like a human34

Reinforcement Learning
Example: CartPoleSolving with model based algorithm (Neural Network)
11/12/18 Reinforcement Learning -learning from experience like a human35

Reinforcement Learning
Copy characters from observation tape to output tape
•Various character sets [A..[
•Different string length increasing during different runs
State
•Character observed at read head
Actions
•Move read head left or right
•Copy character to output tape or not
Termination
•Wrong character written (fails)
•Timeout after some amount of unsuccessful trials (fails)
•All characters written to output tape (terminates successfully)
Reward
•+1 for correct character written
•-0.5 for wrong character written
•0 for plain head movements
Reinforcement Learning -learning from experience like a human
Example: Copy (Algorithm environment)
11/12/1836

Reinforcement Learning
Example: Copy (Algorithm environment) -Solved with discrete Q-Learning
11/12/18 Reinforcement Learning -learning from experience like a human37
Successfully learned to copy
strings of random length and
content

Reinforcement LearningUnder powered car to go across a hill
•You have to go backward to get enough swing
State
•Position on x axis
•Speed
Actions
•Push forward
•Push backward
•Do nothing
Termination
•Time after 200 steps
•Car reaches the flag on the hill
Reward
•-1 for every step
•+0.5 for right push & speed > 0
•+0.5 for left push & speed < 0
Reinforcement Learning -learning from experience like a human
Example: Mountain Car
11/12/1838

Reinforcement Learning
11/12/18 Reinforcement Learning -learning from experience like a human39
Mountain Car Videos
Random WalkTraining phase

Reinforcement LearningLand space ship on the moon
•Land in landing zone
•Surface and start condition change
State
•8 real values (position, angle, speed…)
Actions
•Fire main engine
•Fire left/right engine
Termination
•Move to landing pad with zero speed
•Can also land outside landing pad
Reward
•Firing main engine -0.3 (unlimited fuel)
•Ground contact +10
•Landing in pad 100-140
Reinforcement Learning -learning from experience like a human
Example: Lunar Lander
11/12/1840

Reinforcement Learning
11/12/18 Reinforcement Learning -learning from experience like a human41
Example: Lunar Lander
Random WalkTraining phase

Reinforcement Learning
Example: Lunar Lander
11/12/18 Reinforcement Learning -learning from experience like a human42

Some final words …
11/12/18 Reinforcement Learning -learning from experience
like a human
43

Reinforcement Learning
•Direct reward gives insufficient feed back on success strategyDelayed Rewards
•e.g. states using pictures & complex sensors
•Requires deep learning
Continuous/Large
observations states
•Slows down solution convergence
•Find (sub-)optimal solution
Exploration/exploitation
strategies
•Only a mix of policy, value, model and q-function solves most problems
•Standard supervised algorithms do not solve the problemMeta solution strategies
Reinforcement Learning -learning from experience like a human
Problems & Research Areas
11/12/1844

Reinforcement Learning
11/12/18 Reinforcement Learning -learning from experience like a human45
Takeaways …

Thank you
Questions & Answers
[email protected]
Tags