Backpropagation in RNN and LSTM

1/37
CNN, RNN & LSTM
Himanshu Singh
IIT Bombay
Himanshu Singh CNN, RNN & LSTM

2/37
Convolution
http://deeplearning.stanford.edu/wiki/index.php/Featureextractionusingconvolution
Himanshu Singh CNN, RNN & LSTM

3/37
Why Convolution?
Averaging each pixel with its neighboring values blurs an image:
Taking the dierence between a pixel and its neighbors detects
edges:
https://docs.gimp.org/en/plug-in-convmatrix.html
Himanshu Singh CNN, RNN & LSTM

4/37
What is Convolutional Neural Network?
CNN
It is several layers of convolutions with nonlinear activation
functions like ReLU or tanh applied to the results
During the training phase, a CNN automatically learns the
values of its lters based on the task you want to perform.
Location Invariance :Let's say you want to classify whether
or not there's an elephant in an image. Because you are
sliding your lters over the whole image you don't really care
where the elephant occurs.
Compositionality :Each lter composes a local patch of
lower-level features into higher-level representation.
Himanshu Singh CNN, RNN & LSTM

5/37
What has CNN for NLP?
Zhang, Y., Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural
Networks for Sentence Classication.
Himanshu Singh CNN, RNN & LSTM

6/37
Problems with CNN
Location Invariance: You probably do care a lot where in
the sentence a word appears unlike images.
Local Compositionality: Pixels close to each other are likely
to be semantically related (part of the same object), but the
same isn't always true for words. In many languages, parts of
phrases could be separated by several other words.
Compositional aspect is intuitive in Computer Visioni.e.
edges form shapes and shapes form objects. Clearly, words
compose in some ways, like an adjective modifying a noun,
but how exactly this works what higher level representations
actually \mean" isn't as obvious as in the Computer Vision
case.
Himanshu Singh CNN, RNN & LSTM

7/37
Why not a traditional neural network for sequential task?
Problems:
Inputs and ouputs can be of dierent lengths in dierent
examples
Traditional NN doesn't share features learned accross dierent
positions of text
Recurrent Neural Network
RNN solves above two problems along with the problems posed by
CNNs.
Himanshu Singh CNN, RNN & LSTM

8/37
An Unrolled RNN
NOTE : Hidden state (ht) tells us summary of the sequence till
time t
Forward pass
ht=tanh(Wht1+Uxt+bh)
zt=softmax(Vht+bz)
Himanshu Singh CNN, RNN & LSTM

9/37
Backpropagation in RNN
Notation: E(x, y) = -
P
t
ytlogzt
E : above objective function (i.e. sum of errors at all time stamps)
E(t) : to indicate the output at time t
We have, ht=tanh(Wht1+Uxt+bh)
zt=softmax(Vht+bz)
Gradient of E w.r.t V
lett=Vht+bzthen
@E
@V
=
P
t
@E
@t
@t
@V
@E
@t
is derivative of softmax function w.r.t it's inputt
@E
@t
=ztyt(cite) and
@t
@V
=ht
@E
@V
=
P
t
(ztyt)ht
Himanshu Singh CNN, RNN & LSTM

10/37
Backpropagation in RNN
We have
ht=tanh(Wht1+Uxt+bh)
zt=softmax(Vht+bz)
Gradient of E w.r.t W
@E(t)
@W
=
@E(t)
@zt
@zt
@W
=
@E(t)
@zt
@zt
@ht
@ht
@W
from forward pass equations,htpartially depends onht1
@E(t)
@W
=
@E(t)
@zt
@zt
@ht
@ht
@ht1
@ht1
@W
if we keep on substitutinght1inhteqn, we'll see thatht
indirectly depends onht2,ht3...
@E(t)
@W
=
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@W
, and
@E
@W
=
P
t
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@W
Himanshu Singh CNN, RNN & LSTM

11/37
Backpropagation in RNN
We have
ht=tanh(Wht1+Uxt+bh)
zt=softmax(Vht+bz)
Gradient of E w.r.t U
We can't considerht1as constant when taking partial derivative
ofhtw.r.t U becauseht1depends on U i.e.
ht1=tanh(Wht2+Uxt1+bh)
Again, we get a similar form
@E
@U
=
P
t
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@U
Himanshu Singh CNN, RNN & LSTM

12/37
Problem with RNN
Look closely to these equations:
@E
@W
=
P
t
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@W
@E
@U
=
P
t
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@U
We nd out that
@ht
@hk
is again a chain rule.
@ht
@hk
=
@ht
@ht1
@ht1
@ht2
:::
@hk+1
@hk
If sequence length is large then there will be more number of
terms in the product which will result in vanishing gradient
problem or exploding gradient problem depending on whether
each individual value is less/greater than 1.
LSTM solves this problem to a large extent.
Himanshu Singh CNN, RNN & LSTM

13/37
Long Short Term Memory (LSTM) Network
Forward Pass
ft=(Wf[ht1;xt] +bf)
it=(Wi[ht1;xt] +bi)
at=tanh(Wa[ht1;xt] +ba)
Ct=ftCt1+itat
ot=(Wo[ht1;xt] +bo)
ht=ottanh(Ct)
Himanshu Singh CNN, RNN & LSTM

14/37
Cell state in LSTM
Vanishing Gradient Problem Addressed
It runs straight down the entire chain, with only some minor linear
interactions. It's very easy for information to just ow along it
unchanged. (Mathematical proof on later slides)
Himanshu Singh CNN, RNN & LSTM

15/37
Gates in LSTM
Forget Gate
Decides what information should be thrown away from the cell
state
ft=(Wf[ht1;xt] +bf)
Himanshu Singh CNN, RNN & LSTM

16/37
Gates in LSTM
Input Gate
layer decides which values to update andatis a vector of new
candidate values
it=(Wi[ht1;xt] +bi)
Himanshu Singh CNN, RNN & LSTM

17/37
Gates in LSTM
Updating Memory Cell
Multiply the old state byft, forgetting the things we decided to
forget earlier.
Then we additat. This is the new candidate values, scaled by
how much we decided to update each state value.
Ct=ftCt1+itat
Himanshu Singh CNN, RNN & LSTM

18/37
Gates in LSTM
Output Gate
Output will be based on our cell state, but will be a ltered version.
Cell state is put through tanh to push the output between -1 and 1
ot=(Wo[ht1;xt] +bo)
ht=ottanh(Ct)
Himanshu Singh CNN, RNN & LSTM

19/37
Backpropagation in LSTM
Error propagation
Error propagation happens throughCtandht
Himanshu Singh CNN, RNN & LSTM

20/37
Backpropagation in LSTM
Error propagation
Error propagation throughht
Himanshu Singh CNN, RNN & LSTM

21/37
Backpropagation in LSTM
ot=(Wo[ht1;xt] +bo)
ht=ottanh(Ct)
Himanshu Singh CNN, RNN & LSTM

22/37
Backpropagation in LSTM
ot=(Wo[ht1;xt] +bo)
ht=ottanh(Ct)
Himanshu Singh CNN, RNN & LSTM

23/37
Backpropagation in LSTM
Error propagation
Error propagation through Ct
Himanshu Singh CNN, RNN & LSTM

24/37
Backpropagation in LSTM
Ct=ftCt1+itat
Himanshu Singh CNN, RNN & LSTM

25/37
Backpropagation in LSTM
NOTE : This Ctwill be used at (t1)
th
timestamp for further
error propagation. If f is close to 1 then gradient fromt
th
timestamp is propagated perfectly to (t1)
th
timestamp.
Ct=ftCt1+itat
Himanshu Singh CNN, RNN & LSTM

26/37
Backpropagation in LSTM
Ct=ftCt1+itat
Himanshu Singh CNN, RNN & LSTM

27/37
Backpropagation in LSTM
Ct=ftCt1+itat
Himanshu Singh CNN, RNN & LSTM

28/37
Backpropagation in LSTM
Ct=ftCt1+itat
Himanshu Singh CNN, RNN & LSTM

29/37
Backpropagation in LSTM
Combined Error
Error propagation fromCtandhtboth
Himanshu Singh CNN, RNN & LSTM

30/37
Backpropagation in LSTM
Himanshu Singh CNN, RNN & LSTM

31/37
Backpropagation in LSTM
Himanshu Singh CNN, RNN & LSTM

32/37
Backpropagation in LSTM
Himanshu Singh CNN, RNN & LSTM

33/37
Backpropagation in LSTM
Himanshu Singh CNN, RNN & LSTM

34/37
Backpropagation in LSTM
Himanshu Singh CNN, RNN & LSTM

35/37
Backpropagation in LSTM
NOTE : ht1calculated here will be used by previous timestamp
for further back propagation
Himanshu Singh CNN, RNN & LSTM

36/37
Parameters update
We have calculated Wf, Wi, Waand Wo.
Next step is to do gradient descent:
W

=W

-W

where 2f;i;a;o
Himanshu Singh CNN, RNN & LSTM

37/37
References
colah.github.io/posts/2015-08-Understanding-LSTMs
www.wildml.com/2015/11/understanding-convolutional-
neural-networks-for-nlp
www.youtube.com/watch?v=KGOBB3wUbdc
Zhang, Y., Wallace, B. (2015). A Sensitivity Analysis of (and
Practitioners' Guide to) Convolutional Neural Networks for
Sentence Classication.
A Gentle Tutorial of Recurrent Neural Network with Error
Backpropagation Gang Chen
www.wildml.com/2015/10/recurrent-neural-networks-tutorial-
part-3-backpropagation-through-time-and-vanishing-
gradients/
Himanshu Singh CNN, RNN & LSTM

Backpropagation in RNN and LSTM

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Backpropagation in RNN and LSTM

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx