Backpropagation in RNN and LSTM

HimanshuSingh1370 1,144 views 45 slides Apr 22, 2021
Slide 1
Slide 1 of 45
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45

About This Presentation

These slides intended to show why we shouldn't use Convolutional Neural Network (CNN) for Natural Language Processing (NLP) tasks and why we should use RNN or LSTM based networks along with the mathematical derivation of the backpropagation algorithm in Recurrent Neural Network (RNN) and Long Sh...


Slide Content

1/37
CNN, RNN & LSTM
Himanshu Singh
IIT Bombay
Himanshu Singh CNN, RNN & LSTM

2/37
Convolution
http://deeplearning.stanford.edu/wiki/index.php/Featureextractionusingconvolution
Himanshu Singh CNN, RNN & LSTM

2/37
Convolution
http://deeplearning.stanford.edu/wiki/index.php/Featureextractionusingconvolution
Himanshu Singh CNN, RNN & LSTM

2/37
Convolution
http://deeplearning.stanford.edu/wiki/index.php/Featureextractionusingconvolution
Himanshu Singh CNN, RNN & LSTM

2/37
Convolution
http://deeplearning.stanford.edu/wiki/index.php/Featureextractionusingconvolution
Himanshu Singh CNN, RNN & LSTM

2/37
Convolution
http://deeplearning.stanford.edu/wiki/index.php/Featureextractionusingconvolution
Himanshu Singh CNN, RNN & LSTM

2/37
Convolution
http://deeplearning.stanford.edu/wiki/index.php/Featureextractionusingconvolution
Himanshu Singh CNN, RNN & LSTM

2/37
Convolution
http://deeplearning.stanford.edu/wiki/index.php/Featureextractionusingconvolution
Himanshu Singh CNN, RNN & LSTM

2/37
Convolution
http://deeplearning.stanford.edu/wiki/index.php/Featureextractionusingconvolution
Himanshu Singh CNN, RNN & LSTM

2/37
Convolution
http://deeplearning.stanford.edu/wiki/index.php/Featureextractionusingconvolution
Himanshu Singh CNN, RNN & LSTM

3/37
Why Convolution?
Averaging each pixel with its neighboring values blurs an image:
Taking the dierence between a pixel and its neighbors detects
edges:
https://docs.gimp.org/en/plug-in-convmatrix.html
Himanshu Singh CNN, RNN & LSTM

4/37
What is Convolutional Neural Network?
CNN
It is several layers of convolutions with nonlinear activation
functions like ReLU or tanh applied to the results
During the training phase, a CNN automatically learns the
values of its lters based on the task you want to perform.
Location Invariance :Let's say you want to classify whether
or not there's an elephant in an image. Because you are
sliding your lters over the whole image you don't really care
where the elephant occurs.
Compositionality :Each lter composes a local patch of
lower-level features into higher-level representation.
Himanshu Singh CNN, RNN & LSTM

5/37
What has CNN for NLP?
Zhang, Y., Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural
Networks for Sentence Classication.
Himanshu Singh CNN, RNN & LSTM

6/37
Problems with CNN
Location Invariance: You probably do care a lot where in
the sentence a word appears unlike images.
Local Compositionality: Pixels close to each other are likely
to be semantically related (part of the same object), but the
same isn't always true for words. In many languages, parts of
phrases could be separated by several other words.
Compositional aspect is intuitive in Computer Visioni.e.
edges form shapes and shapes form objects. Clearly, words
compose in some ways, like an adjective modifying a noun,
but how exactly this works what higher level representations
actually \mean" isn't as obvious as in the Computer Vision
case.
Himanshu Singh CNN, RNN & LSTM

7/37
Why not a traditional neural network for sequential task?
Problems:
Inputs and ouputs can be of dierent lengths in dierent
examples
Traditional NN doesn't share features learned accross dierent
positions of text
Recurrent Neural Network
RNN solves above two problems along with the problems posed by
CNNs.
Himanshu Singh CNN, RNN & LSTM

8/37
An Unrolled RNN
NOTE : Hidden state (ht) tells us summary of the sequence till
time t
Forward pass
ht=tanh(Wht1+Uxt+bh)
zt=softmax(Vht+bz)
Himanshu Singh CNN, RNN & LSTM

9/37
Backpropagation in RNN
Notation: E(x, y) = -
P
t
ytlogzt
E : above objective function (i.e. sum of errors at all time stamps)
E(t) : to indicate the output at time t
We have, ht=tanh(Wht1+Uxt+bh)
zt=softmax(Vht+bz)
Gradient of E w.r.t V
lett=Vht+bzthen
@E
@V
=
P
t
@E
@t
@t
@V
@E
@t
is derivative of softmax function w.r.t it's inputt
@E
@t
=ztyt(cite) and
@t
@V
=ht
@E
@V
=
P
t
(ztyt)ht
Himanshu Singh CNN, RNN & LSTM

10/37
Backpropagation in RNN
We have
ht=tanh(Wht1+Uxt+bh)
zt=softmax(Vht+bz)
Gradient of E w.r.t W
@E(t)
@W
=
@E(t)
@zt
@zt
@W
=
@E(t)
@zt
@zt
@ht
@ht
@W
from forward pass equations,htpartially depends onht1
@E(t)
@W
=
@E(t)
@zt
@zt
@ht
@ht
@ht1
@ht1
@W
if we keep on substitutinght1inhteqn, we'll see thatht
indirectly depends onht2,ht3...
@E(t)
@W
=
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@W
, and
@E
@W
=
P
t
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@W
Himanshu Singh CNN, RNN & LSTM

11/37
Backpropagation in RNN
We have
ht=tanh(Wht1+Uxt+bh)
zt=softmax(Vht+bz)
Gradient of E w.r.t U
We can't considerht1as constant when taking partial derivative
ofhtw.r.t U becauseht1depends on U i.e.
ht1=tanh(Wht2+Uxt1+bh)
Again, we get a similar form
@E
@U
=
P
t
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@U
Himanshu Singh CNN, RNN & LSTM

12/37
Problem with RNN
Look closely to these equations:
@E
@W
=
P
t
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@W
@E
@U
=
P
t
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@U
We nd out that
@ht
@hk
is again a chain rule.
@ht
@hk
=
@ht
@ht1
@ht1
@ht2
:::
@hk+1
@hk
If sequence length is large then there will be more number of
terms in the product which will result in vanishing gradient
problem or exploding gradient problem depending on whether
each individual value is less/greater than 1.
LSTM solves this problem to a large extent.
Himanshu Singh CNN, RNN & LSTM

13/37
Long Short Term Memory (LSTM) Network
Forward Pass
ft=(Wf[ht1;xt] +bf)
it=(Wi[ht1;xt] +bi)
at=tanh(Wa[ht1;xt] +ba)
Ct=ftCt1+itat
ot=(Wo[ht1;xt] +bo)
ht=ottanh(Ct)
Himanshu Singh CNN, RNN & LSTM

14/37
Cell state in LSTM
Vanishing Gradient Problem Addressed
It runs straight down the entire chain, with only some minor linear
interactions. It's very easy for information to just ow along it
unchanged. (Mathematical proof on later slides)
Himanshu Singh CNN, RNN & LSTM

15/37
Gates in LSTM
Forget Gate
Decides what information should be thrown away from the cell
state
ft=(Wf[ht1;xt] +bf)
Himanshu Singh CNN, RNN & LSTM

16/37
Gates in LSTM
Input Gate
layer decides which values to update andatis a vector of new
candidate values
it=(Wi[ht1;xt] +bi)
Himanshu Singh CNN, RNN & LSTM

17/37
Gates in LSTM
Updating Memory Cell
Multiply the old state byft, forgetting the things we decided to
forget earlier.
Then we additat. This is the new candidate values, scaled by
how much we decided to update each state value.
Ct=ftCt1+itat
Himanshu Singh CNN, RNN & LSTM

18/37
Gates in LSTM
Output Gate
Output will be based on our cell state, but will be a ltered version.
Cell state is put through tanh to push the output between -1 and 1
ot=(Wo[ht1;xt] +bo)
ht=ottanh(Ct)
Himanshu Singh CNN, RNN & LSTM

19/37
Backpropagation in LSTM
Error propagation
Error propagation happens throughCtandht
Himanshu Singh CNN, RNN & LSTM

20/37
Backpropagation in LSTM
Error propagation
Error propagation throughht
Himanshu Singh CNN, RNN & LSTM

21/37
Backpropagation in LSTM
ot=(Wo[ht1;xt] +bo)
ht=ottanh(Ct)
Himanshu Singh CNN, RNN & LSTM

22/37
Backpropagation in LSTM
ot=(Wo[ht1;xt] +bo)
ht=ottanh(Ct)
Himanshu Singh CNN, RNN & LSTM

23/37
Backpropagation in LSTM
Error propagation
Error propagation through Ct
Himanshu Singh CNN, RNN & LSTM

24/37
Backpropagation in LSTM
Ct=ftCt1+itat
Himanshu Singh CNN, RNN & LSTM

25/37
Backpropagation in LSTM
NOTE : This Ctwill be used at (t1)
th
timestamp for further
error propagation. If f is close to 1 then gradient fromt
th
timestamp is propagated perfectly to (t1)
th
timestamp.
Ct=ftCt1+itat
Himanshu Singh CNN, RNN & LSTM

26/37
Backpropagation in LSTM
Ct=ftCt1+itat
Himanshu Singh CNN, RNN & LSTM

27/37
Backpropagation in LSTM
Ct=ftCt1+itat
Himanshu Singh CNN, RNN & LSTM

28/37
Backpropagation in LSTM
Ct=ftCt1+itat
Himanshu Singh CNN, RNN & LSTM

29/37
Backpropagation in LSTM
Combined Error
Error propagation fromCtandhtboth
Himanshu Singh CNN, RNN & LSTM

30/37
Backpropagation in LSTM
Himanshu Singh CNN, RNN & LSTM

31/37
Backpropagation in LSTM
Himanshu Singh CNN, RNN & LSTM

32/37
Backpropagation in LSTM
Himanshu Singh CNN, RNN & LSTM

33/37
Backpropagation in LSTM
Himanshu Singh CNN, RNN & LSTM

34/37
Backpropagation in LSTM
Himanshu Singh CNN, RNN & LSTM

35/37
Backpropagation in LSTM
NOTE : ht1calculated here will be used by previous timestamp
for further back propagation
Himanshu Singh CNN, RNN & LSTM

36/37
Parameters update
We have calculated Wf, Wi, Waand Wo.
Next step is to do gradient descent:
W

=W

-W

where 2f;i;a;o
Himanshu Singh CNN, RNN & LSTM

37/37
References
colah.github.io/posts/2015-08-Understanding-LSTMs
www.wildml.com/2015/11/understanding-convolutional-
neural-networks-for-nlp
www.youtube.com/watch?v=KGOBB3wUbdc
Zhang, Y., Wallace, B. (2015). A Sensitivity Analysis of (and
Practitioners' Guide to) Convolutional Neural Networks for
Sentence Classication.
A Gentle Tutorial of Recurrent Neural Network with Error
Backpropagation Gang Chen
www.wildml.com/2015/10/recurrent-neural-networks-tutorial-
part-3-backpropagation-through-time-and-vanishing-
gradients/
Himanshu Singh CNN, RNN & LSTM