These slides intended to show why we shouldn't use Convolutional Neural Network (CNN) for Natural Language Processing (NLP) tasks and why we should use RNN or LSTM based networks along with the mathematical derivation of the backpropagation algorithm in Recurrent Neural Network (RNN) and Long Sh...
These slides intended to show why we shouldn't use Convolutional Neural Network (CNN) for Natural Language Processing (NLP) tasks and why we should use RNN or LSTM based networks along with the mathematical derivation of the backpropagation algorithm in Recurrent Neural Network (RNN) and Long Short Term Memory (LSTM) network.
3/37
Why Convolution?
Averaging each pixel with its neighboring values blurs an image:
Taking the dierence between a pixel and its neighbors detects
edges:
https://docs.gimp.org/en/plug-in-convmatrix.html
Himanshu Singh CNN, RNN & LSTM
4/37
What is Convolutional Neural Network?
CNN
It is several layers of convolutions with nonlinear activation
functions like ReLU or tanh applied to the results
During the training phase, a CNN automatically learns the
values of its lters based on the task you want to perform.
Location Invariance :Let's say you want to classify whether
or not there's an elephant in an image. Because you are
sliding your lters over the whole image you don't really care
where the elephant occurs.
Compositionality :Each lter composes a local patch of
lower-level features into higher-level representation.
Himanshu Singh CNN, RNN & LSTM
5/37
What has CNN for NLP?
Zhang, Y., Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural
Networks for Sentence Classication.
Himanshu Singh CNN, RNN & LSTM
6/37
Problems with CNN
Location Invariance: You probably do care a lot where in
the sentence a word appears unlike images.
Local Compositionality: Pixels close to each other are likely
to be semantically related (part of the same object), but the
same isn't always true for words. In many languages, parts of
phrases could be separated by several other words.
Compositional aspect is intuitive in Computer Visioni.e.
edges form shapes and shapes form objects. Clearly, words
compose in some ways, like an adjective modifying a noun,
but how exactly this works what higher level representations
actually \mean" isn't as obvious as in the Computer Vision
case.
Himanshu Singh CNN, RNN & LSTM
7/37
Why not a traditional neural network for sequential task?
Problems:
Inputs and ouputs can be of dierent lengths in dierent
examples
Traditional NN doesn't share features learned accross dierent
positions of text
Recurrent Neural Network
RNN solves above two problems along with the problems posed by
CNNs.
Himanshu Singh CNN, RNN & LSTM
8/37
An Unrolled RNN
NOTE : Hidden state (ht) tells us summary of the sequence till
time t
Forward pass
ht=tanh(Wht1+Uxt+bh)
zt=softmax(Vht+bz)
Himanshu Singh CNN, RNN & LSTM
9/37
Backpropagation in RNN
Notation: E(x, y) = -
P
t
ytlogzt
E : above objective function (i.e. sum of errors at all time stamps)
E(t) : to indicate the output at time t
We have, ht=tanh(Wht1+Uxt+bh)
zt=softmax(Vht+bz)
Gradient of E w.r.t V
lett=Vht+bzthen
@E
@V
=
P
t
@E
@t
@t
@V
@E
@t
is derivative of softmax function w.r.t it's inputt
@E
@t
=ztyt(cite) and
@t
@V
=ht
@E
@V
=
P
t
(ztyt)ht
Himanshu Singh CNN, RNN & LSTM
10/37
Backpropagation in RNN
We have
ht=tanh(Wht1+Uxt+bh)
zt=softmax(Vht+bz)
Gradient of E w.r.t W
@E(t)
@W
=
@E(t)
@zt
@zt
@W
=
@E(t)
@zt
@zt
@ht
@ht
@W
from forward pass equations,htpartially depends onht1
@E(t)
@W
=
@E(t)
@zt
@zt
@ht
@ht
@ht1
@ht1
@W
if we keep on substitutinght1inhteqn, we'll see thatht
indirectly depends onht2,ht3...
@E(t)
@W
=
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@W
, and
@E
@W
=
P
t
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@W
Himanshu Singh CNN, RNN & LSTM
11/37
Backpropagation in RNN
We have
ht=tanh(Wht1+Uxt+bh)
zt=softmax(Vht+bz)
Gradient of E w.r.t U
We can't considerht1as constant when taking partial derivative
ofhtw.r.t U becauseht1depends on U i.e.
ht1=tanh(Wht2+Uxt1+bh)
Again, we get a similar form
@E
@U
=
P
t
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@U
Himanshu Singh CNN, RNN & LSTM
12/37
Problem with RNN
Look closely to these equations:
@E
@W
=
P
t
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@W
@E
@U
=
P
t
P
t
k=1
@E(t)
@zt
@zt
@ht
@ht
@hk
@hk
@U
We nd out that
@ht
@hk
is again a chain rule.
@ht
@hk
=
@ht
@ht1
@ht1
@ht2
:::
@hk+1
@hk
If sequence length is large then there will be more number of
terms in the product which will result in vanishing gradient
problem or exploding gradient problem depending on whether
each individual value is less/greater than 1.
LSTM solves this problem to a large extent.
Himanshu Singh CNN, RNN & LSTM
13/37
Long Short Term Memory (LSTM) Network
Forward Pass
ft=(Wf[ht1;xt] +bf)
it=(Wi[ht1;xt] +bi)
at=tanh(Wa[ht1;xt] +ba)
Ct=ftCt1+itat
ot=(Wo[ht1;xt] +bo)
ht=ottanh(Ct)
Himanshu Singh CNN, RNN & LSTM
14/37
Cell state in LSTM
Vanishing Gradient Problem Addressed
It runs straight down the entire chain, with only some minor linear
interactions. It's very easy for information to just ow along it
unchanged. (Mathematical proof on later slides)
Himanshu Singh CNN, RNN & LSTM
15/37
Gates in LSTM
Forget Gate
Decides what information should be thrown away from the cell
state
ft=(Wf[ht1;xt] +bf)
Himanshu Singh CNN, RNN & LSTM
16/37
Gates in LSTM
Input Gate
layer decides which values to update andatis a vector of new
candidate values
it=(Wi[ht1;xt] +bi)
Himanshu Singh CNN, RNN & LSTM
17/37
Gates in LSTM
Updating Memory Cell
Multiply the old state byft, forgetting the things we decided to
forget earlier.
Then we additat. This is the new candidate values, scaled by
how much we decided to update each state value.
Ct=ftCt1+itat
Himanshu Singh CNN, RNN & LSTM
18/37
Gates in LSTM
Output Gate
Output will be based on our cell state, but will be a ltered version.
Cell state is put through tanh to push the output between -1 and 1
ot=(Wo[ht1;xt] +bo)
ht=ottanh(Ct)
Himanshu Singh CNN, RNN & LSTM
25/37
Backpropagation in LSTM
NOTE : This Ctwill be used at (t1)
th
timestamp for further
error propagation. If f is close to 1 then gradient fromt
th
timestamp is propagated perfectly to (t1)
th
timestamp.
Ct=ftCt1+itat
Himanshu Singh CNN, RNN & LSTM