Machine learning and deep learning algorithms prepared for giving a lecture to the faculty members who are teaching this course
Size: 1.44 MB
Language: en
Added: Jul 09, 2024
Slides: 142 pages
Slide Content
Fundamentals of Deep Learning
Dr. A.Kannan, Former Professor and Head,
Department of Information Science and
Technology, CEG Campus, Anna
University, Chennai-25.
Senior Professor, School of Computer
Science and Engineering,
VIT, Vellore-632014.
.
1
MACHINE LARNING
•“Learning is making useful changes in our
minds.” -Marvin Minsky.
•“Machine Learning (ML) refers to a system
which has the capability of autonomous
knowledge acquisition and integration of
the acquired knowledge.”
2
MACHINE LARNING
•Machine learningis an application of
ArtificialIntelligence(AI) that provides the
systems with the ability to automatically
learn and improve by themselves from the
experience gained by them without being
explicitly programmed.
•It focuses on the development of intelligent
computer programs that can access the data
and use it for learning by themselves.
3
Applications of ML
•Image Processing –Face Recognition, Hand
written character recognition, Self driving
Cars, Traffic Video analysis….
•Natural Language Processing -Social
Network Analysis, Recommendation
Systems and Sentiment Analysis.
•Medical Diagnosis: Disease Identification,
Prediction on Cancer, Diabetes etc using
past history and current data.
4
Machine Learning Paradigms
•Rote Learning
•Transfer of Learning
•Learning by Taking Advice
•Learning By Analogy
•Un-Supervised Learning (Clustering)
•Supervised Learning –Classification
•Deep Learning
5
AI -Knowledge Based Systems
•Facts
•Rules
•Knowledge base
•Knowledge Based Systems
•Knowledge Representation
•Reasoning and Inference
7
FACTS AND RULES
•Pat is a man = true
•Kumar is the father of Raja=True
•Kumar is the grand father of Raja= False
•IF marks >=60 Then
Class = FIRST CLASS
8
Rules
•If A then B (Whenever A is true then B is
also true)
•A = true (now A is TRUE)
•Inference: B is true now
•Using AB and A, we can infer B.
•This rule is called Modus Ponens.
•AB, A implies B.
9
Inference
•Pat is a man = p
•Pat is a woman = q
•Pat is a man or woman = p v q = true
•Pat is not a woman = 7q = true
•Inference: p is true
•Pat is a man or woman
•Pat is not a woman.
•Inference: Pat is a man
10
Knowledge Base Vs Database
More Rules Less Rules
Facts Less More Facts
Explicit Rules and
Facts
Explicit Facts and
Implicit Rules
Experts updateClerks update
Main Memory
Based
Disk Based
11
AI Programs -Exhibit Intelligent Behavior by skillful
application of heuristics.
KBS –Make domain knowledge
explicit
Expert Systems –
Apply expert
knowledge to
difficult,
Real world problems.
12
Knowledge Representation
Techniques
•English or Natural Language
•Tables and Rules
•Logic (Propositional logic, Predicate logic)
•Semantic Networks
•Frames
•Conceptual Dependency
•Scripts
•Ontology
13
Searching
•Depth First (Missionaries and Cannibals)
•Breadth First (Water Jug Problem)
•Hill Climbing ( 8 –Puzzle)
•Best First
•A* Algorithm
•AO* Algorithm
•Mini-Max Algorithm
15
REASONING METHODS
•Reasoning By Analogy –Frames.
•Temporal Reasoning –Higher order logic
(Temporoal Logic).
•Fuzzy Reasoning –Higher Order Logic
(Fuzzy Logic).
•Non-monotonic Reasoning –Higher Order
Logics (Non-monotonic Logic).
•Reasoning Agents –Epistemic Logic.
16
AI and ML
•Roughly speaking, AI and ML are good ways
to ask a computer to provide an answer to a
problem based on some past experience.
(Prediction, Learning, Explanation and Finding
Temporal Dependencies)
•It might be challenging to tell a computer what
a cat is, for instance. ( Computers don’t have
common sense –General Problem Solver-
Human Intelligence).
17
AI and ML
•Still, if you show a neural network enough images
of cats and tell it they are cats, then the computer
will be able to correctly identify other cats that it did
not see before.
•It appears that some of the most prominent and
widely used AI and ML algorithms can be speeded-
up significantly if they are run on quantum
computers. (Example: Bayesian Networks, Graph
Search Algorithms-Shortest path algorithms –
Heuristic Search Algorithms-and Swarm
Intelligence algorithms …).
18
Learning Methods
•Learning from examples
•Winston’s Program
•Explanation based Learning
•Learning by Observation
•Knowledge Acquisition from experts
19
K-means Clustering
•An initial cluster seed represents the “mean
value” of its cluster.
•In the preceding figure:
–Cluster seed 1 = 650
–Cluster seed 2 = 200
23
K-means Clustering
•Step 2: Calculate distance from each object
to each cluster seed.
•What type of distance should we use?
–Squared Euclidean distance
•Step 3: Assign each object to the closest
cluster
24
K-means Clustering
Seed 1
Seed 2
25
K-means Clustering
•Iterate:
–Calculate distance from objects to cluster
centroids.
–Assign objects to closest cluster
–Recalculate new centroids
•Stop based on convergence criteria
–No change in clusters
–Max iterations
26
Regression
•The regression task comes from Supervised
machine learning.
•It can help us to predict (expect continues
values) and explains the objects based on a
given set of numerical and categorical data.
•For example, we can predict the house
prices based on the house attributes such as
number of rooms, size, and location.
27
Regression
•In mathematical terms, the regression
method provides us a linear line with the
equation of Y = mX+c to model a dataset.
•Here we are taking the X (Dependent
variable) and Y (Independent variable) data
points to train the linear regression model.
The best observation line can be found by
calculating the slope (m) and y-intercept (c)
values.
28
Regression Analysis
•The regression analysis is performed with
various effective algorithms namely
•Simple linear regression
•Multiple linear regression
•Decision trees
•Radom forest
•Support Vector Machines (SVM)
30
Decision Trees
31
Name Debt Income Married? Risk
--------------------------------------------------------
Joe High High Yes Good
Sue Low High Yes Good
John Low High No Poor
Mary High Low Yes Poor
Fred Low Low Yes Poor
32
Decision Tree Classification
•Example
D1
D2
Decision Tree
Income
=High
Income
=low
D2
D1
33
Decision Trees Classification (cont.)
•Example
D1a
D2
1 2
D1b
Decision Tree
Income
=High
Income
=low
D2
D1
D1a D1b
Random Forest
•Random forest(or random forests) is an
ensemble classifier that consists of many
decision trees and outputs the class that is
the mode of the class's output by individual
trees. It combines "bagging" and the
random selection of features.
•For many data sets, it produces a highly
accurate classification results than decision
trees.
34
Bagging
•Bagging, also known as bootstrap
aggregation, isthe ensemble learning
method that is commonly used to
reduce variance within a noisy dataset.
•In bagging, a random sample of data in
a training set is selected with
replacement—meaning that the
individual data points can be chosen
more than once.
•Boosting is used to reduce the bias.
35
Regression Vs Classification
•The most significantdifference between
regressionvsclassificationis that
whileregressionhelps predict a continuous
quantity,classificationpredicts discrete
class labels. There are also some
overlapsbetweenthe two types of machine
learning algorithms.
36
Support Vector Machines
•It is a linear classifier.
•It classifies the dataset into two groups for
binary classification problems.
•The multi-class SVM classifies the data set
into multiple groups.
•It is a supervised learning method.
37
Linear Classifiers
f
x
a
y
est
denotes +1
denotes -1
f(x,w,b) = sign(wx+ b)
How would you
classify this data?
wx+ b<0
wx+ b>0
38
Linear Classifiers
f
x
a
y
est
Denotes +1
Denotes -1
f(x,w,b) = sign(wx+b)
How will you
classify this data?
39
40
Maximum Margin
f
x
a
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w. x+ b)
The maximum
margin linear
classifieris the
linear classifier with
the, um, maximum
margin.
This is the simplest
kind of SVM
(Called an LSVM)
Support Vectors
are those
datapoints that the
margin pushes up
against
Linear SVM
Probabilistic Models
•Uncertainty
•ABC Murder Story
•Bayesian Classification
•Neural Networks
•Feature Selection and Classification
•Deep Learning
41
Probabilistic Models
•The probabilisticframework tomachine
learningis thatlearningcan be thought of
as inferring plausiblemodelsto explain
observed data.
•Amachinecan use suchmodelsto make
predictions about future data, and take
decisions that are rational given these
predictions.
42
BAYESIYAN
CLASSIFICATION
•CONDITIONAL PROBABILITY
•BAYES THEOREM
•NAÏVE BAYES CLASSIFIER
•BELIEF NETWORK
•APPLICATION OF BAYESIAN NETWORK -
CYBER CRIME DETECTION
43
BAYESIAN CLASSIFICATION
•Probabilistic learning:Calculates explicit
probabilities for hypothesis, among the most
practical approaches to certain types of
learning problems.
•Incremental:Each training example can
incrementally increase/decrease the probability that
a hypothesis is correct. Prior knowledge can be
combined with observed data.
44
BAYESIAN THEOREM
•A special case of Bayesian
Theorem:
P(A∩B) = P(B) x P(A|B)
P(B∩A) = P(A) x P(B|A)
Since P(A∩B) = P(B∩A),
P(B) x P(A|B) = P(A) x P(B|A)
=> P(A|B) = [P(A) x P(B|A)] /
P(B) ABPAPABPAP
ABPAP
BP
ABPAP
BAP
||
)|()(
)(
)|()(
)|(
A B
45
BAYESIAN THEOREM
•Example 1:A medical cancer diagnosis
problem
There are 2 possible outcomes of a diagnosis:
+ ve, -ve. We know .8% of world population has
cancer. Test gives correct +ve result 98% of the
time and gives correct –ve result 97% of the time.
If a patient’s test returns +ve, should we
diagnose the patient as having cancer?
46
BAYESIAN THEOREM
P(cancer) = .008 P(-cancer) = .992
P(+ve|cancer) = .98P(-ve|cancer) = .02
P(+ve|-cancer) = .03P(-ve|-cancer) = .97
Using Bayes Formula:
P(cancer|+ve) = P(+ve|cancer)xP(cancer) / P(+ve)
= 0.98 x 0.008 = .0078 / P(+ve)
P(-cancer|+ve) = P(+ve|-cancer)xP(-cancer) / P(+ve)
= 0.03 x 0.992 = 0.0298 / P(+ve).
So, the patient most likely does not have cancer.
47
NAÏVE BAYES CLASSIFIER
•A simplified assumption: attributes are
conditionally independent.
•Greatly reduces the computation cost, only
count the class distribution.
48
NAÏVE BAYES CLASSIFIER
The probabilistic model of NBC is to find the probability of a
certain class given multiple dijoint (assumed) events.
The naïve Bayes classifier applies to learning tasks where
each instance x is described by a conjunction of attribute
values and where the target function f(x) can take on any
value from some finite set V.
A set of training examples of the target function is provided,
and a new instance is presented, described by the tuple
of attribute values <a1,a2,…,an>. The learner is asked to
predict the target value, or classification, for this new
instance.
49
NAÏVE BAYES CLASSIFIER
Abstractly, probability model for a classifier is a
conditional model
P(C|F1,F2,…,Fn)
Over a dependent class variable C with a small
number of outcome or classes conditional over
several feature variables F1,…,Fn.
Naïve Bayes Formula:
P(C|F1,F2,…,Fn) = argmax
c[P(C) x P(F1|C) x P(F2|C)
x…x P(Fn|C)] / P(F1,F2,…,Fn)
Since P(F1,F2,…,Fn) is common to all probabilities, we
need not evaluate the denominator for comparisons.
50
NAÏVE BAYES CLASSIFIER
Tennis-Example
51
NAÏVE BAYES CLASSIFIER
•Problem:
Use training data from above to classify the
following instances:
a)<Outlook=sunny, Temperature=cool,
Humidity=high, Wind=strong>
b)<Outlook=overcast, Temperature=cool,
Humidity=high, Wind=strong>
52
NAÏVE BAYES CLASSIFIER
P(yes)xP(sunny|yes)xP(cool|yes)xP(high|yes) x
P(strong|yes) = 0.0053
P(no)xP(sunny|no)xP(cool|no)xP(high|no) x
P(strong|no) = 0.0206
So the class for this instance is ‘no’. We can
normalize the probilityby:
[0.0206]/[0.0206+0.0053] = 0.795
54
NAÏVE BAYES CLASSIFIER
Estimating Probabilities:
In the previous example, P(overcast|no) = 0 which
causes the formula-
P(no)xP(overcast|no)xP(cool|no)xP(high|no) x
P(strong|nno) = 0.0
This causes problems in comparing because the
other probabilities are not considered. We can
avoid this difficulty by using m-estimate.
56
NAÏVE BAYES CLASSIFIER
M-Estimate Formula:
[c + k] / [n + m] where c/n is the original
probability used before, k=1 and
m= equivalent sample size.
Using this method our new values of
probability is given below-
57
NAÏVE BAYES CLASSIFIER
P(yes)xP(overcast|yes)xP(cool|yes)xP(high|ye
s)xP(strong|yes) = 0.011
P(no)xP(overcast|no)xP(cool|no)xP(high|no) x
P(strong|nno) = 0.00486
So the class of this instance is ‘yes’
59
NAÏVE BAYES CLASSIFIER
•The conditional probability values of all the
attributes with respect to the class are
pre-computed and stored on disk.
•This prevents the classifier from computing
the conditional probabilities every time it
runs.
•This stored data can be reused to reduce the
latency of the classifier.
60
Bayesian Belief Networks
•In Naïve Bayes Classifier we make the
assumption of class conditional
independence, that is given the class label
of a sample, the value of the attributes are
conditionally independent of one another.
•However, there can be dependences
between the value of attributes. To avoid
this, we use Bayesian Belief Network
which provides the joint conditional
probability distribution.
61
Bayesian Belief Networks
•A Bayesian network is a form of probabilistic
graphical model.
•Specifically, a Bayesian network is a directed acyclic
graph of nodes representing variables and arcs
representing dependence relations among the
variables.
•They provide a graphical method for getting the inferred
results through joint probabilities.
62
63
BAYESIAN BELIEF
NETWORK
64
BELIEF NETWORKS
•By the chaining rule of probability, the joint
probability of all the nodes in the graph above is:
P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R)
W=Wet Grass, C=Cloudy, R=Rain, S=Sprinkler
Example: P(W∩-R∩S∩C)
= P(W|S,-R)*P(-R|C)*P(S|C)*P(C)
= 0.9*0.2*0.1*0.5 = 0.009
65
BAYESIAN BELIEF NETWORK
What is the probability of wet grass on a given
day -P(W)?
P(W) = P(W|SR) * P(S) * P(R) +
P(W|S-R) * P(S) * P(-R) +
P(W|-SR) * P(-S) * P(R) +
P(W|-S-R) * P(-S) * P(-R)
Here P(S) = P(S|C) * P(C) + P(S|-C) * P(-C)
P(R) = P(R|C) * P(C) + P(R|-C) * P(-C)
P(W)= 0.5985
66
BAYESIAN BELIEF NETWORK
What is the probability of wet grass on a given
day -P(W)?
P(W) = P(W|SR) * P(S) * P(R) +
P(W|S-R) * P(S) * P(-R) +
P(W|-SR) * P(-S) * P(R) +
P(W|-S-R) * P(-S) * P(-R)
Here P(S) = P(S|C) * P(C) + P(S|-C) * P(-C)
P(R) = P(R|C) * P(C) + P(R|-C) * P(-C)
P(W)= 0.5985
67
Advantages of Bayesian Approach
•Bayesian networks can readily handle
incomplete data sets.
•Bayesian networks allow one to learn
about causal relationships
•Bayesian networks readily facilitate use of
prior knowledge.
68
ML Resources (Books)
1.Stephen Marsland, “Machine Learning –
An Algorithmic Perspective”, Second
Edition, Chapman and Hall/CRC Machine
Learning and Pattern Recognition Series,
2014.
2.Tom M Mitchell, ―Machine Learning‖,
First Edition, McGraw Hill Education,
2013.
69
ML Resources (Books)
3. Nilsson, N. (2004). Introduction to Machine
Learning.
http://robotics.stanford.edu/people/nilsson/
mlbook.html.
4. Russell, S. (1997). Machine Learning.
Handbook of Perception and Cognition,
Vol. 14, Chap. 4.
5. Ethem Alpaydin, “Introduction to Machine
Learning”, (Adaptive Computation and
Machine Learning Series), Third Edition,
MIT Press, 2014.
70
Journals -IEEE
•IEEE Transactions on Neural Networks.
•IEEE Transactionson Pattern Analysis
andMachine Intelligence.
•IEEE Transactions on Neural Networks and
Learning Systems.
•IEEE Transactions on Artificial Intelligence
•IEEE Transactions on Knowledge and Data
Engineering.
71
ML Journals -Elsevier
•Machine Learning with Applications
•Expert Systems With Applications
•Applied Soft Computing
•Knowledge-based Systems
•Neural Networks
•Data & Knowledge Engineering
•Artificial Intelligence
72
Neural Networks
Similarity with biological network
Fundamental processing elements of a neural network
is a neuron
1.Receives inputs from other source
2.Combines them in someway
3.Performs a generally nonlinear operation on the result
4.Outputs the final result
•Biologically motivated approach to
machine learning
73
Similarity with Biological Network
•Fundamental processing element of a
neural network is a neuron
•A human brain has 100 billion neurons
•An ant brain has 250,000 neurons 74
Neural Network
•Neural Networkis a set of connected
INPUT/OUTPUT UNITS, where each connection
has a WEIGHT associated with it.
•Neural Networklearning is also called CONNECTIONIST
learning due to the connections between units.
•It is a case of SUPERVISED or
CLASSIFICATIONlearning.
75
Neural Network
•Neural Networklearns by adjusting the weights so
as to be able to correctly classify the training data
and hence, after testing phase, to classify unknown
data.
•Neural Networkneeds long time for training.
•Neural Networkhas a high tolerance to noisy and
incomplete data.
76
Neural Network Classifier
•Input: Classification data
It contains classification attribute
•Data is divided, as in any classification problem.
[Training data and Testing data]
•All data must be normalized.
(i.e. all values of attributes in the database are changed to contain values
in the internal [0,1] or[-1,1])
Neural Network can work with data in the range of (0,1) or (-1,1)
77
One Neuron as a Network
–The neuron receives the weighted sum as input and calculates the
output as a function of input as follows :
•y = f(x) , where f(x) is defined as
•f(x) = 0 { when x< 0.5 } and f(x) = 1 {
when x >= 0.5 }
•For eg, if x is 0.55, then y = 1 , the
input values are classified in class 1.
•If x = 0.45 , f(x) =0, the input values
are classified to class 0.
78
Bias of a Neuron
•We need the bias value to be added to the weighted sum
∑wixi so that we can transform it from the origin.
v = ∑wixi+ b, here b is the bias
x1-x2=0
x1-x2= 1
x1
x2
x1-x2= -1
79
Bias as extra input
Input
Attribute
values
weights
Summing function
Activation
function
v
Output
class
y
x
1
x
2
x
m
w
2
w
m
W1 )(
w
0x
0 = +1bw
xwv j
m
j
j
0
0
80
Neuron with Activation
•The neuron is the basic information processing unit of a NN. It
consists of:
1A set of links, describing the neuron inputs, with weightsW
1,
W
2, …, W
m
2.Anadderfunction (linear combiner) for computing the
weighted sum of the inputs (real numbers):
3Activation function: for limiting the amplitude of the
neuron output.
m
1
jj xwu
j ) (u y b
81
kO jk
w Output nodes
Input nodes
Hidden nodes
Output Class
Input Record : x
i
w
ij-weights
Network is fully connectedj
O
A Multilayer Feed-Forward
Neural Network
82
Neural Network Learning
•The inputs are fed simultaneously into the input
layer.
•The weighted outputs of these units are fed into
hidden layer.
•The weighted outputs of the last hidden layer are
inputs to units making up the output layer.
83
A Multilayer Feed Forward Network
•The units in the hidden layers and output layer are
sometimes referred to as neurodes, due to their symbolic
biological basis, or as output units.
•A network containing two hidden layers is called a three-
layerneural network, and so on.
•The network is feed-forward in that none of the weights
cycles back to an input unit or to an output unit of a
previous layer.
84
A Multilayered Feed –Forward Network
•INPUT: records without class attribute with normalized
attributes values.
•INPUT VECTOR: X = { x1, x2, …. xn}
where n is the number of (non class) attributes.
•INPUT LAYER–there are as many nodes as non-class
attributes i.e. as the length of the input vector.
•HIDDEN LAYER–the number of nodes in the hidden
layer and the number of hidden layers depends on
implementation.
85
A Multilayered Feed–Forward
Network
•OUTPUT LAYER –corresponds to the
class attribute.
•There are as many nodes as classes (values
of the class attribute).kO
k= 1, 2,.. #classes
•Network is fully connected, i.e. each unit provides input
to each unit in the next forward layer.
86
Classification by Back propagation
•Back Propagation learns by iteratively processing a
set of training data (samples).
•For each sample, weights are modified to
minimize the error between network’s
classification and actual classification.
87
Steps in Back propagation
Algorithm
•STEP ONE: initialize the weights and biases.
•The weights in the network are initialized to random
numbers from the interval [-1,1].
•Each unit has a BIAS associated with it
•The biases are similarly initialized to random
numbers from the interval [-1,1].
•STEP TWO: feed the training sample.
88
Steps in Back propagation Algorithm
( cont..)
•STEP THREE: Propagate the inputs forward; we
compute the net input and output of each unit in
the hidden and output layers.
•STEP FOUR: back propagate the error.
•STEP FIVE: update weights and biases to reflect
the propagated errors.
•STEP SIX: terminating conditions.
89
Propagation through Hidden Layer
( One Node )
•The inputs to unit j are outputs from the previous layer. These are
multiplied by their corresponding weights in order to form a weighted
sum, which is added to the bias associated with unit j.
•A nonlinear activation function f is applied to the net input.
-
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector
w
w
0j
w
1j
w
nj
x
0
x
1
x
n
Bias j
90
Propagate the inputs forward
•For unit j in the input layer, its output is
equal to its input, that is,jj
IO
for input unit j.
•The net input to each unit in the hidden and output layers is
computed as follows.
•Given a unit j in a hidden or output layer, the net input is
i
jiijj OwI
where wij is the weight of the connection from unit i in the
previous layer to unit j; Oi is the output of unit I from the
previous layer; j
is the bias of the unit
91
Propagate the inputs forward
•Each unit in the hidden and output layers takes its net
input and then applies an activation function. The
function symbolizes the activation of the neuron
represented by the unit. It is also called a logistic,
sigmoid, or squashing function.
•Given a net input Ij to unit j, then
Oj = f(Ij),
the output of unit j, is computed asjIj
e
O
1
1
92
Back propagate the error
•When reaching the Output layer, the error is
computed and propagated backwards.
•For a unit k in the output layer the error is computed
by a formula:))(1(
kkkkk OTOOErr •
Where O k –actual output of unit k ( computed by activation function.
Tk –True output based of known class label; classification of training
sample
Ok(1-Ok) –is a Derivative ( rate of change ) of activation function. kIk
e
O
1
1
93
Back propagate the error
•The error is propagated backwards by updating
weights and biases to reflect the error of the network
classification .
•For a unit j in the hidden layer the error is computed
by a formula:
•jk
k
kjjj wErrOOErr )1(
where wjk is the weight of the connection from unit j to unit k in
the next higher layer, and Errk is the error of unit k.
94
Update weights and biases
•Weightsare updated by the following equations,
where l is a constant between 0.0 and 1.0 reflecting
the learning rate, this learning rate is fixed for
implementation.ijij
OErrlw )( ijijij
www
•Biasesare updated by the following equationsjjj
jj
Errl)(
95
Update weights and biases
•We are updating weights and biases after the presentation
of each sample.
•This is called case updating.
•Epoch ---One iteration through the training set is called an epoch.
•Epoch updating ------------
•Alternatively, the weight and bias increments could be
accumulated in variables and the weights and biases updated after
all of the samples of the training set have been presented.
•Case updating is more accurate
96
Terminating Conditions
•Training stopsij
w
•All in the previous epoch are below some threshold, or
•The percentage of samples misclassified in the previous epoch
is below some threshold, or
•a pre specified number of epochs has expired.
•In practice, several hundreds of thousands of epochsmay be
required before the weights will converge.
97
Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: x
i
w
ij
i
jiijj OwI ))(1( kk
kkk OTOOErr jk
k
kjjj wErrOOErr )1( ijijij OErrlww )( jjj Errl)( jIj
e
O
1
1
Backpropagation Formulas
98
Example of Back propagation
x1x2x3w14w15w24w25w34w35w46w56
1 0 10.2-0.30.40.1-0.50.2-0.3-0.2
Initial Input and
weight
Initialize weights :
Input = 3, Hidden
Neuron = 2 Output
=1
Random Numbers
from -1.0 to 1.0
99
Example ( cont.. )
•Bias added to Hidden
•+ Output nodes
•Initialize Bias
•Random Values from
•-1.0 to 1.0
•Bias ( Random )
θ4 θ5 θ6
-0.40.2 0.1
100
Net Input and Output Calculation
Unitj Net Input Ij Output Oj
4
0.2 + 0 -0.5 -0.4 = -0.7
5
-0.3 + 0 + 0.2 + 0.2 =0.1
6
(-0.3)0.332-
(0.2)(0.525)+0.1= -0.1051.0
1
1
e
O
j 7.0
1
1
e
O
j
105.0
1
1
e
O
j
= 0.332
= 0.525
= 0.475
101
Calculation of Error at Each
Node
Unitj Errorj
6
0.475(1-0.475)(1-0.475) =0.1311
We assume T6= 1
5
0.525 x (1-0.525)x 0.1311x
(-0.2) = 0.0065
4
0.332 x (1-0.332) x 0.1311 x
(-0.3) = -0.0087
102
DEEP LEARNING
•Deep learningis a subset of MLin
AI that has networks capable of
learningunsupervised from data that is
unstructured or unlabeled.
•Deep Learning is a subfield of Machine
Learning that involves the use of neural
networks to model and solve complex
problems.
104
DEEP LEARNING
•The key characteristic of Deep Learning is the use
of deep neural networks, which have multiple
layers of interconnected nodes.
•These networks can learn complex
representations of data by discovering hierarchical
patterns and features in the data.
•Deep Learning algorithms can automatically learn
and improve from data without the need for
manual feature engineering.
105
DEEP LEARNING
•Deep Learning has achieved significant success in
various fields, including image recognition,
natural language processing, speech recognition,
and recommendation systems.
•Some of the popular Deep Learning architectures
include Convolutional Neural Networks (CNNs),
Recurrent Neural Networks (RNNs), and Deep
Belief Networks (DBNs).
106
DEEP LEARNING
•Training deep neural networks typically
requires a large amount of data and
computational resources.
•However, the availability of cloud
computing and the development of
specialized hardware, such as Graphics
Processing Units (GPUs), has made it easier
to train deep neural networks.
107
CNN
•There are certain steps/operations that are
involved CNN. These can be categorized as
follows:
•Convolution operation
•Pooling
•Flattening
•Fully connected layers
110
CNN
•To understand this operation, let us consider
image as input to our CNN.
•Now when image is given as input, they are
in the form of matrices of pixels.
•If the image is grayscale, then the image is
considered as a matrix, each value in the
matrix ranges from 0-255.
113
CNN
•We can even normalize these values lets say
in range 0-1. 0 represents white and 1
represents black.
•If the image is colored, then three matrices
representing the RGB colors with each
value in range 0-255. The same can be seen
in the images below:
114
Fig 1: Colored image matrices
115
Fig 2: Grascale image matrix
116
Convolution Operation
•MATHEMATICAL OPERATION
Coming to convolution operation, let us
consider an input image. Now for
convolution operation, filters or kernels are
used.
117
Convolution Operation
•The following mathematical operation is
performed:
Let size of image:NxN
Let size of filer:FxF
Then(NxN)*(FxF)= (N-F+1)x(N-F+1)
*= Convolution operation
All these kernels, input channels etc are the hyper
parameters. The result of each layer is passed on
to the next one.
118
Example
119
Example
•Here, input of size 6x6 is given and kernel
of size 3x3 is used. The feature map
obtained in of size 4x4.
To increase non-linearity in the image,
Rectifier function can be applied to the
feature map.
Finally, after the convolution step is
completed and feature map is obtained, this
map is given as input to the pooling layer.
120
CNN Architecture
•A common CNN model architecture is to
have a number of convolution and pooling
layers stacked one after the other.
121
CNN Architecture
122
Pooling Layers
•Pooling layers are used to reduce the
dimensions of the feature maps.
•Thus, it reduces the number of parameters
to learn and the amount of computation
performed in the network.
123
Pooling Layers
•Thepoolinglayersummarizesthefeaturespresent
inaregionofthefeaturemapgeneratedbya
convolutionlayer.
•So,furtheroperationsareperformedon
summarizedfeaturesinsteadofprecisely
positionedfeaturesgeneratedbytheconvolution
layer.
•This makes the model more robust to variations in
the position of the features in the input image.
124
Max Pooling
•Types of Pooling Layers:Max, Min and
Average pooling.
Max Pooling
•Max pooling is a pooling operation that selects the
maximum element from the region of the feature
map covered by the filter.
•Thus, the output after max-pooling layer would be
a feature map containing the most prominent
features of the previous feature map.
125
Max Pooling
126
Average Pooling
•Averagepoolingcomputestheaverageof
theelementspresentintheregionoffeature
mapcoveredbythefilter.
•Thus,whilemaxpoolinggivesthemost
prominentfeatureinaparticularpatchof
thefeaturemap,averagepoolinggivesthe
averageoffeaturespresentinapatch.
127
Average Pooling
128
Flattening
•Flattening is converting the data into a 1-
dimensional array for inputting it to the next
layer.
•We flatten the output of the convolutional
layers to create a single long feature vector.
•And it is connected to the final
classification model, which is called a fully-
connected layer.
129
CNN
•In other words, we put all the pixel data in
one line and make connections with the
final layer.
•And once again. What is the final layer for?
The classification of ‘the cats and dogs.’
130
Flattening
131
Recurrent Neural Network
•RecurrentNeuralNetwork(RNN)isatypeofNeural
Networkwheretheoutputfromthepreviousstepisfedas
inputtothecurrentstep.
•In traditional neural networks, all the inputs and outputs
are independent of each other.
•But in cases when it is required to predict the next word of
a sentence, the previous words are required.
•NLP
•Hence, there is a need to remember the previous words.
132
Recurrent Neural Network
•ThusRNNcameintoexistence,which
solvedthisissuewiththehelpofaHidden
Layer.
•The main and most important feature of
RNN is itsHidden state, which remembers
some information about a sequence.
133
How RNN works
•The Recurrent Neural Network consists of
multiple fixed activation function units, one
for each time step.
•Each unit has an internal state which is
called the hidden state of the unit.
•This hidden state signifies the past
knowledge that the network currently holds
at a given time step.
136
RNN
•This hidden state is updated at every time
step to signify the change in the knowledge
of the network about the past.
•The hidden state is updated using the
following recurrence relation:-
137
RNN
•The formula for calculating the current
state:
where:
ht-> current state
ht-1 -> previous state
xt-> input state
138
RNN
•Training through RNN
•A single-time step of the input is provided
to the network.
•Then it calculates its current state using a
set of current input and the previous state.
•The current ht becomes ht-1 for the next
time step.
139
RNN
•One can go as many time steps according to
the problem and join the information from
all the previous states.
•Once all the time steps are completed the
final current state is used to calculate the
output.
140
RNN
•The output is then compared to the actual
output i.e the target output and the error is
generated.
•Theerroristhenback-propagatedtothe
networktoupdatetheweightsandhencethe
network (RNN) is trained
usingBackpropagationthroughtime.
141