Machine learning and deep learning algorithms

KannanA29 87 views 142 slides Jul 09, 2024
Slide 1
Slide 1 of 142
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142

About This Presentation

Machine learning and deep learning algorithms prepared for giving a lecture to the faculty members who are teaching this course


Slide Content

Fundamentals of Deep Learning
Dr. A.Kannan, Former Professor and Head,
Department of Information Science and
Technology, CEG Campus, Anna
University, Chennai-25.
Senior Professor, School of Computer
Science and Engineering,
VIT, Vellore-632014.
.
1

MACHINE LARNING
•“Learning is making useful changes in our
minds.” -Marvin Minsky.
•“Machine Learning (ML) refers to a system
which has the capability of autonomous
knowledge acquisition and integration of
the acquired knowledge.”
2

MACHINE LARNING
•Machine learningis an application of
ArtificialIntelligence(AI) that provides the
systems with the ability to automatically
learn and improve by themselves from the
experience gained by them without being
explicitly programmed.
•It focuses on the development of intelligent
computer programs that can access the data
and use it for learning by themselves.
3

Applications of ML
•Image Processing –Face Recognition, Hand
written character recognition, Self driving
Cars, Traffic Video analysis….
•Natural Language Processing -Social
Network Analysis, Recommendation
Systems and Sentiment Analysis.
•Medical Diagnosis: Disease Identification,
Prediction on Cancer, Diabetes etc using
past history and current data.
4

Machine Learning Paradigms
•Rote Learning
•Transfer of Learning
•Learning by Taking Advice
•Learning By Analogy
•Un-Supervised Learning (Clustering)
•Supervised Learning –Classification
•Deep Learning
5

Machine Learning Tasks
•Knowledge Representation and Reasoning
•Regression
•Classification
•Clustering
•Dimensionality reduction
•Reinforcement learning (Ranking)
6

AI -Knowledge Based Systems
•Facts
•Rules
•Knowledge base
•Knowledge Based Systems
•Knowledge Representation
•Reasoning and Inference
7

FACTS AND RULES
•Pat is a man = true
•Kumar is the father of Raja=True
•Kumar is the grand father of Raja= False
•IF marks >=60 Then
Class = FIRST CLASS
8

Rules
•If A then B (Whenever A is true then B is
also true)
•A = true (now A is TRUE)
•Inference: B is true now
•Using AB and A, we can infer B.
•This rule is called Modus Ponens.
•AB, A implies B.
9

Inference
•Pat is a man = p
•Pat is a woman = q
•Pat is a man or woman = p v q = true
•Pat is not a woman = 7q = true
•Inference: p is true
•Pat is a man or woman
•Pat is not a woman.
•Inference: Pat is a man
10

Knowledge Base Vs Database
More Rules Less Rules
Facts Less More Facts
Explicit Rules and
Facts
Explicit Facts and
Implicit Rules
Experts updateClerks update
Main Memory
Based
Disk Based
11

AI Programs -Exhibit Intelligent Behavior by skillful
application of heuristics.
KBS –Make domain knowledge
explicit
Expert Systems –
Apply expert
knowledge to
difficult,
Real world problems.
12

Knowledge Representation
Techniques
•English or Natural Language
•Tables and Rules
•Logic (Propositional logic, Predicate logic)
•Semantic Networks
•Frames
•Conceptual Dependency
•Scripts
•Ontology
13

LOGIC
•First Order Logic
–Predicate Logic
–Propositional Logic
•Higher Order Logics
–Situational Logic
–Fuzzy Logic
–Temporal Logic
–Modal Logic
–Epistemic Logic
14

Searching
•Depth First (Missionaries and Cannibals)
•Breadth First (Water Jug Problem)
•Hill Climbing ( 8 –Puzzle)
•Best First
•A* Algorithm
•AO* Algorithm
•Mini-Max Algorithm
15

REASONING METHODS
•Reasoning By Analogy –Frames.
•Temporal Reasoning –Higher order logic
(Temporoal Logic).
•Fuzzy Reasoning –Higher Order Logic
(Fuzzy Logic).
•Non-monotonic Reasoning –Higher Order
Logics (Non-monotonic Logic).
•Reasoning Agents –Epistemic Logic.
16

AI and ML
•Roughly speaking, AI and ML are good ways
to ask a computer to provide an answer to a
problem based on some past experience.
(Prediction, Learning, Explanation and Finding
Temporal Dependencies)
•It might be challenging to tell a computer what
a cat is, for instance. ( Computers don’t have
common sense –General Problem Solver-
Human Intelligence).
17

AI and ML
•Still, if you show a neural network enough images
of cats and tell it they are cats, then the computer
will be able to correctly identify other cats that it did
not see before.
•It appears that some of the most prominent and
widely used AI and ML algorithms can be speeded-
up significantly if they are run on quantum
computers. (Example: Bayesian Networks, Graph
Search Algorithms-Shortest path algorithms –
Heuristic Search Algorithms-and Swarm
Intelligence algorithms …).
18

Learning Methods
•Learning from examples
•Winston’s Program
•Explanation based Learning
•Learning by Observation
•Knowledge Acquisition from experts
19

Machine Learning Methods
•Un-Supervised Learning
•Clustering
•K-means clustering
•Supervised learning
•Classification Algorithms
20

K-means Clustering
•Strengths
–Simple iterative method
–User provides “K”
•Weaknesses
–Often too simple bad results
–Difficult to guess the correct “K”
21

K-means Clustering
Basic Algorithm:
•Step 0: Select K
•Step 1: Randomly select initial cluster seeds
Seed 1
650
Seed 2
200
22

K-means Clustering
•An initial cluster seed represents the “mean
value” of its cluster.
•In the preceding figure:
–Cluster seed 1 = 650
–Cluster seed 2 = 200
23

K-means Clustering
•Step 2: Calculate distance from each object
to each cluster seed.
•What type of distance should we use?
–Squared Euclidean distance
•Step 3: Assign each object to the closest
cluster
24

K-means Clustering
Seed 1
Seed 2
25

K-means Clustering
•Iterate:
–Calculate distance from objects to cluster
centroids.
–Assign objects to closest cluster
–Recalculate new centroids
•Stop based on convergence criteria
–No change in clusters
–Max iterations
26

Regression
•The regression task comes from Supervised
machine learning.
•It can help us to predict (expect continues
values) and explains the objects based on a
given set of numerical and categorical data.
•For example, we can predict the house
prices based on the house attributes such as
number of rooms, size, and location.
27

Regression
•In mathematical terms, the regression
method provides us a linear line with the
equation of Y = mX+c to model a dataset.
•Here we are taking the X (Dependent
variable) and Y (Independent variable) data
points to train the linear regression model.
The best observation line can be found by
calculating the slope (m) and y-intercept (c)
values.
28

Applications
•Risk assessment –Insurance, Banking
•Score prediction –Cricket, Elections
•Market forecasting –Share Market
•Weather forecasting
•Housing and product price prediction
•Analysing engine performance in
Automobiles.
29

Regression Analysis
•The regression analysis is performed with
various effective algorithms namely
•Simple linear regression
•Multiple linear regression
•Decision trees
•Radom forest
•Support Vector Machines (SVM)
30

Decision Trees
31
Name Debt Income Married? Risk
--------------------------------------------------------
Joe High High Yes Good
Sue Low High Yes Good
John Low High No Poor
Mary High Low Yes Poor
Fred Low Low Yes Poor

32
Decision Tree Classification
•Example
D1
D2
Decision Tree
Income
=High
Income
=low
D2
D1

33
Decision Trees Classification (cont.)
•Example
D1a
D2
1 2
D1b
Decision Tree
Income
=High
Income
=low
D2
D1
D1a D1b

Random Forest
•Random forest(or random forests) is an
ensemble classifier that consists of many
decision trees and outputs the class that is
the mode of the class's output by individual
trees. It combines "bagging" and the
random selection of features.
•For many data sets, it produces a highly
accurate classification results than decision
trees.
34

Bagging
•Bagging, also known as bootstrap
aggregation, isthe ensemble learning
method that is commonly used to
reduce variance within a noisy dataset.
•In bagging, a random sample of data in
a training set is selected with
replacement—meaning that the
individual data points can be chosen
more than once.
•Boosting is used to reduce the bias.
35

Regression Vs Classification
•The most significantdifference between
regressionvsclassificationis that
whileregressionhelps predict a continuous
quantity,classificationpredicts discrete
class labels. There are also some
overlapsbetweenthe two types of machine
learning algorithms.
36

Support Vector Machines
•It is a linear classifier.
•It classifies the dataset into two groups for
binary classification problems.
•The multi-class SVM classifies the data set
into multiple groups.
•It is a supervised learning method.
37

Linear Classifiers
f
x
a
y
est
denotes +1
denotes -1
f(x,w,b) = sign(wx+ b)
How would you
classify this data?
wx+ b<0
wx+ b>0
38

Linear Classifiers
f
x
a
y
est
Denotes +1
Denotes -1
f(x,w,b) = sign(wx+b)
How will you
classify this data?
39

40
Maximum Margin
f
x
a
y
est
denotes +1
denotes -1
f(x,w,b) = sign(w. x+ b)
The maximum
margin linear
classifieris the
linear classifier with
the, um, maximum
margin.
This is the simplest
kind of SVM
(Called an LSVM)
Support Vectors
are those
datapoints that the
margin pushes up
against
Linear SVM

Probabilistic Models
•Uncertainty
•ABC Murder Story
•Bayesian Classification
•Neural Networks
•Feature Selection and Classification
•Deep Learning
41

Probabilistic Models
•The probabilisticframework tomachine
learningis thatlearningcan be thought of
as inferring plausiblemodelsto explain
observed data.
•Amachinecan use suchmodelsto make
predictions about future data, and take
decisions that are rational given these
predictions.
42

BAYESIYAN
CLASSIFICATION
•CONDITIONAL PROBABILITY
•BAYES THEOREM
•NAÏVE BAYES CLASSIFIER
•BELIEF NETWORK
•APPLICATION OF BAYESIAN NETWORK -
CYBER CRIME DETECTION
43

BAYESIAN CLASSIFICATION
•Probabilistic learning:Calculates explicit
probabilities for hypothesis, among the most
practical approaches to certain types of
learning problems.
•Incremental:Each training example can
incrementally increase/decrease the probability that
a hypothesis is correct. Prior knowledge can be
combined with observed data.
44

BAYESIAN THEOREM
•A special case of Bayesian
Theorem:
P(A∩B) = P(B) x P(A|B)
P(B∩A) = P(A) x P(B|A)
Since P(A∩B) = P(B∩A),
P(B) x P(A|B) = P(A) x P(B|A)
=> P(A|B) = [P(A) x P(B|A)] /
P(B)  ABPAPABPAP
ABPAP
BP
ABPAP
BAP
||
)|()(
)(
)|()(
)|(


A B
45

BAYESIAN THEOREM
•Example 1:A medical cancer diagnosis
problem
There are 2 possible outcomes of a diagnosis:
+ ve, -ve. We know .8% of world population has
cancer. Test gives correct +ve result 98% of the
time and gives correct –ve result 97% of the time.
If a patient’s test returns +ve, should we
diagnose the patient as having cancer?
46

BAYESIAN THEOREM
P(cancer) = .008 P(-cancer) = .992
P(+ve|cancer) = .98P(-ve|cancer) = .02
P(+ve|-cancer) = .03P(-ve|-cancer) = .97
Using Bayes Formula:
P(cancer|+ve) = P(+ve|cancer)xP(cancer) / P(+ve)
= 0.98 x 0.008 = .0078 / P(+ve)
P(-cancer|+ve) = P(+ve|-cancer)xP(-cancer) / P(+ve)
= 0.03 x 0.992 = 0.0298 / P(+ve).
So, the patient most likely does not have cancer.
47

NAÏVE BAYES CLASSIFIER
•A simplified assumption: attributes are
conditionally independent.
•Greatly reduces the computation cost, only
count the class distribution.
48

NAÏVE BAYES CLASSIFIER
The probabilistic model of NBC is to find the probability of a
certain class given multiple dijoint (assumed) events.
The naïve Bayes classifier applies to learning tasks where
each instance x is described by a conjunction of attribute
values and where the target function f(x) can take on any
value from some finite set V.
A set of training examples of the target function is provided,
and a new instance is presented, described by the tuple
of attribute values <a1,a2,…,an>. The learner is asked to
predict the target value, or classification, for this new
instance.
49

NAÏVE BAYES CLASSIFIER
Abstractly, probability model for a classifier is a
conditional model
P(C|F1,F2,…,Fn)
Over a dependent class variable C with a small
number of outcome or classes conditional over
several feature variables F1,…,Fn.
Naïve Bayes Formula:
P(C|F1,F2,…,Fn) = argmax
c[P(C) x P(F1|C) x P(F2|C)
x…x P(Fn|C)] / P(F1,F2,…,Fn)
Since P(F1,F2,…,Fn) is common to all probabilities, we
need not evaluate the denominator for comparisons.
50

NAÏVE BAYES CLASSIFIER
Tennis-Example
51

NAÏVE BAYES CLASSIFIER
•Problem:
Use training data from above to classify the
following instances:
a)<Outlook=sunny, Temperature=cool,
Humidity=high, Wind=strong>
b)<Outlook=overcast, Temperature=cool,
Humidity=high, Wind=strong>
52

NAÏVE BAYES CLASSIFIER
Answer to (a):
P(PlayTennis=yes) = 9/14 = 0.64
P(PlayTennis=n) = 5/14 = 0.36
P(Outlook=sunny|PlayTennis=yes) = 2/9 = 0.22
P(Outlook=sunny|PlayTennis=no) = 3/5 = 0.60
P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33
P(Temperature=cool|PlayTennis=no) = 1/5 = .20
P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33
P(Humidity=high|PlayTennis=no) = 4/5 = 0.80
P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33
P(Wind=strong|PlayTennis=no) = 3/5 = 0.60
53

NAÏVE BAYES CLASSIFIER
P(yes)xP(sunny|yes)xP(cool|yes)xP(high|yes) x
P(strong|yes) = 0.0053
P(no)xP(sunny|no)xP(cool|no)xP(high|no) x
P(strong|no) = 0.0206
So the class for this instance is ‘no’. We can
normalize the probilityby:
[0.0206]/[0.0206+0.0053] = 0.795
54

NAÏVE BAYES CLASSIFIER
Answer to (b):
P(PlayTennis=yes) = 9/14 = 0.64
P(PlayTennis=no) = 5/14 = 0.36
P(Outlook=overcast|PlayTennis=yes) = 4/9 = 0.44
P(Outlook=overcast|PlayTennis=no) = 0/5 = 0
P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33
P(Temperature=cool|PlayTennis=no) = 1/5 = .20
P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33
P(Humidity=high|PlayTennis=no) = 4/5 = 0.80
P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33
P(Wind=strong|PlayTennis=no) = 3/5 = 0.60
55

NAÏVE BAYES CLASSIFIER
Estimating Probabilities:
In the previous example, P(overcast|no) = 0 which
causes the formula-
P(no)xP(overcast|no)xP(cool|no)xP(high|no) x
P(strong|nno) = 0.0
This causes problems in comparing because the
other probabilities are not considered. We can
avoid this difficulty by using m-estimate.
56

NAÏVE BAYES CLASSIFIER
M-Estimate Formula:
[c + k] / [n + m] where c/n is the original
probability used before, k=1 and
m= equivalent sample size.
Using this method our new values of
probability is given below-
57

NAÏVE BAYES CLASSIFIER
New answer to (b):
P(PlayTennis=yes) = 10/16 = 0.63
P(PlayTennis=no) = 6/16 = 0.37
P(Outlook=overcast|PlayTennis=yes) = 5/12 = 0.42
P(Outlook=overcast|PlayTennis=no) = 1/8 = .13
P(Temperature=cool|PlayTennis=yes) = 4/12 = 0.33
P(Temperature=cool|PlayTennis=no) = 2/8 = .25
P(Humidity=high|PlayTennis=yes) = 4/11 = 0.36
P(Humidity=high|PlayTennis=no) = 5/7 = 0.71
P(Wind=strong|PlayTennis=yes) = 4/11 = 0.36
P(Wind=strong|PlayTennis=no) = 4/7 = 0.57
58

NAÏVE BAYES CLASSIFIER
P(yes)xP(overcast|yes)xP(cool|yes)xP(high|ye
s)xP(strong|yes) = 0.011
P(no)xP(overcast|no)xP(cool|no)xP(high|no) x
P(strong|nno) = 0.00486
So the class of this instance is ‘yes’
59

NAÏVE BAYES CLASSIFIER
•The conditional probability values of all the
attributes with respect to the class are
pre-computed and stored on disk.
•This prevents the classifier from computing
the conditional probabilities every time it
runs.
•This stored data can be reused to reduce the
latency of the classifier.
60

Bayesian Belief Networks
•In Naïve Bayes Classifier we make the
assumption of class conditional
independence, that is given the class label
of a sample, the value of the attributes are
conditionally independent of one another.
•However, there can be dependences
between the value of attributes. To avoid
this, we use Bayesian Belief Network
which provides the joint conditional
probability distribution.
61

Bayesian Belief Networks
•A Bayesian network is a form of probabilistic
graphical model.
•Specifically, a Bayesian network is a directed acyclic
graph of nodes representing variables and arcs
representing dependence relations among the
variables.
•They provide a graphical method for getting the inferred
results through joint probabilities.
62

63

BAYESIAN BELIEF
NETWORK
64

BELIEF NETWORKS
•By the chaining rule of probability, the joint
probability of all the nodes in the graph above is:
P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R)
W=Wet Grass, C=Cloudy, R=Rain, S=Sprinkler
Example: P(W∩-R∩S∩C)
= P(W|S,-R)*P(-R|C)*P(S|C)*P(C)
= 0.9*0.2*0.1*0.5 = 0.009
65

BAYESIAN BELIEF NETWORK
What is the probability of wet grass on a given
day -P(W)?
P(W) = P(W|SR) * P(S) * P(R) +
P(W|S-R) * P(S) * P(-R) +
P(W|-SR) * P(-S) * P(R) +
P(W|-S-R) * P(-S) * P(-R)
Here P(S) = P(S|C) * P(C) + P(S|-C) * P(-C)
P(R) = P(R|C) * P(C) + P(R|-C) * P(-C)
P(W)= 0.5985
66

BAYESIAN BELIEF NETWORK
What is the probability of wet grass on a given
day -P(W)?
P(W) = P(W|SR) * P(S) * P(R) +
P(W|S-R) * P(S) * P(-R) +
P(W|-SR) * P(-S) * P(R) +
P(W|-S-R) * P(-S) * P(-R)
Here P(S) = P(S|C) * P(C) + P(S|-C) * P(-C)
P(R) = P(R|C) * P(C) + P(R|-C) * P(-C)
P(W)= 0.5985
67

Advantages of Bayesian Approach
•Bayesian networks can readily handle
incomplete data sets.
•Bayesian networks allow one to learn
about causal relationships
•Bayesian networks readily facilitate use of
prior knowledge.
68

ML Resources (Books)
1.Stephen Marsland, “Machine Learning –
An Algorithmic Perspective”, Second
Edition, Chapman and Hall/CRC Machine
Learning and Pattern Recognition Series,
2014.
2.Tom M Mitchell, ―Machine Learning‖,
First Edition, McGraw Hill Education,
2013.
69

ML Resources (Books)
3. Nilsson, N. (2004). Introduction to Machine
Learning.
http://robotics.stanford.edu/people/nilsson/
mlbook.html.
4. Russell, S. (1997). Machine Learning.
Handbook of Perception and Cognition,
Vol. 14, Chap. 4.
5. Ethem Alpaydin, “Introduction to Machine
Learning”, (Adaptive Computation and
Machine Learning Series), Third Edition,
MIT Press, 2014.
70

Journals -IEEE
•IEEE Transactions on Neural Networks.
•IEEE Transactionson Pattern Analysis
andMachine Intelligence.
•IEEE Transactions on Neural Networks and
Learning Systems.
•IEEE Transactions on Artificial Intelligence
•IEEE Transactions on Knowledge and Data
Engineering.
71

ML Journals -Elsevier
•Machine Learning with Applications
•Expert Systems With Applications
•Applied Soft Computing
•Knowledge-based Systems
•Neural Networks
•Data & Knowledge Engineering
•Artificial Intelligence
72

Neural Networks
Similarity with biological network
Fundamental processing elements of a neural network
is a neuron
1.Receives inputs from other source
2.Combines them in someway
3.Performs a generally nonlinear operation on the result
4.Outputs the final result
•Biologically motivated approach to
machine learning
73

Similarity with Biological Network
•Fundamental processing element of a
neural network is a neuron
•A human brain has 100 billion neurons
•An ant brain has 250,000 neurons 74

Neural Network
•Neural Networkis a set of connected
INPUT/OUTPUT UNITS, where each connection
has a WEIGHT associated with it.
•Neural Networklearning is also called CONNECTIONIST
learning due to the connections between units.
•It is a case of SUPERVISED or
CLASSIFICATIONlearning.
75

Neural Network
•Neural Networklearns by adjusting the weights so
as to be able to correctly classify the training data
and hence, after testing phase, to classify unknown
data.
•Neural Networkneeds long time for training.
•Neural Networkhas a high tolerance to noisy and
incomplete data.
76

Neural Network Classifier
•Input: Classification data
It contains classification attribute
•Data is divided, as in any classification problem.
[Training data and Testing data]
•All data must be normalized.
(i.e. all values of attributes in the database are changed to contain values
in the internal [0,1] or[-1,1])
Neural Network can work with data in the range of (0,1) or (-1,1)
77

One Neuron as a Network
–The neuron receives the weighted sum as input and calculates the
output as a function of input as follows :
•y = f(x) , where f(x) is defined as
•f(x) = 0 { when x< 0.5 } and f(x) = 1 {
when x >= 0.5 }
•For eg, if x is 0.55, then y = 1 , the
input values are classified in class 1.
•If x = 0.45 , f(x) =0, the input values
are classified to class 0.
78

Bias of a Neuron
•We need the bias value to be added to the weighted sum
∑wixi so that we can transform it from the origin.
v = ∑wixi+ b, here b is the bias
x1-x2=0
x1-x2= 1
x1
x2
x1-x2= -1
79

Bias as extra input
Input
Attribute
values
weights
Summing function
Activation
function
v
Output
class
y
x
1
x
2
x
m
w
2
w
m
W1   )(
w
0x
0 = +1bw
xwv j
m
j
j



0
0
80

Neuron with Activation
•The neuron is the basic information processing unit of a NN. It
consists of:
1A set of links, describing the neuron inputs, with weightsW
1,
W
2, …, W
m
2.Anadderfunction (linear combiner) for computing the
weighted sum of the inputs (real numbers):
3Activation function: for limiting the amplitude of the
neuron output. 


m
1
jj xwu
j ) (u y b
81

kO jk
w Output nodes
Input nodes
Hidden nodes
Output Class
Input Record : x
i
w
ij-weights
Network is fully connectedj
O
A Multilayer Feed-Forward
Neural Network
82

Neural Network Learning
•The inputs are fed simultaneously into the input
layer.
•The weighted outputs of these units are fed into
hidden layer.
•The weighted outputs of the last hidden layer are
inputs to units making up the output layer.
83

A Multilayer Feed Forward Network
•The units in the hidden layers and output layer are
sometimes referred to as neurodes, due to their symbolic
biological basis, or as output units.
•A network containing two hidden layers is called a three-
layerneural network, and so on.
•The network is feed-forward in that none of the weights
cycles back to an input unit or to an output unit of a
previous layer.
84

A Multilayered Feed –Forward Network
•INPUT: records without class attribute with normalized
attributes values.
•INPUT VECTOR: X = { x1, x2, …. xn}
where n is the number of (non class) attributes.
•INPUT LAYER–there are as many nodes as non-class
attributes i.e. as the length of the input vector.
•HIDDEN LAYER–the number of nodes in the hidden
layer and the number of hidden layers depends on
implementation.
85

A Multilayered Feed–Forward
Network
•OUTPUT LAYER –corresponds to the
class attribute.
•There are as many nodes as classes (values
of the class attribute).kO
k= 1, 2,.. #classes
•Network is fully connected, i.e. each unit provides input
to each unit in the next forward layer.
86

Classification by Back propagation
•Back Propagation learns by iteratively processing a
set of training data (samples).
•For each sample, weights are modified to
minimize the error between network’s
classification and actual classification.
87

Steps in Back propagation
Algorithm
•STEP ONE: initialize the weights and biases.
•The weights in the network are initialized to random
numbers from the interval [-1,1].
•Each unit has a BIAS associated with it
•The biases are similarly initialized to random
numbers from the interval [-1,1].
•STEP TWO: feed the training sample.
88

Steps in Back propagation Algorithm
( cont..)
•STEP THREE: Propagate the inputs forward; we
compute the net input and output of each unit in
the hidden and output layers.
•STEP FOUR: back propagate the error.
•STEP FIVE: update weights and biases to reflect
the propagated errors.
•STEP SIX: terminating conditions.
89

Propagation through Hidden Layer
( One Node )
•The inputs to unit j are outputs from the previous layer. These are
multiplied by their corresponding weights in order to form a weighted
sum, which is added to the bias associated with unit j.
•A nonlinear activation function f is applied to the net input.
-
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector
w

w
0j
w
1j
w
nj
x
0
x
1
x
n
Bias j
90

Propagate the inputs forward
•For unit j in the input layer, its output is
equal to its input, that is,jj
IO
for input unit j.
•The net input to each unit in the hidden and output layers is
computed as follows.
•Given a unit j in a hidden or output layer, the net input is 
i
jiijj OwI 
where wij is the weight of the connection from unit i in the
previous layer to unit j; Oi is the output of unit I from the
previous layer; j
is the bias of the unit
91

Propagate the inputs forward
•Each unit in the hidden and output layers takes its net
input and then applies an activation function. The
function symbolizes the activation of the neuron
represented by the unit. It is also called a logistic,
sigmoid, or squashing function.
•Given a net input Ij to unit j, then
Oj = f(Ij),
the output of unit j, is computed asjIj
e
O



1
1
92

Back propagate the error
•When reaching the Output layer, the error is
computed and propagated backwards.
•For a unit k in the output layer the error is computed
by a formula:))(1(
kkkkk OTOOErr  •
Where O k –actual output of unit k ( computed by activation function.
Tk –True output based of known class label; classification of training
sample
Ok(1-Ok) –is a Derivative ( rate of change ) of activation function. kIk
e
O



1
1
93

Back propagate the error
•The error is propagated backwards by updating
weights and biases to reflect the error of the network
classification .
•For a unit j in the hidden layer the error is computed
by a formula:
•jk
k
kjjj wErrOOErr  )1(
where wjk is the weight of the connection from unit j to unit k in
the next higher layer, and Errk is the error of unit k.
94

Update weights and biases
•Weightsare updated by the following equations,
where l is a constant between 0.0 and 1.0 reflecting
the learning rate, this learning rate is fixed for
implementation.ijij
OErrlw )( ijijij
www 
•Biasesare updated by the following equationsjjj
  jj
Errl)(
95

Update weights and biases
•We are updating weights and biases after the presentation
of each sample.
•This is called case updating.
•Epoch ---One iteration through the training set is called an epoch.
•Epoch updating ------------
•Alternatively, the weight and bias increments could be
accumulated in variables and the weights and biases updated after
all of the samples of the training set have been presented.
•Case updating is more accurate
96

Terminating Conditions
•Training stopsij
w
•All in the previous epoch are below some threshold, or
•The percentage of samples misclassified in the previous epoch
is below some threshold, or
•a pre specified number of epochs has expired.
•In practice, several hundreds of thousands of epochsmay be
required before the weights will converge.
97

Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: x
i
w
ij 
i
jiijj OwI  ))(1( kk
kkk OTOOErr  jk
k
kjjj wErrOOErr  )1( ijijij OErrlww )( jjj Errl)( jIj
e
O



1
1
Backpropagation Formulas
98

Example of Back propagation
x1x2x3w14w15w24w25w34w35w46w56
1 0 10.2-0.30.40.1-0.50.2-0.3-0.2
Initial Input and
weight
Initialize weights :
Input = 3, Hidden
Neuron = 2 Output
=1
Random Numbers
from -1.0 to 1.0
99

Example ( cont.. )
•Bias added to Hidden
•+ Output nodes
•Initialize Bias
•Random Values from
•-1.0 to 1.0
•Bias ( Random )
θ4 θ5 θ6
-0.40.2 0.1
100

Net Input and Output Calculation
Unitj Net Input Ij Output Oj
4
0.2 + 0 -0.5 -0.4 = -0.7
5
-0.3 + 0 + 0.2 + 0.2 =0.1
6
(-0.3)0.332-
(0.2)(0.525)+0.1= -0.1051.0
1
1



e
O
j 7.0
1
1
e
O
j

 105.0
1
1
e
O
j


= 0.332
= 0.525
= 0.475
101

Calculation of Error at Each
Node
Unitj Errorj
6
0.475(1-0.475)(1-0.475) =0.1311
We assume T6= 1
5
0.525 x (1-0.525)x 0.1311x
(-0.2) = 0.0065
4
0.332 x (1-0.332) x 0.1311 x
(-0.3) = -0.0087
102

Calculation of weights and Bias Updating
Learning Rate l =0.9
Weight New Values
w46
-0.3 + 0.9(0.1311)(0.332) = -
0.261
w56
-0.2 + (0.9)(0.1311)(0.525) = -
0.138
w14
0.2 + 0.9(-0.0087)(1) = 0.192
w15
-0.3 + (0.9)(-0.0065)(1) = -0.306
……..similarly ………similarly
θ6 0.1 +(0.9)(0.1311)=0.218
……..similarly ………similarly
103

DEEP LEARNING
•Deep learningis a subset of MLin
AI that has networks capable of
learningunsupervised from data that is
unstructured or unlabeled.
•Deep Learning is a subfield of Machine
Learning that involves the use of neural
networks to model and solve complex
problems.
104

DEEP LEARNING
•The key characteristic of Deep Learning is the use
of deep neural networks, which have multiple
layers of interconnected nodes.
•These networks can learn complex
representations of data by discovering hierarchical
patterns and features in the data.
•Deep Learning algorithms can automatically learn
and improve from data without the need for
manual feature engineering.
105

DEEP LEARNING
•Deep Learning has achieved significant success in
various fields, including image recognition,
natural language processing, speech recognition,
and recommendation systems.
•Some of the popular Deep Learning architectures
include Convolutional Neural Networks (CNNs),
Recurrent Neural Networks (RNNs), and Deep
Belief Networks (DBNs).
106

DEEP LEARNING
•Training deep neural networks typically
requires a large amount of data and
computational resources.
•However, the availability of cloud
computing and the development of
specialized hardware, such as Graphics
Processing Units (GPUs), has made it easier
to train deep neural networks.
107

Convolutional Neural Networks
•AConvolutionalNeuralNetwork(CNN)is
atypeofdeeplearningalgorithmthatis
particularlywell-suitedforimage
recognitionandprocessingtasks.
•Itismadeupofmultiplelayers,including
convolutionallayers,poolinglayers,and
fullyconnectedlayers.
108

Convolutional Neural Networks
•TheconvolutionallayersarethekeycomponentofaCNN,
wherefiltersareappliedtotheinputimagetoextract
featuressuchasedges,textures,andshapes.
•Theoutputoftheconvolutionallayersisthenpassed
throughpoolinglayers,whichareusedtodown-samplethe
featuremaps,reducingthespatialdimensionswhile
retainingthemostimportantinformation.
•Theoutputofthepoolinglayersisthenpassedthroughone
ormorefullyconnectedlayers,whichareusedtomakea
predictionorclassifytheimage.
109

CNN
•There are certain steps/operations that are
involved CNN. These can be categorized as
follows:
•Convolution operation
•Pooling
•Flattening
•Fully connected layers
110

CNN
111

CNN
•Convolutionoperationsisthefirstandone
ofthemostimportantstepinthe
functioningofaCNN.Convolution
operationfocusesonextracting/preserving
importantfeaturesfromtheinput(image
etc).
112

CNN
•To understand this operation, let us consider
image as input to our CNN.
•Now when image is given as input, they are
in the form of matrices of pixels.
•If the image is grayscale, then the image is
considered as a matrix, each value in the
matrix ranges from 0-255.
113

CNN
•We can even normalize these values lets say
in range 0-1. 0 represents white and 1
represents black.
•If the image is colored, then three matrices
representing the RGB colors with each
value in range 0-255. The same can be seen
in the images below:
114

Fig 1: Colored image matrices
115

Fig 2: Grascale image matrix
116

Convolution Operation
•MATHEMATICAL OPERATION
Coming to convolution operation, let us
consider an input image. Now for
convolution operation, filters or kernels are
used.
117

Convolution Operation
•The following mathematical operation is
performed:
Let size of image:NxN
Let size of filer:FxF
Then(NxN)*(FxF)= (N-F+1)x(N-F+1)
*= Convolution operation
All these kernels, input channels etc are the hyper
parameters. The result of each layer is passed on
to the next one.
118

Example
119

Example
•Here, input of size 6x6 is given and kernel
of size 3x3 is used. The feature map
obtained in of size 4x4.
To increase non-linearity in the image,
Rectifier function can be applied to the
feature map.
Finally, after the convolution step is
completed and feature map is obtained, this
map is given as input to the pooling layer.
120

CNN Architecture
•A common CNN model architecture is to
have a number of convolution and pooling
layers stacked one after the other.
121

CNN Architecture
122

Pooling Layers
•Pooling layers are used to reduce the
dimensions of the feature maps.
•Thus, it reduces the number of parameters
to learn and the amount of computation
performed in the network.
123

Pooling Layers
•Thepoolinglayersummarizesthefeaturespresent
inaregionofthefeaturemapgeneratedbya
convolutionlayer.
•So,furtheroperationsareperformedon
summarizedfeaturesinsteadofprecisely
positionedfeaturesgeneratedbytheconvolution
layer.
•This makes the model more robust to variations in
the position of the features in the input image.
124

Max Pooling
•Types of Pooling Layers:Max, Min and
Average pooling.
Max Pooling
•Max pooling is a pooling operation that selects the
maximum element from the region of the feature
map covered by the filter.
•Thus, the output after max-pooling layer would be
a feature map containing the most prominent
features of the previous feature map.
125

Max Pooling
126

Average Pooling
•Averagepoolingcomputestheaverageof
theelementspresentintheregionoffeature
mapcoveredbythefilter.
•Thus,whilemaxpoolinggivesthemost
prominentfeatureinaparticularpatchof
thefeaturemap,averagepoolinggivesthe
averageoffeaturespresentinapatch.
127

Average Pooling
128

Flattening
•Flattening is converting the data into a 1-
dimensional array for inputting it to the next
layer.
•We flatten the output of the convolutional
layers to create a single long feature vector.
•And it is connected to the final
classification model, which is called a fully-
connected layer.
129

CNN
•In other words, we put all the pixel data in
one line and make connections with the
final layer.
•And once again. What is the final layer for?
The classification of ‘the cats and dogs.’
130

Flattening
131

Recurrent Neural Network
•RecurrentNeuralNetwork(RNN)isatypeofNeural
Networkwheretheoutputfromthepreviousstepisfedas
inputtothecurrentstep.
•In traditional neural networks, all the inputs and outputs
are independent of each other.
•But in cases when it is required to predict the next word of
a sentence, the previous words are required.
•NLP
•Hence, there is a need to remember the previous words.
132

Recurrent Neural Network
•ThusRNNcameintoexistence,which
solvedthisissuewiththehelpofaHidden
Layer.
•The main and most important feature of
RNN is itsHidden state, which remembers
some information about a sequence.
133

Recurrent Neural Network
•ThestateisalsoreferredtoasMemory
Statesinceitremembersthepreviousinput
tothenetwork.
•Itusesthesameparametersforeachinput
asitperformsthesametaskonalltheinputs
orhiddenlayerstoproducetheoutput.
•Thisreducesthecomplexityofparameters,
unlikeotherneuralnetworks.
134

RNN
135

How RNN works
•The Recurrent Neural Network consists of
multiple fixed activation function units, one
for each time step.
•Each unit has an internal state which is
called the hidden state of the unit.
•This hidden state signifies the past
knowledge that the network currently holds
at a given time step.
136

RNN
•This hidden state is updated at every time
step to signify the change in the knowledge
of the network about the past.
•The hidden state is updated using the
following recurrence relation:-
137

RNN
•The formula for calculating the current
state:
where:
ht-> current state
ht-1 -> previous state
xt-> input state
138

RNN
•Training through RNN
•A single-time step of the input is provided
to the network.
•Then it calculates its current state using a
set of current input and the previous state.
•The current ht becomes ht-1 for the next
time step.
139

RNN
•One can go as many time steps according to
the problem and join the information from
all the previous states.
•Once all the time steps are completed the
final current state is used to calculate the
output.
140

RNN
•The output is then compared to the actual
output i.e the target output and the error is
generated.
•Theerroristhenback-propagatedtothe
networktoupdatetheweightsandhencethe
network (RNN) is trained
usingBackpropagationthroughtime.
141

Conclusions
•Machine Learning
•Supervised and Unsupervised Learning
•Neural Networks
•CNN
•RNN
142
Tags