UNIT 1:
OBJECTIVE : TO UNDERSTAND THE BASICS OF
DEEP NEURAL NETWORKS
• INTRODUCTION TO DEEP NETWORKS
Introduction to Neural Networks, Feed-forward Networks, Deep Feed-forward
Networks -Learning XOR,
Gradient Based learning, Hidden Units, Back-propagation and other
Differential Algorithms,
Regularization for Deep Learning,
Optimization for training Deep Models.
AD2512 DEEP LERANING
•The word “neural” is the adjective form
of “neuron,” and “network” denotes a graph-like structure;
therefore, an “Artificial Neural Network” is a computation system
that attempts to mimic (or at least, is inspired by) the neural
connections in our nervous system.
•Artificial neural networks are also referred to as “neural
networks” or “artificial neural systems.” It is common to
abbreviate Artificial Neural Network and refer to them
as “ANN” or simply “NN” — I will be using both of the
abbreviations.
•For a system to be considered an NN, it must contain a labeled,
directed graph structure where each node in the graph performs
some simple computation. From graph theory, we know that a
directed graph consists of a set of nodes (i.e., vertices) and a set
of connections (i.e., edges) that link together pairs of nodes. In
AD2512 DEEP LERANING
•Figure 1: A simple neural network architecture. Inputs are presented to the network. Each
connection carries a signal through the two hidden layers in the network. A final function
computes the output class label.
Structure of NN Each node performs a simple computation.
Each connection then carries a signal (i.e., the output of the
computation) from one node to another, labeled by a weight
indicating the extent to which the signal is amplified or
diminished.
Some connections have large, positive weights that amplify the
signal, indicating that the signal is very important when making
a classification.
Others have negative weights, diminishing the strength of the
signal, thus specifying that the output of the node is less
important in the final classification.
We call such a system an Artificial Neural Network if it consists
of a graph structure (like in Figure 1) with connection weights
that are modifiable using a learning algorithm.
AD2512 DEEP LERANING
Relation to Biology
•Our brains are composed of approximately 10 billion
neurons, each connected to about 10,000 other
neurons.
•The cell body of the neuron is called the soma, where
the inputs (dendrites) and outputs (axons) connect
soma to other soma (Figure 2).
Each neuron receives electrochemical
inputs from other neurons at their dendrites.
If these electrical inputs are sufficiently
powerful to activate the neuron, then the
activated neuron transmits the signal along
its axon, passing it along to the dendrites of
other neurons.
These attached neurons may also fire, thus
continuing the process of passing the
message along.
The key takeaway here is that a neuron
firing is a binary operation — the neuron
either fires or it doesn’t fire. There are no
different “grades” of firing.
Simply put, a neuron will only fire if the total
signal received at the soma exceeds a
given threshold.
AD2512 DEEP LERANING
BNN vs ANN
Biological Neural Networks (BNNs):
- Found in living organisms
- Composed of biological neurons
- Highly adaptive and energy-efficient
- Process information through electrochemical signals
Artificial Neural Networks (ANNs):
- Man-made and inspired by BNNs
- Composed of artificial neurons (perceptrons)
- Trained using algorithms like backpropagation
- Process information through mathematical functions and
signals
AD2512 DEEP LERANING
AD2512 DEEP LERANING
What is a feed forward neural network?
•Feed forward neural networks are artificial neural networks in which
nodes do not form loops. This type of neural network is also known as a
multi-layer neural network as all information is only passed forward.
•During data flow, input nodes receive data, which travel through hidden
layers, and exit output nodes. No links exist in the network that could get
used to by sending information back from the output node.
•A feed forward neural network approximates functions in the following
way:
•An algorithm calculates classifiers by using the formula y = f* (x).
•Input x is therefore assigned to category y.
•According to the feed forward model, y = f (x; θ). This value determines the
closest approximation of the function.
•Feed forward neural networks serve as the basis for object detection in
photos, as shown in the Google Photos app.
AD2512 DEEP LERANING
What is the working principle of a feed
forward neural network?
When the feed forward neural network gets simplified, it
can appear as a single layer perceptron.
This model multiplies inputs with weights as they enter the
layer. Afterward, the weighted input values get added
together to get the sum. As long as the sum of the values
rises above a certain threshold, set at zero.
As a feed forward neural network model, the single-layer
perceptron often gets used for classification.
Machine learning can also get integrated into single-layer
perceptrons., which helps them coThrough training, neural
networks can adjust their weights based on a property
called the delta rulempare their outputs with the intended
values.
As a result of training and learning, gradient descent occurs.
Similarly, multi-layered perceptrons update their weights.
But, this process gets known as back-propagation. If this is
the case, the network's hidden layers will get adjusted
according to the output values produced by the final layer.
AD2512 DEEP LERANING
AD2512 DEEP LERANING
Function in feed forward neural
network
AD2512 DEEP LERANING
The cost function is essential in neural
networks as it:
•Quantifies the error in predictions.
•Guides the optimization process to adjust
weights and reduce errors.
•Helps in training by providing a target to
minimize.
As a result of multiclass categorization, a
cross-entropy loss occurs:
AD2512 DEEP LERANING
AD2512 DEEP LERANING
Structure of NN (x) = f1(f2 (f3 (x)))
AD2512 DEEP LERANING
In a matrix format, it looks as follows:
AD2512 DEEP LERANING
Gradient Descent
•Basic Idea: Adjust weights to minimize the error by moving in the direction
opposite to the gradient of the error function.
•Learning Rate (η\etaη): A hyperparameter that determines the step size
for each update. A suitable learning rate ensures efficient and stable
convergence.
•Variants:
•Stochastic Gradient Descent (SGD): Updates weights using one sample at a time,
leading to noisy but frequent updates.
•Mini-Batch Gradient Descent: Uses a small batch of samples for each update,
balancing between full batch and stochastic methods.
•Batch Gradient Descent: Uses the entire training dataset for one update, leading to
smoother but computationally intensive updates.
AD2512 DEEP LERANING
GRADIENT DESCENT VARIANTS:
COMPARISON
AD2512 DEEP LERANING
Gradient-Based Learning
•Forward Pass (Input Data -> [Input Layer] -> [Hidden Layers] -> [Output Layer] -> Predicted Output)
•Compute the Loss Function (Predicted Output, Actual Output -> [Loss Function] -> Loss Value)
•Backpropagation (Loss Value -> [Backpropagation] -> Gradients for Weights and Biases)
•Gradient Descent Optimization (Gradients -> [Update Parameters] -> New Weights and Biases)
•Iterative Process and Convergence (Repeat [Forward Pass -> Compute Loss -> Backpropagation -> Gradient Descent]
until Convergence)
•Purpose: Gradient-based learning is a foundational method in deep learning used to train models by minimizing the error or
"cost" function.
•Basic Idea: The goal is to adjust the model's parameters (weights and biases) iteratively to reduce the difference between the
predicted output and the actual output
The input data passes through the neural network, layer by layer, until it produces an output. During this pass, the
network calculates the activations at each neuron based on the current weights and biases.
Step 1 Forward Pass
Step 2: Compute the Loss Function:
•The loss function measures how far the network's prediction is from the actual result. Common loss functions include Mean
Squared Error (MSE) for regression tasks and Cross-Entropy Loss for classification tasks.
Step 3: Backpropagation:
•Backpropagation calculates the gradient of the loss function with respect to each weight and bias in the network. It works
by applying the chain rule to propagate the error backwards through the network.
Step 4: Gradient Descent Optimization:
•The gradients calculated during backpropagation are used to update the weights and biases. This step involves moving the
parameters in the opposite direction of the gradient to reduce the loss.
•The update rule typically looks like this: New Parameter=Old Parameter−Learning Rate×Gradient
Step 5: Iterative Process and Convergence:
The forward pass, loss computation, backpropagation, and parameter update steps are repeated for multiple iterations
(epochs) until the loss stabilizes or converges to a minimum value.
•To apply gradient-based learning we must choose a cost function, and we must choose how to represent the
output of the model. We now revisit these design considerations with special emphasis on the neural networks
scenario.
•1. Cost Functions
• In neural networks, we use a cost function (or loss function) to measure how well the model's predictions
match the actual data.Often, we use the cross-entropy cost function, which is common in many models,
including linear ones. It’s derived from the idea of maximum likelihood, where we want our model’s predicted
probabilities to be as close as possible to the actual probabilities.
•Learning Conditional Distributions with Maximum Likelihood
Learning Conditional Statistics
Challenges in Gradient-Based Learning
1.Non-Convex Loss Functions: Neural networks often have non-convex loss functions, leading to
multiple local minima. This makes it difficult to guarantee convergence to the global minimum.
2.Sensitive to Initial Parameters: The outcome of gradient-based learning can vary significantly
depending on the initial values of weights and biases.
3.Learning Rate Selection: Choosing the right learning rate is crucial. A high learning rate can cause
the model to overshoot the minimum, while a low learning rate can result in slow convergence.
4.Vanishing/Exploding Gradients:In deep networks, gradients can become very small (vanish) or
very large (explode), making training difficult and slow.
5.Overfitting:Neural networks are prone to overfitting, especially when trained on small datasets.
Regularization techniques like dropout are often needed to prevent this.
Hidden Units
•Hidden units, also known as neurons in the hidden layers, are
fundamental components of neural networks. They allow the network
to learn complex representations by transforming inputs into outputs
through learned weights and activation functions.
•Key Concepts
1.Activation Functions
2.Role of Hidden Layers
3.Initialization
AD2512 DEEP LERANING
Hidden units
•Purpose: Hidden units process input features and transform them
into representations that are useful for the network’s task.
•Rectified Linear Units (ReLU):
Function: ReLU(x)=max (0,x)
•Properties:
•Advantages: Simple and efficient, helps avoid the vanishing gradient
problem.
•Disadvantages: Can suffer from "dying ReLU" issue where some neurons
might never activate.
Logistic Sigmoid:
Function:
•Properties:
•Advantages: Outputs values
between 0 and 1, useful for binary
classification.
•Disadvantages: Can suffer from
vanishing gradients for very large
or very small values of xxx.
Backpropagation
•Purpose: Calculate the gradient of the loss function with respect to
each weight by applying the chain rule.
•Steps:
•Forward Pass: Compute the output of the network for a given input.
•Loss Calculation: Compute the loss/error between the predicted output and
the actual target.
•Backward Pass: Compute the gradient of the loss with respect to each weight
by propagating the error backwards through the network.
•EX: learning XOR
AD2512 DEEP LERANING
Solving the XOR
problem
•Logical XOR
•The most popular truth tables
are OR and AND. These tasks can be
solved by a simple
Perceptron. XOR stands for 'exclusive or'.
The output of the XOR function has only a
true value if the two inputs are different. If
the two inputs are identical, the XOR
function returns a false value. The
following table shows the inputs and
outputs of the XOR function.
Inputs Outputs
x
1
x
2
y
0 0 0
0 1 1
1 0 1
1 1 0
•So, what is the special thing about the XOR function? In contrast to the OR
problem and AND problem, the XOR problem cannot be linearly separated by a single
boundaryline. Let's have a look at the following images:
•The image on the right shows what the separation for the XOR function should look like. We can see
that two boundarylines are needed to solve the problem.(refer notes)
•Learning the XOR (exclusive OR) problem using a neural network
means training the network to correctly map the input pairs to their
respective XOR outputs. The XOR function has inputs and outputs that
are not linearly separable, making it a classic problem to demonstrate
the power of neural networks to solve nonlinear problems
•Learning XOR with a Neural Network
•Step-by-Step Process:
1.Network Architecture:
1.Input Layer: 2 neurons (for the two inputs of XOR)
2.Hidden Layer: 2 neurons (with activation functions)
3.Output Layer: 1 neuron (with an activation function)
2.Initialization:
1.Randomly initialize weights and biases for the connections between layers.
AD2512 DEEP LERANING
AD2512 DEEP LERANING
•Implementation and mathematical representation about XoR refer
class notes and lab record
AD2512 DEEP LERANING
Regularization
•L1 regularization and L2 regularization?
•L1 regularization, also known as Lasso regularization, is a method in
deep learning that adds the sum of absolute values of the weights to
the loss function. It encourages to reduce by driving some weights to
•zero, resulting in feature selection.
•L2 regularization, also called Ridge regularization, adds the sum of
squared weights to the loss function, promoting smaller but non-zero
weights and preventing extreme values.
AD2512 DEEP LERANING