UNIT-I
•Introduction
•Feed forward Neural networks
•Gradient descent and the back propagation algorithm
•Unit saturation
•the vanishing gradient problem
•and ways to mitigate it.
•RelU Heuristics for avoiding bad local minima Heuristics
for faster training
•Nestors accelerated gradient descent Regularization
•Dropout
Introduction to Deep Learning
•Deeplearningisasubfieldofartificial
intelligence(AI)andmachinelearningthat
focusesontrainingartificialneuralnetworksto
performtasksthattypicallyrequirehuman
intelligence.
•Ithasgainedwidespreadattentionandmade
significantadvancements in various
applications,includingimagerecognition,natural
languageprocessing,speechrecognition,and
more.
Here are some common types of deep
learning:
Feedforward Neural
Networks (FNNs):
•Thesearethefundamental
buildingblocksofdeep
learning.FNNsconsistofan
inputlayer,oneormorehidden
layers,andanoutputlayer.
•Eachlayercontainsnodes
(neurons)thatprocessand
transformthedata.
•FNNsareusedforvarious
tasks,includingregressionand
classification.
Convolutional Neural
Networks (CNNs):
•CNNs are designed for
processing grid-like data, such
as images and videos.
•They use convolutional layers
to automatically learn features
from local regions of the input,
making them highly effective in
tasks like image classification,
object detection, and image
segmentation.
Common types of deep learning
(contd..)
Recurrent Neural
Networks (RNNs):
•RNNs aredesignedfor
sequentialdata,suchastime
series,text,andspeech.They
havefeedbackconnections,
allowingthemtomaintaina
memoryofpreviousinputs.
•RNNsaresuitablefortasks
like natural language
processing(NLP),machine
translation,and speech
recognition.
Long Short-Term Memory
(LSTM)
•LSTMsareatypeofRNN
architecturedesignedto
capture long-range
dependenciesinsequential
datamoreeffectively.
•Theyusespecializedmemory
cellstostoreandupdate
informationover longer
sequences,makingthem
suitablefortasksrequiring
understandingofcontextover
time.
Common types of deep learning
(contd..)
Gated Recurrent Unit
(GRU):
•GRUs are another variant of
RNNs that address the
vanishing gradient problem,
like LSTMs.
•They are computationally more
efficient and often used for
similar sequence-based tasks
in NLP and speech
recognition.
Autoencoders:
•Autoencodersareneural
networks used for
unsupervisedlearningand
dimensionalityreduction.
•Theyconsistofanencoder
thatmapsinputdatatoa
lower-dimensional
representation(encoding)and
adecoderthatreconstructsthe
originaldatafromthis
encoding.
•Autoencodersareusedin
applicationslikeimage
denoisingand anomaly
detection.
Common types of deep learning
(contd..)
Generative Adversarial
Networks (GANs):
•GANsconsistoftwoneural
networks,ageneratoranda
discriminator,thatcompete
againsteachother.
•Thegeneratortriestocreate
datathatisindistinguishable
fromrealdata,whilethe
discriminatortriestotellreal
fromfake.
•GANsareusedfortaskslike
image generation,style
transfer, and data
augmentation.
Transformer Models:
•Transformers have
revolutionized natural
languageprocessing(NLP)
andhavebeenadaptedto
variousotherdomains.
•Theyuseaself-attention
mechanismtoprocessinput
datainparallel,makingthem
highlyscalableandeffective
for sequence-to-sequence
tasks.
•Notable transformer-based
modelsincludeBERT,GPT
(Generative Pre-trained
Transformer),andT5.
Common types of deep learning
(contd..)
Siamese Networks:
•Thesenetworksaredesigned
fortasksinvolvingsimilarityor
distance measurement
betweenpairsofinputs.
•Siamesenetworkshavetwo
identicalsubnetworksthat
processeachinputand
produceembeddingsthatcan
becomparedtomeasure
similarityordissimilarity.
Capsule Networks
(CapsNets):
•CapsNetsaredesignedto
improvetheshortcomingsof
traditionalCNNs,especiallyin
handlingposevariationsand
hierarchicalfeaturesinimages.
•Theyusecapsulesinsteadof
neuronstorepresentdifferent
partsofanobject.
We want our network to perform correctly on the four points X = {[0,
0], [0,1],[1,0], and [1,1]}.
We will train the network on all four of these points.
The only challenge is to fit the training set.
Evaluated on our whole training set, the MSE loss function is a
linear model, with θ consisting of w and b.
Our model is defined to be
f (x; w, b) = x T w + b.
To finish computing the value of h for each example, we apply the rectified
linear transformation: In this space, all the examples lie along a line with slope
1. As we move along this line, the output needs to begin at 0, then rise to 1,
then drop back down to 0. A linear model cannot implement such a function.
GRADIENT DESCENT & BACK
PROPAGATION
•Gradientdescentandthebackpropagationalgorithmare
fundamentaltechniquesusedintrainingartificialneural
networksforvariousmachinelearningtasks,including
imagerecognition,naturallanguageprocessing,and
more.
•Gradient Descent:
•Gradientdescentisanoptimizationalgorithmusedto
minimizealossfunctionbyadjustingtheparameters
(weightsandbiases)ofamachinelearningmodel
iteratively.Theideaistofindthesetofparametersthat
minimizestheerrorbetweenthemodel'spredictionsand
theactualtargetvalues.
Here's a simple example of gradient
descent with a linear regression model:
•Objective: Minimize the mean squared error (MSE) loss for a
linear regression model.
•Linear Regression Model: The model has a single
parameter, a weight (w), and a bias (b). It predicts an output
(y_pred) given an input (x) as follows:
•y_pred= w * x + b
•Loss Function: The MSE loss for linear regression is defined
as:
•MSE = (1/n) * Σ(y_i-y_pred_i)^2
•Where:
•n is the number of data points.
•y_iis the actual target for the i-thdata point.
•y_pred_iis the predicted output for the i-thdata point.
Gradient Descent Algorithm:
1.Initialize w and b with random values.
2.Choose a learning rate (α). Which is used to scale the
magnitude of parameter updates during gradient
descent.
3.Repeat until convergence:
1.Calculate the gradient of the loss with respect to w and b.
2.Update w and b using the gradient and learning rate:
3.w = w -α * ∂(MSE)/∂w
4.b = b -α * ∂(MSE)/∂b
5.Repeat the above steps until the loss converges to a minimum
value.
•a
A simple example of gradient
descent using a one-dimensional
function.
•Suppose we want to minimize the
following quadratic function:
•f(x) = x^2
•The goal is to find the minimum value of
this function using gradient descent.
GD
•The gradient is:
•∂f/∂x = 2x
•Update x using the gradient and the learning
rate:
•x = x -α * ∂f/∂x
1.Repeat steps 2 and 3 for a specified
number of iterations or until convergence.
•Let's perform a few iterations of gradient
descent:
Example of Backpropagation:
•Let'sconsidertrainingafeedforwardneural
networkforbinaryclassification.Thenetwork
hasonehiddenlayerwithtwoneuronsand
anoutputlayerwithasingleneuron.We'll
useasimpledatasetoftwo-dimensional
points(x1,x2)andbinarylabels(0or1)for
theexample.Thenetwork'sarchitectureisas
follows:
•Input layer: 2 neurons (corresponding to x1
and x2)
•Hidden layer: 2 neurons (with sigmoid
activation)
•Output layer: 1 neuron (with sigmoid
activation)
Steps in Backpropagation:
•Forward Pass:
•Input (x1, x2) is fed into the network.
•Calculate the weighted sum and apply the
sigmoid activation in the hidden layer.
•Calculate the weighted sum and apply the
sigmoid activation in the output layer.
1.LossCalculation:
1.Computetheloss(e.g.,cross-entropy)betweenthepredicted
outputandtheactualtargetlabel.
2.Backpropagation:
1.Calculatethegradientofthelosswithrespecttotheoutput
layer'sweightedsumandbiases.
2.Backpropagatethisgradienttothehiddenlayerandcompute
gradientsforitsparameters.
3.Usethesegradientstoupdatetheweightsandbiasesinboth
layersusinggradientdescent.
•Repeat:
•Repeat the above steps for a batch of training
examples (mini-batch) and iterate through the entire
dataset for multiple epochs.
Here's a simplified example of a
single training iteration:
•Forward Pass:
•Input (x1, x2) = (1.0, 0.5)
•Hidden layer:
•Weighted sum: z1 = w1 * x1 + w2 * x2 + b1
•Activation: a1 = sigmoid(z1)
•Similar calculations for neuron 2 in the hidden layer.
•Output layer:
•Weighted sum: z2 = w3 * a1 + w4 * a2 + b2
•Activation: a2 = sigmoid(z2)
•Loss Calculation:
•Calculate the cross-entropy loss between the
predicted output a2 and the actual label (0 or 1).
•Backpropagation:
•Compute gradients for output layer parameters (e.g.,
w3, w4, b2).
•Propagate gradients backward to the hidden layer,
compute gradients for its parameters (e.g., w1, w2,
b1).
•Update all weights and biases using gradient
descent.
•This process is repeated for multiple training
iterations until the network's parameters
converge, and the loss reaches a satisfactory
minimum.
•Sigmoid Activation Function: The sigmoid function
is defined as follows:
•σ(x) = 1 / (1 + exp(-x))
•When x is very large (positive or negative), σ(x) approaches
1 or 0, respectively.
•When x is close to 0, σ(x) is approximately 0.5.
•Example of Unit Saturation:
•Consider a neural network with a sigmoid activation
function and a weight (w) connected to a neuron. Let's say
that during training, the network encounters an input value
(x) of 10 for this neuron:
•x = 10