Machine Learning Lectures - Optimization - part 2.pdf

zahsa 0 views 35 slides Oct 12, 2025
Slide 1
Slide 1 of 35
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35

About This Presentation

Machine Learning Lectures: Optimization - part 2


Slide Content

Machine Learning
Zahra Sadeghi, PhD
1
Optimization, part 2

Gradient
•The gradient is the most important part of our model: it is the value
that tells us how we need to modify our parameters.
•We usually find this value by setting the derivative of our loss function
equal to 0,
•because when our derivative is at 0, it means we’ve reached a local
optima.
•if the derivative of our function is the slope at a certain point, and the
slope at that point is 0, then that means our loss is at a relatively low
point, which is good, because that means our model is more accurate.

Optimizer
•Optimizers are functions that are responsible for updating our
weights.
•When we find our gradient after running through our network, we
need to change our weights accordingly, so that we can try to get
closer and closer to the optimal point.
•The simplest way to update weights uses the equation:

Learning rate
•Once you have a random starting point ?????? = (??????₁, …, ??????ᵣ), you update it,
or move it to a new position in the direction of the negative gradient:
•?????? → ?????? − ??????∇??????, where ?????? is a small positive value called the learning rate.
•The learning rate determines how large the update or moving step is.
•If ?????? is too small, then the algorithm might converge very slowly.
•Large ?????? values can also cause issues with convergence or make the
algorithm divergent.

Learning rate
•The learning rate, denoted by the symbol α or ??????, is a hyper-parameter used
to govern the pace at which an algorithm updates or learns the values of a
parameter estimate.
•It regulates the amount of allocated error with which the model’s weights
are updated each time they are updated, such as at the end of each batch
of training instances.
•Smaller learning rates necessitate more training epochs because of the
fewer changes.
•larger learning rates result in faster changes.
•larger learning rates frequently result in a suboptimal final set of weights.

Epoch
•The epoch of your model is the number of times your model will run
through the training data.
•When training a model on anything requiring iterations, an epoch
value is necessary for your model to know how many times it needs
to go through and analyze your data in order to perform well.
•Iteration is the number of batches or steps through partitioned
packets of the training data, needed to complete one epoch.
•The number of iterations is equivalent to the number of batches
needed to complete one epoch.

Activation Function
•are functions that create non-linearity in our model.
•Activation functions are the main components that solve our issue
with creating a model that can adapt to data that requires a non-
linear equation.

Sigmoid and softmax
•Activation functions forthe outputlayer
•Sigmoid is used for binary classification methods where we only have
2 classes, while SoftMax applies to multiclass problems.
•Sigmoid represents the probability of belonging to class1 (the
probability of belonging to class2 = 1 - P(class1)).
•Sigmoid produces independentprobabilities (not constrained to sum
to one)
•Softmax produces a probability distribution over all predicted
classes(The probabilities sum will be 1)
8

Sigmoid vs softmax
•The sigmoid function outputs the probability between 0 and 1 for the
positive class, i.e., the most relevant class for which we are predicting.
•The softmax function is a generalization of the sigmoid to additional
classes, and it outputs a probability for each class.
•While it's possible to use softmax in binary classification, the sigmoid
is most often used in practice because it is less expensive to calculate
and operates more seamlessly than the softmax.
9

Sigmoid vssoftmax
•logitsof P(Y=i|X)
•Unnormalized scores of the model
•We apply these functions to get a probability
10

advantages of Sigmoid
•It is derivable at every point
•gradients can be calculated
•The output values are bound between 0 to 1.
•Produces continuous output

Disadvantages of Sigmoid
•Itsaturates and kills gradients (vanishinggradient)
•for a very high or very low value of x, the derivative of the sigmoid is very low(problematic in chainrule
ofbackpropagation)
•At both positiveand negative ends, the value of the gradient saturates at 0. That means for thosevalues,
thegradient will be 0 or close to 0, which simply means no learning.
•Notzerocentered.
•Theoutput of this activation function always lies within 0 & 1 i.e. always positive.
•Thiswill obstruct the possible update directions of the gradients.
•it wouldtake a substantially longer time to converge. Whereas zerocentredfunctionhelps in fast
convergence.
•It is computationally expensive because of the exponential term in it.

Hyperbolic Activation Functions
•Based on the desired output, a data
scientist can decide which of these
activation functions need to be used in the
Perceptron logic.

Hyperbolic Tangent
•Is a mathematically shifted version of sigmoid and works better than
sigmoid in most cases.
•It has similar advantages to sigmoid but can be considered better due
to being zero centred.
•The output of tanh lies between -1 and 1. Hence solving one of the
issues with the sigmoid activation function.
•With larger output space and symmetryaround zero, the tanh
function leads to themore even handling of data, and it iseasier
to arrive at the global maxima in theloss function.

Hyperbolic Tangent
•It provides output between -1 and +1.
•This is an extension of logistic sigmoid
•The advantage of the hyperbolic tangent over the logistic function is
that it has a broader output spectrum and ranges in the open interval (-
1, 1), which can improve the convergence of the backpropagation
algorithm.
•thegradient of the tanh function is much steeper as compared to
thesigmoid function.Hence making the gradients stronger for tanh
than sigmoid.

Disadvantages of Tanh
•italso faces the problem of vanishing gradients
(gradientssaturation)similar to the sigmoid activation function.
•But the derivatives are steeper than that of the sigmoid. Hence
making the gradients stronger for tanh than sigmoid.
•As it is almost similar to sigmoid, tanh is also computationally
expensive.

ReLU (Rectified Linear Unit)
•It gives an output x if x is positive and 0 otherwise.
•It is derivable, but It is non-differentiable at 0.
•It suffers from the dying ReLU problem (when
output is negative).
•ReLU is always going to discard the negative values

Leaky ReLU
•α is a small constant (normally 0.01).
•Advantages:
•itenjoys similar advantages toReLU.
•If the input is negative, the gradient will be α. As a
result, there will be learning for these units as
well.
•Disadvantage:
•the value of α is always constant and is a
hyperparameter (parametric ReLU).

ELU(Exponential Linear Unit)
•Advantages:
•ELU becomes smooth slowly until its output equal
to -α whereas RELU sharply smoothes.This
smoothness plays a beneficial role in optimization
and generalization.
•Unlike ReLU, It is derivable at 0.
•Avoids dead ReLU problem by introducing log curve
for negative values of input.
•Disadvantages:
•It is computationally expensive due to the presence
of the exponential term.
•α is a hyperparameter

Regularization
•it is often convenient to explicitly introduce a regularization term (??????)
which maps ??????to a real number ?????? ∈ R. This term is usually used for
penalizing the complexity of the model in theoptimization
•In practice, the family of functions chosen for the optimization can be
parameterized by aparameter vector Θ, which allows the
minimization to be defined as an exploration in the parameterspace:

•Regularization is a technique used in machine learning to prevent
overfitting.
•Overfitting happens when a model learns the training data too well,
including the noise and outliers, which causes it to perform poorly on
new data.
•In simple terms, regularization adds a penalty to the model for being
too complex, encouraging it to stay simpler and more general. This
way, it’s less likely to make extreme predictions based on the noise in
the data.
•Regularization techniques help prevent overfitting by imposing
constraints on the model’s parameters

•This technique helps in several ways:
•Complexity Control: Regularization reduces model complexity preventing overfitting and
enhancing generalization to new data.
•Balancing Bias and Variance: Regularization helps manage the trade-off between model bias
(underfitting) and variance (overfitting) leading to better performance.
•Feature Selection: Methods like L1 regularization (Lasso) encourage sparse solutions
automatically selecting important features while excluding irrelevant ones.
•Handling Multicollinearity: Regularization stabilizes models by reducing sensitivity to small
changes when features are highly correlated.
•Generalization: Regularized models focus on underlying patterns in the data ensuring better
generalization rather than memorization.
•In simple terms, regularization works like a guide that keeps the model from getting distracted by
small, irrelevant details in the training data. By making the model simpler, regularization improves
its ability to perform well on new, unseen data, rather than just memorizing the training data.

•Regularization is a process that converts the answer of a problemto a
simpler one.
•L1 regularization (also called LASSO) leads to sparse models by
adding a penalty based on the absolute value of coefficients.
•L2 regularization (also called ridge regression) encourages smaller,
more evenly distributed weights by adding a penalty based on the
square of the coefficients.

Regularization for regression
•1. Lasso Regularization – (L1 Regularization)
•2. Ridge Regularization – (L2 Regularization)
•3. Elastic Net Regularization – (L1 and L2 Regularization)

LASSO
•A regression model which uses the L1 Regularization technique is
called LASSO(Least Absolute Shrinkage and Selection Operator)
regression.
•Lasso Regression adds the “absolute value of magnitude” of the
coefficient as a penalty term to the loss function(L).
•Lasso regression also helps us achieve feature selection by penalizing
the weights to approximately equal to zero if that feature does not
serve any purpose in the model.

Ridge
•A regression model that uses the L2 regularization technique is called
Ridge regression.
•Ridge regression adds the “squared magnitude” of the coefficient as
a penalty term to the loss function(L).

Elastic Net
•Elastic Net Regression is a combination of both L1 as well as L2
regularization.
•That implies that we add the absolute norm of the weights as well as
the squared measure of the weights.
•With the help of an extra hyperparameter that controls the ratio of
the L1 and L2 regularization.
•α = 1 corresponds to Lasso (L1) regularization
•α = 0 corresponds to Ridge (L2) regularization
•Values between 0 and 1 provide a balance of both L1 and L2
regularization

Regularization for Neural Networks
•Weight decay is a regularization technique that
operates by subtracting a fraction of the previous
weights when updating the weights during training,
effectively making the weights smaller over time.
•Unlike L2 regularization which adds a penalty terms
to the loss function, weight decay directly
influences the weight update step itself.
•This subtraction of a portion of the existing weights
ensures that during each iteration of training, the
model’s parameters are nudged towards smaller
values.
•whereλis a value determining the strength of the
penalty (encouraging smaller weights).

Weight decay
•In the context of SGD, weight decay and L2 regularization are equivalent. This
equivalence arises from the fact that the gradient of the L2 regularization term
leads to the same parameter update as the one applied in weight decay.
Therefore, in vanilla SGD, using either weight decay or L2 regularization will
achieve the same result in terms of weight updates. And this is exactly the reason
why the two terms have been used synonymously.
•The paper by Loshchilov and Hutter(Link to PDF)conducted several experiments
showing that weight decay performs significantly better than L2 regularization
when used with Adam and other adaptive optimizers. Consequently, researchers
and practitioners now commonly use a modified version of Adam called AdamW,
which incorporates weight decay directly into the optimizer

DropOut
•In the context of neural networks, the Dropout technique repeatedly
ignores random subsets of neurons during training, which simulates
the training of multiple neural network architectures at once to
improve generalization

Batch Normalization

Best Practices for Fine-Tuning
•Gradual Unfreezing: Rather than fine-tuning all layers at once, gradually unfreeze the pre-trained layers
starting from the top. This can lead to better performance and less risk of catastrophic forgetting.
•Learning Rate Scheduling: Implement learning rate scheduling to decrease the learning rate gradually as
training progresses. This approach can help in achieving better fine-tuning results.
•Data Augmentation: For tasks such as image classification, data augmentation techniques (e.g., cropping,
rotation) can effectively increase the size of your training dataset, helping the model generalize better.
•Regularization Techniques: Utilize regularization techniques like dropout and weight decay to prevent
overfitting, especially when your task-specific dataset is relatively small.

PEFT
•Parameter-efficient fine-tuning (PEFT) is a language model fine-tuning technique specifically
designed to update only a small portion of the model’s parameters instead of all of the
parameters. It alleviates the computational problem and catastrophic forget problem that full
fine-tuning has.
•The most famous technique within PEFT is LoRA (Low-Rank Adaptation).
•It’s a method for adapting a pre-trained model by injecting low-rank matrices into the model’s
layer to modify certain parts’ behavior while keeping the original parameters frozen.
•This technique is valuable and has been proven to alter the pre-trained model.

Instruction Tuning
•Instruction tuning is a fine-tuning technique for the pre-trained model to follow natural language
directions for various tasks.
•instruction tuning usually does not focus on specific tasks; instead, it uses a dataset that includes
diverse tasks that were formatted asinstructions with the expected output.