Gradient Descent or Assent is to find optimal parameters that minimize the loss.
MakalaRamesh1
34 views
12 slides
Sep 22, 2024
Slide 1 of 12
1
2
3
4
5
6
7
8
9
10
11
12
About This Presentation
Gradient Descent is used to minimize a function, typically the loss or cost function in machine learning models. The goal is to find the optimal parameters (e.g., weights in a neural network) that minimize the loss.
Size: 92.04 KB
Language: en
Added: Sep 22, 2024
Slides: 12 pages
Slide Content
Gradient Descent Dr. M. Ramesh Prof. & HOD CSE - Cyber Security
The Idea Behind Gradient Descent Purpose : Gradient Descent is used to minimize a function, typically the loss or cost function in machine learning models. The goal is to find the optimal parameters (e.g., weights in a neural network) that minimize the loss. Key Insight : The direction of steepest descent (negative gradient) tells us how to update the parameters to reduce the loss. By iteratively adjusting the parameters in small steps, we eventually reach the minimum of the loss function. Example : Consider a simple linear regression problem with a loss function that measures the difference between predicted values and actual values (Mean Squared Error). Gradient descent helps adjust the line's slope and intercept until this loss is minimized.
What is the Gradient ? The gradient is a vector that points in the direction of the steepest ascent of the loss function. The negative of this vector points in the direction of the steepest descent.
Estimating the Gradient : In simple cases (e.g., linear regression), the gradient can be computed analytically. In more complex scenarios (e.g., neural networks), backpropagation is used to calculate gradients. In cases with large datasets, mini-batches or stochastic techniques are used to estimate the gradient over small subsets of data.
Using the Gradient
Choosing the Right Step Size (Learning Rate) Importance of Step Size : Too Large : If the step size (learning rate) is too large, the algorithm might overshoot the minimum, causing divergence (i.e., the loss increases). Too Small : If the step size is too small, convergence will be slow, requiring many iterations to reach the minimum. Optimal Step Size : Selecting the right learning rate is crucial for efficient training. Methods such as learning rate schedules (reducing the learning rate over time) or adaptive learning rates (e.g., Adam, RMSprop) can help.
Heuristics : Use cross-validation to experiment with different learning rates. Start with a higher learning rate and gradually reduce it (learning rate annealing).
Using Gradient Descent to Fit Models Example 1: Linear Regression Loss Function: Mean Squared Error (MSE). Gradient Descent: Adjusts the slope and intercept of the regression line until the error between predicted and actual values is minimized. Example 2: Logistic Regression Loss Function: Binary Cross-Entropy. Gradient Descent: Finds the optimal decision boundary by adjusting weights to minimize classification error. Example 3: Neural Networks : Loss Functions : Cross-Entropy for classification, Mean Squared Error for regression tasks. Gradient Descent : Using backpropagation, the gradients are propagated backward through the network to update the weights in each layer.
Mini-Batch Gradient Descent : Instead of using the entire dataset, mini-batch gradient descent computes the gradient on small batches (subsets) of data. Advantages : Computationally efficient. Introduces a balance between the accuracy of Batch Gradient Descent and the noisy updates of SGD. Common batch sizes: 32, 64, 128. Use Case : Widely used in deep learning frameworks (e.g., TensorFlow, PyTorch) as it optimizes memory usage and allows for faster computation on GPUs.
Stochastic Gradient Descent (SGD) : In each iteration, SGD computes the gradient using a single randomly chosen data point. Advantages : Very fast as it only processes one example per iteration. Introduces noise in the gradient updates, which can help the algorithm escape local minima and saddle points. Disadvantages : Noisy updates can cause oscillations around the minimum rather than exact convergence. Use Case : Suitable for large-scale problems where using the entire dataset is computationally expensive.
Comparison of Gradient Descent Methods Batch Gradient Descent : Uses the entire dataset for each update; more accurate but computationally intensive. Mini-Batch Gradient Descent : Trades off between stability and computational efficiency by using small batches; very popular in practice. Stochastic Gradient Descent (SGD) : Fast, noisy updates; useful for large datasets and online learning.
Takeaways Gradient Descent is a versatile and widely used optimization algorithm for training machine learning models. Proper tuning of the learning rate and choosing the right variant (batch, mini-batch, or stochastic) are key to achieving efficient and effective optimization.