introduction An algorithm for finding the nearest local minimum of a function which presupposes that the gradient of the function can be computed. The method of steepest descent is also called the gradient descent method starts at point P(0) and, as many time as needed. It moves from point P( i ) to P(i+1) by minimizing along the line extending from P( i ) in the direction of -<delta> function of P( i ).
diagram
A drawback in the method This method has the severe drawback of requiring a great many iterations for functions which have long, valley structures. In such cases, a conjugate gradient method is preferable. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient for of the approximate gradient of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function ; the procedure is then known as gradient ascent.
A good and a bad example
A good example In the above plot you can see the function to be minimized and the points at each iteration of the gradient descent .if you increase too much ,the method will not converge
The bad example: There is a chronical problem to the gradient descent .for function that have valleys(in the case of descent) or saddle points (in the case of ascent). The gradient descent/ascent algorithm zig -zags , because the gradient is nearly orthogonal .
The ugly: Imagine the ugliest example you can think of. Draw it on your notebook. Compare it to the guy next to you. Ugliest example wins.
Estimating step size A wrong step size may not reach convergence,so a careful selection of the step size is important. Too large it will diverge,too small it will take a long time to coverge . One option is to choose a fixed step size that will assure covergence wherever you start gradient descent. Another option is choose a different step size at each iteration (adaptive step size).
Maximum step size for convergence Any differentiable function has a maximum derivative value, the maximum of the derivatives at all points. If this maximum is not infinite, this value is known as the lipschitc constant and the function is lipschitz continuous. This constant is important because it says that, given a certain function, any derivative will have a smaller value than the lipschitc constant. ||f(x)-f(y)||||x-y||<=L(f),for any x,y
The same can be said for the gradient of the function: if the maximum second derivative is finite , the function is lipschitz continues gradient and that value is the lipschitiz constant of For the f(x)=x2 example, the derivative is df (x)/ dx =2x and therefore the function is not lipschitz continuous. But the second derivative is d2f(x)/dx2=2,and the function is lipschitz continuous gradient with lipschitz constant .