Md. Mehedi Hasan Rasel.
khulna university ,mathematics Discipline , Bangladesh
Size: 4.9 MB
Language: en
Added: Jan 06, 2021
Slides: 60 pages
Slide Content
GRADIENT DESCENT ALGORITHM FEED FORWARD NEURAL NETWORK BACK PROPAGATION ALGORITHM
Content Brief discussion about Machine Learning,Artificial Intelligence and Deep learning Gradient Descent Algorithm Feed Forward Neural Network Back propagation Algorithm
Neural Network
MACHINE LEARNING
MACHINE LEARNING(ML) Machine learning ( ML ) is the study of computer algorithms that improve automatically through experience . It is seen as a subset of artificial intelligence. Machine learning algorithms build a model based on sample data, known as " training data ", in order to make predictions or decisions.
ARTIFICIAL INTELLIGENCE
ARTIFICIAL INTELLIGENCE Artificial intelligence ( AI ), is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals. The study of " intelligent agents ": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals . Colloquially, the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving ".
DEEP LEARNING
DEEP LEARNING Deep learning (also known as deep structured learning ) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. Deep-learning architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, machine vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.
FIG 1 FIG 2
GRADIENT DESCENT ALGORITHM Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function . Gradient Descent is an optimization technique that is used to improve deep learning and neural network-based models by minimizing the cost function To find a local minimum of a function using gradient descent, we take steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point . But if we instead take steps proportional to the positive of the gradient, we approach a local maximum of that function; the procedure is then known as gradient ascent . Gradient descent is generally attributed to Cauchy, who first suggested it in 1847,but its convergence properties for non-linear optimization problems were first studied by Haskell Curry in 1944 .
Local minima,Global Minima,Local Maxima ,Global Maxima
An analogy for understanding gradient descent The basic intuition behind gradient descent can be illustrated by a hypothetical scenario. A person is stuck in the mountains and is trying to get down (i.e. trying to find the global minimum). There is heavy fog such that visibility is extremely low. Therefore , the path down the mountain is not visible, so they must use local information to find the minimum. They can use the method of gradient descent, which involves looking at the steepness of the hill at their current position, then proceeding in the direction with the steepest descent (i.e. downhill).
An analogy for understanding gradient descent In this analogy, the person represents the algorithm, and the path taken down the mountain represents the sequence of parameter settings that the algorithm will explore. The steepness of the hill represents the slope of the error surface at that point. The instrument used to measure steepness is differentiation . The direction they choose to travel in aligns with the gradient of the error surface at that point . The amount of time they travel before taking another measurement is the learning rate of the algorithm.
ANOTHER ANALOGY An analogy could be drawn in the form of a steep mountain whose base touches the sea. We assume a person’s goal is to reach down to sea level. Ideally, the person would have to take one step at a time to reach the goal. Each step has a gradient in the negative direction (Note: the value can be of different magnitude). The person continues hiking down till he reaches the bottom or to a threshold point, where there is no room to go further down.
GRADIENT DESCENT ALGORITHM
Illustration of gradient descent on an example Consider the nonlinear system of equations showing the first 80 iterations of gradient descent applied to this example. and arrows show the direction of descent. Due to a small and constant step size, the convergence is slow.
APPLICATION Gradient descent can be used to solve a system of linear equations. Gradient descent can also be used to solve a system of nonlinear equations. Gradient descent works in spaces of any number of dimensions, even in infinite-dimensional ones. The gradient descent can be combined with a line search Methods based on Newton's method and inversion of the Hessian using conjugate gradient techniques can be better alternatives Gradient descent can be viewed as applying Euler's method for solving ordinary differential equations to a gradient flow .
FEED FORWARD NEURAL NETWORK A feedforward neural network is an artificial neural network wherein connections between the nodes do not form a cycle . As such, it is different from its descendant: recurrent neural networks. The feedforward neural network was the first and simplest type of artificial neural network devised. In this network, the information moves in only one direction—forward—from the input nodes, through the hidden nodes (if any) and to the output nodes. Deep feedforward networks , also often called feedforward neural networks , or multilayer perceptrons (MLPs), are the quintessential deep learning models. The goal of a feedforward network is to approximate some function f* .
FEED FORWARD NEURAL NETWORK These models are called feedforward because information flows through the function being evaluated from x , through the intermediate computations used to define f, and finally to the output y . There are no feedback connections in which outputs of the model are fed back into itself . When feedforward neural networks are extended to include feedback connections, they are called recurrent neural networks
FEED FORWARD NEURAL NETWORK T he inspiration behind neural networks are our brains. So lets see the biological aspect of neural networks .
FEED FORWARD NEURAL NETWORK Visualising the two images in Fig 1 where the left image shows how multilayer neural network identify different object by learning different characteristic of object at each layer, for example at first hidden layer edges are detected, on second hidden layer corners and contours are identified. Similarly in our brain there are different regions for the same purpose, as we can the region denoted by V1, identifies edges, corners and etc.
SINGLE LAYER PERCEPTRON The simplest kind of neural network is a single-layer perceptron network, which consists of a single layer of output nodes; the inputs are fed directly to the outputs via a series of weights . The sum of the products of the weights and the inputs is calculated in each node if the value is above some threshold (typically 0) the neuron fires and takes the activated value (typically 1); otherwise it takes the deactivated value (typically -1). Neurons with this kind of activation function are also called artificial neurons or linear threshold units . A perceptron can be created using any values for the activated and deactivated states as long as the threshold value lies between the two.
SINGLE LAYER PERCEPTRON Perceptrons can be trained by a simple learning algorithm that is usually called the delta rule . It calculates the errors between calculated output and sample output data, and uses this to create an adjustment to the weights, thus implementing a form of gradient descent. Single-layer perceptrons are only capable of learning linearly separable patterns In 1969 in a famous monograph entitled Perceptrons , Marvin Minsky and Seymour Papert showed that it was impossible for a single-layer perceptron network to learn an XOR function (nonetheless, it was known that multi-layer perceptrons are capable of producing any possible boolean function).
SINGLE LAYER PERCEPTRON A single-layer neural network can compute a continuous output instead of a step function. A common choice is the so-called logistic function. the single-layer network is identical to the logistic regression model, widely used in statistical modeling. If single-layer neural network activation function is modulo 1, then this network can solve XOR problem with exactly ONE neuron.
MULTILAYER PERCEPTRON This class of networks consists of multiple layers of computational units, usually interconnected in a feed-forward way. Each neuron in one layer has directed connections to the neurons of the subsequent layer . In many applications the units of these networks apply a sigmoid function as an activation function. The universal approximation theorem for neural networks states that every continuous function that maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily closely by a multi-layer perceptron with just one hidden layer. This result holds for a wide range of activation functions, e.g. for the sigmoidal functions. Multi-layer networks use a variety of learning techniques, the most popular being back-propagation .
OTHER FEED FORWARD NETWORKS More generally, any directed acyclic graph may be used for a feedforward network, with some nodes (with no parents) designated as inputs, and some nodes (with no children) designated as outputs. These can be viewed as multilayer networks where some edges skip layers, either counting layers backwards from the outputs or forwards from the inputs. Various activation functions can be used, and there can be relations between weights, as in convolutional neural networks. Examples of other feedforward networks include radial basis function networks, which use a different activation function. Sometimes multi-layer perceptron is used loosely to refer to any feedforward neural network, while in other cases it is restricted to specific ones (e.g., with specific activation functions, or with fully connected layers, or trained by the perceptron algorithm).
MULTILAYER PERCEPTRON A two-layer neural network capable of calculating XOR. The numbers within the neurons represent each neuron's explicit threshold (which can be factored out so that all neurons have the same threshold, usually 1). The numbers that annotate arrows represent the weight of the inputs. This net assumes that if the threshold is not reached, zero (not -1) is output. Note that the bottom layer of inputs is not always considered a real neural network layer
BACK PROPAGATION ALGORITHM Backpropagation algorithm is probably the most fundamental building block in a neural network. It was first introduced in 1960s and almost 30 years later (1989) popularized by Rumelhart , Hinton and Williams in a paper called “Learning representations by back-propagating errors” . The algorithm is used to effectively train a neural network through a method called chain rule. In simple terms, after each forward pass through a network, backpropagation performs a backward pass while adjusting the model’s parameters (weights and biases).
BACK PROPAGATION ALGORITHM The output values are compared with the correct answer to compute the value of some predefined error-function . By various techniques, the error is then fed back through the network. Using this information, the algorithm adjusts the weights of each connection in order to reduce the value of the error function by some small amount. After repeating this process for a sufficiently large number of training cycles, the network will usually converge to some state where the error of the calculations is small To adjust weights properly, we can apply a general method for non-linear optimization that is called gradient descent.
BACK PROPAGATION For this, the network calculates the derivative of the error function with respect to the network weights, and changes the weights such that the error decreases (thus going downhill on the surface of the error function). For this reason, back-propagation can only be applied on networks with differentiable activation functions
Why We Need Backpropagation ? Most prominent advantages of Backpropagation are: Backpropagation is fast, simple and easy to program It has no parameters to tune apart from the numbers of input It is a flexible method as it does not require prior knowledge about the network It is a standard method that generally works well It does not need any special mention of the features of the function to be learned.
BACK PROPAGATION ALGORITHM EXAMPLE Define the neural network model: The 4-layer neural network consists of 4 neurons for the input layer , 4 neurons for the hidden layers and 1 neuron for the output layer .
EXAMPLE CONTINUED INPUT LAYER : The neurons, colored in purple , represent the input data. These can be as simple as scalars or more complex like vectors or multidimensional matrices . The first set of activations ( a ) are equal to the input values. NB: “activation” is the neuron’s value after applying an activation function.
EXAMPLE CONTINUED HIDDEN LAYER: The final values at the hidden neurons, colored in green , are computed using z^l — weighted inputs in layer l , and a^l — activations in layer l.
EXAMPLE CONTINUED For layer 2 and 3 the equations are:
EXAMPLE CONTINUED W² and W³ are the weights in layer 2 and 3 while b² and b³ are the biases in those layers. Activations a² and a³ are computed using an activation function f . Typically, this function f is non-linear (e.g. sigmoid , ReLU , tanh ) and allows the network to learn complex patterns in data. Combined all parameter values in matrices, grouped by layers . Let’s pick layer 2 and its parameters as an example. The same operations can be applied to any layer in the network. W¹ is a weight matrix of shape (n, m) where n is the number of output neurons (neurons in the next layer) and m is the number of input neurons (neurons in the previous layer). For us, n = 2 and m = 4 .
EXAMPLE CONTINUED NB: The first number in any weight’s subscript matches the index of the neuron in the next layer (in our case this is the Hidden_1 layer ) and the second number matches the index of the neuron in previous layer (in our case this is the Input layer ).
EXAMPLE CONTINUED x is the input vector of shape (m, 1) where m is the number of input neurons. For us, m = 4 . b¹ is a bias vector of shape (n , 1) where n is the number of neurons in the current layer. For us, n = 2 .
EXAMPLE CONTINUED Following the equation for z², we can use the above definitions of W¹, x and b¹ to derive “ Equation for z² ” :
EXAMPLE CONTINUED Now carefully observe the neural network illustration from above.
EXAMPLE CONTINUED OUTPUT LAYER The final part of a neural network is the output layer which produces the predicated value. In our simple example, it is presented as a single neuron, colored in blue and evaluated as follows:
EXAMPLE CONTINUED Again, we are using the matrix representation to simplify the equation. One can use the above techniques to understand the underlying logic. Forward propagation and evaluation The equations above form network’s forward propagation.The slide is a short overview:
EXAMPLE CONTINUED(overview)
EXAMPLE CONTINUED The final step in a forward pass is to evaluate the predicted output s against an expected output y . The output y is part of the training dataset (x, y) where x is the input (as we saw in the previous section). Evaluation between s and y happens through a cost function . This can be as simple as MSE (mean squared error) or more complex like cross-entropy We name this cost function C and denote it as follows:
EXAMPLE CONTINUED where cost can be equal to MSE, cross-entropy or any other cost function. Based on C ’s value, the model “knows” how much to adjust its parameters in order to get closer to the expected output y . This happens using the backpropagation algorithm . Backpropagation and computing gradients According to the paper from 1989, backpropagation: repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector .
EXAMPLE CONTINUED And, the ability to create useful new features distinguishes back-propagation from earlier, simpler methods… In other words, backpropagation aims to minimize the cost function by adjusting network’s weights and biases. The level of adjustment is determined by the gradients of the cost function with respect to those parameters. One question may arise — why computing gradients ? To answer this, we first need to revisit some calculus terminology:
EXAMPLE CONTINUED Gradient of a function C(x_1, x_2, …, x_m ) in point x is a vector of the partial derivatives of C in x. The derivative of a function C measures the sensitivity to change of the function value (output value) with respect to a change in its argument x (input value). In other words, the derivative tells us the direction C is going. The gradient shows how much the parameter x needs to change (in positive or negative direction) to minimize C. Compute those gradients happens using a technique called chain rule
EXAMPLE CONTINUED Similar set of equations can be applied to ( b_j )^l :
EXAMPLE CONTINUED The common part in both equations is often called “local gradient” and is expressed as follows: The “local gradient” can easily be determined using the chain rule . The gradients allow us to optimize the model’s parameters :
EXAMPLE CONTINUED Initial values of w and b are randomly chosen. Epsilon ( e ) is the learning rate . It determines the gradient’s influence. w and b are matrix representations of the weights and biases. Derivative of C in w or b can be calculated using partial derivatives of C in the individual weights or biases. Termination condition is met once the cost function is minimized .
EXAMPLE CONTINUED The final part of this section to a simple example in which we will calculate the gradient of C with respect to a single weight (w_22)² . Let’s zoom in on the bottom part of the above neural network :
EXAMPLE CONTINUED Weight (w_22)² connects (a_2)² and (z_2)² , so computing the gradient requires applying the chain rule through (z_2)³ and (a_2)³ : Calculating the final value of derivative of C in (a_2)³ requires knowledge of the function C . Since C is dependent on (a_2)³ , calculating the derivative should be fairly straightforward . Knowing the nuts and bolts of this algorithm will fortify your neural networks knowledge and make you feel comfortable to take on more complex models.
Summary of Back Propagation Algorithm
Summary of Back Propagation Algorithm Inputs X, arrive through the preconnected path Input is modeled using real weights W. The weights are usually randomly selected. Calculate the output for every neuron from the input layer, to the hidden layers, to the output layer. Calculate the error in the outputs Error B = Actual Output – Desired Output Travel back from the output layer to the hidden layer to adjust the weights such that the error is decreased. Keep repeating the process until the desired output is achieved