ANN Architecture 6 Activation function Weighted sum Connection weights Input features a = = σ (z) z = w T x + b weight vector , w Feature vector , x bias b Neuron
Simple Linear Regression y a y p e 1 y x e 2 x 1 e 3 e 4 e 6 e 5
= σ (z), where z = wx + b , x is the feature vector , w and b are the parameter vectors = 0 if a< 0.5, 1 if a ≥ 0.5 The objective of training is to set the parameter vectors w , b so that the model estimates high probabilities for positive instances ( y = 1) and low probabilities for negative instances ( y = 0) Logistic Regression
Logistic Regression Loss function of a single training instance –log( ) grows very large when approaches 0, so the cost will be large if the model estimates a probability close to 0 for a positive instance It will also be very large if the model estimates a probability close to 1 for a negative instance The cost function over the whole training set is the average cost over all training instances Logistic Regression cost function (L( , )) L = - ) + ) ]
Gradient Descent Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide range of problems The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function -
Gradient Descent
Gradient Descent – Learning Rate 12 The learning rate is too small The learning rate is too large
ANN Architecture 13 Activation function Weighted sum Connection weights Input features a = σ (z) z = w T x + b weight vector , w Feature vector , x bias b Neuron
Artificial Neural Network 14 x 3 x 2 x 1 W [1] x a [1] = σ (z [1] ) = W [1] x + b [1] b [1] L( a [2] , ) [1] [2] a [2] = σ (z [2] ) = W [2] a [1] + b [2] W [2] b [2] Activation function Weighted sum Connection weights Input features Neuron Layer [1] Computation Perceptron Multi-layered Perceptron Layer [2] Computation Output W [1] a [1] W [2] a [2] [0]
Forward Propagation X 3 X 2 X 1 = a [1] [2] x = W [1] = 0.9 0.3 0.4 1.16 0.761 0.3 0.6 0.8 0.2 0.8 0.2 0.1 0.5 0.6 0.62 0.650 0.42 0.603 1.11 0.75
ANN: Forward Propagation 16 x 3 x 2 x 1 W [1] x a [1] = σ (z [1] ) = W [1] x + b [1] b [1] L( a [2] , ) [1] [2] a [2] = σ (z [2] ) = W [2] a [1] + b [2] W [2] b [2] Multi-layered Perceptron x = W [1] = z [1] = W [1] x = a [1] = σ (z [1] ) = W [2] = z [2] = W [2] a [1] = a [2] = σ (z [2] ) = 0.75 [0]
Backpropagation X 2 X 1 = a = σ (z) w 1 w 2 b Z = w 1 X 1 + w 2 X 2 + b × × ( a,y )