Deep learning requirement and notes for novoice

AmmarAhmedSiddiqui2 35 views 111 slides Oct 15, 2024
Slide 1
Slide 1 of 111
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111

About This Presentation

Deep learning notes


Slide Content

(35) Simple explanation of convolutional neural network | Deep Learning Tutorial 23 (Tensorflow & Python) - YouTube Implementing a Neural Network from Scratch in Python · Denny's Blog (dennybritz.com) MOST IMPORTANT LINKS

Pip install tensorflow -- to install tensorflow on your system Running Jupyter server Accessing jupyter notepad

Most popular DL frameworks ! Pytorch By facebook Tensorflow By google

Keras is not a full fledge framework but rather a nice wrapper around TensorFlow , CNTK (By Microsoft) and “ Theano . It just makes programming easier Post TensorFlow 2.0, Keras is now part of Tensorflow Library suite. Two most popular deep learning frameworks are PyTorch and (b) Tensorflow

Slope of a vertical line is always undefined Slope of a horizontal line is always 0

Beauty of CNN is that no need to provide explicit filters, it will automatically detect filters on its own ! Classification through Deep Network (All nodes connected to all other) CNN is not necessarily deep means all nodes not necessarily connected to all other nodes. We are providing thousand of photos of koalas here, so CNN will use backpropagation to automatically generate appropriate filters. It is part of learning. Only parameters we specify is How many filters u want What will be size of filters. No need to provide values for filter !

CNN architecture components Convolution Padding Stride Pooling SoftMax Fully Connected NN Tensor Tensor

The Forward pass of kernel During the forward pass, the kernel slides across the height and width of the image-producing the image representation of that receptive region. This produces a two-dimensional representation of the image known as an activation map that gives the response of the kernel at each spatial position of the image. The sliding size of the kernel is called a stride.

Sample CNN

What is local receptive field? Subset of a feature-map or input-image

Parameters? # of parameters in a given layer is the count of “learnable” elements Parameters in general are weights that are learnt during training. They are weight matrices that contribute to model’s predictive power, changed during back-propagation process.

# of parameters in an Input Layer Input layer has nothing to learn, at it’s core, what it does is just provide the input image’s shape. So no learnable parameters here. Thus number of  parameters = 0

# of parameters in a Convolutional Layer Consider a convolutional layer which takes “ l” feature maps as the input “ k” feature maps as output. The filter size is “ n*m ”. Example: Here the input has   l=3  feature maps as inputs,  k=96  feature maps as outputs and filter size is  n=11 and m=11 . It is important to understand, that we don’t simply have a 11*11 filter, but actually, we have  11 *11*3  filter, as our input has 3 dimensions. And as an output from first conv layer, we learn 96 different  filters which total weights is “ n*m*k*l ”. Then there is a term called bias for each feature map. So, the total number of parameters are “ (n*m*l+1)*k ”. # param = ( (11 * 11 * 3) + 1) * 96

Formula for Output Shape N = input size F = size of filter P = # of padding S = # of strides Output_shape = ((10 – 3 +2(0))/1)+1 Output_shape = 7/1 + 1 Output_shame = 8

Sample Calculation: Output_shape & # Params # params in 1 filter = 3 x 3 + 1 = 10 (including 1 bias per filter) # params in 5 filters = 10 * 5 = 50 Output_shape = ((10 – 3 + 2(0)) / 1 ) +1 = 7 + 1 = ( 8 x 8 x 5)

Main benefits of Pooling layer Reduced Size Translation invariance Feature Enhancement (Max-Pooling) No need of training No learning parameters Source: (58 ) Pooling Layer in CNN | MaxPooling in Convolutional Neural Network – YouTube . https://www.youtube.com/watch?v=DwmGefkowCU

Benefits of Pooling: Size Reduction

Pooling = Sub-Sampling Sub-sampling handles translation invariance In both figures, A and B, the number ‘8’ is slightly shifted from origin, but after subsampling / max-pooling filter applied, in the resulting image, both are centered at origin however some details are lost. Therefore, generally speaking pooling ( Min,Avg ) focus on higher level features and ignore the minute details. Except Max-pooling where the features are actually enhanced Benefits of Pooling: Size Reduction

Benefits of Pooling: Feature Enhancement In case of max-pooling, you take small area from the input image and take the most dominant number (max) from it You are actually selecting the best / most bright weight from the receptive field, which yields the most enhanced feature to you. Caution, it is only in the case of Max-pooling. Not applicable to other forms of pooling

Benefits of Pooling: No need of training In Convolution layer, what will be the weights in the filter is actually found out by applying back propagation. However, pooling is just an aggregate operation (Min, Max, Avg ) therefore, no training is required. It is just an aggregate function. All you need to tell model is What is local receptive field? Value of stride? Type of pooling – ( Avg , Min, Max) Pooling is layer is faster for the above reason. There is no back propagation involved

Types of pooling in KERAS Max-pooling Avg -pooling Global-Pooling Global Max-Pooling Global Avg -Pooling Simply average of receptive field Usually in majority of cases, max-pooling is used, but some time avg -pooling is also used.

Global Max-pooling You convert entire filter map into a 1x1 scaler value For global average pooling, you take average of all values of an input feature map For global max pooling, you take max of all values of an input feature map Where to Use?: In the end stage of a CNN, when you are flattening your data, you use global max pooling as replacement of flattening, to reduce over-fitting. For global max pooling of an input 4x4x3 feature maps, you get output 1 x 3. 1 for each feature map

Disadvantages of Pooling: Location The feature of translation invariance actually make location of required feature irrelevant to the detection of feature. This is quite helpful in some classification tasks , where for example you need to identify, if the image contains a cat or not, regardless of her position/location in the input image However, in some computer vision tasks, location of the feature is very important. Such as image segmentation tasks, location does matter. Thus pooling is not used in image-segmentation tasks In image segmentation tasks, the location of car is important. The features must all be in the same location where the car is present.

Disadvantages of Pooling: Information loss Lot of information is lost For example for pooling conversion from 4x4=16 to 2x2=4, its almost 60% loss of information However, it all depend on the application and information vs computational complexity tradeoff In image segmentation tasks, the location of car is important. The features must all be in the same location where the car is present.

LENET-5 Architecture Considered as 1 layer Considered as 1 layer Monochrome / greyscale image FULLY CONNECTED ANN FLATTEN LAYER

LeNET-5 / TENSORFLOW # Adding libraries import tensorflow from tensorflow import keras from keras.layers import Dense,Conv2D,Flatten,MaxPooling2D from keras import Sequential from keras.datasets import mnist # Loading dataset ( X_train , y_train ), ( X_test , y_test ) = mnist.load_data () # Generating model through KERAS Library model_lenet5 = Sequential() model_lenet5.add(Conv2D(6,kernel_size =(5,5),padding='valid', activation=' tanh ', input_shape =(32,32,1))) model_lenet5.add(MaxPooling2D( pool_size =(2, 2), strides=2, padding='valid')) model_lenet5.add(Conv2D(16,kernel_size=(5,5),padding='valid', activation=' tanh ')) model_lenet5.add(MaxPooling2D( pool_size =(2, 2), strides=2, padding='valid')) model_lenet5.add(Flatten ()) model_lenet5.add(Dense(120,activation =' tanh ')) model_lenet5.add(Dense(84,activation=' tanh ')) model_lenet5.add(Dense(10,activation=' softmax ')) # Generating model Summary model_lenet5.summary() OUTPUT ARCHITECTURE SOURCE CODE AMMAR AHMED

How to calculate # of Learnable Parameters for Convlution Layer 6 feature maps at output 6 Filters required Each of size mxnxl Because Filter is also 3Dimensional And 3 rd dimension comes from input channels Input image (RGB) 3 Dimensional mxnxl mxnxl mxnxl mxnxl mxnxl mxnxl

LENET-5 – Parameter Estimation Layer Fs = ( nxm ) Input Shape (a x b x l) Output Shape = ( (a – n + 2p) / s ) + 1 # of learnable parameters FS = Filter Size #F = Number of Filters applied l = # of channels at input n , m = size of filter p = padding s = strides k = output feature maps First Layer: Conv3D fs=( 5x5), p=0, s=1, #f=6 ( 32,32,1) = ( (32-5+2(0)) / 1 ) + 1 = 27 + 1 = 28 (28 x 28 , 6) = ( n x m x l + 1) * k = ( 5 x 5 x 1 + 1 ) * 6 = 156 First Layer: Max-Pool fs=(2x2), p=0, s=2, #f=6 (28x28x6) = ( (28-2+2(0)) / 2 ) + 1 = 13 + 1 = 14 (14 x 14 x 6 ) Second layer: Conv3D fs=(5x5), p=0, s=1, #f=16 (14 x 14 x 6) = ( (14-5+2(0)) / 1 ) + 1 = 9 + 1 = 10 (10 x 10 x 16) = ( n x m x l +1 ) * k = ( 5 x 5 x 6 + 1 ) * 16 = 2,416 Second layer: Max-Pool fs=(2x2), p=0, s=2, #f=16 (10 x 10 x 16) = ( (10 – 2 + 2(0)) / 2 ) + 1 = 4 + 1 = 4 (5 x 5 x 16) Flatten Layer (5 x 5 x 16) ( 1,400) – 1D Array First Dense Layer (neuron=120) (1,400) (1,120) - 1D Array =(input*neurons)+biases =(400*120)+120=48,120 2 nd Dense Layer (neuron=84) (1,120) (1,84) – 1D Array =(120*84+84)=10,164 Final output layer (1,84) (1,10) =(84x10+10)=850 Total learnable parameters 61,706 or 241 KB AMMAR AHMED

LENET-5 – Parameter Estimation Layer Input Shape (n x m x l) Output Shape = ( (n – f + 2p) / s ) + 1 # of parameters FS = Filter Size #F = Number of Filters applied l = # of channels at input n , m = size of filter p = padding s = strides k = output feature maps First Layer: Conv3D fs=(5x5), p=0, s=1, #f=6 (32,32,1) = ( (32-5+2(0)) / 1 ) + 1 = 27 + 1 = 28 (28 x 28 , 6) = ( n x m x l ) * k = ( 5 x 5 x 1 + 1 ) * 6 = 156 Fs2 * INPUTCHANNEL * OUTPUT CHANNEL + BIAS (=# OUTPUT CHANNELS) (5*5*1*6+6) First Layer: Max-Pool fs=(2,2), p=0, s=2, #f=6 (28x28x6) = ( (28-2+2(0)) / 2 ) + 1 = 13 + 1 = 14 (14 x 14 x 6 ) Second layer: Conv3D fs=(5,5), p=0, s=1, #f=16 (14 x 14 x 6) = ( (14-5+2(0)) / 1 ) + 1 = 9 + 1 = 10 (10 x 10 x 16) = ( n x m x l ) * k = ( 5 x 5 x 6 + 1 ) * 16 = 2,416 Second layer: Max-Pool fs=(2,2), p=0, s=2, #f=16 (10 x 10 x 16) = ( (10 – 2 + 2(0)) / 2 ) + 1 = 4 + 1 = 4 (5 x 5 x 16) Flatten Layer (5 x 5 x 16) (1,400) First Dense Layer (nodes=120) (1,400) (1,120) =(input*nodes)+biases =(400*120)+120=48,120 (1*1) * 400 * 120 + 120 2 nd Dense Layer (nodes=84) (1,120) (1,84) =(120*84+84)=10,164 Final output layer (1,84) (1,10) =(84x10+10)=850 Total learnable parameters 61,706 or 241 KB AMMAR AHMED

LENET-5 – Parameter Estimation Layer Input Shape (n x m x l) Output Shape = ( (n – f + 2p) / s ) + 1 # of parameters FS = Filter Size #F = Number of Filters applied l = # of channels at input n , m = size of filter p = padding s = strides k = output feature maps First Layer: Conv3D fs=(9x9), p=0, s=1, #f=3 (15,15,1) = ( (15-9+2(0)) / 1 ) + 1 =6 + 1 =7 (7x 7 , 3) = ( n x m x l ) * k = ( 9 x 9 x 1 + 1 ) * 3 = 82*3 = 246 First Layer: Max-Pool fs=(2,2), p=0, s=2, #f=3 (7x7x3) = ( (7-3+2(0)) / 2 ) + 1 = 2 + 1 = 3 (3 x 3 x 3 ) Flatten Layer (3 x 3 x 3) (1,27) First Dense Layer (nodes=27) (1,27) (1,27) =(input*nodes)+biases =(27*27)+27= 756 Final output layer(nodes=3) (1,27) (1,3) =(27*3)+3=84 Total learnable parameters 61,706 or 241 KB AMMAR AHMED

Object Detector types

Single-shot object detection Single-shot object detection uses a single pass of the input image to make predictions about the presence and location of objects in the image. It processes an entire image in a single pass, making them computationally efficient . It is generally less accurat e than other methods, and it’s less effective in detecting small objects. Such algorithms can be used to detect objects in real time in resource-constrained environments. YOLO is a single-shot detector that uses a fully convolutional neural network (CNN) to process an image. Two-shot object detection Uses two passes of the input image to make predictions about the presence and location of objects. The first pass is used to generate a set of proposals or potential object locations , and the second pass is used to refine these proposals and make final predictions . This approach is more accurate than single-shot object detection but is also more computationally expensive . Generally , single-shot object detection is better suited for real-time applications , while two-shot object detection is better for applications where accuracy is more important . Object Detection Method Types

Metrics: Intersection over Union ( IoU ) Intersection over Union is a popular metric to measure localization accuracy and calculate localization errors in object detection models. To calculate the IoU between the predicted and the ground truth  bounding boxes , we first take the intersecting area between the two corresponding bounding boxes for the same object. Following this, we calculate the total area covered by the two bounding boxes— also known as the “Union” and the area of overlap between them called the “Intersection.” The intersection divided by the Union gives us the ratio of the overlap to the total area, providing a good estimate of how close the prediction bounding box is to the original bounding box.

Metrics: Average Precision (AP) Average Precision (AP) is calculated as the area under a precision vs. recall curve for a set of predictions. Recall is calculated as the ratio of the total predictions made by the model under a class with a total of existing labels for the class. Precision refers to the ratio of true positives with respect to the total predictions made by the model. Recall and precision  offer a trade-off that is graphically represented into a curve by varying the classification threshold. The area under this precision vs. recall curve gives us the Average Precision per class for the model. The average of this value, taken over all classes, is called mean Average Precision ( mAP ). In object detection,  precision and recall aren’t used for class predictions. Instead, they serve as predictions of boundary boxes for measuring the decision performance. An IoU value > 0.5. is taken as a positive prediction, while an IoU value < 0.5 is a negative prediction.

YOLO from Ultralytics You Only Look Once (YOLO) proposes using an end-to-end  neural network  that makes predictions of bounding boxes and class probabilities all at once. It differs from the approach taken by previous object detection algorithms, which repurposed classifiers to perform detection . YOLO performs all of its predictions with the help of a single fully connected layer . Ultralytics YOLOv8 | State-of-the-Art Vision AI

YOLO Vs. Others

YOLO Algorithm / Architecture YOLO Algorithm for Object Detection Explained [+Examples] (v7labs.com)

YoloR History : YOLO = You Only Look Once Yolo has outperformed previous R CNN, Fast R CNN and Faster R CNN methods for object detection In one forward pass, it can make prediction, therefore called You only look once (YOLO) (3) What is YOLO algorithm? | Deep Learning Tutorial 31 ( Tensorflow , Keras & Python) - YouTube

Object Localization

Yolo : Multi-Object Detection : Mult -grid and center of body approach

Training YOLO on multi-grid vectors

YOLO prediction

First Issue with YOLO: Multiple objects overlap but centers of both object not in one cell It may detect multiple bounding box for same object (as shown) We don’t which box contains person and which contains dog But every bounding box will have its own probability

No Max Suppression We try to find the overlap area, which is intersection over union We apply technique of “No Max suppression” to get the two distinct boxes as shown on right

First Issue with YOLO: Multiple objects overlap but centers of both objects are in one cell When one cell contains center of two objects – we have a problem of representation Should we generate two separate vectors of 7 depth or should we combine them into one anchor box of 14 values?? In real life, its rare to have center of multiple objects into same small cell. Taking into account situations where at max two objects lie in center of one cell, is sufficient for most cases.

CNN with two anchor boxes: A solution

Neural Network Types and Data RNN CNN ANN ANN RNN CNN

Details Shifting from sigmoid to RELU activation function, drastically improved computation of gradient descent algorithm. This enabled use of larger and shallow networks Yhat = output Y = ground truth Loss function find difference

Logistic Regression cost function

Gradient Descent J( w,b ) = cost function The plot of w,b and J( w,b ) is a surface J( w,b ) is a convex function Target to find minimum of J( w,b ) It is not an non-convex function, which has lot of local minimas This nature of convex is reason why we use this cost function For logistic function and due to convex nature, it is not necessary to initialize w,b from 0. you can start from any point on the surface

Initialization of w and b To find a good value for the parameters,  what we'll do is initialize w and  b to some initial value may be denoted by that little red dot.  And for logistic regression, almost any initialization method works.  Usually you Initialize the values of 0.  Random initialization also works, but people don't usually do that for  logistic regression.  Because this function is convex, no matter where you initialize,  you should get to the same point or roughly the same point.  And what gradient descent does is it starts at that initial point and   then takes a step in the steepest downhill direction.  So after one step of gradient descent, you might end up there because it's  trying to take a step downhill in the direction of steepest descent or  as quickly down who as possible.  So that's one iteration of gradient descent.  And after iterations of gradient descent,  you might stop there, three iterations and so on.  Need to converge to this point here. Which is absolute minima or global optimum

Gradeint Descent: assume on one parameter ‘w’ 1 2 Suppose initial w is at point 1 W:= w - a (1) .. Because slope is + ve , updated w will be lower, move downward the curve Suppose initial w is at point 2 W:= w – a (-1) .. Because slope is – ve , updated w will be higher, move downward the curve Alpha = update / trainging rate

Gradeint Descent: for both parameter ‘w ’ and ‘b’ If J is function of one variable, we use simple derivative “d” If J is function of 2 or more variable, we use partial derivative

Coding convention Dw = simple derivative Db = partial derivative

Derivative of a linear line is constant

Derivative of a curve (non linear) is not constant

Computation Graph Going in reverse order (back propagation) is easier way to calculate derivatives . DJ/ Dv then Dv / dy then du/ db

Comuting dJ / da

Computing dJ / db and dJ /dc

Logistic Regression Derivatives

Derivation Derivation of DL/ dz - Deep Learning Specialization / Neural Networks and Deep Learning - DeepLearning.AI

Derivation of dZ = a – y Y^(y cap) is denoted as ‘a’ here dZ in ML means = dL ( a,y ) / dz

Derivation of dw1, dw2 and db Dw1 = ( -y/a + 1-y / 1-a ) x a (1-a) x d/dw1 (w1x1+w2x2+b) Dw1 = ( a – y ) . d/dw2 (w1x1 + w2x2 + b) Dw1 = ( a – y ) . x1 Dw1 = x1. DZ (here DZ is ML notation) Dw2 = ( -y/a + 1-y / 1-a ) x a (1-a) x d/dw1 (w1x1+w2x2+b) Dw2 = ( a – y ) . d/dw2 (w1x1 + w2x2 + b) Dw2 = ( a – y ) . x2 Dw2 = x2. DZ Db = ( -y/a + 1-y / 1-a ) x a (1-a) x d/dw1 (w1x1+w2x2+b) Db = ( a – y ) . d/ db (w1x1 + w2x2 + b) Db = ( a – y ) . 1 Db = DZ Dw1,Dw2,Db and DZ are all ML notations And means derivative of L( a,y ) with respect to w1,w2 and b respectively

Computing for entire dataset [m samples] Major for loop for m samples Minor for loop (required) for wn weights We need vectorization to get rid of these for loop and write efficient code. Necessary when m is very large.

What is Vectorization? GPU and CPU can exceute parallel instructions If you use built in function such as np.dot which don’t require explicitly implementing for loop, it enables numpy to exploit parallelism and thus your computation runs faster Vectorization can significantly improve your code. # Program to demonstrate how vectorization improves computational performance # by comparing a vector dot product ( parall implementation) vs foor loop implementation (sequential execution) # Ammar  Ahmed import time import numpy as np # Getting details about underlying hardware # running on google collab import platform print("Machine              :" + str ( platform.machine ())) print("Platform version     :" + str ( platform.version ())) print("Platform             :" + str ( platform.platform ())) # Generating array of elements a = np.random.rand (1000000) b = np.random.rand (1000000) # Vector implementation result=0 tic     = time.time () result  = np.dot( a,b ) toc     = time.time () t1      = (toc - tic)*1000 print("Execution time of vectorized version      = " + str (t1) + " ms " + " Computed Value :" + str (result)) # Non vector / loop implementation result=0 tic   = time.time () for i in range(1000000):   result += a[ i ]*b[ i ] toc   = time.time () t2      = (toc - tic)*1000 print("Execution time of sequential loop version = " + str (t2) + " ms "+ " Computed Value :" + str (result)) print("Time difference                           = " + str (t2-21) + " ms ")

Vector implementation using python : numpy

Logistic regression derivatives

Python broadcasting

Python broadcasting Cal is already in right shape for divide through broadcasting method. However .reshape make sure you are doing it right. However, it can be omitted.

Python broadcasting

Python broadcasting Refer to python documentation for more general principle for broadcasting

Correct Answers

Correct Answers

KEY LEARNING! A * b = element wise multiplication Np.dot( a,b ) = matrix multiplication

Vectorization

Tips: “Don’t use rank 1 arrays” Rank 1 arrays are neither row vector nor column vector. Therefore matrix or vector operations are not consistent with them. Always initialize with proper structure and size Use A = a.reshape ((5,1)) to convert rank 1 array into a column vector a.Shape = (5,1) = column vector a.Shape = (1,5) = row vector Assert( a.shape == (5,1) ) SIMPLIY YOUR CODE ALWAYS USE COLUMN OR ROW VECTORS

Python code: wrong implementation Rank 1 Array Its neither row vector nor column vector Not doing proper transpose to the vector. wrong Not proper dot product as required

Python code: correct implementation

Why we use numpy and not python’s math library? Actually, we rarely use the "math" library in deep learning because the inputs of the functions are real numbers. In deep learning we mostly use matrices and vectors. This is why numpy is more useful. Numpy version of sigmoid

Vectorization of an RGB image | Reshaping Arrays

image2vector Flatten vector / vectorization using .reshape on image matrix

Normalizing Rows

Normalizaiton - code **Note**: In normalize_rows (), you can try to print the shapes of x_norm and x, and then rerun the assessment. You'll find out that they have different shapes. This is normal given that x_norm takes the norm of each row of x. So x_norm has the same number of rows but only 1 column. So how did it work when you divided x by x_norm ? This is called broadcasting and we'll talk about it now!

Softmax – A normalizing function Softmax is a normalizing function used when the algorithm needs to classify two or more classes.

Softmax – Python Code - If you print the shapes of x_exp , x_sum and s above and rerun the assessment cell, you will see that x_sum is of shape (2,1) while x_exp and s are of shape (2,5). ** x_exp / x_sum ** works due to python broadcasting .

Key points to remember np.exp (x ) works for any np.array x and applies the exponential function to every coordinate the sigmoid function and its gradient image2vector is commonly used in deep learning np.reshape is widely used. numpy has efficient built-in functions broadcasting is extremely useful

Implement L1 loss function

Implement L2 loss function

Some interpreter directives import numpy as np import copy import matplotlib.pyplot as plt import h5py import scipy from PIL import Image from scipy import ndimage from lr_utils import load_dataset from public_tests import * % matplotlib inline % load_ext autoreload % autoreload 2

Trick to learn

ANN as simple cat detector

What is deepcopy ?

Required Functions to implement

Neural Network Representation Two layer network As we don’t count input layer [n] represent layer # a n represent node # in the layer

Neural Network Representation

Neural Network Representation Implementing above four set of equations using loop will be very slow. We need to VECTORIZE them.

Vectorize representation Converting to stacked matrixes or column vector notation

Vectorize representation

Convolution animations No padding, no strides Arbitrary padding, no strides Half padding, no strides Full padding, no strides No padding, strides Padding, strides Padding, strides (odd) GitHub - vdumoulin / conv_arithmetic : A technical report on convolution arithmetic in the context of deep learning