Content: Training Machine Learning Algorithms for Classification Artificial neurons – a brief glimpse into the early history of machine learning Implementing a perception learning algorithm in Python Training a perception model on the Iris dataset. Adaptive linear neurons and the convergence of learning Self Learning Exercise: Minimizing cost functions with gradient descent
Training Machine Learning Algorithms for Classification Building an understanding of ML algorithms Using Pandas, NumPy and Matplotlib to read in, process and visualize data Implementing linear classification algorithms in Python
Artificial neurons – a brief glimpse into the early history of machine learning H ow the biological brain works to design artificial intelligence, Warren McCullock and Walter Pitts published the first concept of a simplified brain cell, the so-called McCullock -Pitts (MCP) neuron , in 1943 Neurons are interconnected nerve cells in the brain that are involved in the processing and transmitting of chemical and electrical signals, which is illustrated in the following figure:
A Biological Neuron: source: https://www.kdnuggets.com/2019/10/introduction-artificial-neural-networks.html
Cell body (Soma): The body of the neuron cell contains the nucleus and carries out biochemical transformation necessary to the life of neurons. Dendrites: Each neuron has fine, hair-like tubular structures (extensions) around it. They branch out into a tree around the cell body. They accept incoming signals. Axon: It is a long, thin, tubular structure that works like a transmission line. Synapse: Neurons are connected to one another in a complex spatial arrangement.
The Perceptron The following diagram represents the general model of ANN which is inspired by a biological neuron. It is also called Perceptron. A single layer neural network is called a Perceptron. It gives a single output. Biological Neural Network Artificial Neural Network Dendrites Inputs Cell nucleus Nodes Synapse Weights Axon Output
In the above figure, for one single observation, x0, x1, x2, x3...x(n) represents various inputs(independent variables) to the network. Each of these inputs is multiplied by a connection weight or synapse. The weights are represented as w0, w1, w2, w3….w(n). Weight shows the strength of a particular node. b is a bias value. A bias value allows you to shift the activation function up or down. In the simplest case, these products are summed, fed to a transfer function (activation function) to generate a result, and this result is sent as output. Mathematically, x1.w1 + x2.w2 + x3.w3 ...... xn.wn = ∑ xi.wi Now activation function is applied 𝜙(∑ xi.wi )
Activation function Activation function decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it. The motive is to introduce non-linearity into the output of a neuron. If we do not apply activation function then the output signal would be simply linear function(one-degree polynomial). Now the question arises why do we need Non-Linearity? Non-Linear functions are those which have a degree more than one and they have a curvature. Now we need a neural network to learn and represent almost anything and any arbitrary complex function that maps an input to output. Neural Network is considered “Universal Function Approximators”. It means they can learn and compute any function at all.
Types of Activation Functions: 1.Threshold Activation Function — (Binary step function)
2. Sigmoid Activation Function — (Logistic function) sigmoid curve which ranges between 0 and 1
3. Hyperbolic Tangent Function — (tanh) It is similar to Sigmoid but better in performance. It is nonlinear in nature, so great, we can stack layers. The function ranges between (-1,1).
4. Rectified Linear Units — ( ReLu ) ReLu is the most used activation function in CNN and ANN which ranges from zero to infinity.[0,∞]
Relu is non-linear in nature and a combination of ReLu is also non-linear. In fact, it is a good approximator and any function can be approximated with a combination of Relu . ReLu is 6 times improved over hyperbolic tangent function . It should only be applied to hidden layers of a neural network. So, for the output layer use softmax function for classification problem and for regression problem use a Linear function.
How does the Neural network work? The Algorithm
Since Perceptions are Binary Classifiers (0/1), we can define their computation as follows: Let’s recall that the dot product of two vectors of length n (1≤i≤n) is w . x = ∑ᵢ wᵢ . xᵢ The function f(x)= b+ w . x is a linear combination of weight and feature vectors. Perceptron is, therefore, a linear classifier — an algorithm that predicts using a linear predictor function.
Linear Separability
Linearly separable Gates
Now what about the XOR?
The model can be trained using the following algorithm: 1. set b = w = 0 2. for N iterations, or until weights do not change (a) for each training example xᵏ with label yᵏ i . if yᵏ — f(xᵏ) = 0, continue ii. else, update wᵢ, △wᵢ = (yᵏ — f(xᵏ)) xᵢ
Implementation The dataset that we consider for implementing Perceptron is the Iris flower dataset . This dataset contains 4 features that describe the flower and classify them as belonging to one of the 3 classes . We strip the last 50 rows of the dataset that belongs to the class ‘Iris-virginica’ and use only 2 classes ‘Iris- setosa ’ and ‘Iris-versicolor’ because these classes are linearly separable and the algorithm converges to a local minimum by eventually finding the optimal weights.
Gradient Descent Gradient Descent is known as one of the most commonly used optimization algorithms to train machine learning models by means of minimizing errors between actual and expected results. Further, gradient descent is also used to train Neural Networks. In mathematical terminology, Optimization algorithm refers to the task of minimizing/maximizing an objective function f(x) parameterized by x. Similarly, in machine learning, optimization is the task of minimizing the cost function parameterized by the model's parameters. The main objective of gradient descent is to minimize the convex function using iteration of parameter updates. Once these machine learning models are optimized, these models can be used as powerful tools for Artificial Intelligence and various computer science applications. In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about gradient descent, the role of cost functions specifically as a barometer within Machine Learning, types of gradient descents, learning rates, etc.
What is Gradient Descent Gradient Descent is defined as one of the most commonly used iterative optimization algorithms of machine learning to train the machine learning and deep learning models. It helps in finding the local minimum of a function. The best way to define the local minimum or local maximum of a function using gradient descent is as follows: If we move towards a negative gradient or away from the gradient of the function at the current point, it will give the local minimum of that function. Whenever we move towards a positive gradient or towards the gradient of the function at the current point, we will get the local maximum of that function.
This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The main objective of using a gradient descent algorithm is to minimize the cost function using iteration. To achieve this goal, it performs two steps iteratively: Calculates the first-order derivative of the function to compute the gradient or slope of that function. Move away from the direction of the gradient, which means slope increased from the current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning parameter in the optimization process which helps to decide the length of the steps.
How does Gradient Descent work? Before starting the working principle of gradient descent, we should know some basic concepts to find out the slope of a line from linear regression. The equation for simple linear regression is given as: Y = mX +c Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
The starting point(shown in above fig.) is used to evaluate the performance as it is considered just as an arbitrary point. At this starting point, we will derive the first derivative or slope and then use a tangent line to calculate the steepness of this slope. Further, this slope will inform the updates to the parameters (weights and bias). The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters are generated, then steepness gradually reduces, and at the lowest point, it approaches the lowest point, which is called a point of convergence. The main objective of gradient descent is to minimize the cost function or the error between expected and actual. To minimize the cost function, two data points are required: Direction & Learning Rate These two factors are used to determine the partial derivative calculation of future iteration and allow it to the point of convergence or local minimum or global minimum. Let's discuss learning rate factors in brief;
Learning Rate: It is defined as the step size taken to reach the minimum or lowest point. This is typically a small value that is evaluated and updated based on the behavior of the cost function. If the learning rate is high, it results in larger steps but also leads to risks of overshooting the minimum. At the same time, a low learning rate shows the small step sizes, which compromises overall efficiency but gives the advantage of more precision.
Implementing the Perceptron Algorithm in Python
T he most basic single-layered neural network used for binary classification . First, we will look at the Unit Step Function and see how the Perceptron Algorithm classifies and then have a look at the perceptron update rule . Finally, we will plot the decision boundary for our data. We will use the data with only two features, and there will be two classes since Perceptron is a binary classifier. We will implement all the code using Python NumPy , and visualize/plot using Matplotlib .
Perceptron Let us try to understand the Perceptron algorithm using the following data as a motivating example. from sklearn import datasets x, y = datasets.make_blobs ( n_samples =150,n_features=2, centers=2,cluster_std=1.05,random_state=2) #Plotting fig = plt.figure ( figsize =(10,8)) plt.plot (X[:, 0][y == 0], X[:, 1][y == 0], 'r^') plt.plot (X[:, 0][y == 1], X[:, 1][y == 1], 'bs') plt.xlabel ("feature 1") plt.ylabel ("feature 2") plt.title ('Random Classification Data with 2 classes')
There are two classes, red and green, and we want to separate them by drawing a straight line between them. Or, more formally, we want to learn a set of parameters theta to find an optimal hyperplane(straight line for our data) that separates the two classes. For Linear regression our hypothesis ( y_hat ) was theta.X . Then, for binary classification in Logistic Regression , we needed to output probabilities between 0 and 1, so we modified the hypothesis as — sigmoid( theta.X ) . We applied the sigmoid function over the dot product of input features and parameters because we needed to squish our output between 0 and 1. For the Perceptron algorithm, we apply a different function over theta.X , which is the Unit Step Function, which is defined as —
where,
Unlike Logistic Regression which outputs probability between 0 and 1, the Perceptron outputs values that are either 0 or 1 exactly. This function says that if the output( theta.X ) is greater than or equal to zero, then the model will classify 1 (red for example)and if the output is less than zero, the model will classify as (green for example). And that is how the perception algorithm classifies.
Let’s look at the Unit Step Function graphically — We can see for z≥0 , g(z) = 1 and for z<0 , g(z) = 0 .
Let’s code the step function. def step_func (z): return 1.0 if (z > 0) else 0.0