18ECE307J - APPLIED MACHINE LEARNING unit -1 Introduction to Machine learning: Types of Machine Learning - Supervised Learning – Unsupervised, Learning, reinforcement learning , The Curse of dimensionality, Bias and Variance, Learning Curve, Classification Error and noise, linear regression, Support Vector Machines, basics of neural network, perceptrons , LINEAR SEPARABILITY, Perceptrons and introduction to Multiplayer, Perceptrons Prepared by Dr.P.Vijayakumar , Associate professor, ECE,SRM IST
Introduction to Machine learning Machine Learning is the science (and art) of programming computers so they can learn from data Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.—Arthur Samuel, 1959 A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.-—Tom Mitchell, 1997 Example : spam filter is a Machine Learning program that can learn to flag spam given examples of spam emails (e.g., flagged by users) and examples of regular ( nonspam , also called “ham”) emails. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
More on spam filter .. The examples that the spam filter system uses to learn are called the training set. Each training example is called a training instance (or sample). In this case, the task T is to flag spam for new emails, the experience E is the training data, and the performance measure P needs to be defined; for example, you can use the ratio of correctly classified emails. This particular performance measure is called accuracy and it is often used in classification tasks. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
T raditional programming and machine learning long list of complex rules—pretty hard to maintain. program is much shorter, easier to maintain, and most likely more accurate. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
More on ML Automatically adapting to change Machine Learning can help humans learn EX: list of words and combinations of words that it believes are the best predictors of spam. Applying ML techniques to dig into large amounts of data can help discover patterns that were not immediately apparent. This is called data mining. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
When to use Machine Learning • Problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better. • Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution. • Fluctuating environments: a Machine Learning system can adapt to new data. • Getting insights about complex problems and large amounts of data. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Types of Machine Learning Supervised learning: A training set of examples with the correct responses (targets) is provided and, based on this training set, the algorithm generalises to respond correctly to all possible inputs. This is also called learning from exemplars. Unsupervised learning : Correct responses are not provided, but instead the algorithm tries to identify similarities between the inputs so that inputs that have something in common are categorised together. The statistical approach to unsupervised learning is known as density estimation Reinforcement learning: This is somewhere between supervised and unsupervised learning. The algorithm gets told when the answer is wrong, but does not get told how to correct it. It has to explore and try out different possibilities until it works out how to get the answer right. Reinforcement learning is sometime called learning with a critic because of this monitor that scores the answer, but does not suggest improvements Evolutionary learning: Biological evolution can be seen as a learning process: biological organisms adapt to improve their survival rates and chance of having offspring in their environment. We’ll look at how we can model this in a computer, using an idea of fitness, which corresponds to a score for how good the current solution is. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Supervised Learning In supervised learning, the training data you feed to the algorithm includes the desired solutions, called labels A typical supervised learning task is classification . The spam filter is a good example of this: it is trained with many example emails along with their class (spam or ham), and it must learn how to classify new emails. Another typical task is to predict a target numeric value, such as the price of a car, given a set of features (mileage, age, brand, etc.) called predictors . This sort of task is called regression To train the system, you need to give it many examples of cars, including both their predictors and their labels Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
supervised learning algorithms • k-Nearest Neighbors • Linear Regression • Logistic Regression • Support Vector Machines (SVMs) • Decision Trees and Random Forests • Neural networks Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Unsupervised Learning In unsupervised learning, as you might guess, the training data is unlabeled unsupervised learning algorithms Clustering —k-Means —Hierarchical Cluster Analysis (HCA) —Expectation Maximization • Visualization and dimensionality reduction —Principal Component Analysis (PCA) —Kernel PCA —Locally-Linear Embedding (LLE) —t-distributed Stochastic Neighbor Embedding (t-SNE) • Association rule learning — Apriori —Eclat Clustering: algorithm to try to detect groups of similarity Visualization : input: a lot of complex and unlabeled data. Output: a 2D or 3D representation of your data that can easily be plotted . These algorithms try to preserve as much structure as they can (e.g., trying to keep separate clusters in the input space from overlapping in the visualization), so you can understand how the data is organized and perhaps identify unsuspected patterns. dimensionality reduction : goal is to simplify the data without losing too much information. One way to do this is to merge several correlated features into one.(feature extraction) association rule learning : goal is to dig into large amounts of data and discover interesting relations between attributes. For example, suppose you own a supermarket. Running an association rule on your sales logs may reveal that people who purchase barbecue sauce and potato chips also tend to buy steak. Thus, you may want to place these items close to each other. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Reinforcement learning The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return or penalties in the form of negative rewards. It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
THE MACHINE LEARNING PROCESS Data Collection and Preparation : Machine learning algorithms need significant amounts of data, preferably without too much noise, but with increased dataset size comes increased computational costs, and the sweet spot at which there is enough data without excessive computational overhead is generally impossible to predict. Feature Selection: It consists of identifying the features that are most useful for the problem under examination. This invariably requires prior knowledge of the problem and the data; our common sense was used in the coins example above to identify some potentially useful features and to exclude others. Algorithm Choice: Given the dataset, the choice of an appropriate algorithm Parameter and Model Selection: For many of the algorithms there are parameters that have to be set manually, or that require experimentation to identify appropriate values Training : training should be simply the use of computational resources in order to build a model of the data Evaluation : Before a system can be deployed it needs to be tested and evaluated for accuracy on data that it was not trained on. This can often include a comparison with human experts in the field, and the selection of appropriate metrics for this comparison. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
The Curse of dimensionality The essence of the curse is the realisation that as the number of dimensions increases, the volume of the unit hypersphere does not increase with it. The curse of dimensionality will apply to our machine learning algorithms because as the number of input dimensions gets larger, we will need more data to enable the algorithm to generalize sufficiently well. ML algorithms try to separate data into classes based on the features; therefore as the number of features increases, more number of datapoints we need. For this reason, we will often have to be careful about what information we give to the algorithm, meaning that we need to understand something about the data in advance. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Bias and Variance-bulls-eye diagram Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
underfitting and overfitting In supervised learning, underfitting : happens when a model unable to capture the underlying pattern of the data. These models usually have high bias and low variance. It happens when we have very less amount of data to build an accurate model or when we try to build a linear model with a nonlinear data. Also, these kind of models are very simple to capture the complex patterns in data like Linear and logistic regression. In supervised learning, overfitting: happens when our model captures the noise along with the underlying pattern in data. It happens when we train our model a lot over noisy dataset. These models have low bias and high variance. These models are very complex like Decision trees which are prone to overfitting. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Learning Curve Graph that compares the performance of a model on training and testing data over a varying number of training instances We should generally see performance improve as the number of training points increases When we separate training and testing sets and graph them individually, We can get an idea of how well the model can generalize to new data a learning curve (or training curve) plots the optimal value of a model's loss function for a training set against this loss function evaluated on a validation data set with same parameters as produced the optimal function. It is a tool to find out how much a machine model benefits from adding more training data and whether the estimator suffers more from a variance error or a bias error. If both the validation score and the training score converge to a value that is too low with increasing size of the training set, it will not benefit much from more training data curve is useful for many purposes including comparing different algorithms, choosing model parameters during design, adjusting optimization to improve convergence, and determining the amount of data used for training. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Underfit Learning Curves model does not have a suitable capacity for the complexity of the dataset he model is capable of further learning and possible further improvements and that the training process was halted prematurely Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Overfit Learning Curves The plot of training loss continues to decrease with experience. The plot of validation loss decreases to a point and begins increasing again. The inflection point in validation loss may be the point at which training could be halted as experience after that point shows the dynamics of overfitting. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Good Fit Learning Curves The plot of training loss decreases to a point of stability. The plot of validation loss decreases to a point of stability and has a small gap with the training loss. Continued training of a good fit will likely lead to an overfit. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Classification Error and noise For binary classification problems, there are two primary types of errors. Type 1 errors (false positives) - rejection of a true null hypothesis Type 2 errors (false negatives)- the non-rejection of a false null hypothesis a true positive is an observation correctly put into class 1, while a false positive is an observation incorrectly put into class 1, Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
The Confusion Matrix The confusion matrix is a nice simple idea: make a square matrix that contains all the possible classes in both the horizontal and vertical directions and list the classes along the top of a table as the predicted outputs, and then down the left-hand side as the targets Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Measures False Positive Rate: When it's actually no, how often does it predict yes? FP/actual no = 10/60 = 0.17 True Negative Rate: When it's actually no, how often does it predict no? TN/actual no = 50/60 = 0.83 equivalent to 1 minus False Positive Rate also known as "Specificity" Precision: When it predicts yes, how often is it correct? TP/predicted yes = 100/110 = 0.91 Prevalence: How often does the yes condition actually occur in our sample? actual yes/total = 105/165 = 0.64 Accuracy: Overall, how often is the classifier correct? (TP+TN)/total = (100+50)/165 = 0.91 Misclassification Rate: Overall, how often is it wrong? (FP+FN)/total = (10+5)/165 = 0.09 equivalent to 1 minus Accuracy also known as "Error Rate" True Positive Rate: When it's actually yes, how often does it predict yes? TP/actual yes = 100/105 = 0.95 also known as "Sensitivity" or "Recall" Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
linear regression fit a line to data, from classification problems, where we find a line that separates out the classes, so that they can be distinguished. turn classification problems into regression problems. This can be done in two ways first by introducing an indicator variable, which simply says which class each datapoint belongs to. The problem is now to use the data to predict the indicator variable, which is a regression problem. The second approach is to do repeated regression, once for each class, with the indicator value being 1 for examples in the class and 0 for all of the others. making a prediction about an unknown value y (such as the indicator variable for classes or a future value of some data) by computing some function of known values xi. With straight lines model , output y is going to be a sum of the xi values, each multiplied by a constant parameter: Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Defining the line try to minimize the distance between each datapoint and the line that we fit. We can measure the distance between a point and a line by defining another line that goes through the point and hits the line. Now, we can try to minimize an error function that measures the sum of all these distances. Minimize the sum-of-squares of the errors-least-squares optimization. choosing the parameters β in order to minimize the squared difference between the prediction and the actual data value, summed over all of the datapoints. given input vector Z, the prediction is Z β 1 2 Differentiation 2 and equating to 0 3 4 Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Support Vector Machines A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection. SVMs are particularly well suited for classification of complex but small- or medium-sized datasets. The two classes can clearly be separated easily with a straight line (they are linearly separable ). The model whose decision boundary is represented by the dashed line is so bad that it does not even separate the classes properly. The other two models work perfectly on this training set, but their decision boundaries come so close to the instances that these models will probably not perform as well on new instances. the decision boundary of an SVM classifier; this line not only separates the two classes but also stays as far away from the closest training instances as possible. You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes. This is called large margin classification . Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
SENSITIVE TO THE FEATURE SCALES AND SOFT MARGIN CLASSIFICATION on the left plot, the vertical scale is much larger than the horizontal scale, so the widest possible street is close to horizontal. After feature scaling (e.g., using Scikit- Learn’s StandardScaler ), the decision boundary looks much better (on the right plot). If we strictly impose that all instances be off the street and on the right side, this is called hard margin classification . There are two main issues with hard margin classification. First, it only works if the data is linearly separable, and second it is quite sensitive to outliers. To avoid these issues it is preferable to use a more flexible model. The objective is to find a good balance between keeping the street as large as possible and limiting the margin violations (i.e., instances that end up in the middle of the street or even on the wrong side). This is called soft margin classification Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Decision Function and Predictions The linear SVM classifier model predicts the class of a new instance x by simply computing the decision function w T ・ x + b = w 1 x 1 + ⋯ + wn xn + b : if the result is positive, the predicted class ŷ is the positive class (1), or else it is the negative class (0); Here ,Decision function is a two-dimensional plane since this dataset has two features (petal width and petal length). The decision boundary is the set of points where the decision function is equal to 0: it is the intersection of two planes, which is a straight line (represented by the thick solid line) The dashed lines represent the points where the decision function is equal to 1 or –1: they are parallel and at equal distance to the decision boundary, forming a margin around it. Training a linear SVM classifier means finding the value of w and b that make this margin as wide as possible while avoiding margin violations (hard margin) or limiting them (soft margin). Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Training Objective Consider the slope of the decision function: it is equal to the norm of the weight vector, ∥ w ∥. If we divide this slope by 2, the points where the decision function is equal to ―1 are going to be twice as far away from the decision boundary. In other words, dividing the slope by 2 will multiply the margin by 2. The smaller the weight vector w , the larger the margin So we want to minimize ∥ w ∥ to get a large margin. However, if we also want to avoid any margin violation (hard margin), then we need the decision function to be greater than 1 for all positive training instances, and lower than –1 for negative training instances. If we define t( i ) = –1 for negative instances (if y( i ) = 0) and t( i ) = 1 for positive instances (if y( i ) = 1), then we can express this constraint as t( i ) ( w T ・ x ( i ) + b ) ≥ 1 for all instances. We can therefore express the hard margin linear SVM classifier objective as the constrained optimization problem as Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
soft margin objective and optimization problem To get the soft margin objective, we need to introduce a slack variable ζ( i ) ≥ 0 for each instance: ζ( i ) measures how much the ith instance is allowed to violate the margin. We now have two conflicting objectives: making the slack variables as small as possible to reduce the margin violations, and making ½ . w T ・ w as small as possible to increase the margin. This is where the C hyperparameter comes in: it allows us to define the trade‐ off between these two objectives. This gives us the constrained optimization problem svm_clf = Pipeline (( ( "scaler" , StandardScaler ()), ( " linear_svc " , LinearSVC ( C = 1 , loss = "hinge" )), )) svm_clf . fit ( X_scaled , y ) Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Nonlinear SVM Classification Although linear SVM classifiers are efficient and work surprisingly well in many cases, many datasets are not even close to being linearly separable. One approach to handling nonlinear datasets is to add more features, such as polynomial features in some cases this can result in a linearly separable dataset Consider the left plot in Figure it represents a simple dataset with just one feature x 1. This dataset is not linearly separable, as you can see. But if you add a second feature x 2 = ( x 1) 2, the resulting 2D dataset is perfectly linearly separable Adding polynomial features is simple to implement and can work great with all sorts of Machine Learning algorithms (not just SVMs), but at a low polynomial degree it cannot deal with very complex datasets, and with a high polynomial degree it creates a huge number of features, making the model too slow. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
kernel trick kernel trick makes it possible to get the same result as if you added many polynomial features, even with very highdegree polynomials, without actually having to add them. So there is no combinatorial explosion of the number of features since you don’t actually add any features On the right is another SVM classifier using a 10 th degree polynomial kernel. if your model is overfitting, you need to reduce the polynomial degree. Conversely, if it is underfitting, you can try increasing it. The hyperparameter coef0 controls how much the model is influenced by highdegree polynomials versus low-degree polynomials. A common approach to find the right hyperparameter values is to use grid search. It is often faster to first do a very coarse grid search, then a finer grid search around the best values found. Having a good sense of what each hyperparameter actually does can also help you search in the right part of the hyperparameter space polynomial_svm_clf = Pipeline (( ( " poly_features " , PolynomialFeatures ( degree = 3 )), ( "scaler" , StandardScaler ()), ( " svm_clf " , LinearSVC ( C = 10 , loss = "hinge" )) )) polynomial_svm_clf . fit ( X , y ) Nonlinear Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Adding Similarity Features Another technique to tackle nonlinear problems is to add features computed using a similarity function that measures how much each instance resembles a particular landmark . For example, let’s take the one-dimensional dataset discussed earlier and add two landmarks to it at x 1 = –2 and x 1 = 1 (see the left plot in Figure ). let’s define the similarity function to be the Gaussian Radial Basis Function ( RBF ) with γ = 0.3 It is a bell-shaped function varying from 0 (very far away from the landmark) to 1 (at the landmark). Now we are ready to compute the new features. For example, let’s look at the instance x 1 = –1: it is located at a distance of 1 from the first landmark, and 2 from the second landmark. Therefore its new features are x 2 = exp (–0.3 × 12) ≈ 0.74 and x 3 = exp (–0.3 × 22) ≈ 0.30. The plot on the right of Figure shows the transformed dataset (dropping the original features). As you can see, it is now linearly separable. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
Gaussian RBF Kernel Just like the polynomial features method, the similarity features method can be useful with any Machine Learning algorithm, but it may be computationally expensive to compute all the additional features, especially on large training sets. However, once again the kernel trick does its SVM magic: it makes it possible to obtain a similar result as if you had added many similarity features, without actually having to add them. Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
SVM Regression As we mentioned earlier, the SVM algorithm is quite versatile: not only does it support linear and nonlinear classification, but it also supports linear and nonlinear regression. The trick is to reverse the objective: instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible on the street while limiting margin violations (i.e., instances off the street). The width of the street is controlled by a hyperparameter ϵ . Figure shows two linear SVM Regression models trained on some random linear data, one with a large margin ( ϵ = 1.5) and the other with a small margin ( ϵ = 0.5). You can use Scikit- Learn’s LinearSVR class to perform linear SVM Regression Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
basics of neural network Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
perceptrons , Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
LINEAR SEPARABILITY Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST
introduction to Multiplayer Perceptron's Prepared by Dr.P.Vijayakumar,Associate professor,ECE,SRM IST