DECISION TREE AND PROBABILISTIC MODELS.pptx

GoodReads1 239 views 61 slides Oct 04, 2024
Slide 1
Slide 1 of 61
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61

About This Presentation

This presentation coverts the topics like constructing decision trees and unsupervised learning


Slide Content

Machine Learning Techniques Dr. M. Lilly Florence Professor Adhiyamaan College of Engineering (Autonomous) Hosur , Tamilnadu

TREE AND PROBABILISTIC MODELS Learning with Trees Decision Trees Constructing Decision Trees Classification and Regression Trees Ensemble Learning Boosting Bagging Different ways to Combine Classifiers Probability and Learning Data into Probabilities Basic Statistics Gaussian Mixture Models Nearest Neighbor Methods Unsupervised Learning K means Algorithms Vector Quantization Self Organizing Feature Map

Learning with Trees The idea of a decision tree is that we break classification down into a set of choices about each feature in turn, starting at the root (base) of the tree and progressing down to the leaves , where we receive the classification decision . The trees are very easy to understand , and can even be turned into a set of if-then rules, suitable for use in a rule induction system . In terms of optimisation and search, decision trees use a greedy heuristic to perform search , evaluating the possible options at the current stage of learning and making the one that seems optimal at that point.

Decision Trees One of the reasons that decision trees are popular is that we can turn them into a set of logical disjunctions (if ... then rules) that then go into program code very simply—the first part of the tree above can be turned into: • if salary >80000 then next condn . • ------

Constructing Decision Trees There are a few different decision tree algorithms, the algorithms build the tree in a greedy manner starting at the root , choosing the most informative feature at each step . Quinlan’s ID3 (  Iterative Dichotomiser 3) , its extension, known as C4.5, and another known as CART( Classification And Regression Trees) . The idea is to quantify this question of how much information is provided to you by knowing certain facts. Encoding this mathematically is the task of information theory.

Constructing Decision Trees Decision Tree consists of: Root Node : First node in the decision tree. Nodes : Test for the value of a certain attribute & splits into further sub-nodes. Edges/ Branch : Correspond to the outcome of a test and connect to the next node or leaf. Leaf nodes : Terminal nodes that predict the outcome. Types of Decision Tree: Classification Tree :- The target variable is a categorical variable. Regression Tree: The target variable is a continuous variable.

Constructing Decision Trees Entropy in Information theory Information theory was ‘born’ in 1948 when Claude Shannon published a paper called “A Mathematical Theory of Communication.” In that paper, he proposed the measure of information entropy, which describes the amount of impurity in a set of features. The entropy H of a set of probabilities pi is, For our decision tree, the best feature to pick as the one to classify on now is the one that gives you the most information, i.e., the one with the highest entropy. After using that feature, we re-evaluate the entropy of each feature and again pick the one with the highest entropy.

Constructing Decision Trees While implementing a Decision tree, the main issue arises that how to select the best attribute for the root node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes of the tree. There are two popular techniques for ASM, which are: Information Gain Gini Index

Constructing Decision Trees Entropy in Information theory – ID3 The important idea is to work out how much the entropy of the whole training set would decrease if we choose each particular feature for the next classification step. This is known as the information gain, and it is defined as the entropy of the whole set minus the entropy when a particular feature is chosen. This is defined by, https://towardsdatascience.com/entropy-how-decision-trees-make-decisions-2946b9c18c8

Example Entropy The Mathematical formula for Entropy is as follows - Where ‘Pi’ is simply the frequentist probability of an element/class ‘ i ’ in our data. For simplicity’s sake let’s say we only have two classes , a positive class and a negative class. Therefore ‘ i ’ here could be either + or (-). So if we had a total of 100 data points in our dataset with 30 belonging to the positive class and 70 belonging to the negative class then ‘P+’ would be 3/10 and ‘P-’ would be 7/10. Pretty straightforward .

Example https://sefiks.com/2017/11/20/a-step-by-step-id3-decision-tree-example/

CART-Classification and Regression Tree Gini Impurity The ‘impurity’ in the name suggests that the aim of the decision tree is to have each leaf node represent a set of data points that are in the same class, so that there are no mismatches. This is known as purity. If a leaf is pure then all of the training data within it have just one class .

Classification Example – CART https://sefiks.com/2018/08/27/a-step-by-step-cart-decision-tree-example/

Ensemble Learning Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model. Boosting Bagging

Ensemble Learning Boosting The principal algorithm of boosting is named AdaBoost , and is from Adaptive Boosting. The algorithm was first described in the mid-1990s by Freund and Shapiro, and while it has had many variations derived from it, the principal algorithm is still one of the most widely used. The algorithm was proposed as an improvement on the original 1990 boosting algorithm, which was rather data hungry.

Ensemble Learning Boosting In this algorithm, the training set was split into three. A classifier was trained on the first third, and then tested on the second third. All of the data that was misclassified during that testing was used to form a new dataset, along with an equally sized random selection of the data that was correctly classified . A second classifier was trained on this new dataset, and then both of the classifiers were tested on the final third of the dataset . If they both produced the same output , then that datapoint was ignored, otherwise the datapoint was added to yet another new dataset , which formed the training set for a third classifer .

Ensemble Learning AdaBoost

Ensemble Learning Ada Boost Algorithm- https://www.mygreatlearning.com/blog/adaboost-algorithm /

Ensemble Learning Bagging The simplest method of combining classifiers is known as bagging, which stands for bootstrap aggregating , the statistical description of the method. Having taken a set of bootstrap samples, the bagging method simply requires that we fit a model to each dataset, and then combine them by taking the output to be the majority vote of all the classifiers. Subbagging - It is a combination of ‘subsample ’ and ‘bagging,’ and it is the fairly obvious idea that you don’t need to produce samples that are the same size as the original data.

Ensemble Learning Different ways to combine classifiers Mixture of Expert algorithm

Ensemble Learning Different ways to combine classifiers Mixture of Expert algorithm

Probability and Learning Data into Probabilities Prior Knowledge the conditional probability of C 1 given that x has value X : P ( C 1 | X ) . The conditional probability tells us how likely it is that the class is C 1 given that the value of x is X . We can compute P ( C i ,X j ), which is the joint probability , and tells us how often a measurement of C i fell into histogram bin X j . Bayes rule joins the conditional and joint probability, where x is a vector of feature values instead of just one feature. This is known as the maximum a posteriori or MAP hypothesis, and it gives us a way to choose which class to choose as the output one.

Basic Statistics Two types of statistics ; Descriptive statistics - Descriptive statistics is understanding, analyzing, summarizing the data in form of numbers and graphs. Inferential statistics - We are extracting some sample of data from population data, and from that sample of data, we are inferencing something Types of Data; Numerical Data   I)Discrete Numerical variables – Discrete variables are those whose values are infinite range, for example, rank in the classroom, number of faculties in the department, etc. II) Continuous Numeric variable – Continuous variables are those whose value can range infinite, means not in the proper range for example salary of an employee . Categorical Data – Categorical data means categories or programming strings or a character type of data like name and color. generally, these are also of 2 types. I) Ordinal Variables – Ordinal categorical variable means whose values you can rank between any range like a grade of student(A, B, C), high, medium, and low. II) Nominal Variables – Nominal variables are variables that cannot be ranked, simply contain names or number of categories like colour name, subjects, etc.

Basic Statistics Averages Mean - average Median – Middle value Mode - counting how many times each element appears and picking the most frequent one Variance and Covariance Variance is a measure of how data points differ from the mean. The variance of the set of numbers is a measure of how spread out the values are. It is computed as the sum of the squared distances between each element in the set and the expected value of the set (the mean, μ ):

Variance Example: Find the variance of the numbers 3, 8, 6, 10, 12, 9, 11, 10, 12, 7. Solution: Given, 3, 8, 6, 10, 12, 9, 11, 10, 12, 7 Step 1: Compute the mean of the 10 values given. Mean = (3+8+6+10+12+9+11+10+12+7) / 10 = 88 / 10 = 8.8 Step 2: Make a table with three columns, one for the X values, the second for the deviations and the third for squared deviations. As the data is not given as sample data so we use the formula for population variance. Thus, the mean is denoted by μ.

Basic Statistics Covariance, it is a measure of how dependent the two variables are (in the statistical sense). It is computed by: The covariance can be used to look at the correlation between all pairs of variables within a set of data. We need to compute the covariance of each pair, and these are then put together into what is imaginatively known as the covariance matrix.

Basic Statistics Covariance Matrix

Covariance Example For example, to calculate the covariance between two stocks, assume you have the stock prices for a period of four days and use the formula :

The Gaussian The probability distribution that is most well known (indeed, the only one that many people know, or even need to know) is the Gaussian or normal distribution.

Probability and Learning Gaussian Mixture Model The Gaussian distribution, also known as the normal distribution, is a continuous probability distribution that is widely used in statistical modeling and Machine Learning. It is a bell-shaped curve that is symmetrical around its mean and is characterized by its mean and standard deviation . A Gaussian mixture model is  a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters . Suppose that the different classes each come from their own Gaussian distribution. This is known as multi-modal data, since there is one distribution ( mode) for each different class . How many classes are in the data , then we can try to estimate the parameters for that many Gaussians, all at once.

https://www.analyticsvidhya.com/blog/2019/10/gaussian-mixture-models-clustering/ https://www.mygreatlearning.com/blog/gaussian-mixture-model/

In a one dimensional space, the probability density function of a Gaussian distribution is given by:

Expectation Maximization Algorithm Expectation-Maximization (EM) is a statistical algorithm for finding the right model parameters. We typically use EM when the data has missing values, or in other words, when the data is incomplete . Given a set of incomplete data, start with a set of initialized parameters. Expectation step (E – step):  In this expectation step, by using the observed available data of the dataset, we can try to estimate or guess the values of the missing data. Finally, after this step, we get complete data having no missing values. Maximization step (M – step):  Now, we have to use the complete data, which is prepared in the expectation step, and update the parameters. Repeat step 2 and step 3 until we converge to our solution.

Expectation Maximization Algorithm Expectation-Maximization (EM) is a statistical algorithm for finding the right model parameters. We typically use EM when the data has missing values, or in other words, when the data is incomplete.

Broadly, the Expectation-Maximization algorithm has two steps: E-step:  In this step, the available data is used to estimate (guess) the values of the missing variables M-step:  Based on the estimated values generated in the E-step, the complete data is used to update the parameters

Nearest Neighbor Method The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems.  KNN captures the idea of similarity (sometimes called distance, proximity, or closeness) with some mathematics— calculating the distance between points on a graph.

Nearest Neighbor Method The KNN Algorithm Load the data Initialize K to your chosen number of neighbors For each example in the data 3.1 Calculate the distance between the query example and the current example from the data. 3.2 Add the distance and the index of the example to an ordered collection Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances Pick the first K entries from the sorted collection Get the labels of the selected K entries If regression, return the mean of the K labels If classification, return the mode of the K labels

Nearest Neighbor Method

Nearest Neighbor Method https://www.freecodecamp.org/news/k-nearest-neighbors-algorithm-classifiers-and-model-example/

Nearest Neighbor Method https :// towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761 https:// people.revoledu.com/kardi/tutorial/KNN/KNN_Numerical-example.html

Unsupervised Learning Beginning with Unsupervised Learning, a part of machine learning where no response variable is present to provide guidelines in the learning process and data is analyzed by algorithms itself to identify the trends.  Opposite to that, supervised learning is where existing data is already labelled and you know which behaviour you want to recognize from new datasets, unsupervised learning doesn’t exhibit labelled dataset and algorithms are there to explore relationships and patterns in the data. 

K means Algorithm K-means algorithm explores for a preplanned number of clusters in an unlabelled multidimensional dataset, it concludes this via an easy interpretation of how an optimized cluster can be expressed.  Primarily the concept would be in two steps; Firstly ,   the cluster centre is the arithmetic mean (AM) of all the data points associated with the cluster. Secondly , each point is adjoint to its cluster centre in comparison to other cluster centres . These two interpretations are the foundation of the k-means clustering model.

K means Algorithm

K means Algorithm K-Means Clustering Algorithm  Let's say we have x1, x2, x3……… x(n) as our inputs, and we want to split this into K clusters.  The steps to form clusters are: Step 1: Choose K random points as cluster centers called centroids.  Step 2: Assign each x( i ) to the closest cluster by implementing euclidean distance (i.e., calculating its distance to each centroid) Step 3: Identify new centroids by taking the average of the assigned points. Step 4: Keep repeating step 2 and step 3 until convergence is achieved Let's take a detailed look at it at each of these steps.

Vector Quantization For VQ process, we require a codebook which includes a number of codewords . Applying VQ on a data point ( gray dots) means to map it to the closest codeword (blue dots), i.e. replace the value of data point with the closest codeword value. Vector quantization has been used in a wide range of applications for speech, image, and video data, such as image generation [2], speech and audio coding [3], voice conversion [4,5], music generation [6], and text-to-speech synthesis [7,8]. The k -means algorithm can be used to solve the problem if we know how large we want our codebook to be. However, another algorithm turns out to be more useful, the Self- Organising Feature Map.

Vector Quantization

Self-Organising Feature Map A  self-organizing map  ( SOM ) is a type of  artificial neural network  (ANN) that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a  map , and is therefore a method to do dimensionality reduction.  Self-organizing maps differ from other artificial neural networks as they apply competitive learning as opposed to error-correction learning (such as backpropagation with gradient descent), and in the sense that they use a neighborhood function to preserve the topological properties of the input space .   SOM  was introduced by Finnish professor Teuvo Kohonen in the 1980s is sometimes called a  Kohonen map . To minimize complex issues for straightforward interpretation, SOM is utilized for mapping and clustering (or dimensionality reduction) procedures to map multidimensional data onto lower-dimensional spaces. The output layer and the input layer are the two layers that make up the SOM. This is also known as the Kohonen Map.

Self-Organising Feature Map Algorithm Steps involved are : Weight initialization For 1 to N number of epochs Select a training example Compute the winning vector Update the winning vector Repeat steps 3, 4, 5 for all training examples. Clustering the test sample

Self-Organising Feature Map The weights update phase is when the weight vector’s values are adjusted slightly to match the feature values closer. Not only is the winning neuron updated, but so are the neighboring units. The following equation is used to determine how exactly to implement the modifications : ∆ wji = η(t)* Tj,I(x)(t)*(xi − wji)

Self-Organising Feature Map η(t) – is the learning rate, it determines the speed at which the weights update. Tj,I (x )(t) – is the topological neighborhood that defines to what extent the units close to the winning neuron will be modified. t – epoch i – neuron j – another neuron I(x ) – best matching unit; the winning neuron

Self-Organising Feature Map The topological neighborhood variable, in turn, is dependent on the lateral distance between neurons and the neighborhood size (both are calculated by tuning the networks hyperparameters ). Tj,I (x )(t) = exp (-S²j,I(x)/2σ²) S²j,I(x ) is essentially the same as the Euclidean distance but, in this case, it’s between two neurons rather than a neuron and an input . https:// medium.com/machine-learning-researcher/self-organizing-map-som-c296561e2117

Exercise 1.