This PowerPoint presentation explains about the ensemble methods in machine learning
Size: 1.18 MB
Language: en
Added: Jan 26, 2021
Slides: 31 pages
Slide Content
Ensemble methods
Why do we use E nsemble methods ? Ensemble learning helps improve machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model in order to decrease variance (bagging), bias (boosting), or improve predictions (stacking).
Groups Sequential Ensemble methods where the base learners are generated sequentially (e.g. AdaBoost ). Parallel Ensemble methods where the base learners are generated in parallel (e.g. Random Forest).
Bagging Bagging stands for bootstrap aggregation . One way to reduce the variance of an estimate is to average together multiple estimates.
…..Working with bagging Consider dataset D. It has many rows and columns Consider models or base learners M(M1,M2,…, Mn ) for dataset D For each model we provide dataset D’M,D ’ ’ M,Etc . Suppose we have n records we select sample of n records and provide a particular record to model 1 Similarly for next model we use row sampling with replacement. For example in model M1 if there is data (A,B) ,then for model M2(B,C) where B is reptive After training is done we give new test data to predict. Now we consider this method in binary classifier model Dataset D Model Mn Model M2 Model M1 D’M D’’M D’’’M
…..solving with test data in bagging Suppose we give new test data and made them to pass The models gives their values as 1 or 0 as we consider binary classifier In the given dataset by voting classifier the majority (1) is taken as O/P Dataset D Model Mn Model M2 Model M1 D’M D’’M D’’’M 1 1 1 Bootstrapping Aggregation
Random Forest One of the technique used for bagging is random forest Random Forest models decide where to split based on a random selection of features. Rather than splitting at similar features at each node throughout, Random Forest models implement a level of differentiation because each tree will split based on different features.
…..Working with Random forest Consider dataset D. It has many rows and columns Consider models or base learners and decision tree M(M1,M2,…,Mn) & Decision tree (DT 1,DT 2,DTN) for dataset D For each model we provide dataset D’M,D ’ ’ M,Etc .. Suppose we n records we select sample of n records and provide a particular record to model 1 Similarly for next model we use row sampling with replacement ( rs ) and sample of feature (FS) Take some no of rows and columns and give it to the DT.It will be trained on particular dataset. Similarly for DT2,,, After training is done we give new test data and training to predict. Now we consider this method in binary classifier model Dataset D DT N DT 2 DT 1 D’M D’’M D’’’M M1 MN M2 rs+fs rs+fs rs+fs
…..Working with Random forest Suppose we give new test data and made them to pass The models gives their values as 1 or 0 as we consider binary classifier In the given dataset by voting classifier the majority (1) is taken as O/P Dataset D DT N DT 2 DT 1 D’M D’’M D’’’M M1 MN M2 rs+fs rs+fs rs+fs 1 1 1
Why do we use Random forest for decision tree? Decision tree basically has Low Bias – When we are training a Decision Tree to its complete depth,then it will be properly trained so training error is less High Variance – When we give new test data ,it give larger amount of error in it so there occurs a high variance. In random forest we use multiple Decision tree and we know it has high variance but when we combine all decision tree with respect to majority vote,the high variance will be reduced to low variance When we are sampling and giving records to decision tree ,the decision tree tends to become expert with respect to rows. In this way the high variance is converted to low variance
Advantage with example Suppose we have 1000 records they get spilted into decision trees. Now if we give 200 records as new data they does not create a great impact in it Dataset D DT N DT 2 DT 1 D’M D’’M D’’’M M1 MN M2 rs+fs rs+fs rs+fs 1 1 1
Regressor If we use regression instead of binary classifier, we use this method as It takes the average value Dataset D DT N DT 2 DT 1 D’M D’’M D’’’M M1 MN M2 rs+fs rs+fs rs+fs 2000 1000 3000 2000 Important Note Classifier is solved by Major vote Regressor is solved by average of votes
Boosting Boosting refers to a family of algorithms that are able to convert weak learners to strong learners. The predictions are then combined through a weighted majority vote (classification) or a weighted sum (regression) to produce the final prediction.
Working with Boosting Consider a dataset with records Consider models(M1,M2,..,Mn) or base learners Some data are passed to base learners or model once it is trained . After training we will pass records to base learners or model and see how particular model is performed Dataset Records M1 M2 Mn
.. cont The records are allowed to pass to model M1 and red coloured 2 records are incorrectly classified, the next model will be created sequentially and only 2 records will be passed to next model M2 If M2 gives some wrong records then the error will be passed continuously to M3 This will go until we specify some strong learners. This boosting technique will make weak learners to strong learners. M1 M2 Mn
ADABOOST or ADAptive BOOSTing It is little different. There is something called weights will be assigned here. Suppose we have features and O/P There are 5 Steps we need to do .They are Step 1 - Calculate Sample weight Step 2 – Create 1 st base learner Step 3 – Find the performance of stamp Step 4 – Updating Sample weight Step 5 - Calculating normalized weight
Step 1 - Calculate Sample weight Sample weight(w)=1/n here n = 7(since 7 records) so sample weight(w)=1/7 All records contains same sample weight as 1/7 S.No F1 f2 f3 O/P Sample weight (W) 1 1/7 2 1/7 3 1/7 4 1/7 5 1/7 6 1/7 7 1/7
Step 2 – Create 1 st base learner We will create with the help Decision tree in AdaBoost . In this decision tree is not created as in random forest It is created with help of stamps (Stamps are one that has 1 depth in tree) Let us consider 3 stamps with our functions We consider function f1 and create stamp and sequentially for f2,f3 as shown f1 f2 f3
… From the 1 st decision tree we have to create Base learners model Compare f1,f2,f3 & select base learners model Consider Binary classifier as Output(Yes/No) If we select f1 as Base learners model . If there are 6 records correctly classified and 1 record is incorrectly classified ,we have to find total error for that incorrectly classified one( red colour marked ). Total error is calculated by summing up sample weight. Since there is one row has error ,the sample weight is taken with that, Total error = (1/7)+0(since no other record causes errors) =(1/7) S.No F1 f2 f3 O/P Sample weight 1 Yes 1/7 2 No 1/7 3 Yes 1/7 4 Yes 1/7 5 No 1/7 6 No 1/7 7 Yes 1/7
Step 3 – find the performance of stamp It is calculated by using the formula Performance of the stamp = here TE = Total Error By the formula =1/2(log e((1-(1/7))/1/7) =0.895
Step 4 – Updating Sample weight If the record is incorrectly classified one we use New sample weight = weight * e ^ performance = (1/7)*e ^0.895 =0.349 If the record is correctly classified one we use New sample weight = weight * e ^ - performance = (1/7)*e ^ - 0.895 = 0.05 S.No F1 f2 f3 O/P Sample weight Updated sample weight 1 Yes 1/7 0.05 2 No 1/7 0.05 3 Yes 1/7 0.349 4 Yes 1/7 0.05 5 No 1/7 0.05 6 No 1/7 0.05 7 Yes 1/7 0.05
Step 5 - Calculating normalized weight We can observe that all the sample weights are > 1 but the update weight are < 0. In this case we find normalized weight The Normalized weight is calculated by Σ (updated sample weight) Divide by (updated sample weights ) / Σ (updated sample weights) for each record. Now normalized values found as shown in table. S.No F1 f2 f3 O/P Sample weight Updated sample weight Normalized weights 1 Yes 1/7 0.05 0.07 2 No 1/7 0.05 0.07 3 Yes 1/7 0.349 0.513 4 Yes 1/7 0.05 0.07 5 No 1/7 0.05 0.07 6 No 1/7 0.05 0.07 7 Yes 1/7 0.05 0.07 Σ 0.68
… Based upon the normalized weight,we will divide normalized values into buckets as shown 0.07 – > 0.02 to 0.05 (Bucket 1) 0.07 - > 0.05 to 0.07 (Bucket 2) 0.513 - > 0.07 to 0.58 (Bucket 3) 0.07 -> 0.58 to 0.65 (Bucket 4) 0.07 -> 0.65 to 0.78 (Bucket 5) 0.07 -> 0.78 to 0.87 (Bucket 6) 0.07 - > 0.87 to 0.96 (Bucket 7) We consider a new dataset as in image S.No F1 f2 f3 O/P 1 2 3
….. ADABOOST algorithm run 8 iterations to select different records from older dataset by the bucket value If 1 st iteration select random value 0.43 check where does 0.43 lies in the bucket 0.07 – > 0.02 to 0.05 (Bucket 1) 0.07 - > 0.05 to 0.07 (Bucket 2) 0.513 - > 0.07 to 0.58 (Bucket 3) 0.07 -> 0.58 to 0.65 (Bucket 4) 0.07 -> 0.65 to 0.78 (Bucket 5) 0.07 -> 0.78 to 0.87 (Bucket 6) 0.07 - > 0.87 to 0.96 (Bucket 7 ) Here 0.43 lies in 0.513 ( red coloured marked) so this record is classified as incorrectly classified in Decision Tree and the corresponding value is filled in new dataset from old set
…. Now the new dataset look like this Next iteration and upto 8 th iteration is done and random value is selected. Similar to the 1 st iteration 2 nd iteration done and the values are added in new dataset and goes on The same process is done as in dataset 1 in dataset 2 and also in dataset 3 The process is continued until it pass all sequential Decision Tree’s. Then by this we will be considering less error S.No F1 f2 f3 O/P 1 yes 2 3
ADABOOST with TestData A test data is allowed to pass through this stamps f1 f2 f3 Test Data 1 1 The major vote is 1 and the o/p become 1. Thus we combine weak learner and makes strong learner
Stacking It is use heterogeneous method (strong learner + weak learner) where other methods use Homogenous method (strong learner or weak learner) Meta model How stacking works in meta model? Let have 100 records to train data Logistic regression SVM Neural Networks 100 records 75 % typically trained 2 5 % test 80% training 20 % Output 80 % trained on these data will be used for Prediction on 20% data In this we take this group
Working with example in oral X1 X2 X3 Target 10 12 13 Let have record 20% of data We have All the models {Logistic regression , SVM,Neural Network} to predict the record. We have 3 prediction as Logistic Regression SVM Neural network Target 1 1 The above table becomes training set and become I/p to meta model. Final model we get in meta model will be O/P. Out of 100 % 80% we keep 20 % for meta model. This approach is blending
…Stacking We can take k fold approach in 75 % typically trained data, we can create k buckets. We can always create meta model on 1 bucket out of k bucket or k-1 bucket This is stacking 100 records 75 % typically trained 2 5 % test 80% training 20 % Output