A brief presentation given on the basics of Ensemble Methods. Given as a 'Lightning Talk' during the 7th Cohort of General Assembly's Data Science Immersive Course
Size: 4.94 MB
Language: en
Added: Apr 16, 2018
Slides: 14 pages
Slide Content
Ensemble Methods Chris Marker
Evaluation and recommendation Treatment
Evaluation and recommendation Evaluation and recommendation Evaluation and recommendation Evaluation and recommendation Treatment
Ensemble Methods In contrast to traditional modeling methods which train a single model on a set of data, ensemble methods seek to train multiple models and aggregate the results to achieve superior results In most cases, a single base learning algorithm is used in ensemble methods which is called a homogenous ensemble In some cases it is useful to combine multiple algorithms in a heterogeneous ensemble
The Statistical Issue It is often the case that the hypothesis space is too large to explore for limited training data, and that there may be several different hypotheses giving the same accuracy on the training data. By "averaging" predictions from multiple models, we can often cancel out errors and get closer to the true function we are seeking to learn.
The Computational Issue It might be impossible to develop one model that globally optimizes our objective function. For instance, classification and regression trees reach locally-optimal solutions and that all generalized linear models iterate toward a solution that isn't guaranteed to be the globally-optimal. Starting "local searches" at different points and aggregating our predictions might produce a better result.
The Representational Issue Even with vast data and computer power, it might still be impossible for one model to exactly model our objective function. For example, a linear regression model can never model a relationship where a one-unit change in X effects some different change in y based on the value of X. All models have their shortcomings but by creating multiple model and aggregating their predictions it is often possible to get closer to the objective function.
Two Ensemble Paradigms - Parallel The base models are generated in parallel and the results are aggregated Bagging Random Forests
Bootstrap Aggregated Decision Trees (Bagging) From the original data of size N, bootstrap K samples each of size N (with replacement!). Build a classification or regression decision tree on each bootstrapped sample. Make predictions by passing a test observation through all K trees and developing one aggregate prediction for that observation. Discrete: In ensemble methods, we will most commonly predict a discrete y by "plurality vote," where the most common class is the predicted value for a given observation. Continuous: In ensemble methods, we will most commonly predict a continuous y by averaging the predicted values into one final prediction.
Random Forests Bagging reduces variance in individual decision trees but they are still highly correlated with each other. By "decorrelating" our trees from one another, we can drastically reduce the variance of our model. Random forests differ from bagging decision trees in only one way: they use a modified tree learning algorithm that selects, at each split in the learning process, a random subset of the features . This process is sometimes called the random subspace method .
Two Ensemble Paradigms - Sequential The base models are generated sequentially and the results are updated based on the previous results AdaBoosting Gradient Boosting
AdaBoosting Instead of using deep/full decision trees like in bagging, boosting uses shallow/high-bias base estimators. Iterative fitting is used to explain error/misclassification unexplained by the previous base models and reduces bias without increasing variance. The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as a single-split tree) on repeatedly modified versions of the data . After each fit, the importance weights on each observation need to be updated. The predictions are then combined through a weighted majority vote (or sum) to produce the final prediction. AdaBoost, like all boosting ensemble methods, focuses the next model's fit on the misclassifications/weaknesses of the prior models.
Gradient Boosting Gradient boosting is similar to AdaBoosting but functions by fitting the next model to the residuals of the previous model , for instance: Suppose you start with a model F that gives you a less than satisfactory result. F(x1) = 0.8, while y1 = 0.9 Each successive model is correcting for that distance between the true value and predicted This process can be repeated until you have fit an effective model. In other words, residuals are interpreted as negative gradients.