Ensemble Models Bagging and Random Forests Presented by: Chandru Dawood Ifzal Bharathi
ensemble / ɑːnˈsɑːm.bəl / - a small group of musicians, dancers or actors who perform together. - a number of things considered as a group. - a group producing a single effect
Ensemble Models Ensemble learning is a machine learning technique that enhances accuracy and resilience in forecasting by merging predictions from multiple models. It aims to mitigate errors or biases that may exist in individual models by leveraging the collective intelligence of the ensemble. It is much like seeking advice from multiple sources before making a significant decision, such as purchasing a car. Just as you wouldn’t rely solely on one opinion, ensemble models combine predictions from multiple base models to enhance overall performance.
Ensemble Models Bias-Variance Trade-off Single models often have to compromise between bias and variance. For example, decision trees might have high variance but low bias, while linear models may have low variance but high bias. Complex models, such as deep neural networks or individual decision trees, can overfit the training data, meaning they perform well on the training set but poorly on unseen data.
Ensemble Learning Yes No Let us ask one of our friends to rate the movie
Ensemble Learning Yes No Let us ask one of our friends to rate the movie High Bias: Our friend (model) oversimplifies the problem, failing to capture important patterns, leading to poor predictions. It underfits the data.
Ensemble Learning Yes No Let us ask one of our friends to rate the movie High Variance : On the other hand, our expert is too complex, paying attention to tiny details that might not be relevant in the future, leading to overfitting.
Ensemble Learning Yes No How about asking 50 people to rate the movie? The responses, in this case, would be more generalized and diversified since now we have people with different sets of skills. And as it turns out – this is a better approach to get honest ratings than the previous cases we saw.
Ensemble Learning An ensemble model combines several individual models to produce more accurate predictions than a single model alone.
Bagging Bagging, also known as Bootstrap Aggregating, is a potent ensemble learning strategy that has been shown to improve model performance. Bagging can be used to solve a variety of machine-learning issues, such as classification, regression, and clustering.
Bagging Bagging is combining the results of multiple models (for instance, all decision trees) to get a generalized result. Here’s a question: If we create all the models on the same set of data and combine it, will it be useful? There is a high chance that these models will give the same result since they are getting the same input. We solve this problem by one of the techniques called “bootstrapping”.
Bootstrapping This new dataset, known as Sampling with Replacement , has the same number of values as original dataset is called a Bootstrapped Dataset
Bootstrapping This new dataset, known as Sampling with Replacement , has the same number of values as original dataset is called a Bootstrapped Dataset We calculate the mean of the new Bootstrapped Dataset. We get a new mean, as this dataset is different from the original dataset.
Implementation Steps of Bagging Dataset Preparation: Clean and preprocess your dataset. Split it into training and test sets. Bootstrap Sampling: Randomly sample from the training data with replacement to create multiple bootstrap samples. Each sample typically has the same size as the original dataset. Model Training: Train a base model (e.g., decision tree, neural network) on each bootstrap sample. Each model is trained independently. Prediction Generation: Use each trained model to predict the test data.
Implementation Steps of Bagging Combining Predictions: Aggregate the predictions from all models using methods like majority voting for classification or averaging for regression. Evaluation: Assess the ensemble’s performance on the test data using metrics like accuracy, F1 score, or mean squared error. Hyperparameter Tuning: Adjust the hyperparameters of the base models or the ensemble as needed, using techniques like cross-validation. Deployment: Once satisfied with the ensemble’s performance, deploy it to make predictions on new data.
Random Forests
Random Forest Random Forest is another ensemble machine learning algorithm that follows the bagging technique . It is an extension of the bagging estimator algorithm. The base estimators in random forest are decision trees . Random forest randomly selects a set of features which are used to decide the best split at each node of the decision tree.
1. Create a “bootstrapped” dataset
2. Create a decision tree using the bootstrapped dataset, only using a random subset of variables (columns) at each step Let us select two random variables Good Blood Circ. and Blocked Arteries as candidates for the root node
Since we used Good Blood Circ., let us grey it out and focus on the remaining variables.
We just build the tree as usual, but only considering a random subset of variables at each step
How do we use the Random Forest? We got a new patient data, and we want to know if he has heart disease or not. We take the data and run it down the first tree
In this case, "Yes" received most votes. So, we conclude that this patient has heart disease. ?