Why Ensemble Learning? Combine multiple models to improve accuracy and robustness Reduce variance (averaging), sometimes bias (boosting) Stronger generalization on tabular data; competitive baseline in practice Natural parallelization (bagging) & strong off-the-shelf performance (RF)
Bias–Variance Perspective (High-Level) Prediction error = Bias² + Variance + Irreducible noise Bagging ↓ variance by averaging unstable learners (e.g., trees) Boosting can ↓ bias by sequentially correcting residuals Diversity among base learners is key for gains
Ensemble Taxonomy Homogeneous vs. Heterogeneous (same vs. different base models) Parallel (Bagging/Random Forest) vs. Sequential (Boosting) Voting/Averaging (hard vs. soft) Stacking/Blending with a meta-learner
Bagging: Bootstrap Aggregating Train B models on bootstrap samples (sampling with replacement) Each model is high-variance (e.g., deep tree) → averaging stabilizes Out-of-Bag (OOB) estimation: ~37% data not seen by a given tree Key knobs: n_estimators (B), base_learner complexity, max_samples
Random Forests (RF) Bagging + Random Subspace: each split considers random subset of features De-correlates trees; often best default for tabular data Classification: majority vote; Regression: average Tune: n_estimators, max_features, max_depth, min_samples_leaf, class_weight
Extremely Randomized Trees (ExtraTrees) Randomized thresholds + random feature subsets at each split Even more de-correlation; often faster May increase bias slightly but reduce variance further
Boosting: Core Idea Train weak learners sequentially; each focuses on previous errors Stage-wise additive modeling: f_{t}(x) = f_{t-1}(x) + η * h_t(x) Requires careful regularization (learning rate, depth, subsampling)
AdaBoost (Binary Classification) Re-weights samples; harder examples get higher weight Weak learner: typically shallow decision trees (stumps) Final prediction: weighted vote of weak learners Sensitive to noise/outliers; strong on clean, small-to-medium data
Gradient Boosting (GBDT/GBM) Fit new tree to negative gradient of loss (residuals) Key hyperparameters: n_estimators, learning_rate, max_depth (or max_leaves), subsample Use early stopping with validation set to prevent overfit Variants: XGBoost, LightGBM, CatBoost
XGBoost vs. LightGBM vs. CatBoost (At a Glance) XGBoost: robust regularization, shrinkage, column subsampling, wide ecosystem LightGBM: leaf-wise growth with depth limits; fast on large, sparse datasets CatBoost: native categorical handling, ordered boosting (reduces target leakage) Pick based on data size, sparsity, categorical richness, and latency needs
Stacking (Meta-Learning) Level-0: diverse base models; Level-1: meta-learner uses out-of-fold predictions Cross-validation is crucial to avoid leakage Use simple meta-learners first (logistic/linear) to avoid overfitting Blending: holdout set for meta-features (simpler, less data-efficient)
Voting & Averaging Hard voting: majority class label Soft voting: average predicted probabilities (requires calibrated models) Weighted voting: weight by validation performance or domain knowledge
Imbalanced Data Strategies Use class_weight='balanced' (RF/AdaBoost/etc.) or sampling strategies Optimize thresholds using PR curves; use AUPRC for evaluation Consider Balanced Random Forest, EasyEnsemble, or focal loss (boosting variants)
Interpretability & Diagnostics Global: permutation importance, minimal depth, gain statistics Local: SHAP values, tree path analysis, counterfactuals Check calibration (reliability curves) for probability outputs
Practical Tips Start with RF as a baseline for tabular problems For boosting: tune learning_rate and trees with early stopping Use OOB (bagging/RF) for quick model iteration Cross-validate across time splits for temporal data (avoid leakage)
When to Use What RF: strong baseline for mixed/tabular data; low tuning cost GBM (XGB/LGBM/CatBoost): when you need top accuracy and can tune carefully Bagging (generic): unstable base learner & small data → variance reduction Stacking: when diverse models each capture different structure; ensure robust CV
Common Pitfalls & How to Avoid Data leakage in stacking/blending → use out-of-fold predictions Overfitting with too-deep trees in boosting → use small max_depth + regularization Poor probability calibration in RF/Boosting → use calibration on validation set Distribution shift → evaluate with time-aware or group-aware splits
Mini Case Sketch (Credit Risk) Goal: predict default; imbalanced (5% positive) Baseline RF with class_weight='balanced' → tune max_features Compare with LightGBM + early stopping; evaluate AUPRC Stack RF + SVC + LR-meta for final model; check calibration
References & Further Reading Breiman, L. (1996) Bagging Predictors; (2001) Random Forests Freund & Schapire (1997) AdaBoost Friedman (2001) Greedy Function Approximation (GBM) Chen & Guestrin (2016) XGBoost; Ke et al. (2017) LightGBM; Dorogush et al. (2018) CatBoost