shivangisingh564490
3 views
14 slides
Aug 30, 2025
Slide 1 of 14
1
2
3
4
5
6
7
8
9
10
11
12
13
14
About This Presentation
scdw
Size: 40.42 KB
Language: en
Added: Aug 30, 2025
Slides: 14 pages
Slide Content
Random Forests: Bagging + Random Subspaces Ensembles of Decision Trees for Robust, High-Performance Models
Core Intuition Train many de-correlated decision trees on bootstrap samples At inference: classification uses majority vote; regression uses average Reduces variance while keeping low bias of deep trees
Bagging (Bootstrap Aggregating) Sample with replacement from training set to create diverse datasets Each tree sees a different subset with duplicates; ~63% unique examples per bootstrap Averaging votes smooths out individual tree noise
Random Feature Selection at Splits At each split, consider a random subset of features (mtry) Prevents dominant features from making trees too similar Typical defaults: sqrt(p) for classification, p/3 for regression (library-dependent)
Training Algorithm (High-level) For b = 1..B (n_estimators): • Draw bootstrap sample; grow a deep CART tree without pruning • At each node: choose best split among mtry random features Aggregate predictions across all trees
Out-of-Bag (OOB) Error Each tree leaves out ~37% of samples → ‘out-of-bag’ set Use OOB samples to estimate generalization error without extra validation set Enable oob_score=True in scikit-learn
Feature Importance Impurity-based: average decrease in Gini/MSE across splits (fast, biased) Permutation importance: drop-in performance after shuffling a feature (slower, more reliable) Consider correlated features and interactions when interpreting
Hyperparameters to Tune n_estimators (B): more trees → better stability (diminishing returns) max_depth, min_samples_leaf: control overfitting and leaf purity max_features (mtry): controls de-correlation bootstrap, class_weight (for imbalance), ccp_alpha (post-pruning if supported)
Bias–Variance Characteristics Forests keep low bias (deep trees) but strongly reduce variance via averaging Performance improves with tree diversity; tune max_features to trade accuracy vs. diversity Too small mtry → underfit; too large mtry → highly correlated trees
Handling Imbalanced Data Use class_weight='balanced' or custom weights Adjust decision threshold using validation PR/ROC curves Try balanced subsampling per tree (if supported)
Practical Tips Use OOB for quick model selection; still confirm with cross-validation Calibrate probabilities (Platt/Isotonic) if well-calibrated confidence is needed For large p, consider feature selection or dimensionality reduction
scikit-learn Example (Classification) from sklearn.ensemble import RandomForestClassifier from sklearn.inspection import permutation_importance rf = RandomForestClassifier(n_estimators=300, max_features='sqrt', oob_score=True, n_jobs=-1, random_state=42) rf.fit(X_train, y_train) print('OOB accuracy:', rf.oob_score_) # Permutation importance result = permutation_importance(rf, X_valid, y_valid, n_repeats=5, random_state=42) imp = result.importances_mean print('Top features:', [feature_names[i] for i in imp.argsort()[::-1][:10]])
Interpretability in Practice Global: feature importance (impurity/permutation), minimal depth analysis Local: tree path inspection, surrogate models, or SHAP for per-instance contributions Remember: ensembles are less interpretable than single trees
Pros & Cons (Summary) Pros: strong accuracy out-of-the-box, robust to noise/outliers, little tuning needed Cons: larger memory/latency, less interpretable, may struggle with very sparse high-dimensional data Widely used baseline for tabular problems