Why Ensemble Learning?
•Combine multiple models to improve accuracy
and robustness
• Reduce variance (averaging), sometimes bias
(boosting)
• Stronger generalization on tabular data;
competitive baseline in practice
• Natural parallelization (bagging) & strong off-
the-shelf performance (RF)
3.
Bias–Variance Perspective (High-
Level)
•Prediction error = Bias² + Variance +
Irreducible noise
• Bagging ↓ variance by averaging unstable
learners (e.g., trees)
• Boosting can ↓ bias by sequentially correcting
residuals
• Diversity among base learners is key for gains
4.
Ensemble Taxonomy
• Homogeneousvs. Heterogeneous (same vs.
different base models)
• Parallel (Bagging/Random Forest) vs.
Sequential (Boosting)
• Voting/Averaging (hard vs. soft)
• Stacking/Blending with a meta-learner
5.
Bagging: Bootstrap Aggregating
•Train B models on bootstrap samples
(sampling with replacement)
• Each model is high-variance (e.g., deep tree)
→ averaging stabilizes
• Out-of-Bag (OOB) estimation: ~37% data not
seen by a given tree
• Key knobs: n_estimators (B), base_learner
complexity, max_samples
6.
Random Forests (RF)
•Bagging + Random Subspace: each split
considers random subset of features
• De-correlates trees; often best default for
tabular data
• Classification: majority vote; Regression:
average
• Tune: n_estimators, max_features,
max_depth, min_samples_leaf, class_weight
7.
Extremely Randomized Trees
(ExtraTrees)
•Randomized thresholds + random feature
subsets at each split
• Even more de-correlation; often faster
• May increase bias slightly but reduce variance
further
8.
Boosting: Core Idea
•Train weak learners sequentially; each focuses
on previous errors
• Stage-wise additive modeling: f_{t}(x) = f_{t-1}
(x) + η * h_t(x)
• Requires careful regularization (learning rate,
depth, subsampling)
9.
AdaBoost (Binary Classification)
•Re-weights samples; harder examples get
higher weight
• Weak learner: typically shallow decision trees
(stumps)
• Final prediction: weighted vote of weak
learners
• Sensitive to noise/outliers; strong on clean,
small-to-medium data
10.
Gradient Boosting (GBDT/GBM)
•Fit new tree to negative gradient of loss
(residuals)
• Key hyperparameters: n_estimators,
learning_rate, max_depth (or max_leaves),
subsample
• Use early stopping with validation set to
prevent overfit
• Variants: XGBoost, LightGBM, CatBoost
11.
XGBoost vs. LightGBMvs. CatBoost
(At a Glance)
• XGBoost: robust regularization, shrinkage,
column subsampling, wide ecosystem
• LightGBM: leaf-wise growth with depth limits;
fast on large, sparse datasets
• CatBoost: native categorical handling, ordered
boosting (reduces target leakage)
• Pick based on data size, sparsity, categorical
richness, and latency needs
12.
Stacking (Meta-Learning)
• Level-0:diverse base models; Level-1: meta-
learner uses out-of-fold predictions
• Cross-validation is crucial to avoid leakage
• Use simple meta-learners first (logistic/linear)
to avoid overfitting
• Blending: holdout set for meta-features
(simpler, less data-efficient)
13.
Voting & Averaging
•Hard voting: majority class label
• Soft voting: average predicted probabilities
(requires calibrated models)
• Weighted voting: weight by validation
performance or domain knowledge
14.
Imbalanced Data Strategies
•Use class_weight='balanced'
(RF/AdaBoost/etc.) or sampling strategies
• Optimize thresholds using PR curves; use
AUPRC for evaluation
• Consider Balanced Random Forest,
EasyEnsemble, or focal loss (boosting variants)
15.
Interpretability & Diagnostics
•Global: permutation importance, minimal
depth, gain statistics
• Local: SHAP values, tree path analysis,
counterfactuals
• Check calibration (reliability curves) for
probability outputs
16.
Practical Tips
• Startwith RF as a baseline for tabular
problems
• For boosting: tune learning_rate and trees
with early stopping
• Use OOB (bagging/RF) for quick model
iteration
• Cross-validate across time splits for temporal
data (avoid leakage)
When to UseWhat
• RF: strong baseline for mixed/tabular data;
low tuning cost
• GBM (XGB/LGBM/CatBoost): when you need
top accuracy and can tune carefully
• Bagging (generic): unstable base learner &
small data → variance reduction
• Stacking: when diverse models each capture
different structure; ensure robust CV
22.
Common Pitfalls &How to Avoid
• Data leakage in stacking/blending → use out-
of-fold predictions
• Overfitting with too-deep trees in boosting →
use small max_depth + regularization
• Poor probability calibration in RF/Boosting →
use calibration on validation set
• Distribution shift → evaluate with time-aware
or group-aware splits
23.
Mini Case Sketch(Credit Risk)
• Goal: predict default; imbalanced (5%
positive)
• Baseline RF with class_weight='balanced' →
tune max_features
• Compare with LightGBM + early stopping;
evaluate AUPRC
• Stack RF + SVC + LR-meta for final model;
check calibration
24.
References & FurtherReading
• Breiman, L. (1996) Bagging Predictors; (2001)
Random Forests
• Freund & Schapire (1997) AdaBoost
• Friedman (2001) Greedy Function
Approximation (GBM)
• Chen & Guestrin (2016) XGBoost; Ke et al.
(2017) LightGBM; Dorogush et al. (2018)
CatBoost