Ensemble Learning: Bagging,
Boosting & Stacking
A practical, teaching-ready deck (with
code snippets)
Why Ensemble Learning?
• Combine multiple models to improve accuracy
and robustness
• Reduce variance (averaging), sometimes bias
(boosting)
• Stronger generalization on tabular data;
competitive baseline in practice
• Natural parallelization (bagging) & strong off-
the-shelf performance (RF)
Bias–Variance Perspective (High-
Level)
• Prediction error = Bias² + Variance +
Irreducible noise
• Bagging ↓ variance by averaging unstable
learners (e.g., trees)
• Boosting can ↓ bias by sequentially correcting
residuals
• Diversity among base learners is key for gains
Ensemble Taxonomy
• Homogeneous vs. Heterogeneous (same vs.
different base models)
• Parallel (Bagging/Random Forest) vs.
Sequential (Boosting)
• Voting/Averaging (hard vs. soft)
• Stacking/Blending with a meta-learner
Bagging: Bootstrap Aggregating
• Train B models on bootstrap samples
(sampling with replacement)
• Each model is high-variance (e.g., deep tree)
→ averaging stabilizes
• Out-of-Bag (OOB) estimation: ~37% data not
seen by a given tree
• Key knobs: n_estimators (B), base_learner
complexity, max_samples
Random Forests (RF)
• Bagging + Random Subspace: each split
considers random subset of features
• De-correlates trees; often best default for
tabular data
• Classification: majority vote; Regression:
average
• Tune: n_estimators, max_features,
max_depth, min_samples_leaf, class_weight
Extremely Randomized Trees
(ExtraTrees)
• Randomized thresholds + random feature
subsets at each split
• Even more de-correlation; often faster
• May increase bias slightly but reduce variance
further
Boosting: Core Idea
• Train weak learners sequentially; each focuses
on previous errors
• Stage-wise additive modeling: f_{t}(x) = f_{t-1}
(x) + η * h_t(x)
• Requires careful regularization (learning rate,
depth, subsampling)
AdaBoost (Binary Classification)
• Re-weights samples; harder examples get
higher weight
• Weak learner: typically shallow decision trees
(stumps)
• Final prediction: weighted vote of weak
learners
• Sensitive to noise/outliers; strong on clean,
small-to-medium data
Gradient Boosting (GBDT/GBM)
• Fit new tree to negative gradient of loss
(residuals)
• Key hyperparameters: n_estimators,
learning_rate, max_depth (or max_leaves),
subsample
• Use early stopping with validation set to
prevent overfit
• Variants: XGBoost, LightGBM, CatBoost
XGBoost vs. LightGBM vs. CatBoost
(At a Glance)
• XGBoost: robust regularization, shrinkage,
column subsampling, wide ecosystem
• LightGBM: leaf-wise growth with depth limits;
fast on large, sparse datasets
• CatBoost: native categorical handling, ordered
boosting (reduces target leakage)
• Pick based on data size, sparsity, categorical
richness, and latency needs
Stacking (Meta-Learning)
• Level-0: diverse base models; Level-1: meta-
learner uses out-of-fold predictions
• Cross-validation is crucial to avoid leakage
• Use simple meta-learners first (logistic/linear)
to avoid overfitting
• Blending: holdout set for meta-features
(simpler, less data-efficient)
Voting & Averaging
• Hard voting: majority class label
• Soft voting: average predicted probabilities
(requires calibrated models)
• Weighted voting: weight by validation
performance or domain knowledge
Imbalanced Data Strategies
• Use class_weight='balanced'
(RF/AdaBoost/etc.) or sampling strategies
• Optimize thresholds using PR curves; use
AUPRC for evaluation
• Consider Balanced Random Forest,
EasyEnsemble, or focal loss (boosting variants)
Interpretability & Diagnostics
• Global: permutation importance, minimal
depth, gain statistics
• Local: SHAP values, tree path analysis,
counterfactuals
• Check calibration (reliability curves) for
probability outputs
Practical Tips
• Start with RF as a baseline for tabular
problems
• For boosting: tune learning_rate and trees
with early stopping
• Use OOB (bagging/RF) for quick model
iteration
• Cross-validate across time splits for temporal
data (avoid leakage)
Key Hyperparameters (Cheat
Sheet)
• RF: n_estimators↑, max_features (sqrt/log2),
min_samples_leaf (1–10)
• GBM: learning_rate (0.01–0.1), n_estimators
(100–1000+), max_depth (3–10), subsample
(0.6–0.9)
• AdaBoost: n_estimators (50–500),
learning_rate (0.01–1.0)
• Stacking: base diversity ↑, meta-learner
regularization (ridge/logistic)
Code: Bagging & Random Forest
(scikit-learn)
• from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
bag = BaggingClassifier(n_estimators=200, max_samples=0.8, n_jobs=-1,
random_state=42)
rf = RandomForestClassifier(n_estimators=400, max_features='sqrt',
oob_score=True,
n_jobs=-1, random_state=42)
for model in [bag, rf]:
scores = cross_val_score(model, X_train, y_train, cv=5, n_jobs=-1)
print(model.__class__.__name__, scores.mean(), scores.std())
rf.fit(X_train, y_train); print('OOB:', rf.oob_score_)
Code: AdaBoost &
GradientBoosting (scikit-learn)
• from sklearn.ensemble import AdaBoostClassifier,
GradientBoostingClassifier
ada = AdaBoostClassifier(n_estimators=300, learning_rate=0.1,
random_state=42)
gb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.05,
max_depth=3,
subsample=0.8, random_state=42)
ada.fit(X_train, y_train)
gb.fit(X_train, y_train)
print('AdaBoost test acc:', ada.score(X_test, y_test))
print('GB test acc:', gb.score(X_test, y_test))
Code: Stacking & Voting (scikit-
learn)
• from sklearn.ensemble import StackingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
base_learners = [
('rf', RandomForestClassifier(n_estimators=300, random_state=42)),
('svc', SVC(probability=True, kernel='rbf', C=2.0, gamma='scale',
random_state=42))
]
meta = LogisticRegression(max_iter=1000)
stack = StackingClassifier(estimators=base_learners,
final_estimator=meta, cv=5, n_jobs=-1)
vote = VotingClassifier(estimators=base_learners, voting='soft',
n_jobs=-1)
stack.fit(X_train, y_train)
vote.fit(X_train, y_train)
print('Stack acc:', stack.score(X_test, y_test))
print('Vote acc:', vote.score(X_test, y_test))
When to Use What
• RF: strong baseline for mixed/tabular data;
low tuning cost
• GBM (XGB/LGBM/CatBoost): when you need
top accuracy and can tune carefully
• Bagging (generic): unstable base learner &
small data → variance reduction
• Stacking: when diverse models each capture
different structure; ensure robust CV
Common Pitfalls & How to Avoid
• Data leakage in stacking/blending → use out-
of-fold predictions
• Overfitting with too-deep trees in boosting →
use small max_depth + regularization
• Poor probability calibration in RF/Boosting →
use calibration on validation set
• Distribution shift → evaluate with time-aware
or group-aware splits
Mini Case Sketch (Credit Risk)
• Goal: predict default; imbalanced (5%
positive)
• Baseline RF with class_weight='balanced' →
tune max_features
• Compare with LightGBM + early stopping;
evaluate AUPRC
• Stack RF + SVC + LR-meta for final model;
check calibration
References & Further Reading
• Breiman, L. (1996) Bagging Predictors; (2001)
Random Forests
• Freund & Schapire (1997) AdaBoost
• Friedman (2001) Greedy Function
Approximation (GBM)
• Chen & Guestrin (2016) XGBoost; Ke et al.
(2017) LightGBM; Dorogush et al. (2018)
CatBoost

This_ is_the_Ensemble_Learning_Deck.pptx

  • 1.
    Ensemble Learning: Bagging, Boosting& Stacking A practical, teaching-ready deck (with code snippets)
  • 2.
    Why Ensemble Learning? •Combine multiple models to improve accuracy and robustness • Reduce variance (averaging), sometimes bias (boosting) • Stronger generalization on tabular data; competitive baseline in practice • Natural parallelization (bagging) & strong off- the-shelf performance (RF)
  • 3.
    Bias–Variance Perspective (High- Level) •Prediction error = Bias² + Variance + Irreducible noise • Bagging ↓ variance by averaging unstable learners (e.g., trees) • Boosting can ↓ bias by sequentially correcting residuals • Diversity among base learners is key for gains
  • 4.
    Ensemble Taxonomy • Homogeneousvs. Heterogeneous (same vs. different base models) • Parallel (Bagging/Random Forest) vs. Sequential (Boosting) • Voting/Averaging (hard vs. soft) • Stacking/Blending with a meta-learner
  • 5.
    Bagging: Bootstrap Aggregating •Train B models on bootstrap samples (sampling with replacement) • Each model is high-variance (e.g., deep tree) → averaging stabilizes • Out-of-Bag (OOB) estimation: ~37% data not seen by a given tree • Key knobs: n_estimators (B), base_learner complexity, max_samples
  • 6.
    Random Forests (RF) •Bagging + Random Subspace: each split considers random subset of features • De-correlates trees; often best default for tabular data • Classification: majority vote; Regression: average • Tune: n_estimators, max_features, max_depth, min_samples_leaf, class_weight
  • 7.
    Extremely Randomized Trees (ExtraTrees) •Randomized thresholds + random feature subsets at each split • Even more de-correlation; often faster • May increase bias slightly but reduce variance further
  • 8.
    Boosting: Core Idea •Train weak learners sequentially; each focuses on previous errors • Stage-wise additive modeling: f_{t}(x) = f_{t-1} (x) + η * h_t(x) • Requires careful regularization (learning rate, depth, subsampling)
  • 9.
    AdaBoost (Binary Classification) •Re-weights samples; harder examples get higher weight • Weak learner: typically shallow decision trees (stumps) • Final prediction: weighted vote of weak learners • Sensitive to noise/outliers; strong on clean, small-to-medium data
  • 10.
    Gradient Boosting (GBDT/GBM) •Fit new tree to negative gradient of loss (residuals) • Key hyperparameters: n_estimators, learning_rate, max_depth (or max_leaves), subsample • Use early stopping with validation set to prevent overfit • Variants: XGBoost, LightGBM, CatBoost
  • 11.
    XGBoost vs. LightGBMvs. CatBoost (At a Glance) • XGBoost: robust regularization, shrinkage, column subsampling, wide ecosystem • LightGBM: leaf-wise growth with depth limits; fast on large, sparse datasets • CatBoost: native categorical handling, ordered boosting (reduces target leakage) • Pick based on data size, sparsity, categorical richness, and latency needs
  • 12.
    Stacking (Meta-Learning) • Level-0:diverse base models; Level-1: meta- learner uses out-of-fold predictions • Cross-validation is crucial to avoid leakage • Use simple meta-learners first (logistic/linear) to avoid overfitting • Blending: holdout set for meta-features (simpler, less data-efficient)
  • 13.
    Voting & Averaging •Hard voting: majority class label • Soft voting: average predicted probabilities (requires calibrated models) • Weighted voting: weight by validation performance or domain knowledge
  • 14.
    Imbalanced Data Strategies •Use class_weight='balanced' (RF/AdaBoost/etc.) or sampling strategies • Optimize thresholds using PR curves; use AUPRC for evaluation • Consider Balanced Random Forest, EasyEnsemble, or focal loss (boosting variants)
  • 15.
    Interpretability & Diagnostics •Global: permutation importance, minimal depth, gain statistics • Local: SHAP values, tree path analysis, counterfactuals • Check calibration (reliability curves) for probability outputs
  • 16.
    Practical Tips • Startwith RF as a baseline for tabular problems • For boosting: tune learning_rate and trees with early stopping • Use OOB (bagging/RF) for quick model iteration • Cross-validate across time splits for temporal data (avoid leakage)
  • 17.
    Key Hyperparameters (Cheat Sheet) •RF: n_estimators↑, max_features (sqrt/log2), min_samples_leaf (1–10) • GBM: learning_rate (0.01–0.1), n_estimators (100–1000+), max_depth (3–10), subsample (0.6–0.9) • AdaBoost: n_estimators (50–500), learning_rate (0.01–1.0) • Stacking: base diversity ↑, meta-learner regularization (ridge/logistic)
  • 18.
    Code: Bagging &Random Forest (scikit-learn) • from sklearn.ensemble import BaggingClassifier, RandomForestClassifier from sklearn.model_selection import train_test_split, cross_val_score X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) bag = BaggingClassifier(n_estimators=200, max_samples=0.8, n_jobs=-1, random_state=42) rf = RandomForestClassifier(n_estimators=400, max_features='sqrt', oob_score=True, n_jobs=-1, random_state=42) for model in [bag, rf]: scores = cross_val_score(model, X_train, y_train, cv=5, n_jobs=-1) print(model.__class__.__name__, scores.mean(), scores.std()) rf.fit(X_train, y_train); print('OOB:', rf.oob_score_)
  • 19.
    Code: AdaBoost & GradientBoosting(scikit-learn) • from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier ada = AdaBoostClassifier(n_estimators=300, learning_rate=0.1, random_state=42) gb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.05, max_depth=3, subsample=0.8, random_state=42) ada.fit(X_train, y_train) gb.fit(X_train, y_train) print('AdaBoost test acc:', ada.score(X_test, y_test)) print('GB test acc:', gb.score(X_test, y_test))
  • 20.
    Code: Stacking &Voting (scikit- learn) • from sklearn.ensemble import StackingClassifier, VotingClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC base_learners = [ ('rf', RandomForestClassifier(n_estimators=300, random_state=42)), ('svc', SVC(probability=True, kernel='rbf', C=2.0, gamma='scale', random_state=42)) ] meta = LogisticRegression(max_iter=1000) stack = StackingClassifier(estimators=base_learners, final_estimator=meta, cv=5, n_jobs=-1) vote = VotingClassifier(estimators=base_learners, voting='soft', n_jobs=-1) stack.fit(X_train, y_train) vote.fit(X_train, y_train) print('Stack acc:', stack.score(X_test, y_test)) print('Vote acc:', vote.score(X_test, y_test))
  • 21.
    When to UseWhat • RF: strong baseline for mixed/tabular data; low tuning cost • GBM (XGB/LGBM/CatBoost): when you need top accuracy and can tune carefully • Bagging (generic): unstable base learner & small data → variance reduction • Stacking: when diverse models each capture different structure; ensure robust CV
  • 22.
    Common Pitfalls &How to Avoid • Data leakage in stacking/blending → use out- of-fold predictions • Overfitting with too-deep trees in boosting → use small max_depth + regularization • Poor probability calibration in RF/Boosting → use calibration on validation set • Distribution shift → evaluate with time-aware or group-aware splits
  • 23.
    Mini Case Sketch(Credit Risk) • Goal: predict default; imbalanced (5% positive) • Baseline RF with class_weight='balanced' → tune max_features • Compare with LightGBM + early stopping; evaluate AUPRC • Stack RF + SVC + LR-meta for final model; check calibration
  • 24.
    References & FurtherReading • Breiman, L. (1996) Bagging Predictors; (2001) Random Forests • Freund & Schapire (1997) AdaBoost • Friedman (2001) Greedy Function Approximation (GBM) • Chen & Guestrin (2016) XGBoost; Ke et al. (2017) LightGBM; Dorogush et al. (2018) CatBoost