2. Ensemble learning combines various set of learners (individual models) together which actually
improvise on stability and predictive power of model.
• Combining classifier is an ensemble method which increases the accuracy.
• To get improved model M*,combine a series of “n” learned models, M1,M2,M3…..Mn
Training Data
Data1 Data mData2
Learner1 Learner2 Learner m
Model1 Model2 Model m
Model Combiner Final Model
Original
training data
Step 1: create
multiple
datasets
Step 2: build
multiple
classifiers
Step 3:
Combine classifiers
[1]
3. • Ensemble methods that minimize
variance
– Bagging
– Random Forests
• Ensemble methods that minimize bias
– Functional Gradient Descent
– Boosting
– Ensemble Selection
[6]
4.
5. 1.The dataset is too large or small — If dataset is too large or small we have to use
sampling to choose sample to take average of the result.
2.Complex(Non-linear) data — Real time dataset is mostly in non-linear fashion. so
when we train a single model which cannot define the class boundary clearly and
model become under-fit. That case we have to take different sub sample and take
average of different model.
3.High Confidence — when we train a model with multiple classes and get high
correlated output these situation lead the High Confidence. So, In this case most of
the model predict the same class which lead that high confidence
4.Low Bias- It is a measure of how flexible the model is so if the model is very
flexible or very powerful then the bias is low.
5.Low variance-Variance is high if you give different subsets of data as training set,
the models output are very different then we say variance is high. Low for vice-versa.
Reasons to use ensemble learning
7. Bagging, which stands for bootstrap aggregating, is one
of the earliest, most intuitive and perhaps the simplest
ensemble based algorithms, with a surprisingly good
performance (Breiman 1996). Diversity of classifiers in
bagging is obtained by using bootstrapped replicas of the
training data.
Bagging Steps:
1) Suppose there are N observations and M features in
training data set. A sample from training data set is
taken randomly with replacement.
2) A subset of M features are selected randomly and
whichever feature gives the best split is used to split
the node iteratively.
3) The tree is grown to the largest.
4) Above steps are repeated n times and prediction is
given based on the aggregation of predictions from n
number of trees.
8. Advantages:
1) Reduces over-fitting of the model.
2) Handles higher dimensionality data very well.
3) Maintains accuracy for missing data.
Disadvantages:
1) Since final prediction is based on the mean
predictions from subset trees, it won’t give precise
values for the classification and regression model.
Python Syntax:
rfm = RandomForestClassifier(n_estimators=80, oob_score=True, n_jobs=-1, random_state=101, max_features = 0.50, min_samples_
fit(x_train, y_train)
predicted = rfm.predict_proba(x_test)
Objectives Achieved by Bagging:
9. Similar to bagging, boosting also creates an ensemble
of classifiers by resampling the data, which are then
combined by majority voting. However, in boosting,
resampling is strategically geared to provide the most
informative training data for each consecutive
classifier.
Boosting Steps:
1) Draw a random subset of training samples d1
without replacement from the training set D to
train a weak learner C1
2) Draw second random training subset d2 without
replacement from the training set and add 50
percent of the samples that were previously falsely
classified/misclassified to train a weak learner C2
3) Find the training samples d3 in the training set D
on which C1 and C2 disagree to train a third weak
learner C3
4) Combine all the weak learners via majority voting.
10. Advantages:
1) Supports different loss function (we have used
‘binary:logistic’ for this example).
2) Works well with interactions.
Disadvantages:
1) Prone to over-fitting.
2) Requires careful tuning of different hyper-parameters.
Python Syntax:
from xgboost import XGBClassifier
xgb = XGBClassifier(objective=’binary:logistic’,
n_estimators=70, seed=101)
fit(x_train, y_train)
predicted = xgb.predict_proba(x_test)
Objectives Achieved by Boosting:
11. Email Spam and Not spam detection e.g.: Feature used for Gmail
[4]
12. Users rate movies (1,2,3,4,5 stars);
Netflix makes suggestions to users based on previous rated movies.
“The Netflix Prize seeks to substantially improve the accuracy of
predictions about how much someone is going to love a movie
based on their movie preferences. Improve it enough and you win one
(or more) Prizes. Winning the Netflix Prize improves our ability to
connect people to the movies they love.”
13. • No clear winner; usually depends on the data
• Bagging is computationally more efficient than boosting (note that bagging
can train the M models in parallel, boosting can't)
• Both reduce variance (and overfitting) by combining different models
• The resulting model has higher stability as compared to the individual ones
• Bagging usually can't reduce the bias, boosting can (note that in boosting,
the training error steadily decreases)
• Bagging usually performs better than boosting if we don't have a high bias
and only want to reduce variance (i.e., if we are overfitting)
14. 1.Baldi, P., Frasconi, P., Smyth, P. (2003). Modeling the Internet and the Web - Probabilistic Methods and
Algorithms. New York: Wiley.
A good introduction to machine learning approaches to text mining and related applications on the web.
2.Bishop, C. M. Neural Networks for Pattern Recognition. New York: Oxford University Press (1995).
This book offers a good coverage of neural networks
3.Chakrabarti, S. (2003). Mining the Web, Morgan Kaufmann.
4.Cohen, P.R. (1995) Empirical Methods in Artificial Intelligence. Cambridge, MA: MIT Press. This is an excellent
reference on experiment design, and hypothesis testing, and related topics that are essential for empirical machine
learning research.
5.Cowell, R.G., Dawid, A.P., Lauritzen, S.L., and Spiegelhalter,D.J. (1999). Graphical Models and Expert
Systems.Berlin: Springer.
This is a very good introduction to probabilistic graphical models.
6.Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. London: Cambridge
University Press.
This is an excellent introduction to kernel methods for pattern classification.
.