SlideShare a Scribd company logo
Ensemble Learning with
Apache Spark MLlib 1.5
leoricklin@gmail.com
Reference
[1] http://www.csdn.net/article/2015-03-02/2824069
[2] http://www.csdn.net/article/2015-09-07/2825629
[3] http://www.scholarpedia.org/article/Ensemble_learning
What is Ensemble Learning (集成学习) ?
● 结合不同的学习模块(单个模型)来加强模型的稳定性和预
测能力
● 导致模型不同的4个主要因素。这些因素的组合也可能会造
成模型不同:
● 集成学习是典型的实践驱动的研究方向,它一开始先在实践
中证明有效,而后才有学者从理论上进行各种分析
● 不同种类
● 不同假设
● 不同建模技术
● 初始化参数不同
A pinch of math
● There are 3 (independent) binary classifiers (A,B,C) with a
70% accuracy
● For a majority vote with 3 members we can expect 4
outcomes:
● All three are correct
0.7 * 0.7 * 0.7 = 0.3429
● Two are correct
0.7 * 0.7 * 0.3 + 0.7 * 0.3 * 0.7
+ 0.3 * 0.7 * 0.7 = 0.4409
● Two are wrong
0.3 * 0.3 * 0.7 + 0.3 * 0.7 * 0.3 + 0.7 * 0.3 *
0.3 = 0.189
● All three are wrong
0.3 * 0.3 * 0.3 = 0.027
0.3429 + 0.4409 = 0.7838 > 0.7
Model Error
● 任何模型中出现的误差都可以在
数学上分解成三个分量:
○ Bias error 是用来度量预测值与实
际值的差异
○ Variance 则是度量基于同一观测
值,预测值之间的差异
Trade-off management of bias-variance errors
● 通当模型复杂性增加时,最
终会过拟合,因此模型开始
出现Variance
● 优良的模型应该在这两种
误差之间保持平衡
● 集成学习就是执行折衷权
衡的一种方法
○ 怎么训练每个算法?
○ 怎么融合每个算法?
EL techniques (1): Bagging
● 试图在小样本集上实现相
似的学习模块,然后对预
测值求平均值
● 可以帮助减少Variance
EL techniques (2): Boosting
● 是一项迭代技术
● 它在上一次分类的基础上
调整观测值的权重。如果
观测值被错误分类,它就
会增加这个观测值的权重
● 会减少Bias error,但是有
些时候会在训练数据上过
拟合
EL techniques (3): Stacking
● 用一个学习模块与来自
不同学习模块的输出结
合起来
● 可以减少Bias error和
Variance
● 选择合适的集成模块与
其说是纯粹的科研问题,
不如说是一种艺术
https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov
https://www.linkedin.com/pulse/ideas-sharing-kaggle-crowdflower-search-results-relevance-mark-peng
Stacking with Apache MLLib (1)
● Dataset:UCI Covtype (Ch04, Adv. Analytic w/ Spark)
● Baseline: RandomForest (Best from 8 hyper-parameters
with 3-folds C.V.)
○ precision = 0.956144
○ recall = 0.956144
Training
set X RF(θ1
)
fits
Training
set Y
predicts
h1
(Y,θ1
)
#trees = 32
θ1
: #bins=300, #depth=30, entropy
● Using Meta-features
Stacking with Apache MLLib (2)
Training
set X RF(θ1
) RF(θ2
) RF(θ3
)
RF(θ1
)
h1
(Y,θ1
)
#trees = 32
θ1
: #bins=300, #depth=30, entropy
θ2
: #bins=40, #depth=30, entropy
θ3
: #bins=300, #depth=30, gini
fits
predicts 3-folds C.V of
Training set X
h1
(X,θ1
) h2
(X,θ2
) h3
(X,θ3
) Label
fits
Training
set Y
predicts
RF(θ1
)
RF(θ2
)
RF(θ3
)
h1
(Y,θ1
)
h2
(Y,θ2
)
h3
(Y,θ3
)
predicts
sort by precision
Baseline Current
precision 0.956144 0.951056
recall 0.956144 0.951056
Stacking with Apache MLLib (3)
● Using Original features
& Meta-features
Training
set X RF(θ1
) RF(θ2
) RF(θ3
)
RF(θ1
)
h1
(Y,θ1
)
#trees = 32
θ1
: #bins=300, #depth=30, entropy
θ2
: #bins=40, #depth=30, entropy
θ3
: #bins=300, #depth=30, gini
fits
predicts 3-folds C.V of
Training set X
h1
(X,θ1
) h2
(X,θ2
) h3
(X,θ3
) Label
fits
Training
set Y
predicts
RF(θ1
)
RF(θ2
)
RF(θ3
)
h1
(Y,θ1
)
h2
(Y,θ2
)
h3
(Y,θ3
)
predicts
sort by precision
Baseline Current
precision 0.956144 0.951094
recall 0.956144 0.951094
f1
fn
………..
f1
...fn
Stacking with Apache MLLib (4)
● Retrain tier-1 models and
stacking with all features
Training
set X RF(θ1
) RF(θ2
) RF(θ3
)
RF(θ1
)
h1
(Y,θ1
)
#trees = 32
θ1
: #bins=300, #depth=30, entropy
θ2
: #bins=40, #depth=30, entropy
θ3
: #bins=300, #depth=30, gini
fits
predicts Training
set X
h1
(X,θ1
) h2
(X,θ2
) h3
(X,θ3
) Label
fits
Training
set Y
predicts
h1
(Y,θ1
)
h2
(Y,θ2
)
h3
(Y,θ3
)
predicts
sort by precision
Baseline Current
precision 0.956144 0.956836
recall 0.956144 0.956836
f1
fn
………..
f1
...fn
RF(θ1
)
RF(θ2
)
RF(θ3
)

More Related Content

What's hot

Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
Andrew Ferlitsch
 
Chapter 6 ds
Chapter 6 dsChapter 6 ds
Chapter 6 ds
Hanif Durad
 
02 Stack
02 Stack02 Stack
Lect9
Lect9Lect9
Object-Oriented Programming (OOP)
Object-Oriented Programming (OOP)Object-Oriented Programming (OOP)
Object-Oriented Programming (OOP)
Ahmad karawash
 
7.basic array
7.basic array7.basic array
7.basic array
Mir Riyanul Islam
 
NumPy
NumPyNumPy
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
Ruby Shrestha
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
Amol Agrawal
 
Introduction to numpy
Introduction to numpyIntroduction to numpy
Introduction to numpy
Gaurav Aggarwal
 
Numpy
NumpyNumpy
Numpy
ToniyaP1
 
Pooja
PoojaPooja
Lec3
Lec3Lec3
Lec3
Saad Gabr
 
Dev Concepts: Object-Oriented Programming
Dev Concepts: Object-Oriented ProgrammingDev Concepts: Object-Oriented Programming
Dev Concepts: Object-Oriented Programming
Svetlin Nakov
 
DATASTRUCTURES UNIT-1
DATASTRUCTURES UNIT-1DATASTRUCTURES UNIT-1
DATASTRUCTURES UNIT-1
Malikireddy Bramhananda Reddy
 
Pointer to array and structure
Pointer to array and structurePointer to array and structure
Pointer to array and structure
sangrampatil81
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
Pramod Toraskar
 
Integration of all 6 trig functions
Integration of all 6 trig functionsIntegration of all 6 trig functions
Integration of all 6 trig functions
Ron Eick
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
PyData
 

What's hot (19)

Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Chapter 6 ds
Chapter 6 dsChapter 6 ds
Chapter 6 ds
 
02 Stack
02 Stack02 Stack
02 Stack
 
Lect9
Lect9Lect9
Lect9
 
Object-Oriented Programming (OOP)
Object-Oriented Programming (OOP)Object-Oriented Programming (OOP)
Object-Oriented Programming (OOP)
 
7.basic array
7.basic array7.basic array
7.basic array
 
NumPy
NumPyNumPy
NumPy
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
 
Introduction to numpy
Introduction to numpyIntroduction to numpy
Introduction to numpy
 
Numpy
NumpyNumpy
Numpy
 
Pooja
PoojaPooja
Pooja
 
Lec3
Lec3Lec3
Lec3
 
Dev Concepts: Object-Oriented Programming
Dev Concepts: Object-Oriented ProgrammingDev Concepts: Object-Oriented Programming
Dev Concepts: Object-Oriented Programming
 
DATASTRUCTURES UNIT-1
DATASTRUCTURES UNIT-1DATASTRUCTURES UNIT-1
DATASTRUCTURES UNIT-1
 
Pointer to array and structure
Pointer to array and structurePointer to array and structure
Pointer to array and structure
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
Integration of all 6 trig functions
Integration of all 6 trig functionsIntegration of all 6 trig functions
Integration of all 6 trig functions
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
 

Viewers also liked

обява про прийом в гуртки 2016 з виправленням
обява про прийом в гуртки 2016 з виправленнямобява про прийом в гуртки 2016 з виправленням
обява про прийом в гуртки 2016 з виправленням
Марина Московская
 
речівки
речівкиречівки
речівкиdianchuk
 
Work Project 2-latest
Work Project 2-latestWork Project 2-latest
Work Project 2-latest
Ranjit David
 
160203 테헤란로 커피클럽_바이로봇
160203 테헤란로 커피클럽_바이로봇160203 테헤란로 커피클럽_바이로봇
160203 테헤란로 커피클럽_바이로봇
StartupAlliance
 
Pullman_teen_earns_Distinguished_Young_Woman_of_Washington_title
Pullman_teen_earns_Distinguished_Young_Woman_of_Washington_titlePullman_teen_earns_Distinguished_Young_Woman_of_Washington_title
Pullman_teen_earns_Distinguished_Young_Woman_of_Washington_title
Rose Jao
 
AMISTAD
AMISTADAMISTAD
AMISTAD
dany1753
 
160810_테헤란로 커피클럽_52th_헤이뷰티
160810_테헤란로 커피클럽_52th_헤이뷰티160810_테헤란로 커피클럽_52th_헤이뷰티
160810_테헤란로 커피클럽_52th_헤이뷰티
StartupAlliance
 
8. la prohibition de l'inceste en islam
8. la prohibition de l'inceste en islam8. la prohibition de l'inceste en islam
8. la prohibition de l'inceste en islam
Souad Azizi
 
KubeFuse - A File-System for Kubernetes
KubeFuse - A File-System for KubernetesKubeFuse - A File-System for Kubernetes
KubeFuse - A File-System for Kubernetes
Bart Spaans
 
킥스타터 모바일 참여방법 Sgnl(outline)
킥스타터 모바일 참여방법 Sgnl(outline)킥스타터 모바일 참여방법 Sgnl(outline)
킥스타터 모바일 참여방법 Sgnl(outline)
Sae-Mi Kim
 
Поетична студія "Елегія"
Поетична студія "Елегія"Поетична студія "Елегія"
Поетична студія "Елегія"
Галина Сызько
 
160615_테헤란로 커피클럽_이놈들연구소
160615_테헤란로 커피클럽_이놈들연구소160615_테헤란로 커피클럽_이놈들연구소
160615_테헤란로 커피클럽_이놈들연구소
StartupAlliance
 

Viewers also liked (12)

обява про прийом в гуртки 2016 з виправленням
обява про прийом в гуртки 2016 з виправленнямобява про прийом в гуртки 2016 з виправленням
обява про прийом в гуртки 2016 з виправленням
 
речівки
речівкиречівки
речівки
 
Work Project 2-latest
Work Project 2-latestWork Project 2-latest
Work Project 2-latest
 
160203 테헤란로 커피클럽_바이로봇
160203 테헤란로 커피클럽_바이로봇160203 테헤란로 커피클럽_바이로봇
160203 테헤란로 커피클럽_바이로봇
 
Pullman_teen_earns_Distinguished_Young_Woman_of_Washington_title
Pullman_teen_earns_Distinguished_Young_Woman_of_Washington_titlePullman_teen_earns_Distinguished_Young_Woman_of_Washington_title
Pullman_teen_earns_Distinguished_Young_Woman_of_Washington_title
 
AMISTAD
AMISTADAMISTAD
AMISTAD
 
160810_테헤란로 커피클럽_52th_헤이뷰티
160810_테헤란로 커피클럽_52th_헤이뷰티160810_테헤란로 커피클럽_52th_헤이뷰티
160810_테헤란로 커피클럽_52th_헤이뷰티
 
8. la prohibition de l'inceste en islam
8. la prohibition de l'inceste en islam8. la prohibition de l'inceste en islam
8. la prohibition de l'inceste en islam
 
KubeFuse - A File-System for Kubernetes
KubeFuse - A File-System for KubernetesKubeFuse - A File-System for Kubernetes
KubeFuse - A File-System for Kubernetes
 
킥스타터 모바일 참여방법 Sgnl(outline)
킥스타터 모바일 참여방법 Sgnl(outline)킥스타터 모바일 참여방법 Sgnl(outline)
킥스타터 모바일 참여방법 Sgnl(outline)
 
Поетична студія "Елегія"
Поетична студія "Елегія"Поетична студія "Елегія"
Поетична студія "Елегія"
 
160615_테헤란로 커피클럽_이놈들연구소
160615_테헤란로 커피클럽_이놈들연구소160615_테헤란로 커피클럽_이놈들연구소
160615_테헤란로 커피클럽_이놈들연구소
 

Similar to 1.5.ensemble learning with apache spark m llib 1.5

Parallel Ablation Studies for Machine Learning with Maggy on Apache Spark
Parallel Ablation Studies for Machine Learning with Maggy on Apache SparkParallel Ablation Studies for Machine Learning with Maggy on Apache Spark
Parallel Ablation Studies for Machine Learning with Maggy on Apache Spark
Databricks
 
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
Databricks
 
Please .pdf
Please .pdfPlease .pdf
Please .pdf
mohammedfootwear
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
Parinaz Ameri
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
Joaquin Vanschoren
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
butest
 
Meta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learningMeta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learning
Université de Liège (ULg)
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
Sri Ambati
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
Databricks
 
A baseline for_few_shot_image_classification
A baseline for_few_shot_image_classificationA baseline for_few_shot_image_classification
A baseline for_few_shot_image_classification
DongHeeKim39
 
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Université de Liège (ULg)
 
Deep Learning meetup
Deep Learning meetupDeep Learning meetup
Deep Learning meetup
Ivan Goloskokovic
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
Bryan Yang
 
S2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptxS2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptx
YashaswiniChandrappa1
 
S2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptxS2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptx
yashaswinic11
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
Taegyun Jeon
 
Matlab and Python: Basic Operations
Matlab and Python: Basic OperationsMatlab and Python: Basic Operations
Matlab and Python: Basic Operations
Wai Nwe Tun
 
Object Oriented Programming in Matlab
Object Oriented Programming in Matlab Object Oriented Programming in Matlab
Object Oriented Programming in Matlab
AlbanLevy
 
Slides
SlidesSlides
Slides
shahriar-ro
 
Csc446: Pattern Recognition
Csc446: Pattern Recognition Csc446: Pattern Recognition
Csc446: Pattern Recognition
Mostafa G. M. Mostafa
 

Similar to 1.5.ensemble learning with apache spark m llib 1.5 (20)

Parallel Ablation Studies for Machine Learning with Maggy on Apache Spark
Parallel Ablation Studies for Machine Learning with Maggy on Apache SparkParallel Ablation Studies for Machine Learning with Maggy on Apache Spark
Parallel Ablation Studies for Machine Learning with Maggy on Apache Spark
 
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
 
Please .pdf
Please .pdfPlease .pdf
Please .pdf
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
Meta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learningMeta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learning
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
A baseline for_few_shot_image_classification
A baseline for_few_shot_image_classificationA baseline for_few_shot_image_classification
A baseline for_few_shot_image_classification
 
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...Learning for exploration-exploitation in reinforcement learning. The dusk of ...
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
 
Deep Learning meetup
Deep Learning meetupDeep Learning meetup
Deep Learning meetup
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
S2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptxS2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptx
 
S2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptxS2 NIGHT SKILL.pptx
S2 NIGHT SKILL.pptx
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
 
Matlab and Python: Basic Operations
Matlab and Python: Basic OperationsMatlab and Python: Basic Operations
Matlab and Python: Basic Operations
 
Object Oriented Programming in Matlab
Object Oriented Programming in Matlab Object Oriented Programming in Matlab
Object Oriented Programming in Matlab
 
Slides
SlidesSlides
Slides
 
Csc446: Pattern Recognition
Csc446: Pattern Recognition Csc446: Pattern Recognition
Csc446: Pattern Recognition
 

More from leorick lin

How to prepare for pca certification 2021
How to prepare for pca certification 2021How to prepare for pca certification 2021
How to prepare for pca certification 2021
leorick lin
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml
leorick lin
 
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformatanalyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
leorick lin
 
Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3
leorick lin
 
Email Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML PipelineEmail Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML Pipeline
leorick lin
 
Integrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoopIntegrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoop
leorick lin
 

More from leorick lin (6)

How to prepare for pca certification 2021
How to prepare for pca certification 2021How to prepare for pca certification 2021
How to prepare for pca certification 2021
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml
 
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformatanalyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
 
Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3
 
Email Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML PipelineEmail Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML Pipeline
 
Integrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoopIntegrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoop
 

Recently uploaded

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 

Recently uploaded (20)

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 

1.5.ensemble learning with apache spark m llib 1.5

  • 1. Ensemble Learning with Apache Spark MLlib 1.5 leoricklin@gmail.com
  • 3. What is Ensemble Learning (集成学习) ? ● 结合不同的学习模块(单个模型)来加强模型的稳定性和预 测能力 ● 导致模型不同的4个主要因素。这些因素的组合也可能会造 成模型不同: ● 集成学习是典型的实践驱动的研究方向,它一开始先在实践 中证明有效,而后才有学者从理论上进行各种分析 ● 不同种类 ● 不同假设 ● 不同建模技术 ● 初始化参数不同
  • 4. A pinch of math ● There are 3 (independent) binary classifiers (A,B,C) with a 70% accuracy ● For a majority vote with 3 members we can expect 4 outcomes: ● All three are correct 0.7 * 0.7 * 0.7 = 0.3429 ● Two are correct 0.7 * 0.7 * 0.3 + 0.7 * 0.3 * 0.7 + 0.3 * 0.7 * 0.7 = 0.4409 ● Two are wrong 0.3 * 0.3 * 0.7 + 0.3 * 0.7 * 0.3 + 0.7 * 0.3 * 0.3 = 0.189 ● All three are wrong 0.3 * 0.3 * 0.3 = 0.027 0.3429 + 0.4409 = 0.7838 > 0.7
  • 5. Model Error ● 任何模型中出现的误差都可以在 数学上分解成三个分量: ○ Bias error 是用来度量预测值与实 际值的差异 ○ Variance 则是度量基于同一观测 值,预测值之间的差异
  • 6. Trade-off management of bias-variance errors ● 通当模型复杂性增加时,最 终会过拟合,因此模型开始 出现Variance ● 优良的模型应该在这两种 误差之间保持平衡 ● 集成学习就是执行折衷权 衡的一种方法 ○ 怎么训练每个算法? ○ 怎么融合每个算法?
  • 7. EL techniques (1): Bagging ● 试图在小样本集上实现相 似的学习模块,然后对预 测值求平均值 ● 可以帮助减少Variance
  • 8. EL techniques (2): Boosting ● 是一项迭代技术 ● 它在上一次分类的基础上 调整观测值的权重。如果 观测值被错误分类,它就 会增加这个观测值的权重 ● 会减少Bias error,但是有 些时候会在训练数据上过 拟合
  • 9. EL techniques (3): Stacking ● 用一个学习模块与来自 不同学习模块的输出结 合起来 ● 可以减少Bias error和 Variance ● 选择合适的集成模块与 其说是纯粹的科研问题, 不如说是一种艺术
  • 12. Stacking with Apache MLLib (1) ● Dataset:UCI Covtype (Ch04, Adv. Analytic w/ Spark) ● Baseline: RandomForest (Best from 8 hyper-parameters with 3-folds C.V.) ○ precision = 0.956144 ○ recall = 0.956144 Training set X RF(θ1 ) fits Training set Y predicts h1 (Y,θ1 ) #trees = 32 θ1 : #bins=300, #depth=30, entropy
  • 13. ● Using Meta-features Stacking with Apache MLLib (2) Training set X RF(θ1 ) RF(θ2 ) RF(θ3 ) RF(θ1 ) h1 (Y,θ1 ) #trees = 32 θ1 : #bins=300, #depth=30, entropy θ2 : #bins=40, #depth=30, entropy θ3 : #bins=300, #depth=30, gini fits predicts 3-folds C.V of Training set X h1 (X,θ1 ) h2 (X,θ2 ) h3 (X,θ3 ) Label fits Training set Y predicts RF(θ1 ) RF(θ2 ) RF(θ3 ) h1 (Y,θ1 ) h2 (Y,θ2 ) h3 (Y,θ3 ) predicts sort by precision Baseline Current precision 0.956144 0.951056 recall 0.956144 0.951056
  • 14. Stacking with Apache MLLib (3) ● Using Original features & Meta-features Training set X RF(θ1 ) RF(θ2 ) RF(θ3 ) RF(θ1 ) h1 (Y,θ1 ) #trees = 32 θ1 : #bins=300, #depth=30, entropy θ2 : #bins=40, #depth=30, entropy θ3 : #bins=300, #depth=30, gini fits predicts 3-folds C.V of Training set X h1 (X,θ1 ) h2 (X,θ2 ) h3 (X,θ3 ) Label fits Training set Y predicts RF(θ1 ) RF(θ2 ) RF(θ3 ) h1 (Y,θ1 ) h2 (Y,θ2 ) h3 (Y,θ3 ) predicts sort by precision Baseline Current precision 0.956144 0.951094 recall 0.956144 0.951094 f1 fn ……….. f1 ...fn
  • 15. Stacking with Apache MLLib (4) ● Retrain tier-1 models and stacking with all features Training set X RF(θ1 ) RF(θ2 ) RF(θ3 ) RF(θ1 ) h1 (Y,θ1 ) #trees = 32 θ1 : #bins=300, #depth=30, entropy θ2 : #bins=40, #depth=30, entropy θ3 : #bins=300, #depth=30, gini fits predicts Training set X h1 (X,θ1 ) h2 (X,θ2 ) h3 (X,θ3 ) Label fits Training set Y predicts h1 (Y,θ1 ) h2 (Y,θ2 ) h3 (Y,θ3 ) predicts sort by precision Baseline Current precision 0.956144 0.956836 recall 0.956144 0.956836 f1 fn ……….. f1 ...fn RF(θ1 ) RF(θ2 ) RF(θ3 )