SlideShare a Scribd company logo
1 of 22
Download to read offline
1
PLANET
Massively Parallel Learning of Tree Ensembles with
MapReduce
Joshua Herbach*
Google Inc., AdWords
*Joint work with Biswanath Panda, Sugato Basu, Roberto J. Bayardo
2
Outline
• PLANET – infrastructure for building trees
• Decision trees
• Usage and motivation
• MapReduce
• PLANET details
• Results
• Future Work
.42
3
Tree Models
• Classic data mining model
• Interpretable
• Good when built with ensemble
techniques like bagging
and boosting
|D|=100
|D|=10 |D|=90
|D|=45|D|=45
|D|=20 |D|=30
|D|=15
|D|=25
A
C
D E
F G H I
X1 < v1
X2 є {v2, v3}
|D|=100
|D|=10 |D|=90
|D|=45|D|=45
|D|=20 |D|=30
|D|=15
|D|=25
A
B C
D E
F G H I
Construction
X1 < v1
X2 є {v2, v3}
.42
4
5
Find Best Split
6
Trees at Google
• Large Datasets
 Iterating through a large dataset (10s, 100s, or 1000s of GB) is slow
 Computing values based on the records in a large dataset is really slow
• Parallelism!
 Break up dataset across many processing units and then combine results
 Super computers with specialized parallel hardware to support high throughput
are expensive
 Computers made from commodity hardware are cheap
• Enter MapReduce
7
MapReduce*
*http://labs.google.com/papers/mapreduce.html
Input 1
Mappers
Input 2
Input 3
Input 4
Input 5
Key A Value
Key B Value
Key A Value
Key B Value
Key C Value
Key B Value
Key C Value
Reducers
Output 1 Output 2
Can use a secondary key to control
ordering reducers see key-value pairs
8
PLANET
• Parallel Learner for Assembling Numerous Ensemble Trees
• PLANET is a learner for training decision trees that is built on MapReduce
 Regression models (or classification using logistic regression)
 Supports boosting, bagging and combinations thereof
 Scales to very large datasets
9
System Components
• Master
 Monitors and controls everything
• MapReduce Initialization Task
 Identifies all the attribute values which need to be considered for splits
• MapReduce FindBestSplit Task
 MapReduce job to find best split when there is too much data to fit in memory
• MapReduce InMemoryGrow Task
 Task to grow an entire subtree once the data for it fits in memory
• Model File
 A file describing the state of the model
10
Architecture
Master
Input Data
Initialization
MapReduce
Attribute
Metadata
Model
FindBestSplit
MapReduce
Intermediate
Results
|D|=100 A
|D|=10 |D|=90C
X1 < v1
.42
|D|=45|D|=45 D E
X2 є {v2, v3}
|D|=20
|D|=15
|D|=25
F G H I
InMemory
MapReduce
11
Master
• Controls the entire process
• Determines the state of the tree and grows it
 Decides if nodes should be leaves
 If there is relatively little data entering a node; launch an InMemory MapReduce
job to grow the entire subtree
 For larger nodes, launches a MapReduce job to find candidate best splits
 Collects results from MapReduce jobs and chooses the best split for a node
 Updates Model
• Periodically checkpoints system
• Maintains status page for monitoring
12
Status page
13
Initialization MapReduce
• Identifies all the attribute values which need to be considered for splits
• Continuous attributes
 Compute an approximate equi-depth histogram*
 Boundary points of histogram used for potential splits
• Categorical attributes
 Identify attribute's domain
• Generates an “attribute file” to be loaded in memory by other tasks
*G. S. Manku, S. Rajagopalan, and B. G. Lindsay, SIGMOD, 1999.
14
FindBestSplit MapReduce
• MapReduce job to find best split when there is too much data to fit in memory
• Mapper
 Initialize by loading attribute file from Initialization task and current model file
 For each record run the Map algorithm
 For each node output to all reducers
<Node.Id, <Sum Result, Sum Squared Result, Count>>
 For each split output <Split.Id, <Sum Result, Sum Squared Result, Count>>
Map(data):
Node = TraverseTree(data, Model)
if Node to be grown:
Node.stats.AddData(data)
for feature in data:
Split = FindSplitForValue(Node.Id, feature)
Split.stats.AddData(data)
15
FindBestSplit MapReduce
• MapReduce job to find best split when there is too much data to fit in memory
• Reducer (Continuous Attributes)
 Load in all the <Node_Id, List<Sum Result, Sum Squared Result, Count>> pairs
and aggregate the per_node statistics.
 For each <Split_Id, List<Sum Result, Sum Squared Result, Count>> run the
Reduce algorithm
 For each Node_Id, output the best split found
Reduce(Split_Id, values):
Split = NewSplit(Split_Id)
best = FindBestSplitSoFar(Split.Node.Id)
for stats in values
split.stats.AddStats(stats)
left = ComputeImpurity(split.stats)
right = ComputeImpurity(split.node.stats – split.stats)
split.impurity = left + right
if split.impurity < best.impurity:
UpdateBestSplit(Split.Node.Id, split)
16
FindBestSplit MapReduce
• MapReduce job to find best split when there is too much data to fit in memory
 Reducer (Categorical Attributes)
• Modification to reduce algorithm:
 Compute the aggregate stats for each individual value
 Sort values by average target value
 Iterate through list and find optimal subsequence in list*
*L. Breiman, J. H. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. 1984.
17
InMemoryGrow MapReduce
• Task to grow an entire subtree once the data for it fits in memory
• Mapper
 Initialize by loading current model file
 For each record identify the node it falls under and if that node is to be grown,
output <Node_Id, Record>
• Reducer
 Initialize by loading attribute file from Initialization task
 For each <Node_Id, List<Record>> run the basic tree growing algorithm on the
records
 Output the best splits for each node in the subtree
18
Ensembles
• Bagging
 Construct multiple trees in parallel, each on a sample of the data
 Sampling without replacement is easy to implement on the Mapper side for both
types of MapReduce tasks
• Compute a hash of <Tree_Id, Record_Id> and if it's below a threshold then sample it
 Get results by combining the output of the trees
• Boosting
 Construct multiple trees in a series, each on a sample of the data*
 Modify the target of each record to be the residual of the target and the model's
prediction for the record
• For regression, the residual z is the target y minus the model prediction F(x)
• For classification, z = y – 1 / (1 + exp(-F(x)))
 Get results by combining output from each tree
*J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 2001.
19
Performance Issues
• Set up and Tear down
 Per-MapReduce overhead is significant for large forests or deep trees
 Reduce tear-down cost by polling for output instead of waiting for a task to return
 Reduce start-up cost through forward scheduling
• Maintain a set of live MapReduce jobs and assign them tasks instead of starting new
jobs from scratch
• Categorical Attributes
 Basic implementation stored and tracked these as strings
• This made traversing the tree expensive
 Improved latency by instead considering fingerprints of these values
• Very high dimensional data
 If the number of splits is too large the Mapper might run out of memory
 Instead of defining split tasks as a set of nodes to grow, define them as a set of
nodes grow and a set of attributes to explore.
20
Results
21
Conclusions
• Large-scale learning is increasingly important
• Computing infrastructures like MapReduce can be leveraged for large-scale
learning
• PLANET scales efficiently with larger datasets and complex models.
• Future work
 Adding support for sampling with replacement
 Categorical attributes with large domains
• Might run out of memory
 Only support splitting on single values
 Area for future exploration
Google Confidential and Proprietary 22
Thank You!
Q&A

More Related Content

What's hot

master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19Hyun Wong Choi
 
defense hyun-wong choi_2019_05_14_rev18
defense hyun-wong choi_2019_05_14_rev18defense hyun-wong choi_2019_05_14_rev18
defense hyun-wong choi_2019_05_14_rev18Hyun Wong Choi
 
master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19Hyun Wong Choi
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component AnalysisMason Ziemer
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
presentation 2019 04_09_rev1
presentation 2019 04_09_rev1presentation 2019 04_09_rev1
presentation 2019 04_09_rev1Hyun Wong Choi
 
Decision tree clustering a columnstores tuple reconstruction
Decision tree clustering  a columnstores tuple reconstructionDecision tree clustering  a columnstores tuple reconstruction
Decision tree clustering a columnstores tuple reconstructioncsandit
 
Size measurement and estimation
Size measurement and estimationSize measurement and estimation
Size measurement and estimationLouis A. Poulin
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnSarah Guido
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduceShrihari Rathod
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 r-kor
 
Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2Hyun Wong Choi
 

What's hot (17)

master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19
 
defense hyun-wong choi_2019_05_14_rev18
defense hyun-wong choi_2019_05_14_rev18defense hyun-wong choi_2019_05_14_rev18
defense hyun-wong choi_2019_05_14_rev18
 
master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19master defense hyun-wong choi_2019_05_14_rev19
master defense hyun-wong choi_2019_05_14_rev19
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
presentation 2019 04_09_rev1
presentation 2019 04_09_rev1presentation 2019 04_09_rev1
presentation 2019 04_09_rev1
 
Decision tree clustering a columnstores tuple reconstruction
Decision tree clustering  a columnstores tuple reconstructionDecision tree clustering  a columnstores tuple reconstruction
Decision tree clustering a columnstores tuple reconstruction
 
Size measurement and estimation
Size measurement and estimationSize measurement and estimation
Size measurement and estimation
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
SQL Server 2008 Spatial Data - Getting Started
SQL Server 2008 Spatial Data - Getting StartedSQL Server 2008 Spatial Data - Getting Started
SQL Server 2008 Spatial Data - Getting Started
 
T180304125129
T180304125129T180304125129
T180304125129
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2
 
Tractor Pulling on Data Warehouse
Tractor Pulling on Data WarehouseTractor Pulling on Data Warehouse
Tractor Pulling on Data Warehouse
 

Similar to Planet

Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLMLconf
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화NAVER Engineering
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkDB Tsai
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010Cloudera, Inc.
 
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesStrata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesIntuit Inc.
 
Pandas data transformational data structure patterns and challenges final
Pandas   data transformational data structure patterns and challenges  finalPandas   data transformational data structure patterns and challenges  final
Pandas data transformational data structure patterns and challenges finalRajesh M
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce Sina Ebrahimi
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...Victor Giannakouris
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2yannabraham
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 

Similar to Planet (20)

Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATLSandy Ryza – Software Engineer, Cloudera at MLconf ATL
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesStrata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
 
Pandas data transformational data structure patterns and challenges final
Pandas   data transformational data structure patterns and challenges  finalPandas   data transformational data structure patterns and challenges  final
Pandas data transformational data structure patterns and challenges final
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
Hadoop
HadoopHadoop
Hadoop
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 

Recently uploaded

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 

Recently uploaded (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 

Planet

  • 1. 1 PLANET Massively Parallel Learning of Tree Ensembles with MapReduce Joshua Herbach* Google Inc., AdWords *Joint work with Biswanath Panda, Sugato Basu, Roberto J. Bayardo
  • 2. 2 Outline • PLANET – infrastructure for building trees • Decision trees • Usage and motivation • MapReduce • PLANET details • Results • Future Work
  • 3. .42 3 Tree Models • Classic data mining model • Interpretable • Good when built with ensemble techniques like bagging and boosting |D|=100 |D|=10 |D|=90 |D|=45|D|=45 |D|=20 |D|=30 |D|=15 |D|=25 A C D E F G H I X1 < v1 X2 є {v2, v3}
  • 4. |D|=100 |D|=10 |D|=90 |D|=45|D|=45 |D|=20 |D|=30 |D|=15 |D|=25 A B C D E F G H I Construction X1 < v1 X2 є {v2, v3} .42 4
  • 6. 6 Trees at Google • Large Datasets  Iterating through a large dataset (10s, 100s, or 1000s of GB) is slow  Computing values based on the records in a large dataset is really slow • Parallelism!  Break up dataset across many processing units and then combine results  Super computers with specialized parallel hardware to support high throughput are expensive  Computers made from commodity hardware are cheap • Enter MapReduce
  • 7. 7 MapReduce* *http://labs.google.com/papers/mapreduce.html Input 1 Mappers Input 2 Input 3 Input 4 Input 5 Key A Value Key B Value Key A Value Key B Value Key C Value Key B Value Key C Value Reducers Output 1 Output 2 Can use a secondary key to control ordering reducers see key-value pairs
  • 8. 8 PLANET • Parallel Learner for Assembling Numerous Ensemble Trees • PLANET is a learner for training decision trees that is built on MapReduce  Regression models (or classification using logistic regression)  Supports boosting, bagging and combinations thereof  Scales to very large datasets
  • 9. 9 System Components • Master  Monitors and controls everything • MapReduce Initialization Task  Identifies all the attribute values which need to be considered for splits • MapReduce FindBestSplit Task  MapReduce job to find best split when there is too much data to fit in memory • MapReduce InMemoryGrow Task  Task to grow an entire subtree once the data for it fits in memory • Model File  A file describing the state of the model
  • 10. 10 Architecture Master Input Data Initialization MapReduce Attribute Metadata Model FindBestSplit MapReduce Intermediate Results |D|=100 A |D|=10 |D|=90C X1 < v1 .42 |D|=45|D|=45 D E X2 є {v2, v3} |D|=20 |D|=15 |D|=25 F G H I InMemory MapReduce
  • 11. 11 Master • Controls the entire process • Determines the state of the tree and grows it  Decides if nodes should be leaves  If there is relatively little data entering a node; launch an InMemory MapReduce job to grow the entire subtree  For larger nodes, launches a MapReduce job to find candidate best splits  Collects results from MapReduce jobs and chooses the best split for a node  Updates Model • Periodically checkpoints system • Maintains status page for monitoring
  • 13. 13 Initialization MapReduce • Identifies all the attribute values which need to be considered for splits • Continuous attributes  Compute an approximate equi-depth histogram*  Boundary points of histogram used for potential splits • Categorical attributes  Identify attribute's domain • Generates an “attribute file” to be loaded in memory by other tasks *G. S. Manku, S. Rajagopalan, and B. G. Lindsay, SIGMOD, 1999.
  • 14. 14 FindBestSplit MapReduce • MapReduce job to find best split when there is too much data to fit in memory • Mapper  Initialize by loading attribute file from Initialization task and current model file  For each record run the Map algorithm  For each node output to all reducers <Node.Id, <Sum Result, Sum Squared Result, Count>>  For each split output <Split.Id, <Sum Result, Sum Squared Result, Count>> Map(data): Node = TraverseTree(data, Model) if Node to be grown: Node.stats.AddData(data) for feature in data: Split = FindSplitForValue(Node.Id, feature) Split.stats.AddData(data)
  • 15. 15 FindBestSplit MapReduce • MapReduce job to find best split when there is too much data to fit in memory • Reducer (Continuous Attributes)  Load in all the <Node_Id, List<Sum Result, Sum Squared Result, Count>> pairs and aggregate the per_node statistics.  For each <Split_Id, List<Sum Result, Sum Squared Result, Count>> run the Reduce algorithm  For each Node_Id, output the best split found Reduce(Split_Id, values): Split = NewSplit(Split_Id) best = FindBestSplitSoFar(Split.Node.Id) for stats in values split.stats.AddStats(stats) left = ComputeImpurity(split.stats) right = ComputeImpurity(split.node.stats – split.stats) split.impurity = left + right if split.impurity < best.impurity: UpdateBestSplit(Split.Node.Id, split)
  • 16. 16 FindBestSplit MapReduce • MapReduce job to find best split when there is too much data to fit in memory  Reducer (Categorical Attributes) • Modification to reduce algorithm:  Compute the aggregate stats for each individual value  Sort values by average target value  Iterate through list and find optimal subsequence in list* *L. Breiman, J. H. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. 1984.
  • 17. 17 InMemoryGrow MapReduce • Task to grow an entire subtree once the data for it fits in memory • Mapper  Initialize by loading current model file  For each record identify the node it falls under and if that node is to be grown, output <Node_Id, Record> • Reducer  Initialize by loading attribute file from Initialization task  For each <Node_Id, List<Record>> run the basic tree growing algorithm on the records  Output the best splits for each node in the subtree
  • 18. 18 Ensembles • Bagging  Construct multiple trees in parallel, each on a sample of the data  Sampling without replacement is easy to implement on the Mapper side for both types of MapReduce tasks • Compute a hash of <Tree_Id, Record_Id> and if it's below a threshold then sample it  Get results by combining the output of the trees • Boosting  Construct multiple trees in a series, each on a sample of the data*  Modify the target of each record to be the residual of the target and the model's prediction for the record • For regression, the residual z is the target y minus the model prediction F(x) • For classification, z = y – 1 / (1 + exp(-F(x)))  Get results by combining output from each tree *J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 2001.
  • 19. 19 Performance Issues • Set up and Tear down  Per-MapReduce overhead is significant for large forests or deep trees  Reduce tear-down cost by polling for output instead of waiting for a task to return  Reduce start-up cost through forward scheduling • Maintain a set of live MapReduce jobs and assign them tasks instead of starting new jobs from scratch • Categorical Attributes  Basic implementation stored and tracked these as strings • This made traversing the tree expensive  Improved latency by instead considering fingerprints of these values • Very high dimensional data  If the number of splits is too large the Mapper might run out of memory  Instead of defining split tasks as a set of nodes to grow, define them as a set of nodes grow and a set of attributes to explore.
  • 21. 21 Conclusions • Large-scale learning is increasingly important • Computing infrastructures like MapReduce can be leveraged for large-scale learning • PLANET scales efficiently with larger datasets and complex models. • Future work  Adding support for sampling with replacement  Categorical attributes with large domains • Might run out of memory  Only support splitting on single values  Area for future exploration
  • 22. Google Confidential and Proprietary 22 Thank You! Q&A