SlideShare a Scribd company logo
1 of 25
VHT: Vertical Hoeffding Tree
Nicolas Kourtellis
Telefonica I+D
Gianmarco De Francisci Morales
QCRI
Albert Bifet
Telecom ParisTech
Arinto Murdopo
LARC-SMU
1
Decision Trees (DT)
Easy to visualize and understand
Fast to predict new instances
Can model non-linear relationships
Constructed using data batches
Scans data multiple times
Optimal Tree? NP-complete…
Greedy heuristics to build them
2
Big data anyone?
3
Present of big data
Too big to handle
1
Future of big data
Drinking from a firehose
14
+
DT + Streaming
Data come one example at a time with speed
Tree must be modified incrementally
VFDT with Hoeffding bound for guarantees
4
DT + Streaming + Distributed
Tree construction & maintenance
distributed across machines
How?
Task parallelism
Horizontal parallelism
Vertical parallelism
5
Task Parallelism
Task parallelism
Horizontal Parallelism
Independent instances
processed in isolation
Instances distributed
randomly to machines
Same attribute counters
exist multiple times
Memory for model grows
linearly with the parallelism
Split criterion centrally
computed after partial
counters aggregated
6
Horizontal Parallelism
Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,”
The Journal of Machine Learning Research, vol. 11, pp. 849–872, Mar. 2010.
Stats
Stats
Stats
Stream
Histograms
Model
Instances
Model Updates
Vertical Parallelism
 Independent attributes
processed in isolation
 Instances must be transformed
in column-format
 Attributes distributed
consistently to same
machine
 Attribute counters exist only
once
 Memory for model same as
sequential version
 Split criterion computed in
parallel
7
Vertical Parallelism
Stats
Stats
Stats
Stream
Model
Attributes
Splits
Algorithm
8
9: end for
10: Send dr op content event with id of leaf l to all local-statistic
PIs
11: end if
12: end if
Control
Split
Result
Source (n) Model (n) Stats (n) Evaluator (1)
AttributeInstance
Shuffle Grouping
Key Grouping
All Grouping
Model Aggregator: Receive(local result,
sult is an l ocal - r esul t content event
ee is the current state of the decision tree in model-
af l from the list of splitting leaves
nd X b in the splitting leaf l with X l ocal
a and X l ocal
b
sult
lts from all local-statistic PIs received or time out
oeffding bound ✏=
q
R 2 ln(1/ δ)
2n l
; and (Gl (X a ) − Gl (X b) > ✏or ✏< ⌧) then
l with a split-node on X a
ranches of the split do
new leaf with derived sufficient statistic from split node
op content event with id of leaf l to all local-statistic
VHT Optimizations
Optimistic split execution
Use instances during split decision (in case no split)
Instance buffering
Keep instances at model for replay (in case of split)
Timeout before model decides to split
Model replication
Remove bottleneck of aggregation in single model
9
SAMOA ArchitectureArchitecture
SASAMOA%
Machine Learning
Algorithms
Distributed Stream
Processing Engines
Flink
10
Apex
Scalable Advanced Massive Online Analysis
• Program once, run everywhere
• Reuse existing infrastructure
• Avoid deploy cycles
• No system downtime
• No complex backup/update process
• No need to select update frequency
Experimental Setup: Artificial Tweets
Zipf skew: 1.5
Bag of words: 100, 1000, 10000 (attributes)
Size of tweet: ~15 words
Instances: 1,000,000
Class: positive or negative
 Gaussian random variable
10 different seeded runs
Test every 100k instances
MOA HT, Local VHT, Storm cluster VHT, Horizontal
HT
More experiments on dense instances in paper!
11
Local VHT vs. MOA HT
12
• Accuracy: Local VHT ≥ MOA HT
• Exec. time: extra overhead due interfacing with DSPE
without scaling out
VHT vs. Horizontal HT
13
• Small drop in accuracy due to scaling and more attributes
• Always better than Hor. HT (more gains in dense instances)
VHT vs. Horizontal HT
14
• Up to 20x faster
than MOA HT
• 5-10x faster than
Hor. HT
• In dense
instances, Hor. HT
fails to run due to
overhead
• Scaling out: not
much impact
VHT Evolution
15
• Closely following MOA, better than Hor. HT
• Quickly captures best accuracy
Experimental Setup: Dense
Instances
Random decision tree
Mixed categorical and numerical attributes
 10-10, 100-100, 1k-1k, 10k-10k
Instances: 1,000,000
2 balanced classes
10 different seeded runs
Test every 100k instances
MOA HT, Local VHT, Storm cluster VHT,
Horizontal HT
16
Local VHT vs. MOA HT
17
80
85
90
95
100
10-10 100-100 1k-1k 10k-10k
%accuracy
nominal attributes - numerical attributes
Dense attributes
local
moa
100 1k 10k
attributes
Sparse attributes
VHT vs. Horizontal HT
18
0
20
40
60
80
100
10-10 100-100 1k-1k 10k-10k
%accuracy
parallelism = 2
sharding wok wk(0) wk(1k) wk(10k) local
0
20
40
60
80
100
10-10 100-100 1k-1k 10k-10k
nominal attributes - numerical attributes
parallelism = 4
1
VHT vs. Horizontal HT
19
VHT: Vertical Hoeffding Tree
@ApacheSAMOA
http://samoa.incubator.apache.org/
https://github.com/apache/incubator-samoa
Nicolas Kourtellis
@kourtellis
nicolas.kourtellis@telefonica.com
20
Extra slides
21
What is SAMOA?
Scalable Advanced Massive Online Analysis
A platform for mining big data streams
Framework for developing new distributed stream
mining algorithms
Framework for deploying algorithms on new distributed
stream processing engines
22
Taxonomy
23
Machine
Learning
Distributed
Batch
Hadoop
Mahout
Stream
S4, Storm
SAMOA
Non
Distributed
Batch
R,
WEKA,
…
Stream
MOA
Algorithms in SAMOA
Existing:
 Vertical Hoeffding Tree (classification)
 CluStream (clustering)
 Adaptive Model Rules (regression)
Pending:
 Distributed Naïve Bayes
 Stochastic Gradient Descent
 Adaptive + Boosting VHT
 Parallelized Gradient Boosted Decision Tree
 PARMA (frequent pattern mining)
 …
Check Samoa Roadmap for more
Looking for
contributors!
24
VHT Evolution
25

More Related Content

Viewers also liked

cse ieee projects in trichy,BE cse projects in Trichy
 cse ieee projects in trichy,BE cse projects in Trichy  cse ieee projects in trichy,BE cse projects in Trichy
cse ieee projects in trichy,BE cse projects in Trichy vsanthosh05
 
Trust me, I am authentic -
Trust me, I am authentic - Trust me, I am authentic -
Trust me, I am authentic - Gunn Enli
 
IEEE BIGDATA PROJECT TITLE 2015-16
IEEE BIGDATA  PROJECT TITLE 2015-16IEEE BIGDATA  PROJECT TITLE 2015-16
IEEE BIGDATA PROJECT TITLE 2015-16Spiro Vellore
 
Ieee-no sql distributed db and cloud architecture report
Ieee-no sql distributed db and cloud architecture reportIeee-no sql distributed db and cloud architecture report
Ieee-no sql distributed db and cloud architecture reportOutsource Portfolio
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminarshilpi nagpal
 

Viewers also liked (6)

cse ieee projects in trichy,BE cse projects in Trichy
 cse ieee projects in trichy,BE cse projects in Trichy  cse ieee projects in trichy,BE cse projects in Trichy
cse ieee projects in trichy,BE cse projects in Trichy
 
Trust me, I am authentic -
Trust me, I am authentic - Trust me, I am authentic -
Trust me, I am authentic -
 
IEEE BIGDATA PROJECT TITLE 2015-16
IEEE BIGDATA  PROJECT TITLE 2015-16IEEE BIGDATA  PROJECT TITLE 2015-16
IEEE BIGDATA PROJECT TITLE 2015-16
 
Ieee-no sql distributed db and cloud architecture report
Ieee-no sql distributed db and cloud architecture reportIeee-no sql distributed db and cloud architecture report
Ieee-no sql distributed db and cloud architecture report
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminar
 

Similar to VHT: Stream Vertical Hoeffding Trees for Big Data

SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)Nicolas Kourtellis
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226Nick Kypreos
 
Deep Learning Inference at speed and scale
Deep Learning Inference at speed and scaleDeep Learning Inference at speed and scale
Deep Learning Inference at speed and scaleBill Liu
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageNilesh Salpe
 
The data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesThe data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
 
Matsunaga crowdsourcing IEEE e-science 2014
Matsunaga crowdsourcing IEEE e-science 2014Matsunaga crowdsourcing IEEE e-science 2014
Matsunaga crowdsourcing IEEE e-science 2014Andrea Matsunaga
 
Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams
Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data StreamsLow-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams
Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data StreamsDiego Marrón Vida
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analyticsinside-BigData.com
 
vJUG - Introduction to data streaming
vJUG - Introduction to data streamingvJUG - Introduction to data streaming
vJUG - Introduction to data streamingNicolas Fränkel
 
Handling Numeric Attributes in Hoeffding Trees
Handling Numeric Attributes in Hoeffding TreesHandling Numeric Attributes in Hoeffding Trees
Handling Numeric Attributes in Hoeffding Treesbutest
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsVineet Gupta
 
Graph processing
Graph processingGraph processing
Graph processingyeahjs
 
JUG Tirana - Introduction to data streaming
JUG Tirana - Introduction to data streamingJUG Tirana - Introduction to data streaming
JUG Tirana - Introduction to data streamingNicolas Fränkel
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardBrian O'Neill
 

Similar to VHT: Stream Vertical Hoeffding Trees for Big Data (20)

SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226
 
Deep Learning Inference at speed and scale
Deep Learning Inference at speed and scaleDeep Learning Inference at speed and scale
Deep Learning Inference at speed and scale
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Basics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed StorageBasics of Distributed Systems - Distributed Storage
Basics of Distributed Systems - Distributed Storage
 
The data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesThe data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architectures
 
Matsunaga crowdsourcing IEEE e-science 2014
Matsunaga crowdsourcing IEEE e-science 2014Matsunaga crowdsourcing IEEE e-science 2014
Matsunaga crowdsourcing IEEE e-science 2014
 
Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams
Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data StreamsLow-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams
Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analytics
 
vJUG - Introduction to data streaming
vJUG - Introduction to data streamingvJUG - Introduction to data streaming
vJUG - Introduction to data streaming
 
Handling Numeric Attributes in Hoeffding Trees
Handling Numeric Attributes in Hoeffding TreesHandling Numeric Attributes in Hoeffding Trees
Handling Numeric Attributes in Hoeffding Trees
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web Systems
 
Graph processing
Graph processingGraph processing
Graph processing
 
Bayesian Counters
Bayesian CountersBayesian Counters
Bayesian Counters
 
JUG Tirana - Introduction to data streaming
JUG Tirana - Introduction to data streamingJUG Tirana - Introduction to data streaming
JUG Tirana - Introduction to data streaming
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
 

More from Nicolas Kourtellis

Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer SystemsInferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer SystemsNicolas Kourtellis
 
On managing social data for enabling socially-aware applications and services
On managing social data for enabling socially-aware applications and servicesOn managing social data for enabling socially-aware applications and services
On managing social data for enabling socially-aware applications and servicesNicolas Kourtellis
 
Scalable Online Betweenness Centrality in Evolving Graphs
Scalable Online Betweenness Centrality in Evolving GraphsScalable Online Betweenness Centrality in Evolving Graphs
Scalable Online Betweenness Centrality in Evolving GraphsNicolas Kourtellis
 
Prometheus: Distributed Management of Geo-Social Data
Prometheus: Distributed Management of Geo-Social DataPrometheus: Distributed Management of Geo-Social Data
Prometheus: Distributed Management of Geo-Social DataNicolas Kourtellis
 
Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...
Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...
Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...Nicolas Kourtellis
 
Cultures in Community Question Answering
Cultures in Community Question AnsweringCultures in Community Question Answering
Cultures in Community Question AnsweringNicolas Kourtellis
 
Privacy Concerns vs. User Behavior in Community Question Answering
Privacy Concerns vs. User Behavior in Community Question AnsweringPrivacy Concerns vs. User Behavior in Community Question Answering
Privacy Concerns vs. User Behavior in Community Question AnsweringNicolas Kourtellis
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...Nicolas Kourtellis
 

More from Nicolas Kourtellis (8)

Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer SystemsInferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
 
On managing social data for enabling socially-aware applications and services
On managing social data for enabling socially-aware applications and servicesOn managing social data for enabling socially-aware applications and services
On managing social data for enabling socially-aware applications and services
 
Scalable Online Betweenness Centrality in Evolving Graphs
Scalable Online Betweenness Centrality in Evolving GraphsScalable Online Betweenness Centrality in Evolving Graphs
Scalable Online Betweenness Centrality in Evolving Graphs
 
Prometheus: Distributed Management of Geo-Social Data
Prometheus: Distributed Management of Geo-Social DataPrometheus: Distributed Management of Geo-Social Data
Prometheus: Distributed Management of Geo-Social Data
 
Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...
Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...
Prometheus: User-Controlled P2P Social Data Management for Socially-aware App...
 
Cultures in Community Question Answering
Cultures in Community Question AnsweringCultures in Community Question Answering
Cultures in Community Question Answering
 
Privacy Concerns vs. User Behavior in Community Question Answering
Privacy Concerns vs. User Behavior in Community Question AnsweringPrivacy Concerns vs. User Behavior in Community Question Answering
Privacy Concerns vs. User Behavior in Community Question Answering
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
 

Recently uploaded

Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 

Recently uploaded (20)

Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 

VHT: Stream Vertical Hoeffding Trees for Big Data

  • 1. VHT: Vertical Hoeffding Tree Nicolas Kourtellis Telefonica I+D Gianmarco De Francisci Morales QCRI Albert Bifet Telecom ParisTech Arinto Murdopo LARC-SMU 1
  • 2. Decision Trees (DT) Easy to visualize and understand Fast to predict new instances Can model non-linear relationships Constructed using data batches Scans data multiple times Optimal Tree? NP-complete… Greedy heuristics to build them 2
  • 3. Big data anyone? 3 Present of big data Too big to handle 1 Future of big data Drinking from a firehose 14 +
  • 4. DT + Streaming Data come one example at a time with speed Tree must be modified incrementally VFDT with Hoeffding bound for guarantees 4
  • 5. DT + Streaming + Distributed Tree construction & maintenance distributed across machines How? Task parallelism Horizontal parallelism Vertical parallelism 5 Task Parallelism Task parallelism
  • 6. Horizontal Parallelism Independent instances processed in isolation Instances distributed randomly to machines Same attribute counters exist multiple times Memory for model grows linearly with the parallelism Split criterion centrally computed after partial counters aggregated 6 Horizontal Parallelism Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” The Journal of Machine Learning Research, vol. 11, pp. 849–872, Mar. 2010. Stats Stats Stats Stream Histograms Model Instances Model Updates
  • 7. Vertical Parallelism  Independent attributes processed in isolation  Instances must be transformed in column-format  Attributes distributed consistently to same machine  Attribute counters exist only once  Memory for model same as sequential version  Split criterion computed in parallel 7 Vertical Parallelism Stats Stats Stats Stream Model Attributes Splits
  • 8. Algorithm 8 9: end for 10: Send dr op content event with id of leaf l to all local-statistic PIs 11: end if 12: end if Control Split Result Source (n) Model (n) Stats (n) Evaluator (1) AttributeInstance Shuffle Grouping Key Grouping All Grouping Model Aggregator: Receive(local result, sult is an l ocal - r esul t content event ee is the current state of the decision tree in model- af l from the list of splitting leaves nd X b in the splitting leaf l with X l ocal a and X l ocal b sult lts from all local-statistic PIs received or time out oeffding bound ✏= q R 2 ln(1/ δ) 2n l ; and (Gl (X a ) − Gl (X b) > ✏or ✏< ⌧) then l with a split-node on X a ranches of the split do new leaf with derived sufficient statistic from split node op content event with id of leaf l to all local-statistic
  • 9. VHT Optimizations Optimistic split execution Use instances during split decision (in case no split) Instance buffering Keep instances at model for replay (in case of split) Timeout before model decides to split Model replication Remove bottleneck of aggregation in single model 9
  • 10. SAMOA ArchitectureArchitecture SASAMOA% Machine Learning Algorithms Distributed Stream Processing Engines Flink 10 Apex Scalable Advanced Massive Online Analysis • Program once, run everywhere • Reuse existing infrastructure • Avoid deploy cycles • No system downtime • No complex backup/update process • No need to select update frequency
  • 11. Experimental Setup: Artificial Tweets Zipf skew: 1.5 Bag of words: 100, 1000, 10000 (attributes) Size of tweet: ~15 words Instances: 1,000,000 Class: positive or negative  Gaussian random variable 10 different seeded runs Test every 100k instances MOA HT, Local VHT, Storm cluster VHT, Horizontal HT More experiments on dense instances in paper! 11
  • 12. Local VHT vs. MOA HT 12 • Accuracy: Local VHT ≥ MOA HT • Exec. time: extra overhead due interfacing with DSPE without scaling out
  • 13. VHT vs. Horizontal HT 13 • Small drop in accuracy due to scaling and more attributes • Always better than Hor. HT (more gains in dense instances)
  • 14. VHT vs. Horizontal HT 14 • Up to 20x faster than MOA HT • 5-10x faster than Hor. HT • In dense instances, Hor. HT fails to run due to overhead • Scaling out: not much impact
  • 15. VHT Evolution 15 • Closely following MOA, better than Hor. HT • Quickly captures best accuracy
  • 16. Experimental Setup: Dense Instances Random decision tree Mixed categorical and numerical attributes  10-10, 100-100, 1k-1k, 10k-10k Instances: 1,000,000 2 balanced classes 10 different seeded runs Test every 100k instances MOA HT, Local VHT, Storm cluster VHT, Horizontal HT 16
  • 17. Local VHT vs. MOA HT 17 80 85 90 95 100 10-10 100-100 1k-1k 10k-10k %accuracy nominal attributes - numerical attributes Dense attributes local moa 100 1k 10k attributes Sparse attributes
  • 18. VHT vs. Horizontal HT 18 0 20 40 60 80 100 10-10 100-100 1k-1k 10k-10k %accuracy parallelism = 2 sharding wok wk(0) wk(1k) wk(10k) local 0 20 40 60 80 100 10-10 100-100 1k-1k 10k-10k nominal attributes - numerical attributes parallelism = 4 1
  • 20. VHT: Vertical Hoeffding Tree @ApacheSAMOA http://samoa.incubator.apache.org/ https://github.com/apache/incubator-samoa Nicolas Kourtellis @kourtellis nicolas.kourtellis@telefonica.com 20
  • 22. What is SAMOA? Scalable Advanced Massive Online Analysis A platform for mining big data streams Framework for developing new distributed stream mining algorithms Framework for deploying algorithms on new distributed stream processing engines 22
  • 24. Algorithms in SAMOA Existing:  Vertical Hoeffding Tree (classification)  CluStream (clustering)  Adaptive Model Rules (regression) Pending:  Distributed Naïve Bayes  Stochastic Gradient Descent  Adaptive + Boosting VHT  Parallelized Gradient Boosted Decision Tree  PARMA (frequent pattern mining)  … Check Samoa Roadmap for more Looking for contributors! 24