VHT: Stream Vertical Hoeffding Trees for Big Data

VHT: Vertical Hoeffding Tree
Nicolas Kourtellis
Telefonica I+D
Gianmarco De Francisci Morales
QCRI
Albert Bifet
Telecom ParisTech
Arinto Murdopo
LARC-SMU
1

Decision Trees (DT)
Easy to visualize and understand
Fast to predict new instances
Can model non-linear relationships
Constructed using data batches
Scans data multiple times
Optimal Tree? NP-complete…
Greedy heuristics to build them
2

Big data anyone?
3
Present of big data
Too big to handle
1
Future of big data
Drinking from a ﬁrehose
14
+

DT + Streaming
Data come one example at a time with speed
Tree must be modified incrementally
VFDT with Hoeffding bound for guarantees
4

DT + Streaming + Distributed
Tree construction & maintenance
distributed across machines
How?
Task parallelism
Horizontal parallelism
Vertical parallelism
5
Task Parallelism
Task parallelism

Horizontal Parallelism
Independent instances
processed in isolation
Instances distributed
randomly to machines
Same attribute counters
exist multiple times
Memory for model grows
linearly with the parallelism
Split criterion centrally
computed after partial
counters aggregated
6
Horizontal Parallelism
Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,”
The Journal of Machine Learning Research, vol. 11, pp. 849–872, Mar. 2010.
Stats
Stats
Stats
Stream
Histograms
Model
Instances
Model Updates

Vertical Parallelism
 Independent attributes
processed in isolation
 Instances must be transformed
in column-format
 Attributes distributed
consistently to same
machine
 Attribute counters exist only
once
 Memory for model same as
sequential version
 Split criterion computed in
parallel
7
Vertical Parallelism
Stats
Stats
Stats
Stream
Model
Attributes
Splits

Algorithm
8
9: end for
10: Send dr op content event with id of leaf l to all local-statistic
PIs
11: end if
12: end if
Control
Split
Result
Source (n) Model (n) Stats (n) Evaluator (1)
AttributeInstance
Shuffle Grouping
Key Grouping
All Grouping
Model Aggregator: Receive(local result,
sult is an l ocal - r esul t content event
ee is the current state of the decision tree in model-
af l from the list of splitting leaves
nd X b in the splitting leaf l with X l ocal
a and X l ocal
b
sult
lts from all local-statistic PIs received or time out
oeffding bound ✏=
q
R 2 ln(1/ δ)
2n l
; and (Gl (X a ) − Gl (X b) > ✏or ✏< ⌧) then
l with a split-node on X a
ranches of the split do
new leaf with derived sufﬁcient statistic from split node
op content event with id of leaf l to all local-statistic

VHT Optimizations
Optimistic split execution
Use instances during split decision (in case no split)
Instance buffering
Keep instances at model for replay (in case of split)
Timeout before model decides to split
Model replication
Remove bottleneck of aggregation in single model
9

SAMOA ArchitectureArchitecture
SASAMOA%
Machine Learning
Algorithms
Distributed Stream
Processing Engines
Flink
10
Apex
Scalable Advanced Massive Online Analysis
• Program once, run everywhere
• Reuse existing infrastructure
• Avoid deploy cycles
• No system downtime
• No complex backup/update process
• No need to select update frequency

Experimental Setup: Artificial Tweets
Zipf skew: 1.5
Bag of words: 100, 1000, 10000 (attributes)
Size of tweet: ~15 words
Instances: 1,000,000
Class: positive or negative
 Gaussian random variable
10 different seeded runs
Test every 100k instances
MOA HT, Local VHT, Storm cluster VHT, Horizontal
HT
More experiments on dense instances in paper!
11

Local VHT vs. MOA HT
12
• Accuracy: Local VHT ≥ MOA HT
• Exec. time: extra overhead due interfacing with DSPE
without scaling out

VHT vs. Horizontal HT
13
• Small drop in accuracy due to scaling and more attributes
• Always better than Hor. HT (more gains in dense instances)

14
• Up to 20x faster
than MOA HT
• 5-10x faster than
Hor. HT
• In dense
instances, Hor. HT
fails to run due to
overhead
• Scaling out: not
much impact

VHT Evolution
15
• Closely following MOA, better than Hor. HT
• Quickly captures best accuracy

Experimental Setup: Dense
Instances
Random decision tree
Mixed categorical and numerical attributes
 10-10, 100-100, 1k-1k, 10k-10k
Instances: 1,000,000
2 balanced classes
10 different seeded runs
Test every 100k instances
MOA HT, Local VHT, Storm cluster VHT,
Horizontal HT
16

Local VHT vs. MOA HT
17
80
85
90
95
100
10-10 100-100 1k-1k 10k-10k
%accuracy
nominal attributes - numerical attributes
Dense attributes
local
moa
100 1k 10k
attributes
Sparse attributes

18
0
20
40
60
80
100
10-10 100-100 1k-1k 10k-10k
%accuracy
parallelism = 2
sharding wok wk(0) wk(1k) wk(10k) local
0
20
40
60
80
100
10-10 100-100 1k-1k 10k-10k
nominal attributes - numerical attributes
parallelism = 4
1

VHT: Vertical Hoeffding Tree
@ApacheSAMOA
http://samoa.incubator.apache.org/
https://github.com/apache/incubator-samoa
Nicolas Kourtellis
@kourtellis
nicolas.kourtellis@telefonica.com
20

What is SAMOA?
Scalable Advanced Massive Online Analysis
A platform for mining big data streams
Framework for developing new distributed stream
mining algorithms
Framework for deploying algorithms on new distributed
stream processing engines
22

Taxonomy
23
Machine
Learning
Distributed
Batch
Hadoop
Mahout
Stream
S4, Storm
SAMOA
Non
Distributed
Batch
R,
WEKA,
…
Stream
MOA

Algorithms in SAMOA
Existing:
 Vertical Hoeffding Tree (classification)
 CluStream (clustering)
 Adaptive Model Rules (regression)
Pending:
 Distributed Naïve Bayes
 Stochastic Gradient Descent
 Adaptive + Boosting VHT
 Parallelized Gradient Boosted Decision Tree
 PARMA (frequent pattern mining)
 …
Check Samoa Roadmap for more
Looking for
contributors!
24

VHT: Stream Vertical Hoeffding Trees for Big Data

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to VHT: Stream Vertical Hoeffding Trees for Big Data

Similar to VHT: Stream Vertical Hoeffding Trees for Big Data (20)

More from Nicolas Kourtellis

More from Nicolas Kourtellis (8)

Recently uploaded

Recently uploaded (20)

VHT: Stream Vertical Hoeffding Trees for Big Data