VHT is a vertical Hoeffding tree algorithm for streaming data that can operate in distributed environments. It builds an incremental decision tree model from data streams where attributes and instances are distributed across machines. This allows it to scale to large data volumes and high arrival rates. VHT uses Hoeffding bounds to decide when to split nodes, optimizes the tree construction process, and can replicate the tree model to remove bottlenecks. Experiments show VHT achieves similar or better accuracy than traditional decision tree algorithms, while being significantly faster, especially on dense data with many attributes. It is implemented within the SAMOA platform for developing distributed streaming machine learning algorithms.
1. VHT: Vertical Hoeffding Tree
Nicolas Kourtellis
Telefonica I+D
Gianmarco De Francisci Morales
QCRI
Albert Bifet
Telecom ParisTech
Arinto Murdopo
LARC-SMU
1
2. Decision Trees (DT)
Easy to visualize and understand
Fast to predict new instances
Can model non-linear relationships
Constructed using data batches
Scans data multiple times
Optimal Tree? NP-complete…
Greedy heuristics to build them
2
3. Big data anyone?
3
Present of big data
Too big to handle
1
Future of big data
Drinking from a firehose
14
+
4. DT + Streaming
Data come one example at a time with speed
Tree must be modified incrementally
VFDT with Hoeffding bound for guarantees
4
6. Horizontal Parallelism
Independent instances
processed in isolation
Instances distributed
randomly to machines
Same attribute counters
exist multiple times
Memory for model grows
linearly with the parallelism
Split criterion centrally
computed after partial
counters aggregated
6
Horizontal Parallelism
Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,”
The Journal of Machine Learning Research, vol. 11, pp. 849–872, Mar. 2010.
Stats
Stats
Stats
Stream
Histograms
Model
Instances
Model Updates
7. Vertical Parallelism
Independent attributes
processed in isolation
Instances must be transformed
in column-format
Attributes distributed
consistently to same
machine
Attribute counters exist only
once
Memory for model same as
sequential version
Split criterion computed in
parallel
7
Vertical Parallelism
Stats
Stats
Stats
Stream
Model
Attributes
Splits
8. Algorithm
8
9: end for
10: Send dr op content event with id of leaf l to all local-statistic
PIs
11: end if
12: end if
Control
Split
Result
Source (n) Model (n) Stats (n) Evaluator (1)
AttributeInstance
Shuffle Grouping
Key Grouping
All Grouping
Model Aggregator: Receive(local result,
sult is an l ocal - r esul t content event
ee is the current state of the decision tree in model-
af l from the list of splitting leaves
nd X b in the splitting leaf l with X l ocal
a and X l ocal
b
sult
lts from all local-statistic PIs received or time out
oeffding bound ✏=
q
R 2 ln(1/ δ)
2n l
; and (Gl (X a ) − Gl (X b) > ✏or ✏< ⌧) then
l with a split-node on X a
ranches of the split do
new leaf with derived sufficient statistic from split node
op content event with id of leaf l to all local-statistic
9. VHT Optimizations
Optimistic split execution
Use instances during split decision (in case no split)
Instance buffering
Keep instances at model for replay (in case of split)
Timeout before model decides to split
Model replication
Remove bottleneck of aggregation in single model
9
10. SAMOA ArchitectureArchitecture
SASAMOA%
Machine Learning
Algorithms
Distributed Stream
Processing Engines
Flink
10
Apex
Scalable Advanced Massive Online Analysis
• Program once, run everywhere
• Reuse existing infrastructure
• Avoid deploy cycles
• No system downtime
• No complex backup/update process
• No need to select update frequency
11. Experimental Setup: Artificial Tweets
Zipf skew: 1.5
Bag of words: 100, 1000, 10000 (attributes)
Size of tweet: ~15 words
Instances: 1,000,000
Class: positive or negative
Gaussian random variable
10 different seeded runs
Test every 100k instances
MOA HT, Local VHT, Storm cluster VHT, Horizontal
HT
More experiments on dense instances in paper!
11
12. Local VHT vs. MOA HT
12
• Accuracy: Local VHT ≥ MOA HT
• Exec. time: extra overhead due interfacing with DSPE
without scaling out
13. VHT vs. Horizontal HT
13
• Small drop in accuracy due to scaling and more attributes
• Always better than Hor. HT (more gains in dense instances)
14. VHT vs. Horizontal HT
14
• Up to 20x faster
than MOA HT
• 5-10x faster than
Hor. HT
• In dense
instances, Hor. HT
fails to run due to
overhead
• Scaling out: not
much impact
22. What is SAMOA?
Scalable Advanced Massive Online Analysis
A platform for mining big data streams
Framework for developing new distributed stream
mining algorithms
Framework for deploying algorithms on new distributed
stream processing engines
22