Distributed Decision Tree
Learning for Mining Big Data
Streams
1
Master Thesis presentation by:
Arinto Murdopo
EMDC
arinto@yahoo-inc.com
Supervisors:
Albert Bifet
Gianmarco de Francisci Morales
Ricard Gavaldà
Big Data
200 million users
400 million tweets/day
2
1+ TB/day to Hadoop
2.7 TB/day follower update
4.5 billion likes/day
350 million photos/day
Volume
Velocity
Variety
May 2013
March 2013
May 2013
Machine Learning (ML)
3
Make sense of the data, but how?
Machine Learning = learn & adapt based on data
Due to the 3Vs, we should:
1. Distribute, to scale
2. Stream, to be fast
3. Distribute and stream,
scale and fast
Are We Satisfied?
4
scale fast
fastscale
scale fast
loose-coupling
loose-coupling
We want machine learning frameworks that
are able to scale, fast, and loose-coupling
loose-coupling
SAMOA
Scalable Advanced Massive Online Analysis
Distributed Streaming Machine Learning Framework:
• Fast, using streaming model
• Scale, on top of distributed SPEs (Storm and S4)
• Loose-coupling between ML algorithms and SPEs
5
Contributions
SAMOA
• Architecture and Abstractions
• Stream Processing Engine Adapter
• Integration with Storm
Vertical Hoeffding Tree
• Better than MOA for high number of
attributes
6
7
SAMOA Architecture
Frequent
Pattern
Mining
Storm Other SPEs
SAMOA
S4
Clustering
Methods
Classification
Methods
SAMOA Abstractions
To develop distributed ML algorithms
8
z
EPI
Processor
Stream
n
Content
Events
Grouping
Parallelism
Hint
Topology
PI
External
Event Source
SAMOA SPE-adapter
• Transforms the abstractions into SPE-
specific runtime components
• Abstract factory pattern to decouple API
and SPE
• Platform developers need to provide
1. PI and EPI
2. Stream
3. Grouping 9
SAMOA SPE-adapter
Examples of SPE-specific runtime
components from SPE-adapter
10
Focus of this
thesis
Storm
• Distributed Streaming Processing Engine
• MapReduce-like programming model
11
stream A
................
stream B
S1
S2
B1
B2
B3
B5
B4
stores useful information
data
storage
Stream
Spout
Bolt
DAG
Tuples
SAMOA-Storm Integration
Mapping between Storm and SAMOA
1. Spout  Entrance Processing Item (EPI)
2. Bolt  Processing Item
• Use composition for EPI and PI
3. Bolt Stream & Spout Stream  Stream
• Storm pull model
12
Contributions so far ..
13
samoa-SPE
SAMOA
Algorithm and API
SPE-adapter
S4 Storm other SPEs
ML-adapter
MOA
Other ML
frameworks
samoa-S4 samoa-storm samoa-other-SPEs
Flexibility
Scalability
Extensibility
Next Contribution…
Distributed Algorithm implementation:
Vertical Hoeffding Tree
Decision tree:
• Classification
• Divide and conquer
• Easy to interpret
14
Sample Dataset
ID
Code
Outlook Temperature Humidity Windy Play
a sunny hot high false no
b sunny hot high true no
c overcast hot high false yes
d rainy mild high false yes
… … … … … …
15
attribute class
a datum (an instance) to
build the tree
Decision Tree
16
outlook
Y
sunny
rainy
overcast
humidity windy
N Y NY
truefalsenormalhigh
root
split node
leaf node
Very Fast Decision Tree (VFDT)
• Pioneer in decision tree for streaming
• Information Gain + Gain Ratio + Hoeffding
bound
• Hoeffding bound decides whether the
difference in information gain is enough to
split or not
• Often called Hoeffding Tree
17
Distributed Decision Tree
Types of parallelism
• Horizontal
• Partition the data by the instance
• Vertical
• Partition the data by the attribute
• Task
• Tree leaf nodes grow in parallel 18
MOA Hoeffding Tree Profiling
19
Learn
70%
Split
24%
Other
6%
CPU Time Breakdown, 200 attributes
Vertical Hoeffding Tree
20
1 z1 zz
n 1
source PI
model-
aggregator PI
local-statistic PI
evaluator PI
source
local-result
control
attribute
result
Evaluation
Metrics:
• Accuracy
• Throughput
Input data:
• Random Tree Generator
• Text Generator – resembles tweets
Cluster: 3 shared nodes 48 GB of RAM, Intel Xeon
CPU E5620 @ 2.4 GHz: 16 processors, Linux
Kernel 2.6.18
21
VHT iteration 1 (VHT1)
• Goal: Verify algorithm correctness (same
accuracy as MOA)
• Utilized 2 internal queues: instances queue,
local-result queue
• Achieved same accuracy but throughput is
low. Proceed with VHT 2
22
VHT Iteration 2 (VHT2)
Goal: improve VHT1 throughput
• Kryo serializer: 2.5x throughput
improvement
• long identifier instead of String
• Remove 2 internal queues in VHT1 
discard instances while attempting to split
23
tree-10
24
Around 8.2 % differences
in accuracy
tree-100
25
Same trend as tree-10
(7.9% difference in accuracy)
No. Leaf Nodes VHT2 –
tree-100
26
Very close and
very high accuracy
Accuracy VHT2 – text-1000
27
Low accuracy when
the number of
attributes increased
Throughput VHT2 – tree-
generator
28
Not good for dense
instance and low
number of attributes
Throughput VHT2 – text-generator
29
Higher throughput
than MHT
30
0
50
100
150
200
250
300
VHT2-par-3 MHT
ExecutionTime(seconds)
Classifier
Profiling Results for text-1000 with
1000000 instances
t_calc
t_comm
t_serial
Minimizing t_comm will
increase throughput
31
0
50
100
150
200
250
VHT2-par-3 MHT
ExecutionTime(seconds)
Classifier
Profiling Results for text-10000
with 100000 instances
t_calc
t_comm
t_serial
Throughput
VHT2-par-3: 2631 inst/sec
MHT : 507 inst/sec
Future Work
• Open Source
• Evaluation layer in SAMOA architecture
• Online classification algorithms that are
based on horizontal parallelism
32
Conclusions
Mining big data stream is challenging
• Systems needs to satisfy 3Vs of big data.
SAMOA – Distributed Streaming ML Framework
• Architecture and Abstractions
• Stream Processing Engine (SPE) adapter
• SAMOA Integration with Storm
Vertical Hoeffding Tree
• Better than MOA for high number of attributes
33

Distributed Decision Tree Learning for Mining Big Data Streams

  • 1.
    Distributed Decision Tree Learningfor Mining Big Data Streams 1 Master Thesis presentation by: Arinto Murdopo EMDC arinto@yahoo-inc.com Supervisors: Albert Bifet Gianmarco de Francisci Morales Ricard Gavaldà
  • 2.
    Big Data 200 millionusers 400 million tweets/day 2 1+ TB/day to Hadoop 2.7 TB/day follower update 4.5 billion likes/day 350 million photos/day Volume Velocity Variety May 2013 March 2013 May 2013
  • 3.
    Machine Learning (ML) 3 Makesense of the data, but how? Machine Learning = learn & adapt based on data Due to the 3Vs, we should: 1. Distribute, to scale 2. Stream, to be fast 3. Distribute and stream, scale and fast
  • 4.
    Are We Satisfied? 4 scalefast fastscale scale fast loose-coupling loose-coupling We want machine learning frameworks that are able to scale, fast, and loose-coupling loose-coupling
  • 5.
    SAMOA Scalable Advanced MassiveOnline Analysis Distributed Streaming Machine Learning Framework: • Fast, using streaming model • Scale, on top of distributed SPEs (Storm and S4) • Loose-coupling between ML algorithms and SPEs 5
  • 6.
    Contributions SAMOA • Architecture andAbstractions • Stream Processing Engine Adapter • Integration with Storm Vertical Hoeffding Tree • Better than MOA for high number of attributes 6
  • 7.
    7 SAMOA Architecture Frequent Pattern Mining Storm OtherSPEs SAMOA S4 Clustering Methods Classification Methods
  • 8.
    SAMOA Abstractions To developdistributed ML algorithms 8 z EPI Processor Stream n Content Events Grouping Parallelism Hint Topology PI External Event Source
  • 9.
    SAMOA SPE-adapter • Transformsthe abstractions into SPE- specific runtime components • Abstract factory pattern to decouple API and SPE • Platform developers need to provide 1. PI and EPI 2. Stream 3. Grouping 9
  • 10.
    SAMOA SPE-adapter Examples ofSPE-specific runtime components from SPE-adapter 10 Focus of this thesis
  • 11.
    Storm • Distributed StreamingProcessing Engine • MapReduce-like programming model 11 stream A ................ stream B S1 S2 B1 B2 B3 B5 B4 stores useful information data storage Stream Spout Bolt DAG Tuples
  • 12.
    SAMOA-Storm Integration Mapping betweenStorm and SAMOA 1. Spout  Entrance Processing Item (EPI) 2. Bolt  Processing Item • Use composition for EPI and PI 3. Bolt Stream & Spout Stream  Stream • Storm pull model 12
  • 13.
    Contributions so far.. 13 samoa-SPE SAMOA Algorithm and API SPE-adapter S4 Storm other SPEs ML-adapter MOA Other ML frameworks samoa-S4 samoa-storm samoa-other-SPEs Flexibility Scalability Extensibility
  • 14.
    Next Contribution… Distributed Algorithmimplementation: Vertical Hoeffding Tree Decision tree: • Classification • Divide and conquer • Easy to interpret 14
  • 15.
    Sample Dataset ID Code Outlook TemperatureHumidity Windy Play a sunny hot high false no b sunny hot high true no c overcast hot high false yes d rainy mild high false yes … … … … … … 15 attribute class a datum (an instance) to build the tree
  • 16.
    Decision Tree 16 outlook Y sunny rainy overcast humidity windy NY NY truefalsenormalhigh root split node leaf node
  • 17.
    Very Fast DecisionTree (VFDT) • Pioneer in decision tree for streaming • Information Gain + Gain Ratio + Hoeffding bound • Hoeffding bound decides whether the difference in information gain is enough to split or not • Often called Hoeffding Tree 17
  • 18.
    Distributed Decision Tree Typesof parallelism • Horizontal • Partition the data by the instance • Vertical • Partition the data by the attribute • Task • Tree leaf nodes grow in parallel 18
  • 19.
    MOA Hoeffding TreeProfiling 19 Learn 70% Split 24% Other 6% CPU Time Breakdown, 200 attributes
  • 20.
    Vertical Hoeffding Tree 20 1z1 zz n 1 source PI model- aggregator PI local-statistic PI evaluator PI source local-result control attribute result
  • 21.
    Evaluation Metrics: • Accuracy • Throughput Inputdata: • Random Tree Generator • Text Generator – resembles tweets Cluster: 3 shared nodes 48 GB of RAM, Intel Xeon CPU E5620 @ 2.4 GHz: 16 processors, Linux Kernel 2.6.18 21
  • 22.
    VHT iteration 1(VHT1) • Goal: Verify algorithm correctness (same accuracy as MOA) • Utilized 2 internal queues: instances queue, local-result queue • Achieved same accuracy but throughput is low. Proceed with VHT 2 22
  • 23.
    VHT Iteration 2(VHT2) Goal: improve VHT1 throughput • Kryo serializer: 2.5x throughput improvement • long identifier instead of String • Remove 2 internal queues in VHT1  discard instances while attempting to split 23
  • 24.
    tree-10 24 Around 8.2 %differences in accuracy
  • 25.
    tree-100 25 Same trend astree-10 (7.9% difference in accuracy)
  • 26.
    No. Leaf NodesVHT2 – tree-100 26 Very close and very high accuracy
  • 27.
    Accuracy VHT2 –text-1000 27 Low accuracy when the number of attributes increased
  • 28.
    Throughput VHT2 –tree- generator 28 Not good for dense instance and low number of attributes
  • 29.
    Throughput VHT2 –text-generator 29 Higher throughput than MHT
  • 30.
    30 0 50 100 150 200 250 300 VHT2-par-3 MHT ExecutionTime(seconds) Classifier Profiling Resultsfor text-1000 with 1000000 instances t_calc t_comm t_serial Minimizing t_comm will increase throughput
  • 31.
    31 0 50 100 150 200 250 VHT2-par-3 MHT ExecutionTime(seconds) Classifier Profiling Resultsfor text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec
  • 32.
    Future Work • OpenSource • Evaluation layer in SAMOA architecture • Online classification algorithms that are based on horizontal parallelism 32
  • 33.
    Conclusions Mining big datastream is challenging • Systems needs to satisfy 3Vs of big data. SAMOA – Distributed Streaming ML Framework • Architecture and Abstractions • Stream Processing Engine (SPE) adapter • SAMOA Integration with Storm Vertical Hoeffding Tree • Better than MOA for high number of attributes 33