Distributed Decision Tree Learning for Mining Big Data Streams

Distributed Decision Tree
Learning for Mining Big Data
Streams
1
Master Thesis presentation by:
Arinto Murdopo
EMDC
arinto@yahoo-inc.com
Supervisors:
Albert Bifet
Gianmarco de Francisci Morales
Ricard Gavaldà

Big Data
200 million users
400 million tweets/day
2
1+ TB/day to Hadoop
2.7 TB/day follower update
4.5 billion likes/day
350 million photos/day
Volume
Velocity
Variety
May 2013
March 2013
May 2013

Machine Learning (ML)
3
Make sense of the data, but how?
Machine Learning = learn & adapt based on data
Due to the 3Vs, we should:
1. Distribute, to scale
2. Stream, to be fast
3. Distribute and stream,
scale and fast

Are We Satisfied?
4
scale fast
fastscale
scale fast
loose-coupling
loose-coupling
We want machine learning frameworks that
are able to scale, fast, and loose-coupling
loose-coupling

SAMOA
Scalable Advanced Massive Online Analysis
Distributed Streaming Machine Learning Framework:
• Fast, using streaming model
• Scale, on top of distributed SPEs (Storm and S4)
• Loose-coupling between ML algorithms and SPEs
5

Contributions
SAMOA
• Architecture and Abstractions
• Stream Processing Engine Adapter
• Integration with Storm
Vertical Hoeffding Tree
• Better than MOA for high number of
attributes
6

7
SAMOA Architecture
Frequent
Pattern
Mining
Storm Other SPEs
SAMOA
S4
Clustering
Methods
Classification
Methods

SAMOA Abstractions
To develop distributed ML algorithms
8
z
EPI
Processor
Stream
n
Content
Events
Grouping
Parallelism
Hint
Topology
PI
External
Event Source

SAMOA SPE-adapter
• Transforms the abstractions into SPE-
specific runtime components
• Abstract factory pattern to decouple API
and SPE
• Platform developers need to provide
1. PI and EPI
2. Stream
3. Grouping 9

SAMOA SPE-adapter
Examples of SPE-specific runtime
components from SPE-adapter
10
Focus of this
thesis

Storm
• Distributed Streaming Processing Engine
• MapReduce-like programming model
11
stream A
................
stream B
S1
S2
B1
B2
B3
B5
B4
stores useful information
data
storage
Stream
Spout
Bolt
DAG
Tuples

SAMOA-Storm Integration
Mapping between Storm and SAMOA
1. Spout  Entrance Processing Item (EPI)
2. Bolt  Processing Item
• Use composition for EPI and PI
3. Bolt Stream & Spout Stream  Stream
• Storm pull model
12

Contributions so far ..
13
samoa-SPE
SAMOA
Algorithm and API
SPE-adapter
S4 Storm other SPEs
ML-adapter
MOA
Other ML
frameworks
samoa-S4 samoa-storm samoa-other-SPEs
Flexibility
Scalability
Extensibility

Next Contribution…
Distributed Algorithm implementation:
Decision tree:
• Classification
• Divide and conquer
• Easy to interpret
14

Sample Dataset
ID
Code
Outlook Temperature Humidity Windy Play
a sunny hot high false no
b sunny hot high true no
c overcast hot high false yes
d rainy mild high false yes
… … … … … …
15
attribute class
a datum (an instance) to
build the tree

Decision Tree
16
outlook
Y
sunny
rainy
overcast
humidity windy
N Y NY
truefalsenormalhigh
root
split node
leaf node

Very Fast Decision Tree (VFDT)
• Pioneer in decision tree for streaming
• Information Gain + Gain Ratio + Hoeffding
bound
• Hoeffding bound decides whether the
difference in information gain is enough to
split or not
• Often called Hoeffding Tree
17

Distributed Decision Tree
Types of parallelism
• Horizontal
• Partition the data by the instance
• Vertical
• Partition the data by the attribute
• Task
• Tree leaf nodes grow in parallel 18

MOA Hoeffding Tree Profiling
19
Learn
70%
Split
24%
Other
6%
CPU Time Breakdown, 200 attributes

20
1 z1 zz
n 1
source PI
model-
aggregator PI
local-statistic PI
evaluator PI
source
local-result
control
attribute
result

Evaluation
Metrics:
• Accuracy
• Throughput
Input data:
• Random Tree Generator
• Text Generator – resembles tweets
Cluster: 3 shared nodes 48 GB of RAM, Intel Xeon
CPU E5620 @ 2.4 GHz: 16 processors, Linux
Kernel 2.6.18
21

VHT iteration 1 (VHT1)
• Goal: Verify algorithm correctness (same
accuracy as MOA)
• Utilized 2 internal queues: instances queue,
local-result queue
• Achieved same accuracy but throughput is
low. Proceed with VHT 2
22

VHT Iteration 2 (VHT2)
Goal: improve VHT1 throughput
• Kryo serializer: 2.5x throughput
improvement
• long identifier instead of String
• Remove 2 internal queues in VHT1 
discard instances while attempting to split
23

tree-10
24
Around 8.2 % differences
in accuracy

tree-100
25
Same trend as tree-10
(7.9% difference in accuracy)

No. Leaf Nodes VHT2 –
tree-100
26
Very close and
very high accuracy

Accuracy VHT2 – text-1000
27
Low accuracy when
the number of
attributes increased

Throughput VHT2 – tree-
generator
28
Not good for dense
instance and low
number of attributes

Throughput VHT2 – text-generator
29
Higher throughput
than MHT

30
0
50
100
150
200
250
300
VHT2-par-3 MHT
ExecutionTime(seconds)
Classifier
Profiling Results for text-1000 with
1000000 instances
t_calc
t_comm
t_serial
Minimizing t_comm will
increase throughput

31
0
50
100
150
200
250
VHT2-par-3 MHT
ExecutionTime(seconds)
Classifier
Profiling Results for text-10000
with 100000 instances
t_calc
t_comm
t_serial
Throughput
VHT2-par-3: 2631 inst/sec
MHT : 507 inst/sec

Future Work
• Open Source
• Evaluation layer in SAMOA architecture
• Online classification algorithms that are
based on horizontal parallelism
32

Conclusions
Mining big data stream is challenging
• Systems needs to satisfy 3Vs of big data.
SAMOA – Distributed Streaming ML Framework
• Architecture and Abstractions
• Stream Processing Engine (SPE) adapter
• SAMOA Integration with Storm
• Better than MOA for high number of attributes
33

Distributed Decision Tree Learning for Mining Big Data Streams

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Distributed Decision Tree Learning for Mining Big Data Streams

Similar to Distributed Decision Tree Learning for Mining Big Data Streams (20)

More from Arinto Murdopo

More from Arinto Murdopo (17)

Recently uploaded

Recently uploaded (20)

Distributed Decision Tree Learning for Mining Big Data Streams