0
Distributed Decision Tree
Learning for Mining Big Data
Streams
1
Master Thesis presentation by:
Arinto Murdopo
EMDC
arinto...
Big Data
200 million users
400 million tweets/day
2
1+ TB/day to Hadoop
2.7 TB/day follower update
4.5 billion likes/day
3...
Machine Learning (ML)
3
Make sense of the data, but how?
Machine Learning = learn & adapt based on data
Due to the 3Vs, we...
Are We Satisfied?
4
scale fast
fastscale
scale fast
loose-coupling
loose-coupling
We want machine learning frameworks that...
SAMOA
Scalable Advanced Massive Online Analysis
Distributed Streaming Machine Learning Framework:
• Fast, using streaming ...
Contributions
SAMOA
• Architecture and Abstractions
• Stream Processing Engine Adapter
• Integration with Storm
Vertical H...
7
SAMOA Architecture
Frequent
Pattern
Mining
Storm Other SPEs
SAMOA
S4
Clustering
Methods
Classification
Methods
SAMOA Abstractions
To develop distributed ML algorithms
8
z
EPI
Processor
Stream
n
Content
Events
Grouping
Parallelism
Hin...
SAMOA SPE-adapter
• Transforms the abstractions into SPE-
specific runtime components
• Abstract factory pattern to decoup...
SAMOA SPE-adapter
Examples of SPE-specific runtime
components from SPE-adapter
10
Focus of this
thesis
Storm
• Distributed Streaming Processing Engine
• MapReduce-like programming model
11
stream A
................
stream B
S...
SAMOA-Storm Integration
Mapping between Storm and SAMOA
1. Spout  Entrance Processing Item (EPI)
2. Bolt  Processing Ite...
Contributions so far ..
13
samoa-SPE
SAMOA
Algorithm and API
SPE-adapter
S4 Storm other SPEs
ML-adapter
MOA
Other ML
frame...
Next Contribution…
Distributed Algorithm implementation:
Vertical Hoeffding Tree
Decision tree:
• Classification
• Divide ...
Sample Dataset
ID
Code
Outlook Temperature Humidity Windy Play
a sunny hot high false no
b sunny hot high true no
c overca...
Decision Tree
16
outlook
Y
sunny
rainy
overcast
humidity windy
N Y NY
truefalsenormalhigh
root
split node
leaf node
Very Fast Decision Tree (VFDT)
• Pioneer in decision tree for streaming
• Information Gain + Gain Ratio + Hoeffding
bound
...
Distributed Decision Tree
Types of parallelism
• Horizontal
• Partition the data by the instance
• Vertical
• Partition th...
MOA Hoeffding Tree Profiling
19
Learn
70%
Split
24%
Other
6%
CPU Time Breakdown, 200 attributes
Vertical Hoeffding Tree
20
1 z1 zz
n 1
source PI
model-
aggregator PI
local-statistic PI
evaluator PI
source
local-result
...
Evaluation
Metrics:
• Accuracy
• Throughput
Input data:
• Random Tree Generator
• Text Generator – resembles tweets
Cluste...
VHT iteration 1 (VHT1)
• Goal: Verify algorithm correctness (same
accuracy as MOA)
• Utilized 2 internal queues: instances...
VHT Iteration 2 (VHT2)
Goal: improve VHT1 throughput
• Kryo serializer: 2.5x throughput
improvement
• long identifier inst...
tree-10
24
Around 8.2 % differences
in accuracy
tree-100
25
Same trend as tree-10
(7.9% difference in accuracy)
No. Leaf Nodes VHT2 –
tree-100
26
Very close and
very high accuracy
Accuracy VHT2 – text-1000
27
Low accuracy when
the number of
attributes increased
Throughput VHT2 – tree-
generator
28
Not good for dense
instance and low
number of attributes
Throughput VHT2 – text-generator
29
Higher throughput
than MHT
30
0
50
100
150
200
250
300
VHT2-par-3 MHT
ExecutionTime(seconds)
Classifier
Profiling Results for text-1000 with
1000000 ...
31
0
50
100
150
200
250
VHT2-par-3 MHT
ExecutionTime(seconds)
Classifier
Profiling Results for text-10000
with 100000 inst...
Future Work
• Open Source
• Evaluation layer in SAMOA architecture
• Online classification algorithms that are
based on ho...
Conclusions
Mining big data stream is challenging
• Systems needs to satisfy 3Vs of big data.
SAMOA – Distributed Streamin...
Upcoming SlideShare
Loading in...5
×

Distributed Decision Tree Learning for Mining Big Data Streams

2,123

Published on

Master Thesis Presentation

Published in: Education, Technology
1 Comment
7 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,123
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
139
Comments
1
Likes
7
Embeds 0
No embeds

No notes for slide

Transcript of "Distributed Decision Tree Learning for Mining Big Data Streams"

  1. 1. Distributed Decision Tree Learning for Mining Big Data Streams 1 Master Thesis presentation by: Arinto Murdopo EMDC arinto@yahoo-inc.com Supervisors: Albert Bifet Gianmarco de Francisci Morales Ricard Gavaldà
  2. 2. Big Data 200 million users 400 million tweets/day 2 1+ TB/day to Hadoop 2.7 TB/day follower update 4.5 billion likes/day 350 million photos/day Volume Velocity Variety May 2013 March 2013 May 2013
  3. 3. Machine Learning (ML) 3 Make sense of the data, but how? Machine Learning = learn & adapt based on data Due to the 3Vs, we should: 1. Distribute, to scale 2. Stream, to be fast 3. Distribute and stream, scale and fast
  4. 4. Are We Satisfied? 4 scale fast fastscale scale fast loose-coupling loose-coupling We want machine learning frameworks that are able to scale, fast, and loose-coupling loose-coupling
  5. 5. SAMOA Scalable Advanced Massive Online Analysis Distributed Streaming Machine Learning Framework: • Fast, using streaming model • Scale, on top of distributed SPEs (Storm and S4) • Loose-coupling between ML algorithms and SPEs 5
  6. 6. Contributions SAMOA • Architecture and Abstractions • Stream Processing Engine Adapter • Integration with Storm Vertical Hoeffding Tree • Better than MOA for high number of attributes 6
  7. 7. 7 SAMOA Architecture Frequent Pattern Mining Storm Other SPEs SAMOA S4 Clustering Methods Classification Methods
  8. 8. SAMOA Abstractions To develop distributed ML algorithms 8 z EPI Processor Stream n Content Events Grouping Parallelism Hint Topology PI External Event Source
  9. 9. SAMOA SPE-adapter • Transforms the abstractions into SPE- specific runtime components • Abstract factory pattern to decouple API and SPE • Platform developers need to provide 1. PI and EPI 2. Stream 3. Grouping 9
  10. 10. SAMOA SPE-adapter Examples of SPE-specific runtime components from SPE-adapter 10 Focus of this thesis
  11. 11. Storm • Distributed Streaming Processing Engine • MapReduce-like programming model 11 stream A ................ stream B S1 S2 B1 B2 B3 B5 B4 stores useful information data storage Stream Spout Bolt DAG Tuples
  12. 12. SAMOA-Storm Integration Mapping between Storm and SAMOA 1. Spout  Entrance Processing Item (EPI) 2. Bolt  Processing Item • Use composition for EPI and PI 3. Bolt Stream & Spout Stream  Stream • Storm pull model 12
  13. 13. Contributions so far .. 13 samoa-SPE SAMOA Algorithm and API SPE-adapter S4 Storm other SPEs ML-adapter MOA Other ML frameworks samoa-S4 samoa-storm samoa-other-SPEs Flexibility Scalability Extensibility
  14. 14. Next Contribution… Distributed Algorithm implementation: Vertical Hoeffding Tree Decision tree: • Classification • Divide and conquer • Easy to interpret 14
  15. 15. Sample Dataset ID Code Outlook Temperature Humidity Windy Play a sunny hot high false no b sunny hot high true no c overcast hot high false yes d rainy mild high false yes … … … … … … 15 attribute class a datum (an instance) to build the tree
  16. 16. Decision Tree 16 outlook Y sunny rainy overcast humidity windy N Y NY truefalsenormalhigh root split node leaf node
  17. 17. Very Fast Decision Tree (VFDT) • Pioneer in decision tree for streaming • Information Gain + Gain Ratio + Hoeffding bound • Hoeffding bound decides whether the difference in information gain is enough to split or not • Often called Hoeffding Tree 17
  18. 18. Distributed Decision Tree Types of parallelism • Horizontal • Partition the data by the instance • Vertical • Partition the data by the attribute • Task • Tree leaf nodes grow in parallel 18
  19. 19. MOA Hoeffding Tree Profiling 19 Learn 70% Split 24% Other 6% CPU Time Breakdown, 200 attributes
  20. 20. Vertical Hoeffding Tree 20 1 z1 zz n 1 source PI model- aggregator PI local-statistic PI evaluator PI source local-result control attribute result
  21. 21. Evaluation Metrics: • Accuracy • Throughput Input data: • Random Tree Generator • Text Generator – resembles tweets Cluster: 3 shared nodes 48 GB of RAM, Intel Xeon CPU E5620 @ 2.4 GHz: 16 processors, Linux Kernel 2.6.18 21
  22. 22. VHT iteration 1 (VHT1) • Goal: Verify algorithm correctness (same accuracy as MOA) • Utilized 2 internal queues: instances queue, local-result queue • Achieved same accuracy but throughput is low. Proceed with VHT 2 22
  23. 23. VHT Iteration 2 (VHT2) Goal: improve VHT1 throughput • Kryo serializer: 2.5x throughput improvement • long identifier instead of String • Remove 2 internal queues in VHT1  discard instances while attempting to split 23
  24. 24. tree-10 24 Around 8.2 % differences in accuracy
  25. 25. tree-100 25 Same trend as tree-10 (7.9% difference in accuracy)
  26. 26. No. Leaf Nodes VHT2 – tree-100 26 Very close and very high accuracy
  27. 27. Accuracy VHT2 – text-1000 27 Low accuracy when the number of attributes increased
  28. 28. Throughput VHT2 – tree- generator 28 Not good for dense instance and low number of attributes
  29. 29. Throughput VHT2 – text-generator 29 Higher throughput than MHT
  30. 30. 30 0 50 100 150 200 250 300 VHT2-par-3 MHT ExecutionTime(seconds) Classifier Profiling Results for text-1000 with 1000000 instances t_calc t_comm t_serial Minimizing t_comm will increase throughput
  31. 31. 31 0 50 100 150 200 250 VHT2-par-3 MHT ExecutionTime(seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec
  32. 32. Future Work • Open Source • Evaluation layer in SAMOA architecture • Online classification algorithms that are based on horizontal parallelism 32
  33. 33. Conclusions Mining big data stream is challenging • Systems needs to satisfy 3Vs of big data. SAMOA – Distributed Streaming ML Framework • Architecture and Abstractions • Stream Processing Engine (SPE) adapter • SAMOA Integration with Storm Vertical Hoeffding Tree • Better than MOA for high number of attributes 33
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×