call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
Distributed Decision Tree Learning for Mining Big Data Streams
1. Distributed Decision Tree
Learning for Mining Big Data
Streams
1
Master Thesis presentation by:
Arinto Murdopo
EMDC
arinto@yahoo-inc.com
Supervisors:
Albert Bifet
Gianmarco de Francisci Morales
Ricard Gavaldà
2. Big Data
200 million users
400 million tweets/day
2
1+ TB/day to Hadoop
2.7 TB/day follower update
4.5 billion likes/day
350 million photos/day
Volume
Velocity
Variety
May 2013
March 2013
May 2013
3. Machine Learning (ML)
3
Make sense of the data, but how?
Machine Learning = learn & adapt based on data
Due to the 3Vs, we should:
1. Distribute, to scale
2. Stream, to be fast
3. Distribute and stream,
scale and fast
4. Are We Satisfied?
4
scale fast
fastscale
scale fast
loose-coupling
loose-coupling
We want machine learning frameworks that
are able to scale, fast, and loose-coupling
loose-coupling
5. SAMOA
Scalable Advanced Massive Online Analysis
Distributed Streaming Machine Learning Framework:
• Fast, using streaming model
• Scale, on top of distributed SPEs (Storm and S4)
• Loose-coupling between ML algorithms and SPEs
5
6. Contributions
SAMOA
• Architecture and Abstractions
• Stream Processing Engine Adapter
• Integration with Storm
Vertical Hoeffding Tree
• Better than MOA for high number of
attributes
6
8. SAMOA Abstractions
To develop distributed ML algorithms
8
z
EPI
Processor
Stream
n
Content
Events
Grouping
Parallelism
Hint
Topology
PI
External
Event Source
9. SAMOA SPE-adapter
• Transforms the abstractions into SPE-
specific runtime components
• Abstract factory pattern to decouple API
and SPE
• Platform developers need to provide
1. PI and EPI
2. Stream
3. Grouping 9
11. Storm
• Distributed Streaming Processing Engine
• MapReduce-like programming model
11
stream A
................
stream B
S1
S2
B1
B2
B3
B5
B4
stores useful information
data
storage
Stream
Spout
Bolt
DAG
Tuples
12. SAMOA-Storm Integration
Mapping between Storm and SAMOA
1. Spout Entrance Processing Item (EPI)
2. Bolt Processing Item
• Use composition for EPI and PI
3. Bolt Stream & Spout Stream Stream
• Storm pull model
12
13. Contributions so far ..
13
samoa-SPE
SAMOA
Algorithm and API
SPE-adapter
S4 Storm other SPEs
ML-adapter
MOA
Other ML
frameworks
samoa-S4 samoa-storm samoa-other-SPEs
Flexibility
Scalability
Extensibility
15. Sample Dataset
ID
Code
Outlook Temperature Humidity Windy Play
a sunny hot high false no
b sunny hot high true no
c overcast hot high false yes
d rainy mild high false yes
… … … … … …
15
attribute class
a datum (an instance) to
build the tree
17. Very Fast Decision Tree (VFDT)
• Pioneer in decision tree for streaming
• Information Gain + Gain Ratio + Hoeffding
bound
• Hoeffding bound decides whether the
difference in information gain is enough to
split or not
• Often called Hoeffding Tree
17
18. Distributed Decision Tree
Types of parallelism
• Horizontal
• Partition the data by the instance
• Vertical
• Partition the data by the attribute
• Task
• Tree leaf nodes grow in parallel 18
19. MOA Hoeffding Tree Profiling
19
Learn
70%
Split
24%
Other
6%
CPU Time Breakdown, 200 attributes
20. Vertical Hoeffding Tree
20
1 z1 zz
n 1
source PI
model-
aggregator PI
local-statistic PI
evaluator PI
source
local-result
control
attribute
result
21. Evaluation
Metrics:
• Accuracy
• Throughput
Input data:
• Random Tree Generator
• Text Generator – resembles tweets
Cluster: 3 shared nodes 48 GB of RAM, Intel Xeon
CPU E5620 @ 2.4 GHz: 16 processors, Linux
Kernel 2.6.18
21
22. VHT iteration 1 (VHT1)
• Goal: Verify algorithm correctness (same
accuracy as MOA)
• Utilized 2 internal queues: instances queue,
local-result queue
• Achieved same accuracy but throughput is
low. Proceed with VHT 2
22
23. VHT Iteration 2 (VHT2)
Goal: improve VHT1 throughput
• Kryo serializer: 2.5x throughput
improvement
• long identifier instead of String
• Remove 2 internal queues in VHT1
discard instances while attempting to split
23
32. Future Work
• Open Source
• Evaluation layer in SAMOA architecture
• Online classification algorithms that are
based on horizontal parallelism
32
33. Conclusions
Mining big data stream is challenging
• Systems needs to satisfy 3Vs of big data.
SAMOA – Distributed Streaming ML Framework
• Architecture and Abstractions
• Stream Processing Engine (SPE) adapter
• SAMOA Integration with Storm
Vertical Hoeffding Tree
• Better than MOA for high number of attributes
33