Distributed Decision Tree Learning for Mining Big Data Streams

4,865 views

Published on

Master Thesis Presentation

Published in: Education, Technology
1 Comment
10 Likes
Statistics
Notes
No Downloads
Views
Total views
4,865
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
201
Comments
1
Likes
10
Embeds 0
No embeds

No notes for slide

Distributed Decision Tree Learning for Mining Big Data Streams

  1. 1. Distributed Decision Tree Learning for Mining Big Data Streams 1 Master Thesis presentation by: Arinto Murdopo EMDC arinto@yahoo-inc.com Supervisors: Albert Bifet Gianmarco de Francisci Morales Ricard Gavaldà
  2. 2. Big Data 200 million users 400 million tweets/day 2 1+ TB/day to Hadoop 2.7 TB/day follower update 4.5 billion likes/day 350 million photos/day Volume Velocity Variety May 2013 March 2013 May 2013
  3. 3. Machine Learning (ML) 3 Make sense of the data, but how? Machine Learning = learn & adapt based on data Due to the 3Vs, we should: 1. Distribute, to scale 2. Stream, to be fast 3. Distribute and stream, scale and fast
  4. 4. Are We Satisfied? 4 scale fast fastscale scale fast loose-coupling loose-coupling We want machine learning frameworks that are able to scale, fast, and loose-coupling loose-coupling
  5. 5. SAMOA Scalable Advanced Massive Online Analysis Distributed Streaming Machine Learning Framework: • Fast, using streaming model • Scale, on top of distributed SPEs (Storm and S4) • Loose-coupling between ML algorithms and SPEs 5
  6. 6. Contributions SAMOA • Architecture and Abstractions • Stream Processing Engine Adapter • Integration with Storm Vertical Hoeffding Tree • Better than MOA for high number of attributes 6
  7. 7. 7 SAMOA Architecture Frequent Pattern Mining Storm Other SPEs SAMOA S4 Clustering Methods Classification Methods
  8. 8. SAMOA Abstractions To develop distributed ML algorithms 8 z EPI Processor Stream n Content Events Grouping Parallelism Hint Topology PI External Event Source
  9. 9. SAMOA SPE-adapter • Transforms the abstractions into SPE- specific runtime components • Abstract factory pattern to decouple API and SPE • Platform developers need to provide 1. PI and EPI 2. Stream 3. Grouping 9
  10. 10. SAMOA SPE-adapter Examples of SPE-specific runtime components from SPE-adapter 10 Focus of this thesis
  11. 11. Storm • Distributed Streaming Processing Engine • MapReduce-like programming model 11 stream A ................ stream B S1 S2 B1 B2 B3 B5 B4 stores useful information data storage Stream Spout Bolt DAG Tuples
  12. 12. SAMOA-Storm Integration Mapping between Storm and SAMOA 1. Spout  Entrance Processing Item (EPI) 2. Bolt  Processing Item • Use composition for EPI and PI 3. Bolt Stream & Spout Stream  Stream • Storm pull model 12
  13. 13. Contributions so far .. 13 samoa-SPE SAMOA Algorithm and API SPE-adapter S4 Storm other SPEs ML-adapter MOA Other ML frameworks samoa-S4 samoa-storm samoa-other-SPEs Flexibility Scalability Extensibility
  14. 14. Next Contribution… Distributed Algorithm implementation: Vertical Hoeffding Tree Decision tree: • Classification • Divide and conquer • Easy to interpret 14
  15. 15. Sample Dataset ID Code Outlook Temperature Humidity Windy Play a sunny hot high false no b sunny hot high true no c overcast hot high false yes d rainy mild high false yes … … … … … … 15 attribute class a datum (an instance) to build the tree
  16. 16. Decision Tree 16 outlook Y sunny rainy overcast humidity windy N Y NY truefalsenormalhigh root split node leaf node
  17. 17. Very Fast Decision Tree (VFDT) • Pioneer in decision tree for streaming • Information Gain + Gain Ratio + Hoeffding bound • Hoeffding bound decides whether the difference in information gain is enough to split or not • Often called Hoeffding Tree 17
  18. 18. Distributed Decision Tree Types of parallelism • Horizontal • Partition the data by the instance • Vertical • Partition the data by the attribute • Task • Tree leaf nodes grow in parallel 18
  19. 19. MOA Hoeffding Tree Profiling 19 Learn 70% Split 24% Other 6% CPU Time Breakdown, 200 attributes
  20. 20. Vertical Hoeffding Tree 20 1 z1 zz n 1 source PI model- aggregator PI local-statistic PI evaluator PI source local-result control attribute result
  21. 21. Evaluation Metrics: • Accuracy • Throughput Input data: • Random Tree Generator • Text Generator – resembles tweets Cluster: 3 shared nodes 48 GB of RAM, Intel Xeon CPU E5620 @ 2.4 GHz: 16 processors, Linux Kernel 2.6.18 21
  22. 22. VHT iteration 1 (VHT1) • Goal: Verify algorithm correctness (same accuracy as MOA) • Utilized 2 internal queues: instances queue, local-result queue • Achieved same accuracy but throughput is low. Proceed with VHT 2 22
  23. 23. VHT Iteration 2 (VHT2) Goal: improve VHT1 throughput • Kryo serializer: 2.5x throughput improvement • long identifier instead of String • Remove 2 internal queues in VHT1  discard instances while attempting to split 23
  24. 24. tree-10 24 Around 8.2 % differences in accuracy
  25. 25. tree-100 25 Same trend as tree-10 (7.9% difference in accuracy)
  26. 26. No. Leaf Nodes VHT2 – tree-100 26 Very close and very high accuracy
  27. 27. Accuracy VHT2 – text-1000 27 Low accuracy when the number of attributes increased
  28. 28. Throughput VHT2 – tree- generator 28 Not good for dense instance and low number of attributes
  29. 29. Throughput VHT2 – text-generator 29 Higher throughput than MHT
  30. 30. 30 0 50 100 150 200 250 300 VHT2-par-3 MHT ExecutionTime(seconds) Classifier Profiling Results for text-1000 with 1000000 instances t_calc t_comm t_serial Minimizing t_comm will increase throughput
  31. 31. 31 0 50 100 150 200 250 VHT2-par-3 MHT ExecutionTime(seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec
  32. 32. Future Work • Open Source • Evaluation layer in SAMOA architecture • Online classification algorithms that are based on horizontal parallelism 32
  33. 33. Conclusions Mining big data stream is challenging • Systems needs to satisfy 3Vs of big data. SAMOA – Distributed Streaming ML Framework • Architecture and Abstractions • Stream Processing Engine (SPE) adapter • SAMOA Integration with Storm Vertical Hoeffding Tree • Better than MOA for high number of attributes 33

×