Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed Decision Tree Learning for Mining Big Data Streams

5,474 views

Published on

Master Thesis Presentation

Published in: Education, Technology

Distributed Decision Tree Learning for Mining Big Data Streams

  1. 1. Distributed Decision Tree Learning for Mining Big Data Streams 1 Master Thesis presentation by: Arinto Murdopo EMDC arinto@yahoo-inc.com Supervisors: Albert Bifet Gianmarco de Francisci Morales Ricard Gavaldà
  2. 2. Big Data 200 million users 400 million tweets/day 2 1+ TB/day to Hadoop 2.7 TB/day follower update 4.5 billion likes/day 350 million photos/day Volume Velocity Variety May 2013 March 2013 May 2013
  3. 3. Machine Learning (ML) 3 Make sense of the data, but how? Machine Learning = learn & adapt based on data Due to the 3Vs, we should: 1. Distribute, to scale 2. Stream, to be fast 3. Distribute and stream, scale and fast
  4. 4. Are We Satisfied? 4 scale fast fastscale scale fast loose-coupling loose-coupling We want machine learning frameworks that are able to scale, fast, and loose-coupling loose-coupling
  5. 5. SAMOA Scalable Advanced Massive Online Analysis Distributed Streaming Machine Learning Framework: • Fast, using streaming model • Scale, on top of distributed SPEs (Storm and S4) • Loose-coupling between ML algorithms and SPEs 5
  6. 6. Contributions SAMOA • Architecture and Abstractions • Stream Processing Engine Adapter • Integration with Storm Vertical Hoeffding Tree • Better than MOA for high number of attributes 6
  7. 7. 7 SAMOA Architecture Frequent Pattern Mining Storm Other SPEs SAMOA S4 Clustering Methods Classification Methods
  8. 8. SAMOA Abstractions To develop distributed ML algorithms 8 z EPI Processor Stream n Content Events Grouping Parallelism Hint Topology PI External Event Source
  9. 9. SAMOA SPE-adapter • Transforms the abstractions into SPE- specific runtime components • Abstract factory pattern to decouple API and SPE • Platform developers need to provide 1. PI and EPI 2. Stream 3. Grouping 9
  10. 10. SAMOA SPE-adapter Examples of SPE-specific runtime components from SPE-adapter 10 Focus of this thesis
  11. 11. Storm • Distributed Streaming Processing Engine • MapReduce-like programming model 11 stream A ................ stream B S1 S2 B1 B2 B3 B5 B4 stores useful information data storage Stream Spout Bolt DAG Tuples
  12. 12. SAMOA-Storm Integration Mapping between Storm and SAMOA 1. Spout  Entrance Processing Item (EPI) 2. Bolt  Processing Item • Use composition for EPI and PI 3. Bolt Stream & Spout Stream  Stream • Storm pull model 12
  13. 13. Contributions so far .. 13 samoa-SPE SAMOA Algorithm and API SPE-adapter S4 Storm other SPEs ML-adapter MOA Other ML frameworks samoa-S4 samoa-storm samoa-other-SPEs Flexibility Scalability Extensibility
  14. 14. Next Contribution… Distributed Algorithm implementation: Vertical Hoeffding Tree Decision tree: • Classification • Divide and conquer • Easy to interpret 14
  15. 15. Sample Dataset ID Code Outlook Temperature Humidity Windy Play a sunny hot high false no b sunny hot high true no c overcast hot high false yes d rainy mild high false yes … … … … … … 15 attribute class a datum (an instance) to build the tree
  16. 16. Decision Tree 16 outlook Y sunny rainy overcast humidity windy N Y NY truefalsenormalhigh root split node leaf node
  17. 17. Very Fast Decision Tree (VFDT) • Pioneer in decision tree for streaming • Information Gain + Gain Ratio + Hoeffding bound • Hoeffding bound decides whether the difference in information gain is enough to split or not • Often called Hoeffding Tree 17
  18. 18. Distributed Decision Tree Types of parallelism • Horizontal • Partition the data by the instance • Vertical • Partition the data by the attribute • Task • Tree leaf nodes grow in parallel 18
  19. 19. MOA Hoeffding Tree Profiling 19 Learn 70% Split 24% Other 6% CPU Time Breakdown, 200 attributes
  20. 20. Vertical Hoeffding Tree 20 1 z1 zz n 1 source PI model- aggregator PI local-statistic PI evaluator PI source local-result control attribute result
  21. 21. Evaluation Metrics: • Accuracy • Throughput Input data: • Random Tree Generator • Text Generator – resembles tweets Cluster: 3 shared nodes 48 GB of RAM, Intel Xeon CPU E5620 @ 2.4 GHz: 16 processors, Linux Kernel 2.6.18 21
  22. 22. VHT iteration 1 (VHT1) • Goal: Verify algorithm correctness (same accuracy as MOA) • Utilized 2 internal queues: instances queue, local-result queue • Achieved same accuracy but throughput is low. Proceed with VHT 2 22
  23. 23. VHT Iteration 2 (VHT2) Goal: improve VHT1 throughput • Kryo serializer: 2.5x throughput improvement • long identifier instead of String • Remove 2 internal queues in VHT1  discard instances while attempting to split 23
  24. 24. tree-10 24 Around 8.2 % differences in accuracy
  25. 25. tree-100 25 Same trend as tree-10 (7.9% difference in accuracy)
  26. 26. No. Leaf Nodes VHT2 – tree-100 26 Very close and very high accuracy
  27. 27. Accuracy VHT2 – text-1000 27 Low accuracy when the number of attributes increased
  28. 28. Throughput VHT2 – tree- generator 28 Not good for dense instance and low number of attributes
  29. 29. Throughput VHT2 – text-generator 29 Higher throughput than MHT
  30. 30. 30 0 50 100 150 200 250 300 VHT2-par-3 MHT ExecutionTime(seconds) Classifier Profiling Results for text-1000 with 1000000 instances t_calc t_comm t_serial Minimizing t_comm will increase throughput
  31. 31. 31 0 50 100 150 200 250 VHT2-par-3 MHT ExecutionTime(seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec
  32. 32. Future Work • Open Source • Evaluation layer in SAMOA architecture • Online classification algorithms that are based on horizontal parallelism 32
  33. 33. Conclusions Mining big data stream is challenging • Systems needs to satisfy 3Vs of big data. SAMOA – Distributed Streaming ML Framework • Architecture and Abstractions • Stream Processing Engine (SPE) adapter • SAMOA Integration with Storm Vertical Hoeffding Tree • Better than MOA for high number of attributes 33

×