Distributed Decision Tree Learning for Mining Big Data Streams
Upcoming SlideShare
Loading in...5

Distributed Decision Tree Learning for Mining Big Data Streams



Master Thesis Presentation

Master Thesis Presentation



Total Views
Views on SlideShare
Embed Views



2 Embeds 3

http://www.linkedin.com 2
https://www.linkedin.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Distributed Decision Tree Learning for Mining Big Data Streams Distributed Decision Tree Learning for Mining Big Data Streams Presentation Transcript

  • Distributed Decision Tree Learning for Mining Big Data Streams 1 Master Thesis presentation by: Arinto Murdopo EMDC arinto@yahoo-inc.com Supervisors: Albert Bifet Gianmarco de Francisci Morales Ricard Gavaldà
  • Big Data 200 million users 400 million tweets/day 2 1+ TB/day to Hadoop 2.7 TB/day follower update 4.5 billion likes/day 350 million photos/day Volume Velocity Variety May 2013 March 2013 May 2013
  • Machine Learning (ML) 3 Make sense of the data, but how? Machine Learning = learn & adapt based on data Due to the 3Vs, we should: 1. Distribute, to scale 2. Stream, to be fast 3. Distribute and stream, scale and fast
  • Are We Satisfied? 4 scale fast fastscale scale fast loose-coupling loose-coupling We want machine learning frameworks that are able to scale, fast, and loose-coupling loose-coupling
  • SAMOA Scalable Advanced Massive Online Analysis Distributed Streaming Machine Learning Framework: • Fast, using streaming model • Scale, on top of distributed SPEs (Storm and S4) • Loose-coupling between ML algorithms and SPEs 5
  • Contributions SAMOA • Architecture and Abstractions • Stream Processing Engine Adapter • Integration with Storm Vertical Hoeffding Tree • Better than MOA for high number of attributes 6
  • 7 SAMOA Architecture Frequent Pattern Mining Storm Other SPEs SAMOA S4 Clustering Methods Classification Methods
  • SAMOA Abstractions To develop distributed ML algorithms 8 z EPI Processor Stream n Content Events Grouping Parallelism Hint Topology PI External Event Source
  • SAMOA SPE-adapter • Transforms the abstractions into SPE- specific runtime components • Abstract factory pattern to decouple API and SPE • Platform developers need to provide 1. PI and EPI 2. Stream 3. Grouping 9
  • SAMOA SPE-adapter Examples of SPE-specific runtime components from SPE-adapter 10 Focus of this thesis
  • Storm • Distributed Streaming Processing Engine • MapReduce-like programming model 11 stream A ................ stream B S1 S2 B1 B2 B3 B5 B4 stores useful information data storage Stream Spout Bolt DAG Tuples
  • SAMOA-Storm Integration Mapping between Storm and SAMOA 1. Spout  Entrance Processing Item (EPI) 2. Bolt  Processing Item • Use composition for EPI and PI 3. Bolt Stream & Spout Stream  Stream • Storm pull model 12
  • Contributions so far .. 13 samoa-SPE SAMOA Algorithm and API SPE-adapter S4 Storm other SPEs ML-adapter MOA Other ML frameworks samoa-S4 samoa-storm samoa-other-SPEs Flexibility Scalability Extensibility
  • Next Contribution… Distributed Algorithm implementation: Vertical Hoeffding Tree Decision tree: • Classification • Divide and conquer • Easy to interpret 14
  • Sample Dataset ID Code Outlook Temperature Humidity Windy Play a sunny hot high false no b sunny hot high true no c overcast hot high false yes d rainy mild high false yes … … … … … … 15 attribute class a datum (an instance) to build the tree
  • Decision Tree 16 outlook Y sunny rainy overcast humidity windy N Y NY truefalsenormalhigh root split node leaf node
  • Very Fast Decision Tree (VFDT) • Pioneer in decision tree for streaming • Information Gain + Gain Ratio + Hoeffding bound • Hoeffding bound decides whether the difference in information gain is enough to split or not • Often called Hoeffding Tree 17
  • Distributed Decision Tree Types of parallelism • Horizontal • Partition the data by the instance • Vertical • Partition the data by the attribute • Task • Tree leaf nodes grow in parallel 18
  • MOA Hoeffding Tree Profiling 19 Learn 70% Split 24% Other 6% CPU Time Breakdown, 200 attributes
  • Vertical Hoeffding Tree 20 1 z1 zz n 1 source PI model- aggregator PI local-statistic PI evaluator PI source local-result control attribute result
  • Evaluation Metrics: • Accuracy • Throughput Input data: • Random Tree Generator • Text Generator – resembles tweets Cluster: 3 shared nodes 48 GB of RAM, Intel Xeon CPU E5620 @ 2.4 GHz: 16 processors, Linux Kernel 2.6.18 21
  • VHT iteration 1 (VHT1) • Goal: Verify algorithm correctness (same accuracy as MOA) • Utilized 2 internal queues: instances queue, local-result queue • Achieved same accuracy but throughput is low. Proceed with VHT 2 22
  • VHT Iteration 2 (VHT2) Goal: improve VHT1 throughput • Kryo serializer: 2.5x throughput improvement • long identifier instead of String • Remove 2 internal queues in VHT1  discard instances while attempting to split 23
  • tree-10 24 Around 8.2 % differences in accuracy
  • tree-100 25 Same trend as tree-10 (7.9% difference in accuracy)
  • No. Leaf Nodes VHT2 – tree-100 26 Very close and very high accuracy
  • Accuracy VHT2 – text-1000 27 Low accuracy when the number of attributes increased
  • Throughput VHT2 – tree- generator 28 Not good for dense instance and low number of attributes
  • Throughput VHT2 – text-generator 29 Higher throughput than MHT
  • 30 0 50 100 150 200 250 300 VHT2-par-3 MHT ExecutionTime(seconds) Classifier Profiling Results for text-1000 with 1000000 instances t_calc t_comm t_serial Minimizing t_comm will increase throughput
  • 31 0 50 100 150 200 250 VHT2-par-3 MHT ExecutionTime(seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec
  • Future Work • Open Source • Evaluation layer in SAMOA architecture • Online classification algorithms that are based on horizontal parallelism 32
  • Conclusions Mining big data stream is challenging • Systems needs to satisfy 3Vs of big data. SAMOA – Distributed Streaming ML Framework • Architecture and Abstractions • Stream Processing Engine (SPE) adapter • SAMOA Integration with Storm Vertical Hoeffding Tree • Better than MOA for high number of attributes 33