• Save
Distributed Online Machine Learning Framework for Big Data
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Distributed Online Machine Learning Framework for Big Data

on

  • 13,498 views

Shohei Hido, "Distributed Online Machine Learning Framework for Big Data", Invited talk at XLDB Asia 2012, Beijing, 22nd June, 2012.

Shohei Hido, "Distributed Online Machine Learning Framework for Big Data", Invited talk at XLDB Asia 2012, Beijing, 22nd June, 2012.

Statistics

Views

Total Views
13,498
Views on SlideShare
4,817
Embed Views
8,681

Actions

Likes
19
Downloads
0
Comments
0

11 Embeds 8,681

http://blog.jubat.us 8278
http://2023884025454259159_72fe9922d69be86dcda13bb0adbe886583a5d5ad.blogspot.com 292
https://twitter.com 66
http://webcache.googleusercontent.com 12
http://translate.googleusercontent.com 10
http://www.twylah.com 10
http://2023884025454259159_72fe9922d69be86dcda13bb0adbe886583a5d5ad.blogspot.jp 6
http://www.linkedin.com 3
https://www.google.co.jp 2
https://si0.twimg.com 1
http://honyaku.yahoofs.jp 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Distributed Online Machine Learning Framework for Big Data Presentation Transcript

  • 1. Distributed Online Machine Learning Framework for Big Data Shohei Hido Preferred Infrastructure, Inc. Japan. XLDB Asia, June 22nd, 2012
  • 2. Overview:Big Data analytics will go real-time and deeper 1. Bigger data 2. More in real-time 3. Deep analysis No storage No data sharing Only mix model
  • 3. Jubatus: OSS platform for Big Data analyticsl  Joint development with NTT laboratory in Japan l  Project started April 2011l  Released as an open source software l  Just released 0.3.0l  You can download it froml  http://github.com/jubatus/l  Waiting for your contribution and collaboration 3
  • 4. Agendal  What’s missing for Big Data analyticsl  Comparison with existing softwarel  Inside Jubatus: Update, Analyze, and Mixl  Jubatus demol  Summary 4
  • 5. Increasing demand in Big Data applications: Real-time deeper analysis l  Current focus: aggregation and rule processing on bigger data l  CEP (Complex Event Processing) for real-time processing l  Hadoop/MapReduce for distributed computation l  Future: deeper analysis for rapid decisions and actions l  Ex. 1: Defect detection on NY power grid [Rubin+,TPAMI2012] l  Ex. 2: Proactive algorithmic trading [ComputerWorldUK, 2011]Data size What will Hadoop come? CEP Deep Reference:http://web.mit.edu/rudin/www/TPAMIPreprint.pdf
 5 analysis http://www.computerworlduk.com/news/networking/3302464/
  • 6. Key technology: Machine learningl  Examples need rapid decisions under uncertainty l  Anomaly detection from M2M sensor data l  Energy demand forecast / Smart grid optimization l  Security monitoring on raw Internet trafficl  What is missing for fast & deep analytics on Big Data? l  Online/real-time machine learning platform l  + Scale-out distributed machine learning platform 1. Bigger data 2. More in real-time 3. Deep analysis
  • 7. Online machine learning in Jubatusl  Batch learning l  Scan all data before building a model l  Data must be stored in memory or storage Modell  Online learning l  Model will be updated by each data sample l  Sometimes with theory that the online model converges to the batch model Model 7
  • 8. Jubatus focuses on latest online algorithmsl  Advantage: fast and not memory-intensive l  Low latency & high throughput l  No need for storing large datasetsl  Eg. Linear classification algorithms l  Perceptron (1958) l  Passive Aggressive (PA) (2003) Very recent progress l  Confidence Weighted Learning (CW) (2008) l  AROW (2009) l  Normal HERD (NHERD) (2010) 8
  • 9. Online learning or distributed learning: No unified solution has been available l  Jubatus combines them into a unified computation framework Real-time/ Online Online ML alg.: Jubatus PA [2003] 2011- CW[2008] Large scaleSmall scale &Stand-alone Distributed/ Parallel WEKA Mahout computing    1993- 2006- SPSS 1988- Batch 9
  • 10. What Jubatus currently supportsl  Classification (multi-class) l  Perceptron / PA / CW / AROWl  Regression l  PA-based regressionl  Nearest neighbor l  LSH / MinHash / Euclid LSHl  Recommendation l  Based on nearest neighborl  Anomaly detection* l  LOF based on nearest neighborl  Graph analysis* l  Shortest path / Centrality (PageRank)l  Some simple statistics 10
  • 11. Agendal  What’s missing for Big Data analyticsl  Comparison with existing softwarel  Inside Jubatus: Update, Analyze, and Mixl  Jubatus demol  Summary 11
  • 12. Hadoop and Mahout: Not good for online learningl  Hadoop l  Advantage l  Many extensions for a variety of applications l  Good for distributed data storing and aggregation l  Disadvantage l  No direct support for machine learning and online processingl  Mahout l  Advantage l  Popular machine learning algorithms are implemented l  Disadvantage l  Some implementation are less mature l  Still not capable of online machine learning 12
  • 13. Jubatus vs. Hadoop, RDB-based, and Storm: Advantage in online AND distributed ML l  Only Jubatus satisfies both of them at the same time Jubatus Hadoop RDB Storm Storing ✓ ✓✓ ✓ ✓ Big Data External DB HDFS Ext. DB Batch ✓ ✓✓ ✓ ✕ learning Mahout SPSS, etc Stream ✓ ✕ ✕ ✓✓ processing Distributed ✓ ✓✓ ✕ ✕ learning Mahout High
 Onlineimportance ✓✓ ✕ ✕ ✕ learning 13
  • 14. Agendal  What’s missing for Big Data analyticsl  Comparison with existing softwarel  Inside Jubatus: Update, Analyze, and Mixl  Jubatus demol  Summary 14
  • 15. How to make online algorithms distributed?=> No trivial! Batch learning Online learning Learn Learn Easy to the update parallelize Model update Learn Model update Model update Hard to Learn Learn parallelize Model update the update due to Learn frequent updates Time Model update Model updatel  Online learning requires frequent model updatesl  Naïve distributed architecture leads to too many synchronization operationsl  It causes performance problems in terms of network communications and accuracy 15
  • 16. Solution: Loose model sharingl  Jubatus only shares the local models in a loose manner l  Model size << Data sizel  Jubatus DOES NOT share datasets l  Unique approach compared to existing frameworkl  Local models can be different on the servers l  Different models will be gradually merged Model Model Model Mixed Mixed Mixed model model model
  • 17. Three fundamental operations on Jubatus:UPDATE, ANALYZE, and MIX1.  UPDATE l  Receive a sample, learn and update the local model2.  ANALYZE l  Receive a sample, apply the local model, return result3.  MIX (called automatically in backend) l  Exchange and merge the local models between serversl  C.f. Map-Shuffle-Reduce operations on Hadoopl  Algorithms can be implemented independently from l  Distribution logic l  Data sharing l  Failover 17
  • 18. UPDATE l  Each server starts from an initial model l  Each data sample are sent to one (or two) servers l  Local models updated based on the sample l  Data samples are NEVER sharedDistributed
randomly Localor consistently Initial model model 1 Local model Initial model 2 18
  • 19. MIXl  Each server sends its model diffl  Model diffs are merged and distributedl  Only model diffs are transmitted Local Model ModelInitial Merged Initial Mixedmodel - model = diff diff diff + model = model 1 1 1 Merged + = diff Local Model ModelInitial Merged Initial Mixedmodel - 2 model = diff diff diff + model = model 2 2 19
  • 20. UPDATE (iteration) l  Locally updated models after MIX are discarded l  Each server starts updating from the mixed model l  The mixed model improves gradually thanks to all of the serversDistributed
randomly Localor consistently Mixed model model 1 Local model Mixed model 2 20
  • 21. ANALYZE l  For prediction, each sample randomly goes to a server l  Server applies the current mixed model to the sample l  The prediction will be returned to the clientDistributed
randomly Mixed model Return prediction Mixed model Return prediction 21
  • 22. Why Jubatus can work in real-time?l  Focus on online machine learning l  Make online machine learning algorithms distributedl  Update locally l  Online training without communication with othersl  Mix only models globally l  Small communication cost, low latency, good performance l  Advantage compared to costly Shuffle in MapReducel  Analyze locally l  Each server has mixed model l  Low latency for making predictionsl  Everything in-memory l  Process data on-the-fly 22
  • 23. Agendal  What’s missing for Big Data analyticsl  Comparison with existing softwarel  Inside Jubatus: Update, Analyze, and Mixl  Jubatus demol  Summary 23
  • 24. Demo: Twitter analysis using natural languageprocessing and machine learningJubatus classifies each tweet from Twitter data stream into pre-definedcategories. Only one Jubatus server is enough to classify over 5,000 QPS,which is close to the raw Twitter data. We provide a browser-based GUI. 24
  • 25. Experiment: Estimation of power consumptionJubatus learns the power usage and network data flow pattern ofcertain servers. The power consumption of individual servers can beestimated in real-time by monitoring and analyzing packets withouthaving to install power measurement modules on all servers. Predicted value (W) Data Center / Office Estimation PowerNo power meter meter Actual value (W) TAP (Packet data)Consumption differs fordifferent types of packets
  • 26. Agendal  What’s missing for Big Data analyticsl  Comparison with existing softwarel  Inside Jubatus: Update, Analyze, and Mixl  Jubatus demol  Summary 26
  • 27. Summaryl  Jubatus is the first OSS platform for online distributed machine learning on Big Data streams.l  Download it from http://github.com/jubatus/l  We welcome your contribution and collaboration 1. Bigger data 2. More in real-time 3. Deep analysis No storage No data sharing Only mix model