Your SlideShare is downloading. ×
Jubatus Invited Talk at XLDB Asia
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Jubatus Invited Talk at XLDB Asia

1,071

Published on

"Distributed Online Machine Learning Framework for Big Data", an invited talk for Jubatus at XLDB Asia, Beijing, 2012

"Distributed Online Machine Learning Framework for Big Data", an invited talk for Jubatus at XLDB Asia, Beijing, 2012

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,071
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Distributed Online Machine Learning Framework for Big Data Shohei Hido Preferred Infrastructure, Inc. Japan. XLDB Asia, June 22nd, 2012
  • 2. Preferred Infrastructure (PFI): to bringcutting-edge research advances to productsl  Founded: March, 2006, located in Tokyo, Japanl  Employees: 28 l  Top university graduates including ICPC world finalists l  Mid-career engineers from Sony, IBM, Yahoo!, Sun Information retrieval Distributed computing Natural language Machine learning processing 2
  • 3. 3
  • 4. Overview:Big Data analytics will go real-time and deeper 1. Bigger data 2. More in real-time 3. Deep analysis No storage No data sharing Only mix model
  • 5. Jubatus: OSS platform for Big Data analyticsl  Joint development with NTT laboratory in Japan l  Project started April 2011l  Released as an open source software l  Just released 0.3.0l  You can download it froml  http://github.com/jubatus/l  Waiting for your contribution and collaboration 5
  • 6. Agendal  What’s missing for Big Data analyticsl  Comparison with existing softwarel  Inside Jubatus: Update, Analyze, and Mixl  Jubatus demol  Summary 6
  • 7. Increasing demand in Big Data applications: Real-time deeper analysis l  Current focus: aggregation and rule processing on bigger data l  CEP (Complex Event Processing) for real-time processing l  Hadoop/MapReduce for distributed computation l  Future: deeper analysis for rapid decisions and actions l  Ex. 1: Defect detection on NY power grid [Rubin+,TPAMI2012] l  Ex. 2: Proactive algorithmic trading [ComputerWorldUK, 2011]Data size What will Hadoop come? CEP Deep Reference:http://web.mit.edu/rudin/www/TPAMIPreprint.pdf
 7 analysis http://www.computerworlduk.com/news/networking/3302464/
  • 8. Key technology: Machine learningl  Examples need rapid decisions under uncertainty l  Anomaly detection from M2M sensor data l  Energy demand forecast / Smart grid optimization l  Security monitoring on raw Internet trafficl  What is missing for fast & deep analytics on Big Data? l  Online/real-time machine learning platform l  + Scale-out distributed machine learning platform 1. Bigger data 2. More in real-time 3. Deep analysis
  • 9. Online machine learning in Jubatusl  Batch learning l  Scan all data before building a model l  Data must be stored in memory or storage Modell  Online learning l  Model will be updated by each data sample l  Sometimes with theory that the online model converges to the batch model Model 9
  • 10. Jubatus focuses on latest online algorithmsl  Advantage: fast and not memory-intensive l  Low latency & high throughput l  No need for storing large datasetsl  Eg. Linear classification algorithms l  Perceptron (1958) l  Passive Aggressive (PA) (2003) Very recent progress l  Confidence Weighted Learning (CW) (2008) l  AROW (2009) l  Normal HERD (NHERD) (2010) 10
  • 11. Online learning or distributed learning: No unified solution has been available l  Jubatus combines them into a unified computation framework Real-time/ Online Online ML alg.: Jubatus PA [2003] 2011- CW[2008] Large scaleSmall scale &Stand-alone Distributed/ Parallel WEKA Mahout computing    1993- 2006- SPSS 1988- Batch 11
  • 12. What Jubatus currently supportsl  Classification (multi-class) l  Perceptron / PA / CW / AROWl  Regression l  PA-based regressionl  Nearest neighbor l  LSH / MinHash / Euclid LSHl  Recommendation l  Based on nearest neighborl  Anomaly detection* l  LOF based on nearest neighborl  Graph analysis* l  Shortest path / Centrality (PageRank)l  Some simple statistics 12
  • 13. Agendal  What’s missing for Big Data analyticsl  Comparison with existing softwarel  Inside Jubatus: Update, Analyze, and Mixl  Jubatus demol  Summary 13
  • 14. Hadoop and Mahout: Not good for online learningl  Hadoop l  Advantage l  Many extensions for a variety of applications l  Good for distributed data storing and aggregation l  Disadvantage l  No direct support for machine learning and online processingl  Mahout l  Advantage l  Popular machine learning algorithms are implemented l  Disadvantage l  Some implementation are less mature l  Still not capable of online machine learning 14
  • 15. Jubatus vs. Hadoop, RDB-based, and Storm: Advantage in online AND distributed ML l  Only Jubatus satisfies both of them at the same time Jubatus Hadoop RDB Storm Storing ✓ ✓✓ ✓ ✓ Big Data External DB HDFS Ext. DB Batch ✓ ✓✓ ✓ ✕ learning Mahout SPSS, etc Stream ✓ ✕ ✕ ✓✓ processing Distributed ✓ ✓✓ ✕ ✕ learning Mahout High
 Onlineimportance ✓✓ ✕ ✕ ✕ learning 15
  • 16. Agendal  What’s missing for Big Data analyticsl  Comparison with existing softwarel  Inside Jubatus: Update, Analyze, and Mixl  Jubatus demol  Summary 16
  • 17. How to make online algorithms distributed?=> No trivial! Batch learning Online learning Learn Learn Easy to the update parallelize Model update Learn Model update Model update Hard to Learn Learn parallelize Model update the update due to Learn frequent updates Time Model update Model updatel  Online learning requires frequent model updatesl  Naïve distributed architecture leads to too many synchronization operationsl  It causes performance problems in terms of network communications and accuracy 17
  • 18. Solution: Loose model sharingl  Jubatus only shares the local models in a loose manner l  Model size << Data sizel  Jubatus DOES NOT share datasets l  Unique approach compared to existing frameworkl  Local models can be different on the servers l  Different models will be gradually merged Model Model Model Mixed Mixed Mixed model model model
  • 19. Three fundamental operations on Jubatus:UPDATE, ANALYZE, and MIX1.  UPDATE l  Receive a sample, learn and update the local model2.  ANALYZE l  Receive a sample, apply the local model, return result3.  MIX (called automatically in backend) l  Exchange and merge the local models between serversl  C.f. Map-Shuffle-Reduce operations on Hadoopl  Algorithms can be implemented independently from l  Distribution logic l  Data sharing l  Failover 19
  • 20. UPDATE l  Each server starts from an initial model l  Each data sample are sent to one (or two) servers l  Local models updated based on the sample l  Data samples are NEVER sharedDistributed
randomly Localor consistently Initial model model 1 Local model Initial model 2 20
  • 21. MIXl  Each server sends its model diffl  Model diffs are merged and distributedl  Only model diffs are transmitted Local Model ModelInitial Merged Initial Mixedmodel - model = diff diff diff + model = model 1 1 1 Merged + = diff Local Model ModelInitial Merged Initial Mixedmodel - 2 model = diff diff diff + model = model 2 2 21
  • 22. UPDATE (iteration) l  Locally updated models after MIX are discarded l  Each server starts updating from the mixed model l  The mixed model improves gradually thanks to all of the serversDistributed
randomly Localor consistently Mixed model model 1 Local model Mixed model 2 22
  • 23. ANALYZE l  For prediction, each sample randomly goes to a server l  Server applies the current mixed model to the sample l  The prediction will be returned to the clientDistributed
randomly Mixed model Return prediction Mixed model Return prediction 23
  • 24. Why Jubatus can work in real-time?l  Focus on online machine learning l  Make online machine learning algorithms distributedl  Update locally l  Online training without communication with othersl  Mix only models globally l  Small communication cost, low latency, good performance l  Advantage compared to costly Shuffle in MapReducel  Analyze locally l  Each server has mixed model l  Low latency for making predictionsl  Everything in-memory l  Process data on-the-fly 24
  • 25. Agendal  What’s missing for Big Data analyticsl  Comparison with existing softwarel  Inside Jubatus: Update, Analyze, and Mixl  Jubatus demol  Summary 25
  • 26. Demo: Twitter analysis using natural languageprocessing and machine learningJubatus classifies each tweet from Twitter data stream into pre-definedcategories. Only one Jubatus server is enough to classify over 5,000 QPS,which is close to the raw Twitter data. We provide a browser-based GUI. 26
  • 27. Experiment: Estimation of power consumptionJubatus learns the power usage and network data flow pattern ofcertain servers. The power consumption of individual servers can beestimated in real-time by monitoring and analyzing packets withouthaving to install power measurement modules on all servers. Predicted value (W) Data Center / Office Estimation PowerNo power meter meter Actual value (W) TAP (Packet data)Consumption differs fordifferent types of packets
  • 28. Agendal  What’s missing for Big Data analyticsl  Comparison with existing softwarel  Inside Jubatus: Update, Analyze, and Mixl  Jubatus demol  Summary 28
  • 29. Summaryl  Jubatus is the first OSS platform for online distributed machine learning on Big Data streams.l  Download it from http://github.com/jubatus/l  We welcome your contribution and collaboration 1. Bigger data 2. More in real-time 3. Deep analysis No storage No data sharing Only mix model

×