Jubatus talk at HadoopSummit 2013


Published on

"Jubatus: Real-time and highly-scalable machine learning platform" at HadoopSummit, 2013/06/27 (Revised version)

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Jubatus talk at HadoopSummit 2013

  1. 1. Jubatus: Real-time and Highly-scalable Machine Learning Platform Shohei Hido Preferred Infrastructure, Inc. Japan. HadoopSummit 2013 @ San Jose, CA 2013/06/27
  2. 2. Jubatus: OSS for real-time big data analytics l  Joint development with NTT laboratory in Japan l  Released Oct. 2011 (current version is v0.4.3) l  You can download it from https://github.com/jubatus/ 2 1. Bigger data 3. Machine learning 2. More in real-time
  3. 3. Bottom line: Just two words 3
  4. 4. l  Software company in Tokyo, Japan (founded in 2006) l  Focus on long-term technology innovation l  28 regular employees, many top-notch engineers l  Customers: media, e-commerce, research institutes Distributed computing Natural language processing Machine learning Information retrieval Preferred Infrastructure, Inc. (PFI) -To bring cutting-edge research advances to the real world- 4
  5. 5. l  What is Jubatus? : Motivation and applications l  How Jubatus works? : The architecture l  How to use it : Quick-start steps l  Summary and future Agenda
  6. 6. At HadoopSummit Last year: Everyone talked about “real-time” 6 Real-time BI Definition of real-time Real-time analytics Real-time SQL-like query Real-time processing Real-time ad-hoc query Real-time visualization
  7. 7. Real-time big data analytics: A trend From an O’Reilly article(2013) “Real-time big data isn’t just a process for storing petabytes or exabytes of data in a data warehouse” “It’s about the ability to make better decisions and take meaningful actions at the right time.” “It’s about combining and analyzing data so you can take the right action, at the right time, and at the right place” - Michael Minelli, Co-author of “Big Data, Big Analytics” 7
  8. 8. Hadoop ecosystem Deeper analytics Decision Speed Sedue Jubatus Surveillance camera Security traffic Automobile Agriculture Market Research Education Bio Health Care Speed and depth of Big data analytics: a whitespace
  9. 9. Big data analytics will go real-time and deeper 9 1. Bigger data 3. Machine learning 2. More in real-time l  Future: Deeper analytics for rapid decisions and actions l  Twitter analysis for personalized advertisement optimization l  Anomaly detection from M2M sensor data l  Energy demand forecast / Smart grid optimization l  Security monitoring on Network traffic or financial fraud
  10. 10. Demo: real-time tweet categorization l  Automatically learns “Apple + iPad => Apple” then “iPad => Apple” in real-time
  11. 11. Jubatus is with Twitter ecosystem in Japan l  NTT Data: Exclusive tweet reseller in Japan l  Firehose contract with Twitter l  Jubatus is an official tool for analytics on Japanese tweets l  Jubatus can classify 5,000+ tweets per second on a few servers l  11 http://blog.jp.twitter.com/2012/09/twitter.html http://www.nttdata.com/jp/ja/news/release/2012/092700.html Our twitter analysis modules
  12. 12. Jubatus as a big data analytics platform for industry l  Gov. fund for IT fusion: big-data new business creation l  In collaboration with NEC and other research labs. l  Focus on performance improvement for larger M2M data 12 Datasize Development plan Human-generated + Machine- generated + Severe real-time requirement SNS data Healthcare Agriculture Network Traffic Video surveillance 12 Scaling up
  13. 13. Active development & growing business/community l  10+ active committers l  & Pull requests from users l  Monthly minor update l  Bug & usability fix l  Quarterly major update l  Add new features & interface 13 l  PoC on user companies l  Real-time ad optimization l  Server monitoring l  Smart-house / smart-grid l  Intelligent camera l  Deployment & Experiment l  Twitter analysis l  Social media monitoring l  Malicious attack detection l  Malware detection l  2 Hands-on: 90+ attend in total l  1 Meetup: 90+ attendees
  14. 14. l  What is Jubatus? : Motivation and applications l  How Jubatus works? : The architecture l  How to use it : Quick-start steps l  Summary and future Agenda
  15. 15. Online machine learning in Jubatus l  Batch learning l  Scan all data before building a model l  Data must be stored in memory or storage l  Online learning l  Model will be updated by each data sample l  Sometimes with theory that the online model converges to the batch model 15 Model Model
  16. 16. What Jubatus currently supports l  Classification (multi-class) l  Perceptron / PA / CW / AROW l  Regression l  PA-based regression l  Nearest neighbor l  LSH / MinHash / Euclid LSH l  Recommendation l  Nearest neighbor based l  Anomaly detection l  LOF (Local Outlier Factor) l  Graph analysis l  Shortest path / Centrality (PageRank) l  Some simple statistics 16
  17. 17. Online learning or distributed learning: No unified solution has been available l  Jubatus combines them into a unified computation framework 17 WEKA   1993- SPSS 1988- Mahout 2006- Online ML alg.: PA [2003] CW[2008] Real-time/ Online Batch Small scale Stand-alone Large scale & Distributed/ Parallel computing Jubatus 2011-
  18. 18. Q: How to make online algorithms distributed? A: no trivial and some tricks needed l  Online learning requires frequent model updates l  Naïve data distribution leads to too many synchronization operations l  It causes performance problems in terms of network communications and accuracy LLLL LLLL L Sync LLL Sync Sync Sync time Data syncronization? Server A Server B Server C Local model update 18
  19. 19. Our approach: Loose model sharing l  Jubatus only shares the local models in a loose manner l  Model size << Data size l  Jubatus DOES NOT share datasets l  Unique approach compared to existing framework l  Local models can be different on the servers l  Different models will be gradually merged l  We define three fundamental operations l  UPDATE / MIX / ANALYZE l  Algorithms can be implemented independently from l  Distribution logic l  Data sharing l  Failover 19 ModelModelModel
  20. 20. UPDATE, MIX, and ANALYZE 1.  UPDATE - locally l  Receive a sample, learn and update the local model 2.  MIX - globally l  Exchange and merge the local models between servers 3.  ANALYZE - locally l  Receive a sample, apply the local model, return result ModelModelModel Unified model Unified model Unified model MIX Share only models UPDATE Distributed training ANALYZE Distributed prediction 20
  21. 21. UPDATE l  Each server starts from an initial model l  Each data sample are sent to one (or two) servers l  Local models updated based on the sample l  Data samples are NEVER shared 21 Local model 1 Local model 2 Initial model Initial model Distributed randomly or consistently
  22. 22. MIX l  Each server sends its model diff l  Model diffs are merged and distributed l  Only model diffs are transmitted Local model 1 Local model 2 Mixed model Mixed model Initial model Initial model = = Model diff 1 Model diff 2 Initial model Initial model - - Model diff 1 Model diff 2 Merged diff Merged diff Merged diff + + = = = + 22
  23. 23. UPDATE (iteration) l  Locally updated models after MIX are discarded l  Each server starts updating from the mixed model l  The mixed model improves gradually thanks to all of the servers Local model 1 Local model 2 Mixed model Mixed model Distributed randomly or consistently 23
  24. 24. ANALYZE l  For prediction, each sample randomly goes to a server l  Server applies the current mixed model to the sample l  The prediction will be returned to the client l  You add servers for higher throughput 24 Mixed model Mixed model Distributed randomly Return prediction Return prediction
  25. 25. Model inside Jubatus (1): classification w1 w2 wn MIX w w w w = 1 n w1 ++ wn( ) l  Each server updates local linear models l  MIX computes the averaged coefficients 25
  26. 26. Model inside Jubatus (2): nearest neighbor 011010010 110001100 110010111 000100101 110101011 000010110 1 2 3 4 5 6 011010010 000010110 1 6 : 011010010 000010110 1 6 : 011010010 000010110 1 6 : MIX l  Samples are approximated by LSH, MinHash, etc l  Only bit-arrays are shared between servers
  27. 27. Jubatus architecture Standard client-server system l  Zookeeper and RPC handles connections between clients and servers l  We have clients for C++/Java/Ruby/Python (All under MIT license) 27 JubaServer JubaKeeper fv_converter Algorithm JubaServer JubaServer Linux server thread thread Linux server Client Linux server Linux server thread thread RPC Client+JubaKeeper Client+JubaKeeper … … … …… thread … thread thread thread … RPCRPC RPC Model
  28. 28. Best QPS performances (evaluated on old ver.) l  Experimental settings l  Standalone vs. multiple servers l  Client processes: 1 - 4 l  Server processes: 1 – 6 l  Server thread: 1 – 6 l  Results l  Classification scales linearly with #server-processes & threads l  Recommendation performance highly depends on collected #samples 28 Task Operation Max-qps Classification UPDATE 3,000 [qps] ANALYZE 6,500 [qps] Recommendation UPDATE 400 [qps] ANALYZE 2,500 [qps]
  29. 29. l  What is Jubatus? : Motivation and applications l  How Jubatus works? : The architecture l  How to use it : Quick-start steps l  Summary and future Agenda
  30. 30. Step (0): Visit Jubatus website (http://jubat.us/) l  Overview l  Installation l  Tutorials l  API documents l  Reference 30
  31. 31. Step (1): VM images and tutorial http://download.jubat.us/event/handson_01/en/ 31 l  Hands-on tutorial l  Intro to ML, How to start, examples, configurations l  VM images running on any OS l  VirtualBox / VMware
  32. 32. Step (2) : Download from github l  https://github.com/jubatus/jubatus/ 32
  33. 33. Step (3): Play with Jubatus examples l  https://github.com/jubatus/jubatus-example/ 33
  34. 34. Step (4) : Build your own apps l  Examples l  Tweet categorization l  User segmentation l  Power consumption estimation l  Stock price prediction l  Real-time recommendation l  Advertisement optimization l  Online fraud prevention l  Early-stage defect detection l  Proactive network monitoring l  Online malware detection 34
  35. 35. l  What is Jubatus? : Motivation and applications l  How Jubatus works? : The architecture l  How to use it : Quick-start steps l  Summary and future Agenda
  36. 36. Summary l  Jubatus is an OSS for online distributed machine learning l  UPDATE-MIX-ANALYZE for abstracting ML algorithms l  Most of the tasks l  Future plans l  Clustering l  P2P-like MIX method l  Time-series preprocessing in fv_converter l  Unlearning 36 1. Bigger data 3. Machine learning 2. More in real-time
  37. 37. Current: As a meta-data predictor 37 User Time Bet Act. Gain Class Est, Cluster Outlier A33 5:34 40 ↑C +20 Good +18 C1 0.07 A33 5:34 10 ←B +80 Good -10 C3 0.92 A33 5:35 20 ↑B -16 Bad -15 C1 0.11 … … … … … RDB •  Aggregation •  Reporting •  AnalyticsOnline learning Input data Enriched data Real-time prediction NoSQL HDFS Predicted columns Search l  Apply Jubatus models before storing l  Adaptive and memory-efficient
  38. 38. Future: For edge-heavy data 38 l  Emerging apps that can’t collect data into one place l  Due to data intensity: video streams from millions of devices l  Due to latency: real-time decision within <100 msec l  Due to privacy: sensitive raw data cannot be shared Smartphones Intelligent cars Intelligent cameras Healthcare monitoring Bio-medical
  39. 39. How can we help you? One more thing… 39
  40. 40. We opened a subsidiary in San Jose l  Preferred Infrastructure America, Inc. l  Established in March, Office opened in April l  Next to the SJC airport l  Start doing business in the U.S. 40
  41. 41. Thank you l  Follow us l  github.com/jubatus l  jubatus@googlegroups.com l  Twitter: @JubatusOfficial l  We welcome your contribution and collaboration 41