Distributed Online Machine Learning
        Framework for Big Data




                 Shohei Hido
     Preferred Infrastructure, Inc. Japan.
        XLDB Asia, June 22nd, 2012
Overview:
Big Data analytics will go real-time and deeper

        1. Bigger data

     2. More in real-time

      3. Deep analysis

                                No storage
                                No data sharing
                                Only mix model
Jubatus: OSS platform for Big Data analytics




l    Joint development with NTT laboratory in Japan
      l    Project started April 2011
l    Released as an open source software
      l    Just released 0.3.0
l    You can download it from
l    http://github.com/jubatus/
l    Waiting for your contribution and collaboration

                                         3
Agenda

l    What’s missing for Big Data analytics


l    Comparison with existing software


l    Inside Jubatus: Update, Analyze, and Mix


l    Jubatus demo


l    Summary




                                    4
Increasing demand in Big Data applications:
    Real-time deeper analysis
    l  Current focus: aggregation and rule processing on bigger data
         l  CEP (Complex Event Processing) for real-time processing

         l  Hadoop/MapReduce for distributed computation

    l  Future: deeper analysis for rapid decisions and actions
         l  Ex. 1: Defect detection on NY power grid [Rubin+,TPAMI2012]

         l  Ex. 2: Proactive algorithmic trading [ComputerWorldUK, 2011]


Data size	

                                                               What will
                                        Hadoop                  come?
                  CEP
                                                                        Deep
    Reference:http://web.mit.edu/rudin/www/TPAMIPreprint.pdf

                                             5	
                        analysis	
        
    
http://www.computerworlduk.com/news/networking/3302464/
Key technology: Machine learning

l    Examples need rapid decisions under uncertainty
      l    Anomaly detection from M2M sensor data
      l    Energy demand forecast / Smart grid optimization
      l    Security monitoring on raw Internet traffic
l    What is missing for fast & deep analytics on Big Data?
      l    Online/real-time machine learning platform
      l    + Scale-out distributed machine learning platform



            1. Bigger data

      2. More in real-time

       3. Deep analysis
Online machine learning in Jubatus
l    Batch learning
       l  Scan all data before building a model
       l  Data must be stored in memory or storage


                                          Model


l    Online learning
       l  Model will be updated by each data sample
       l  Sometimes with theory that the online model
           converges to the batch model


                                              Model


                                7
Jubatus focuses on latest online algorithms

l    Advantage: fast and not memory-intensive
       l  Low latency & high throughput
       l  No need for storing large datasets


l    Eg. Linear classification algorithms
      l    Perceptron (1958)
      l    Passive Aggressive (PA) (2003)             Very recent
                                                        progress
      l    Confidence Weighted Learning (CW) (2008)
      l    AROW (2009)
      l    Normal HERD (NHERD) (2010)




                                    8
Online learning or distributed learning:
   No unified solution has been available
   l    Jubatus combines them into a unified computation framework
                                  Real-time/
                                    Online
                Online ML alg.:                Jubatus
                  PA [2003]                    2011-
                  CW[2008]

                                                                  Large scale
Small scale                                                             &
Stand-alone                                                       Distributed/
                                                                    Parallel
                WEKA                                     Mahout    computing
                   1993-                                  2006-
                SPSS
                   1988-
                                    Batch
                                      9
What Jubatus currently supports

l    Classification (multi-class)
       l  Perceptron / PA / CW / AROW

l    Regression
       l  PA-based regression

l    Nearest neighbor
       l  LSH / MinHash / Euclid LSH

l    Recommendation
       l  Based on nearest neighbor

l    Anomaly detection*
       l  LOF based on nearest neighbor

l  Graph analysis*
     l  Shortest path / Centrality (PageRank)

l  Some simple statistics
                                    10
Agenda

l    What’s missing for Big Data analytics


l    Comparison with existing software


l    Inside Jubatus: Update, Analyze, and Mix


l    Jubatus demo


l    Summary




                                   11
Hadoop and Mahout: Not good for online learning

l    Hadoop
       l  Advantage

              l    Many extensions for a variety of applications
              l    Good for distributed data storing and aggregation
       l    Disadvantage
              l    No direct support for machine learning and online processing
l    Mahout
       l  Advantage

              l    Popular machine learning algorithms are implemented
       l    Disadvantage
              l    Some implementation are less mature
              l    Still not capable of online machine learning

                                              12
Jubatus vs. Hadoop, RDB-based, and Storm:
    Advantage in online AND distributed ML
    l    Only Jubatus satisfies both of them at the same time

                            Jubatus       Hadoop           RDB        Storm
                Storing          ✓               ✓✓                     ✓
                                                             ✓
                Big Data    External DB          HDFS                 Ext. DB
                 Batch                             ✓        ✓✓
                                ✓                                       ✕
                learning                         Mahout   SPSS, etc
                 Stream
                                ✓                  ✕         ✕         ✓✓
               processing
             Distributed                           ✓
                               ✓✓                            ✕          ✕
              learning                           Mahout
   High
         Online
importance	
                   ✓✓                  ✕         ✕          ✕
                learning
                                          13
Agenda

l    What’s missing for Big Data analytics


l    Comparison with existing software


l    Inside Jubatus: Update, Analyze, and Mix


l    Jubatus demo


l    Summary




                                   14
How to make online algorithms distributed?
=> No trivial!
            Batch learning	
                      Online learning	

                Learn                                  Learn
                                    Easy to
              the update           parallelize     Model update
                                                       Learn
             Model update                          Model update
                                    Hard to            Learn
                Learn
                                   parallelize     Model update
              the update
                                     due to
                                                       Learn
                               frequent updates
  Time	
     Model update                          Model update


l    Online learning requires frequent model updates
l    Naïve distributed architecture leads to too many
      synchronization operations
l    It causes performance problems in terms of network
      communications and accuracy
                               15
Solution: Loose model sharing

l  Jubatus only shares the local models in a loose manner
     l  Model size << Data size

l  Jubatus DOES NOT share datasets
     l  Unique approach compared to existing framework

l  Local models can be different on the servers
     l  Different models will be gradually merged




                  Model      Model       Model




                  Mixed      Mixed       Mixed
                  model      model       model
Three fundamental operations on Jubatus:
UPDATE, ANALYZE, and MIX
1.    UPDATE
      l  Receive a sample, learn and update the local model

2.    ANALYZE
      l  Receive a sample, apply the local model, return result

3.    MIX (called automatically in backend)
      l  Exchange and merge the local models between servers



l    C.f. Map-Shuffle-Reduce operations on Hadoop
l    Algorithms can be implemented independently from
      l    Distribution logic
      l    Data sharing
      l    Failover

                                  17
UPDATE

   l  Each server starts from an initial model
   l  Each data sample are sent to one (or two) servers
   l  Local models updated based on the sample
   l  Data samples are NEVER shared




Distributed

randomly
                                            Local
or consistently 	
                                           Initial
                                                     model
                                                             model
                                                       1

                                                     Local
                                                     model   Initial
                                                             model
                                                       2
                                    18
MIX

l  Each server sends its model diff
l  Model diffs are merged and distributed
l  Only model diffs are transmitted




            Local     Model    Model
Initial                                         Merged Initial     Mixed
model     -	
            model   =	
 diff    diff
                                                  diff +	
                                                         model   =	
                                                                   model
              1          1       1    Merged
                                 +	
 =	
 diff
        Local         Model    Model
Initial                                         Merged Initial     Mixed
model -	
 2
        model       =	
 diff    diff
                                                  diff +	
                                                        model    =	
                                                                   model
                         2       2


                                       19
UPDATE (iteration)

   l  Locally updated models after MIX are discarded
   l  Each server starts updating from the mixed model
   l  The mixed model improves gradually thanks to all of the servers




Distributed

randomly
                                            Local
or consistently 	
                                             Mixed
                                                     model
                                                               model
                                                       1

                                                     Local
                                                     model     Mixed
                                                               model
                                                       2
                                   20
ANALYZE

   l  For prediction, each sample randomly goes to a server
   l  Server applies the current mixed model to the sample
   l  The prediction will be returned to the client




Distributed

randomly
                                                      Mixed
                                                               model

                                Return prediction
                                                               Mixed
                                                               model
                                Return prediction
                                   21
Why Jubatus can work in real-time?

l  Focus on online machine learning
     l  Make online machine learning algorithms distributed

l  Update locally
     l  Online training without communication with others

l  Mix only models globally
     l  Small communication cost, low latency, good performance

     l  Advantage compared to costly Shuffle in MapReduce

l  Analyze locally
     l  Each server has mixed model

     l  Low latency for making predictions

l    Everything in-memory
       l  Process data on-the-fly


                                     22
Agenda

l    What’s missing for Big Data analytics


l    Comparison with existing software


l    Inside Jubatus: Update, Analyze, and Mix


l    Jubatus demo


l    Summary




                                   23
Demo: Twitter analysis using natural language
processing and machine learning
Jubatus classifies each tweet from Twitter data stream into pre-defined
categories. Only one Jubatus server is enough to classify over 5,000 QPS,
which is close to the raw Twitter data. We provide a browser-based GUI.




                                   24
Experiment: Estimation of power consumption
Jubatus learns the power usage and network data flow pattern of
certain servers. The power consumption of individual servers can be
estimated in real-time by monitoring and analyzing packets without
having to install power measurement modules on all servers.




                                      Predicted value (W)
  Data Center /
     Office     Estimation

                    Power
No power meter      meter

                                                            Actual value (W)
                         TAP
                         (Packet data)
Consumption differs for
different types of packets
Agenda

l    What’s missing for Big Data analytics


l    Comparison with existing software


l    Inside Jubatus: Update, Analyze, and Mix


l    Jubatus demo


l    Summary




                                   26
Summary

l    Jubatus is the first OSS platform for online
      distributed machine learning on Big Data streams.
l    Download it from http://github.com/jubatus/
l    We welcome your contribution and collaboration
               1. Bigger data

            2. More in real-time

              3. Deep analysis
                                      No storage
                                      No data sharing
                                      Only mix model

Distributed Online Machine Learning Framework for Big Data

  • 1.
    Distributed Online MachineLearning Framework for Big Data Shohei Hido Preferred Infrastructure, Inc. Japan. XLDB Asia, June 22nd, 2012
  • 2.
    Overview: Big Data analyticswill go real-time and deeper 1. Bigger data 2. More in real-time 3. Deep analysis No storage No data sharing Only mix model
  • 3.
    Jubatus: OSS platformfor Big Data analytics l  Joint development with NTT laboratory in Japan l  Project started April 2011 l  Released as an open source software l  Just released 0.3.0 l  You can download it from l  http://github.com/jubatus/ l  Waiting for your contribution and collaboration 3
  • 4.
    Agenda l  What’s missing for Big Data analytics l  Comparison with existing software l  Inside Jubatus: Update, Analyze, and Mix l  Jubatus demo l  Summary 4
  • 5.
    Increasing demand inBig Data applications: Real-time deeper analysis l  Current focus: aggregation and rule processing on bigger data l  CEP (Complex Event Processing) for real-time processing l  Hadoop/MapReduce for distributed computation l  Future: deeper analysis for rapid decisions and actions l  Ex. 1: Defect detection on NY power grid [Rubin+,TPAMI2012] l  Ex. 2: Proactive algorithmic trading [ComputerWorldUK, 2011] Data size What will Hadoop come? CEP Deep Reference:http://web.mit.edu/rudin/www/TPAMIPreprint.pdf
 5 analysis http://www.computerworlduk.com/news/networking/3302464/
  • 6.
    Key technology: Machinelearning l  Examples need rapid decisions under uncertainty l  Anomaly detection from M2M sensor data l  Energy demand forecast / Smart grid optimization l  Security monitoring on raw Internet traffic l  What is missing for fast & deep analytics on Big Data? l  Online/real-time machine learning platform l  + Scale-out distributed machine learning platform 1. Bigger data 2. More in real-time 3. Deep analysis
  • 7.
    Online machine learningin Jubatus l  Batch learning l  Scan all data before building a model l  Data must be stored in memory or storage Model l  Online learning l  Model will be updated by each data sample l  Sometimes with theory that the online model converges to the batch model Model 7
  • 8.
    Jubatus focuses onlatest online algorithms l  Advantage: fast and not memory-intensive l  Low latency & high throughput l  No need for storing large datasets l  Eg. Linear classification algorithms l  Perceptron (1958) l  Passive Aggressive (PA) (2003) Very recent progress l  Confidence Weighted Learning (CW) (2008) l  AROW (2009) l  Normal HERD (NHERD) (2010) 8
  • 9.
    Online learning ordistributed learning: No unified solution has been available l  Jubatus combines them into a unified computation framework Real-time/ Online Online ML alg.: Jubatus PA [2003] 2011- CW[2008] Large scale Small scale & Stand-alone Distributed/ Parallel WEKA Mahout computing    1993- 2006- SPSS 1988- Batch 9
  • 10.
    What Jubatus currentlysupports l  Classification (multi-class) l  Perceptron / PA / CW / AROW l  Regression l  PA-based regression l  Nearest neighbor l  LSH / MinHash / Euclid LSH l  Recommendation l  Based on nearest neighbor l  Anomaly detection* l  LOF based on nearest neighbor l  Graph analysis* l  Shortest path / Centrality (PageRank) l  Some simple statistics 10
  • 11.
    Agenda l  What’s missing for Big Data analytics l  Comparison with existing software l  Inside Jubatus: Update, Analyze, and Mix l  Jubatus demo l  Summary 11
  • 12.
    Hadoop and Mahout:Not good for online learning l  Hadoop l  Advantage l  Many extensions for a variety of applications l  Good for distributed data storing and aggregation l  Disadvantage l  No direct support for machine learning and online processing l  Mahout l  Advantage l  Popular machine learning algorithms are implemented l  Disadvantage l  Some implementation are less mature l  Still not capable of online machine learning 12
  • 13.
    Jubatus vs. Hadoop,RDB-based, and Storm: Advantage in online AND distributed ML l  Only Jubatus satisfies both of them at the same time Jubatus Hadoop RDB Storm Storing ✓ ✓✓ ✓ ✓ Big Data External DB HDFS Ext. DB Batch ✓ ✓✓ ✓ ✕ learning Mahout SPSS, etc Stream ✓ ✕ ✕ ✓✓ processing Distributed ✓ ✓✓ ✕ ✕ learning Mahout High
 Online importance ✓✓ ✕ ✕ ✕ learning 13
  • 14.
    Agenda l  What’s missing for Big Data analytics l  Comparison with existing software l  Inside Jubatus: Update, Analyze, and Mix l  Jubatus demo l  Summary 14
  • 15.
    How to makeonline algorithms distributed? => No trivial! Batch learning Online learning Learn Learn Easy to the update parallelize Model update Learn Model update Model update Hard to Learn Learn parallelize Model update the update due to Learn frequent updates Time Model update Model update l  Online learning requires frequent model updates l  Naïve distributed architecture leads to too many synchronization operations l  It causes performance problems in terms of network communications and accuracy 15
  • 16.
    Solution: Loose modelsharing l  Jubatus only shares the local models in a loose manner l  Model size << Data size l  Jubatus DOES NOT share datasets l  Unique approach compared to existing framework l  Local models can be different on the servers l  Different models will be gradually merged Model Model Model Mixed Mixed Mixed model model model
  • 17.
    Three fundamental operationson Jubatus: UPDATE, ANALYZE, and MIX 1.  UPDATE l  Receive a sample, learn and update the local model 2.  ANALYZE l  Receive a sample, apply the local model, return result 3.  MIX (called automatically in backend) l  Exchange and merge the local models between servers l  C.f. Map-Shuffle-Reduce operations on Hadoop l  Algorithms can be implemented independently from l  Distribution logic l  Data sharing l  Failover 17
  • 18.
    UPDATE l  Each server starts from an initial model l  Each data sample are sent to one (or two) servers l  Local models updated based on the sample l  Data samples are NEVER shared Distributed
 randomly Local or consistently Initial model model 1 Local model Initial model 2 18
  • 19.
    MIX l  Each serversends its model diff l  Model diffs are merged and distributed l  Only model diffs are transmitted Local Model Model Initial Merged Initial Mixed model - model = diff diff diff + model = model 1 1 1 Merged + = diff Local Model Model Initial Merged Initial Mixed model - 2 model = diff diff diff + model = model 2 2 19
  • 20.
    UPDATE (iteration) l  Locally updated models after MIX are discarded l  Each server starts updating from the mixed model l  The mixed model improves gradually thanks to all of the servers Distributed
 randomly Local or consistently Mixed model model 1 Local model Mixed model 2 20
  • 21.
    ANALYZE l  For prediction, each sample randomly goes to a server l  Server applies the current mixed model to the sample l  The prediction will be returned to the client Distributed
 randomly Mixed model Return prediction Mixed model Return prediction 21
  • 22.
    Why Jubatus canwork in real-time? l  Focus on online machine learning l  Make online machine learning algorithms distributed l  Update locally l  Online training without communication with others l  Mix only models globally l  Small communication cost, low latency, good performance l  Advantage compared to costly Shuffle in MapReduce l  Analyze locally l  Each server has mixed model l  Low latency for making predictions l  Everything in-memory l  Process data on-the-fly 22
  • 23.
    Agenda l  What’s missing for Big Data analytics l  Comparison with existing software l  Inside Jubatus: Update, Analyze, and Mix l  Jubatus demo l  Summary 23
  • 24.
    Demo: Twitter analysisusing natural language processing and machine learning Jubatus classifies each tweet from Twitter data stream into pre-defined categories. Only one Jubatus server is enough to classify over 5,000 QPS, which is close to the raw Twitter data. We provide a browser-based GUI. 24
  • 25.
    Experiment: Estimation ofpower consumption Jubatus learns the power usage and network data flow pattern of certain servers. The power consumption of individual servers can be estimated in real-time by monitoring and analyzing packets without having to install power measurement modules on all servers. Predicted value (W) Data Center / Office Estimation Power No power meter meter Actual value (W) TAP (Packet data) Consumption differs for different types of packets
  • 26.
    Agenda l  What’s missing for Big Data analytics l  Comparison with existing software l  Inside Jubatus: Update, Analyze, and Mix l  Jubatus demo l  Summary 26
  • 27.
    Summary l  Jubatus is the first OSS platform for online distributed machine learning on Big Data streams. l  Download it from http://github.com/jubatus/ l  We welcome your contribution and collaboration 1. Bigger data 2. More in real-time 3. Deep analysis No storage No data sharing Only mix model