Machine Learning with
   Apache Hama
    Tommaso Teofili
    tommaso [at] apache [dot] org




                                    1
About me

    ASF member having fun with:

    Lucene / Solr

    Hama

    UIMA

    Stanbol

    … some others

    SW engineer @ Adobe R&D




                                  2
Agenda

    Apache Hama and BSP

    Why machine learning on BSP

    Some examples

    Benchmarks




                                  3
Apache Hama

    Bulk Synchronous Parallel computing
    framework on top of HDFS for massive
    scientific computations

    TLP since May 2012

    0.6.0 release out soon

    Growing community




                                       4
BSP supersteps

    A BSP algorithm is composed by a sequence of “supersteps”




                                                       5
BSP supersteps

    Each task

    Superstep 1
     
         Do some computation
     
         Communicate with other tasks
     
         Synchronize

    Superstep 2
     
         Do some computation
     
         Communicate with other tasks
     
         Synchronize

    …

    …

    …

    Superstep N
     
         Do some computation
     
         Communicate with other tasks
     
         Synchronize




                                        6
Why BSP

    Simple programming model

    Supersteps semantic is easy

    Preserve data locality

    Improve performance

    Well suited for iterative algorithms




                                           7
Apache Hama architecture
  
      BSP Program execution flow




                                   8
Apache Hama architecture




                           9
Apache Hama

    Features
    
        BSP API
    
        M/R like I/O API
    
        Graph API
    
        Job management / monitoring
    
        Checkpoint recovery
    
        Local & (Pseudo) Distributed run modes
    
        Pluggable message transfer architecture
    
        YARN supported
    
        Running in Apache Whirr



                                              10
Apache Hama BSP API

    public abstract class BSP<K1, V1, K2, V2,
    M extends Writable> …
    
        K1, V1 are key, values for inputs
    
        K2, V2 are key, values for outputs
    
        M are they type of messages used for task
        communication




                                              11
Apache Hama BSP API

    public void bsp(BSPPeer<K1, V1, K2, V2,
    M> peer) throws ..

    public void setup(BSPPeer<K1, V1, K2,
    V2, M> peer) throws ..

    public void cleanup(BSPPeer<K1, V1, K2,
    V2, M> peer) throws ..



                                       12
Machine learning on BSP

    Lots (most?) of ML algorithms are
    inherently iterative

    Hama ML module currently counts
    
        Collaborative filtering
    
        Clustering
    
        Gradient descent




                                        13
Benchmarking architecture



Node
Node
 Node
 Node
  Node
   Node
   Node
    Node   Hama
           Hama
                           Solr      DBMS


                            Lucene
           Mahout
           Mahout



                    HDFS
                    HDFS

                                       14
Collaborative filtering

    Given user preferences on movies

    We want to find users “near” to some
    specific user

    So that that user can “follow” them

    And/or see what they like (which he/she could
    like too)




                                             15
Collaborative filtering BSP

    Given a specific user

    Iteratively (for each task)

    Superstep 1*i
     
         Read a new user preference row
     
         Find how near is that user from the current user
         
             That is finding how near their preferences are
              
                Since they are given as vectors we may use vector
                distance measures like Euclidean, cosine, etc. distance
                algorithms
     
         Broadcast the measure output to other peers

    Superstep 2*i
     
         Aggregate measure outputs
     
         Update most relevant users

         
             Still to be committed (HAMA-612)
                                                               16
Collaborative filtering BSP

    Given user ratings about movies

    "john" -> 0, 0, 0, 9.5, 4.5, 9.5, 8

    "paula" -> 7, 3, 8, 2, 8.5, 0, 0

    "jim” -> 4, 5, 0, 5, 8, 0, 1.5

    "tom" -> 9, 4, 9, 1, 5, 0, 8

    "timothy" -> 7, 3, 5.5, 0, 9.5, 6.5, 0


    We ask for 2 nearest users to “paula” and
    we get “timothy” and “tom”
        
            user recommendation

    We can extract highly rated movies
    “timothy” and “tom” that “paula” didn’t see
        
            Item recommendation
                                             17
Benchmarks

    Fairly simple algorithm

    Highly iterative

    Comparing to Apache Mahout

    Behaves better than ALS-WR

    Behaves similarly to RecommenderJob and
    ItemSimilarityJob




                                         18
K-Means clustering

    We have a bunch of data (e.g. documents)

    We want to group those docs in k
    homogeneous clusters

    Iteratively for each cluster
    
        Calculate new cluster center
    
        Add doc nearest to new center to the cluster




                                                19
K-Means clustering




                     20
K-Means clustering BSP

    Iteratively

    Superstep 1*i

    Assignment phase

    Read vectors splits

    Sum up temporary centers with assigned vectors

    Broadcast sum and ingested vectors count

    Superstep 2*i

    Update phase

    Calculate the total sum over all received
    messages and average

    Replace old centers with new centers and check
    for convergence
                                            21
Benchmarks

    One rack (16 nodes 256 cores) cluster

    10G network

    On average faster than Mahout’s impl




                                        22
Gradient descent

    Optimization algorithm

    Find a (local) minimum of some function

    Used for
    
        solving linear systems
    
        solving non linear systems
    
        in machine learning tasks
        
            linear regression
        
            logistic regression
        
            neural networks backpropagation
        
            …




                                              23
Gradient descent

    Minimize a given (cost) function

    Give the function a starting point (set of parameters)

    Iteratively change parameters in order to minimize the
    function

    Stop at the (local)
    minimum





    There’s some math but intuitively:
     
       evaluate derivatives at a given point in order to choose
       where to “go” next
                                                       24
Gradient descent BSP

    Iteratively
    
        Superstep 1*i
        
            each task calculates and broadcasts portions of the
            cost function with the current parameters
    
        Superstep 2*i
        
            aggregate and update cost function
        
            check the aggregated cost and iterations count
            
                cost should always decrease
    
        Superstep 3*i
        
            each task calculates and broadcasts portions of
            (partial) derivatives
    
        Superstep 4*i
        
            aggregate and update parameters

                                                       25
Gradient descent BSP

    Simplistic example
    
        Linear regression
    
        Given real estate market dataset
    
        Estimate new houses prices given known
        houses’ size, geographic region and prices
    
        Expected output: actual parameters for the
        (linear) prediction function




                                               26
Gradient descent BSP

    Generate a different model for each region

    House item vectors
    
        price -> size
    
        150k -> 80

    2 dimensional space

    ~1.3M vectors dataset




                                         27
Gradient descent BSP

    Dataset and model fit




                            28
Gradient descent BSP

    Cost checking




                       29
Gradient descent BSP

    Classification

    Logistic regression with gradient descent

    Real estate market dataset

    We want to find which estate listings belong to agencies
     
         To avoid buying from them 


    Same algorithm

    With different cost function and features

    Existing items are tagged or not as “belonging to agency”

    Create vectors from items’ text

    Sample vector
     
         1 -> 1 3 0 0 5 3 4 1




                                                                30
Gradient descent BSP

    Classification




                       31
Benchmarks

    Not directly comparable to Mahout’s
    regression algorithms

    Both SGD and CGD are inherently better than
    plain GD

    But Hama GD had on average same
    performance of Mahout’s SGD / CGD

    Next step is implementing SGD / CGD on top of
    Hama 




                                            32
Wrap up

    Even if

    ML module is still “young” / work in progress

    and tools like Apache Mahout have better
    “coverage”


    Apache Hama can be particularly useful in
    certain “highly iterative” use cases

    Interesting benchmarks



                                               33
Thanks!




          34

Machine learning with Apache Hama

  • 1.
    Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org 1
  • 2.
    About me  ASF member having fun with:  Lucene / Solr  Hama  UIMA  Stanbol  … some others  SW engineer @ Adobe R&D 2
  • 3.
    Agenda  Apache Hama and BSP  Why machine learning on BSP  Some examples  Benchmarks 3
  • 4.
    Apache Hama  Bulk Synchronous Parallel computing framework on top of HDFS for massive scientific computations  TLP since May 2012  0.6.0 release out soon  Growing community 4
  • 5.
    BSP supersteps  A BSP algorithm is composed by a sequence of “supersteps” 5
  • 6.
    BSP supersteps  Each task  Superstep 1  Do some computation  Communicate with other tasks  Synchronize  Superstep 2  Do some computation  Communicate with other tasks  Synchronize  …  …  …  Superstep N  Do some computation  Communicate with other tasks  Synchronize 6
  • 7.
    Why BSP  Simple programming model  Supersteps semantic is easy  Preserve data locality  Improve performance  Well suited for iterative algorithms 7
  • 8.
    Apache Hama architecture  BSP Program execution flow 8
  • 9.
  • 10.
    Apache Hama  Features  BSP API  M/R like I/O API  Graph API  Job management / monitoring  Checkpoint recovery  Local & (Pseudo) Distributed run modes  Pluggable message transfer architecture  YARN supported  Running in Apache Whirr 10
  • 11.
    Apache Hama BSPAPI  public abstract class BSP<K1, V1, K2, V2, M extends Writable> …  K1, V1 are key, values for inputs  K2, V2 are key, values for outputs  M are they type of messages used for task communication 11
  • 12.
    Apache Hama BSPAPI  public void bsp(BSPPeer<K1, V1, K2, V2, M> peer) throws ..  public void setup(BSPPeer<K1, V1, K2, V2, M> peer) throws ..  public void cleanup(BSPPeer<K1, V1, K2, V2, M> peer) throws .. 12
  • 13.
    Machine learning onBSP  Lots (most?) of ML algorithms are inherently iterative  Hama ML module currently counts  Collaborative filtering  Clustering  Gradient descent 13
  • 14.
    Benchmarking architecture Node Node Node Node Node Node Node Node Hama Hama Solr DBMS Lucene Mahout Mahout HDFS HDFS 14
  • 15.
    Collaborative filtering  Given user preferences on movies  We want to find users “near” to some specific user  So that that user can “follow” them  And/or see what they like (which he/she could like too) 15
  • 16.
    Collaborative filtering BSP  Given a specific user  Iteratively (for each task)  Superstep 1*i  Read a new user preference row  Find how near is that user from the current user  That is finding how near their preferences are  Since they are given as vectors we may use vector distance measures like Euclidean, cosine, etc. distance algorithms  Broadcast the measure output to other peers  Superstep 2*i  Aggregate measure outputs  Update most relevant users  Still to be committed (HAMA-612) 16
  • 17.
    Collaborative filtering BSP  Given user ratings about movies  "john" -> 0, 0, 0, 9.5, 4.5, 9.5, 8  "paula" -> 7, 3, 8, 2, 8.5, 0, 0  "jim” -> 4, 5, 0, 5, 8, 0, 1.5  "tom" -> 9, 4, 9, 1, 5, 0, 8  "timothy" -> 7, 3, 5.5, 0, 9.5, 6.5, 0  We ask for 2 nearest users to “paula” and we get “timothy” and “tom”  user recommendation  We can extract highly rated movies “timothy” and “tom” that “paula” didn’t see  Item recommendation 17
  • 18.
    Benchmarks  Fairly simple algorithm  Highly iterative  Comparing to Apache Mahout  Behaves better than ALS-WR  Behaves similarly to RecommenderJob and ItemSimilarityJob 18
  • 19.
    K-Means clustering  We have a bunch of data (e.g. documents)  We want to group those docs in k homogeneous clusters  Iteratively for each cluster  Calculate new cluster center  Add doc nearest to new center to the cluster 19
  • 20.
  • 21.
    K-Means clustering BSP  Iteratively  Superstep 1*i  Assignment phase  Read vectors splits  Sum up temporary centers with assigned vectors  Broadcast sum and ingested vectors count  Superstep 2*i  Update phase  Calculate the total sum over all received messages and average  Replace old centers with new centers and check for convergence 21
  • 22.
    Benchmarks  One rack (16 nodes 256 cores) cluster  10G network  On average faster than Mahout’s impl 22
  • 23.
    Gradient descent  Optimization algorithm  Find a (local) minimum of some function  Used for  solving linear systems  solving non linear systems  in machine learning tasks  linear regression  logistic regression  neural networks backpropagation  … 23
  • 24.
    Gradient descent  Minimize a given (cost) function  Give the function a starting point (set of parameters)  Iteratively change parameters in order to minimize the function  Stop at the (local) minimum  There’s some math but intuitively:  evaluate derivatives at a given point in order to choose where to “go” next 24
  • 25.
    Gradient descent BSP  Iteratively  Superstep 1*i  each task calculates and broadcasts portions of the cost function with the current parameters  Superstep 2*i  aggregate and update cost function  check the aggregated cost and iterations count  cost should always decrease  Superstep 3*i  each task calculates and broadcasts portions of (partial) derivatives  Superstep 4*i  aggregate and update parameters 25
  • 26.
    Gradient descent BSP  Simplistic example  Linear regression  Given real estate market dataset  Estimate new houses prices given known houses’ size, geographic region and prices  Expected output: actual parameters for the (linear) prediction function 26
  • 27.
    Gradient descent BSP  Generate a different model for each region  House item vectors  price -> size  150k -> 80  2 dimensional space  ~1.3M vectors dataset 27
  • 28.
    Gradient descent BSP  Dataset and model fit 28
  • 29.
    Gradient descent BSP  Cost checking 29
  • 30.
    Gradient descent BSP  Classification  Logistic regression with gradient descent  Real estate market dataset  We want to find which estate listings belong to agencies  To avoid buying from them   Same algorithm  With different cost function and features  Existing items are tagged or not as “belonging to agency”  Create vectors from items’ text  Sample vector  1 -> 1 3 0 0 5 3 4 1 30
  • 31.
    Gradient descent BSP  Classification 31
  • 32.
    Benchmarks  Not directly comparable to Mahout’s regression algorithms  Both SGD and CGD are inherently better than plain GD  But Hama GD had on average same performance of Mahout’s SGD / CGD  Next step is implementing SGD / CGD on top of Hama  32
  • 33.
    Wrap up  Even if  ML module is still “young” / work in progress  and tools like Apache Mahout have better “coverage”  Apache Hama can be particularly useful in certain “highly iterative” use cases  Interesting benchmarks 33
  • 34.