SlideShare a Scribd company logo
1 of 34
Machine Learning with
   Apache Hama
    Tommaso Teofili
    tommaso [at] apache [dot] org




                                    1
About me

    ASF member having fun with:

    Lucene / Solr

    Hama

    UIMA

    Stanbol

    … some others

    SW engineer @ Adobe R&D




                                  2
Agenda

    Apache Hama and BSP

    Why machine learning on BSP

    Some examples

    Benchmarks




                                  3
Apache Hama

    Bulk Synchronous Parallel computing
    framework on top of HDFS for massive
    scientific computations

    TLP since May 2012

    0.6.0 release out soon

    Growing community




                                       4
BSP supersteps

    A BSP algorithm is composed by a sequence of “supersteps”




                                                       5
BSP supersteps

    Each task

    Superstep 1
     
         Do some computation
     
         Communicate with other tasks
     
         Synchronize

    Superstep 2
     
         Do some computation
     
         Communicate with other tasks
     
         Synchronize

    …

    …

    …

    Superstep N
     
         Do some computation
     
         Communicate with other tasks
     
         Synchronize




                                        6
Why BSP

    Simple programming model

    Supersteps semantic is easy

    Preserve data locality

    Improve performance

    Well suited for iterative algorithms




                                           7
Apache Hama architecture
  
      BSP Program execution flow




                                   8
Apache Hama architecture




                           9
Apache Hama

    Features
    
        BSP API
    
        M/R like I/O API
    
        Graph API
    
        Job management / monitoring
    
        Checkpoint recovery
    
        Local & (Pseudo) Distributed run modes
    
        Pluggable message transfer architecture
    
        YARN supported
    
        Running in Apache Whirr



                                              10
Apache Hama BSP API

    public abstract class BSP<K1, V1, K2, V2,
    M extends Writable> …
    
        K1, V1 are key, values for inputs
    
        K2, V2 are key, values for outputs
    
        M are they type of messages used for task
        communication




                                              11
Apache Hama BSP API

    public void bsp(BSPPeer<K1, V1, K2, V2,
    M> peer) throws ..

    public void setup(BSPPeer<K1, V1, K2,
    V2, M> peer) throws ..

    public void cleanup(BSPPeer<K1, V1, K2,
    V2, M> peer) throws ..



                                       12
Machine learning on BSP

    Lots (most?) of ML algorithms are
    inherently iterative

    Hama ML module currently counts
    
        Collaborative filtering
    
        Clustering
    
        Gradient descent




                                        13
Benchmarking architecture



Node
Node
 Node
 Node
  Node
   Node
   Node
    Node   Hama
           Hama
                           Solr      DBMS


                            Lucene
           Mahout
           Mahout



                    HDFS
                    HDFS

                                       14
Collaborative filtering

    Given user preferences on movies

    We want to find users “near” to some
    specific user

    So that that user can “follow” them

    And/or see what they like (which he/she could
    like too)




                                             15
Collaborative filtering BSP

    Given a specific user

    Iteratively (for each task)

    Superstep 1*i
     
         Read a new user preference row
     
         Find how near is that user from the current user
         
             That is finding how near their preferences are
              
                Since they are given as vectors we may use vector
                distance measures like Euclidean, cosine, etc. distance
                algorithms
     
         Broadcast the measure output to other peers

    Superstep 2*i
     
         Aggregate measure outputs
     
         Update most relevant users

         
             Still to be committed (HAMA-612)
                                                               16
Collaborative filtering BSP

    Given user ratings about movies

    "john" -> 0, 0, 0, 9.5, 4.5, 9.5, 8

    "paula" -> 7, 3, 8, 2, 8.5, 0, 0

    "jim” -> 4, 5, 0, 5, 8, 0, 1.5

    "tom" -> 9, 4, 9, 1, 5, 0, 8

    "timothy" -> 7, 3, 5.5, 0, 9.5, 6.5, 0


    We ask for 2 nearest users to “paula” and
    we get “timothy” and “tom”
        
            user recommendation

    We can extract highly rated movies
    “timothy” and “tom” that “paula” didn’t see
        
            Item recommendation
                                             17
Benchmarks

    Fairly simple algorithm

    Highly iterative

    Comparing to Apache Mahout

    Behaves better than ALS-WR

    Behaves similarly to RecommenderJob and
    ItemSimilarityJob




                                         18
K-Means clustering

    We have a bunch of data (e.g. documents)

    We want to group those docs in k
    homogeneous clusters

    Iteratively for each cluster
    
        Calculate new cluster center
    
        Add doc nearest to new center to the cluster




                                                19
K-Means clustering




                     20
K-Means clustering BSP

    Iteratively

    Superstep 1*i

    Assignment phase

    Read vectors splits

    Sum up temporary centers with assigned vectors

    Broadcast sum and ingested vectors count

    Superstep 2*i

    Update phase

    Calculate the total sum over all received
    messages and average

    Replace old centers with new centers and check
    for convergence
                                            21
Benchmarks

    One rack (16 nodes 256 cores) cluster

    10G network

    On average faster than Mahout’s impl




                                        22
Gradient descent

    Optimization algorithm

    Find a (local) minimum of some function

    Used for
    
        solving linear systems
    
        solving non linear systems
    
        in machine learning tasks
        
            linear regression
        
            logistic regression
        
            neural networks backpropagation
        
            …




                                              23
Gradient descent

    Minimize a given (cost) function

    Give the function a starting point (set of parameters)

    Iteratively change parameters in order to minimize the
    function

    Stop at the (local)
    minimum





    There’s some math but intuitively:
     
       evaluate derivatives at a given point in order to choose
       where to “go” next
                                                       24
Gradient descent BSP

    Iteratively
    
        Superstep 1*i
        
            each task calculates and broadcasts portions of the
            cost function with the current parameters
    
        Superstep 2*i
        
            aggregate and update cost function
        
            check the aggregated cost and iterations count
            
                cost should always decrease
    
        Superstep 3*i
        
            each task calculates and broadcasts portions of
            (partial) derivatives
    
        Superstep 4*i
        
            aggregate and update parameters

                                                       25
Gradient descent BSP

    Simplistic example
    
        Linear regression
    
        Given real estate market dataset
    
        Estimate new houses prices given known
        houses’ size, geographic region and prices
    
        Expected output: actual parameters for the
        (linear) prediction function




                                               26
Gradient descent BSP

    Generate a different model for each region

    House item vectors
    
        price -> size
    
        150k -> 80

    2 dimensional space

    ~1.3M vectors dataset




                                         27
Gradient descent BSP

    Dataset and model fit




                            28
Gradient descent BSP

    Cost checking




                       29
Gradient descent BSP

    Classification

    Logistic regression with gradient descent

    Real estate market dataset

    We want to find which estate listings belong to agencies
     
         To avoid buying from them 


    Same algorithm

    With different cost function and features

    Existing items are tagged or not as “belonging to agency”

    Create vectors from items’ text

    Sample vector
     
         1 -> 1 3 0 0 5 3 4 1




                                                                30
Gradient descent BSP

    Classification




                       31
Benchmarks

    Not directly comparable to Mahout’s
    regression algorithms

    Both SGD and CGD are inherently better than
    plain GD

    But Hama GD had on average same
    performance of Mahout’s SGD / CGD

    Next step is implementing SGD / CGD on top of
    Hama 




                                            32
Wrap up

    Even if

    ML module is still “young” / work in progress

    and tools like Apache Mahout have better
    “coverage”


    Apache Hama can be particularly useful in
    certain “highly iterative” use cases

    Interesting benchmarks



                                               33
Thanks!




          34

More Related Content

What's hot

The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Soumee Maschatak
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Indic threads pune12-apache-crunch
Indic threads pune12-apache-crunchIndic threads pune12-apache-crunch
Indic threads pune12-apache-crunchIndicThreads
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - HortonworksAvery Ching
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Hadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveHadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveJoydeep Sen Sarma
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 

What's hot (20)

The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Apache Crunch
Apache CrunchApache Crunch
Apache Crunch
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Indic threads pune12-apache-crunch
Indic threads pune12-apache-crunchIndic threads pune12-apache-crunch
Indic threads pune12-apache-crunch
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Unit 2 part-2
Unit 2 part-2Unit 2 part-2
Unit 2 part-2
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Hadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveHadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspective
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 

Viewers also liked

Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Apache HAMA: An Introduction toBulk Synchronization Parallel on HadoopApache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Apache HAMA: An Introduction toBulk Synchronization Parallel on HadoopEdward Yoon
 
07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descentSubhas Kumar Ghosh
 
Pregel: A System for Large-Scale Graph Processing
Pregel: A System for Large-Scale Graph ProcessingPregel: A System for Large-Scale Graph Processing
Pregel: A System for Large-Scale Graph ProcessingChris Bunch
 
Using Gradient Descent for Optimization and Learning
Using Gradient Descent for Optimization and LearningUsing Gradient Descent for Optimization and Learning
Using Gradient Descent for Optimization and LearningDr. Volkan OBAN
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent methodSanghyuk Chun
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 

Viewers also liked (8)

Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Apache HAMA: An Introduction toBulk Synchronization Parallel on HadoopApache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
 
07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
Pregel: A System for Large-Scale Graph Processing
Pregel: A System for Large-Scale Graph ProcessingPregel: A System for Large-Scale Graph Processing
Pregel: A System for Large-Scale Graph Processing
 
Using Gradient Descent for Optimization and Learning
Using Gradient Descent for Optimization and LearningUsing Gradient Descent for Optimization and Learning
Using Gradient Descent for Optimization and Learning
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 

Similar to Machine learning with Apache Hama

Stream-Native Processing with Pulsar Functions
Stream-Native Processing with Pulsar FunctionsStream-Native Processing with Pulsar Functions
Stream-Native Processing with Pulsar FunctionsStreamlio
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
 
Introduction of Apache Hama - 2011
Introduction of Apache Hama - 2011Introduction of Apache Hama - 2011
Introduction of Apache Hama - 2011Edward Yoon
 
Parallelism in a NumPy-based program
Parallelism in a NumPy-based programParallelism in a NumPy-based program
Parallelism in a NumPy-based programRalf Gommers
 
Hkube
HkubeHkube
Hkubehkube
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNJosh Patterson
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerNopparat Nopkuat
 
Java 8 - A step closer to Parallelism
Java 8 - A step closer to ParallelismJava 8 - A step closer to Parallelism
Java 8 - A step closer to Parallelismjbugkorea
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafkaNitin Kumar
 
Systems building-systems-a-puppet-story-19133
Systems building-systems-a-puppet-story-19133Systems building-systems-a-puppet-story-19133
Systems building-systems-a-puppet-story-19133guestd90cb0
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDatabricks
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...netvis
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationTravis Oliphant
 
Challenges on Distributed Machine Learning
Challenges on Distributed Machine LearningChallenges on Distributed Machine Learning
Challenges on Distributed Machine Learningjie cao
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoopishan0019
 

Similar to Machine learning with Apache Hama (20)

Stream-Native Processing with Pulsar Functions
Stream-Native Processing with Pulsar FunctionsStream-Native Processing with Pulsar Functions
Stream-Native Processing with Pulsar Functions
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
Introduction of Apache Hama - 2011
Introduction of Apache Hama - 2011Introduction of Apache Hama - 2011
Introduction of Apache Hama - 2011
 
Parallelism in a NumPy-based program
Parallelism in a NumPy-based programParallelism in a NumPy-based program
Parallelism in a NumPy-based program
 
Hkube
HkubeHkube
Hkube
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload Scheduler
 
Plreuse
PlreusePlreuse
Plreuse
 
Java 8 - A step closer to Parallelism
Java 8 - A step closer to ParallelismJava 8 - A step closer to Parallelism
Java 8 - A step closer to Parallelism
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafka
 
Systems building-systems-a-puppet-story-19133
Systems building-systems-a-puppet-story-19133Systems building-systems-a-puppet-story-19133
Systems building-systems-a-puppet-story-19133
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...Flink Forward SF 2017: Malo Deniélou -  No shard left behind: Dynamic work re...
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft Presentation
 
Challenges on Distributed Machine Learning
Challenges on Distributed Machine LearningChallenges on Distributed Machine Learning
Challenges on Distributed Machine Learning
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 

More from Tommaso Teofili

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRTommaso Teofili
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr Tommaso Teofili
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiTommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on codeTommaso Teofili
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA IntroductionTommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU TourTommaso Teofili
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesTommaso Teofili
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationTommaso Teofili
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the WebTommaso Teofili
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic SearchTommaso Teofili
 

More from Tommaso Teofili (19)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 

Recently uploaded

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Machine learning with Apache Hama

  • 1. Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org 1
  • 2. About me  ASF member having fun with:  Lucene / Solr  Hama  UIMA  Stanbol  … some others  SW engineer @ Adobe R&D 2
  • 3. Agenda  Apache Hama and BSP  Why machine learning on BSP  Some examples  Benchmarks 3
  • 4. Apache Hama  Bulk Synchronous Parallel computing framework on top of HDFS for massive scientific computations  TLP since May 2012  0.6.0 release out soon  Growing community 4
  • 5. BSP supersteps  A BSP algorithm is composed by a sequence of “supersteps” 5
  • 6. BSP supersteps  Each task  Superstep 1  Do some computation  Communicate with other tasks  Synchronize  Superstep 2  Do some computation  Communicate with other tasks  Synchronize  …  …  …  Superstep N  Do some computation  Communicate with other tasks  Synchronize 6
  • 7. Why BSP  Simple programming model  Supersteps semantic is easy  Preserve data locality  Improve performance  Well suited for iterative algorithms 7
  • 8. Apache Hama architecture  BSP Program execution flow 8
  • 10. Apache Hama  Features  BSP API  M/R like I/O API  Graph API  Job management / monitoring  Checkpoint recovery  Local & (Pseudo) Distributed run modes  Pluggable message transfer architecture  YARN supported  Running in Apache Whirr 10
  • 11. Apache Hama BSP API  public abstract class BSP<K1, V1, K2, V2, M extends Writable> …  K1, V1 are key, values for inputs  K2, V2 are key, values for outputs  M are they type of messages used for task communication 11
  • 12. Apache Hama BSP API  public void bsp(BSPPeer<K1, V1, K2, V2, M> peer) throws ..  public void setup(BSPPeer<K1, V1, K2, V2, M> peer) throws ..  public void cleanup(BSPPeer<K1, V1, K2, V2, M> peer) throws .. 12
  • 13. Machine learning on BSP  Lots (most?) of ML algorithms are inherently iterative  Hama ML module currently counts  Collaborative filtering  Clustering  Gradient descent 13
  • 14. Benchmarking architecture Node Node Node Node Node Node Node Node Hama Hama Solr DBMS Lucene Mahout Mahout HDFS HDFS 14
  • 15. Collaborative filtering  Given user preferences on movies  We want to find users “near” to some specific user  So that that user can “follow” them  And/or see what they like (which he/she could like too) 15
  • 16. Collaborative filtering BSP  Given a specific user  Iteratively (for each task)  Superstep 1*i  Read a new user preference row  Find how near is that user from the current user  That is finding how near their preferences are  Since they are given as vectors we may use vector distance measures like Euclidean, cosine, etc. distance algorithms  Broadcast the measure output to other peers  Superstep 2*i  Aggregate measure outputs  Update most relevant users  Still to be committed (HAMA-612) 16
  • 17. Collaborative filtering BSP  Given user ratings about movies  "john" -> 0, 0, 0, 9.5, 4.5, 9.5, 8  "paula" -> 7, 3, 8, 2, 8.5, 0, 0  "jim” -> 4, 5, 0, 5, 8, 0, 1.5  "tom" -> 9, 4, 9, 1, 5, 0, 8  "timothy" -> 7, 3, 5.5, 0, 9.5, 6.5, 0  We ask for 2 nearest users to “paula” and we get “timothy” and “tom”  user recommendation  We can extract highly rated movies “timothy” and “tom” that “paula” didn’t see  Item recommendation 17
  • 18. Benchmarks  Fairly simple algorithm  Highly iterative  Comparing to Apache Mahout  Behaves better than ALS-WR  Behaves similarly to RecommenderJob and ItemSimilarityJob 18
  • 19. K-Means clustering  We have a bunch of data (e.g. documents)  We want to group those docs in k homogeneous clusters  Iteratively for each cluster  Calculate new cluster center  Add doc nearest to new center to the cluster 19
  • 21. K-Means clustering BSP  Iteratively  Superstep 1*i  Assignment phase  Read vectors splits  Sum up temporary centers with assigned vectors  Broadcast sum and ingested vectors count  Superstep 2*i  Update phase  Calculate the total sum over all received messages and average  Replace old centers with new centers and check for convergence 21
  • 22. Benchmarks  One rack (16 nodes 256 cores) cluster  10G network  On average faster than Mahout’s impl 22
  • 23. Gradient descent  Optimization algorithm  Find a (local) minimum of some function  Used for  solving linear systems  solving non linear systems  in machine learning tasks  linear regression  logistic regression  neural networks backpropagation  … 23
  • 24. Gradient descent  Minimize a given (cost) function  Give the function a starting point (set of parameters)  Iteratively change parameters in order to minimize the function  Stop at the (local) minimum  There’s some math but intuitively:  evaluate derivatives at a given point in order to choose where to “go” next 24
  • 25. Gradient descent BSP  Iteratively  Superstep 1*i  each task calculates and broadcasts portions of the cost function with the current parameters  Superstep 2*i  aggregate and update cost function  check the aggregated cost and iterations count  cost should always decrease  Superstep 3*i  each task calculates and broadcasts portions of (partial) derivatives  Superstep 4*i  aggregate and update parameters 25
  • 26. Gradient descent BSP  Simplistic example  Linear regression  Given real estate market dataset  Estimate new houses prices given known houses’ size, geographic region and prices  Expected output: actual parameters for the (linear) prediction function 26
  • 27. Gradient descent BSP  Generate a different model for each region  House item vectors  price -> size  150k -> 80  2 dimensional space  ~1.3M vectors dataset 27
  • 28. Gradient descent BSP  Dataset and model fit 28
  • 29. Gradient descent BSP  Cost checking 29
  • 30. Gradient descent BSP  Classification  Logistic regression with gradient descent  Real estate market dataset  We want to find which estate listings belong to agencies  To avoid buying from them   Same algorithm  With different cost function and features  Existing items are tagged or not as “belonging to agency”  Create vectors from items’ text  Sample vector  1 -> 1 3 0 0 5 3 4 1 30
  • 31. Gradient descent BSP  Classification 31
  • 32. Benchmarks  Not directly comparable to Mahout’s regression algorithms  Both SGD and CGD are inherently better than plain GD  But Hama GD had on average same performance of Mahout’s SGD / CGD  Next step is implementing SGD / CGD on top of Hama  32
  • 33. Wrap up  Even if  ML module is still “young” / work in progress  and tools like Apache Mahout have better “coverage”  Apache Hama can be particularly useful in certain “highly iterative” use cases  Interesting benchmarks 33
  • 34. Thanks! 34