SlideShare a Scribd company logo
1 of 22
Download to read offline
Parallel auto-tuning of
machine learning
algorithms
Gianmario Spacagna
gm.spacagna@gmail.com

16 October 2012




AgilOne, Inc.                 (877) 769-3047
1091 N Shoreline Blvd. #250   (408) 404-0152 fax
Mountain View, CA 94043       sales@agilone.com
Motivation
• Increase revenue of cloud service providers
    à Keep cost curve linear w.r.t. the expected
    exponential income growth.                                    Income   Cost
• Technically achievable through Scalability:
    • Scalability in terms of resources à Distributed Parallel
      Computing (Hadoop).
    • Scalability in terms of multi-tenancy à Same system
      running for several customers.
    • Scalability in terms of auto-configuration à
      Avoiding manual tuning up operations.




2
Good Work Flow


    Good         ML                     Good
    data      Algorithm                results!



                Tuning
           (Adjusting configuration)
3
General Tuning diagram
         Test Data



       Run algorithm
        with conf. X



             Are       no     Change
           results          configuration
           good?                  X

                yes
           Tuned


4
Tuning of Machine Learning
Algorithms
• We need tuning when:
    • New algorithm or version is released.
    • We want to improve accuracy and/or performance.
    • New customer comes and the system must be customized for the
      new dataset and requirements.



     We need to make it smart, automatic
                and scalable!



5
Vision

Request:
•  Data set                                        Response:
                                                   •  Best algorithm
•  Application
         (prediction,
                                           Magic   •  Optimal
         clustering, classification…)
          •  Algorithm
                                            Box       configuration
               (ANN, LR, K-means…)
                                                   •  Metrics
•        Fitness metrics                              evaluation
         (Std. dev, Prob. of false true,
         clustering coeff., randomness…)
•        Goal constraints
         (x> 0.9 & 0.3<y<0.5)




     6
Architecture Design
                   Upper Applications API

              Initializer

                            Controller

                            Scheduler

      Executor              Executor        Executor
        ANN                   LR            K-Means
       Evaluator             Evaluator       Evaluator

         Data                  Data           Data
        Sampler               Sampler        Sampler


                                             Cloud
       Local                Hadoop
                                            Service

7
Upper Applications API
Tasks:                             Possible data format:
• Interfaces the communication     • JSON
    between the system and the
                                   • STDIN/OUT
    upper applications layer.
• Parse requests and results and
    generates the related output
    domain object.




8
Initializer
Tasks:                           Possible implementations:
• Generates the initial set of   • Random points
    configuration.
                                 • Latin Hyper Cube
                                 • Dataset similarity




9
Controller
Tasks:                               Possible implementations:
• Compares and generates             • Random search
 configurations.
                                     • Grid search
• Decides the convergence of the
 tuning.                             • Stochastic Kriging
                                     • Genetic Algorithms
• Adapt the data sampling request.




10
Scheduler
Tasks:                                Possible implementations:
• Checks if the requests are          • First available
 covered by the available services.
                                      • Oldest idle
• Schedules and parallelizes
 requests executions.                 • Load balanced
                                      • Serialized (single node)
• Optimizes resources.
• Collects evaluated results.




11
Executor
Tasks:                                   Possible implementations:
• Executes the providing algorithm       • Local execution
 with the specified configuration.
                                         • Hadoop cluster
                                         • Cloud service
Sub components:
•  Evaluator: Evaluates results
     standing to the specified fitness
     metrics.
•  Data Sampler: Down and Up
     sampling of data.



12
Tuning diagram
                        Test Data
     Test execution
                                                            Test control

        Scheduler,    Run algorithm
        Executor       with conf. X                   Initializer,
                                                      Controller

                            Are       no     Change
                          results          configuration
                          good?                  X

                               yes
                          Tuned


13
SUNS: Simple, Unclever and Not
Scalable
                     STDIN/OUT

          Random Points

             Random Search – Grid Search

                      Serialized

                      Executor
                      K-Means
                          Evaluator




                          Local




14
SNS: Smart but Not Scalable
                   STDIN/OUT or JSON

          Latin Hyper Cube

           Genetic Algorithm / Stochastic Kriging

                         Serialized

                         Executor
                         K-Means
                          Evaluator




                             Local




15
VSNS: Very Smart but Not Scalable
                   STDIN/OUT or JSON

          Dataset Similarity

           Genetic Algorithm / Stochastic Kriging

                         Serialized

                          Executor
                          K-Means
                           Evaluator




                               Local




16
VSS: Very Smart and Scalable
                  STDIN/OUT or JSON

         Dataset Similarity

          Genetic Algorithm or Stochastic Kriging

                      First Available

                         Executor
                         K-Means
                          Evaluator




                         Hadoop




17
VSVSO: Very Smart, Very Scalable and
Optimized
                  STDIN/OUT or JSON

         Dataset Similarity

          Genetic Algorithm or Stochastic Kriging

                      Load Balanced

                          Executor
                          K-Means
                                  Data
                     Evaluator
                                 Sampler




                           Hadoop




18
Thesis
It is possible to build an intelligent system
based on Genetic Algorithm/Stochastic
   Kriging that automatically selects and
tunes machine learning algorithms, such
   as K-Means and LR, parallelizing the
 work on an Hadoop cluster to scale in a
           cost-efficient manner.


19
Project Plan
Order of priorities:

1.  Design the entire application in Scala in a testable and expandable
     way.
2.  Implement the Genetic Algorithm or the Stochastic Kriging controller.
3.  Implement the Latin Hyper Cube initializer.
4.  Test with local instance algorithms (K-Means and/or LR).
5.  Develop and test at least one algorithm in MapReduce fashion using
     Hadoop.
6.  Test with real AgilOne cluster of servers.
7.  Implement the Dataset Similarity initializer.
8.  Implement the Dataset Sampler.


20
Questions, feedbacks,
        suggestions?




21
Thank you!




22

More Related Content

What's hot

Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Varad Meru
 
2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata GeneratorBoris Glavic
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesPhilip Goddard
 
Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)
Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)
Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)Anmol Dwivedi
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic rankingFELIX75
 
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...BigML, Inc
 
1 cs xii_python_functions_introduction _types of func
1 cs xii_python_functions_introduction _types of func1 cs xii_python_functions_introduction _types of func
1 cs xii_python_functions_introduction _types of funcSanjayKumarMahto1
 
House price prediction
House price predictionHouse price prediction
House price predictionKaranseth30
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setijccmsjournal
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlibTodd McGrath
 
Analysing-MMPLs
Analysing-MMPLsAnalysing-MMPLs
Analysing-MMPLsmiso_uam
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusoneDotNetCampus
 
Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...IJTET Journal
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsChamin Nalinda Loku Gam Hewage
 
Matlab - Introduction and Basics
Matlab - Introduction and BasicsMatlab - Introduction and Basics
Matlab - Introduction and BasicsTechsparks
 

What's hot (19)

Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator
 
Unit 5
Unit 5Unit 5
Unit 5
 
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn PipelinesRevolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines
 
Matlab OOP
Matlab OOPMatlab OOP
Matlab OOP
 
Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)
Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)
Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)
 
Mbd dd
Mbd ddMbd dd
Mbd dd
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
 
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent...
 
1 cs xii_python_functions_introduction _types of func
1 cs xii_python_functions_introduction _types of func1 cs xii_python_functions_introduction _types of func
1 cs xii_python_functions_introduction _types of func
 
House price prediction
House price predictionHouse price prediction
House price prediction
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature set
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlib
 
Analysing-MMPLs
Analysing-MMPLsAnalysing-MMPLs
Analysing-MMPLs
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusone
 
Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...Optimization Technique for Feature Selection and Classification Using Support...
Optimization Technique for Feature Selection and Classification Using Support...
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection Algorithms
 
Matlab - Introduction and Basics
Matlab - Introduction and BasicsMatlab - Introduction and Basics
Matlab - Introduction and Basics
 

Viewers also liked

How High Tech CEOs Can Increase Sales and Marketing Effectiveness and Reduce ...
How High Tech CEOs Can Increase Sales and Marketing Effectiveness and Reduce ...How High Tech CEOs Can Increase Sales and Marketing Effectiveness and Reduce ...
How High Tech CEOs Can Increase Sales and Marketing Effectiveness and Reduce ...Paul R. DiModica
 
Средство индивидуального перемещения "СИП-С"
Средство индивидуального перемещения "СИП-С"Средство индивидуального перемещения "СИП-С"
Средство индивидуального перемещения "СИП-С"kulibin
 
Progress presentation
Progress presentationProgress presentation
Progress presentationjwl92
 
和菓子の販売促進施策について
和菓子の販売促進施策について和菓子の販売促進施策について
和菓子の販売促進施策についてstucon
 
Jihočeské vzdělávání dospělých - SEO část
Jihočeské vzdělávání dospělých - SEO částJihočeské vzdělávání dospělých - SEO část
Jihočeské vzdělávání dospělých - SEO částBrilo Team
 
SXSW Next Gen Responsive Design
SXSW Next Gen Responsive DesignSXSW Next Gen Responsive Design
SXSW Next Gen Responsive DesignKerry Bodine
 
和菓子ここだけの話
和菓子ここだけの話和菓子ここだけの話
和菓子ここだけの話stucon
 
How to Kick Ass on Google+ Local When You're All Out Of Bubblegum
How to Kick Ass on Google+ Local When You're All Out Of BubblegumHow to Kick Ass on Google+ Local When You're All Out Of Bubblegum
How to Kick Ass on Google+ Local When You're All Out Of BubblegumGreg Gifford
 
The Latest SEO Statistics for SEOs, Tweeted at SMX West 2013
The Latest SEO Statistics for SEOs, Tweeted at SMX West 2013The Latest SEO Statistics for SEOs, Tweeted at SMX West 2013
The Latest SEO Statistics for SEOs, Tweeted at SMX West 2013seoinhouse
 
Незаконне_звільнення_з_роботи
Незаконне_звільнення_з_роботиНезаконне_звільнення_з_роботи
Незаконне_звільнення_з_роботиVitalij Misjats
 
Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013Thoughtworks
 
웹 접근성의 지침 및 평가 팀인터페이스 현준호
웹 접근성의 지침 및 평가 팀인터페이스 현준호웹 접근성의 지침 및 평가 팀인터페이스 현준호
웹 접근성의 지침 및 평가 팀인터페이스 현준호SJ Y
 
Infolit day 24_may2016
Infolit day 24_may2016Infolit day 24_may2016
Infolit day 24_may2016HELIGLIASA
 

Viewers also liked (20)

Actividades
ActividadesActividades
Actividades
 
Spring3.1 aop-mvc
Spring3.1 aop-mvcSpring3.1 aop-mvc
Spring3.1 aop-mvc
 
How High Tech CEOs Can Increase Sales and Marketing Effectiveness and Reduce ...
How High Tech CEOs Can Increase Sales and Marketing Effectiveness and Reduce ...How High Tech CEOs Can Increase Sales and Marketing Effectiveness and Reduce ...
How High Tech CEOs Can Increase Sales and Marketing Effectiveness and Reduce ...
 
Fall Simmer Pot Recipes
Fall Simmer Pot RecipesFall Simmer Pot Recipes
Fall Simmer Pot Recipes
 
Средство индивидуального перемещения "СИП-С"
Средство индивидуального перемещения "СИП-С"Средство индивидуального перемещения "СИП-С"
Средство индивидуального перемещения "СИП-С"
 
Progress presentation
Progress presentationProgress presentation
Progress presentation
 
和菓子の販売促進施策について
和菓子の販売促進施策について和菓子の販売促進施策について
和菓子の販売促進施策について
 
Jihočeské vzdělávání dospělých - SEO část
Jihočeské vzdělávání dospělých - SEO částJihočeské vzdělávání dospělých - SEO část
Jihočeské vzdělávání dospělých - SEO část
 
SXSW Next Gen Responsive Design
SXSW Next Gen Responsive DesignSXSW Next Gen Responsive Design
SXSW Next Gen Responsive Design
 
和菓子ここだけの話
和菓子ここだけの話和菓子ここだけの話
和菓子ここだけの話
 
How to Kick Ass on Google+ Local When You're All Out Of Bubblegum
How to Kick Ass on Google+ Local When You're All Out Of BubblegumHow to Kick Ass on Google+ Local When You're All Out Of Bubblegum
How to Kick Ass on Google+ Local When You're All Out Of Bubblegum
 
The Latest SEO Statistics for SEOs, Tweeted at SMX West 2013
The Latest SEO Statistics for SEOs, Tweeted at SMX West 2013The Latest SEO Statistics for SEOs, Tweeted at SMX West 2013
The Latest SEO Statistics for SEOs, Tweeted at SMX West 2013
 
Soluciones de software para CTOUCH
Soluciones de software para CTOUCHSoluciones de software para CTOUCH
Soluciones de software para CTOUCH
 
Незаконне_звільнення_з_роботи
Незаконне_звільнення_з_роботиНезаконне_звільнення_з_роботи
Незаконне_звільнення_з_роботи
 
Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013Quarterly Technology Briefing, Manchester, UK September 2013
Quarterly Technology Briefing, Manchester, UK September 2013
 
NFS: para la gestion de espacios de trabajo
NFS: para la gestion de espacios de trabajoNFS: para la gestion de espacios de trabajo
NFS: para la gestion de espacios de trabajo
 
웹 접근성의 지침 및 평가 팀인터페이스 현준호
웹 접근성의 지침 및 평가 팀인터페이스 현준호웹 접근성의 지침 및 평가 팀인터페이스 현준호
웹 접근성의 지침 및 평가 팀인터페이스 현준호
 
Squaw Lake
Squaw LakeSquaw Lake
Squaw Lake
 
Infolit day 24_may2016
Infolit day 24_may2016Infolit day 24_may2016
Infolit day 24_may2016
 
Evolver Architects
Evolver ArchitectsEvolver Architects
Evolver Architects
 

Similar to Parallel Tuning of Machine Learning Algorithms, Thesis Proposal

Automated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance SystemsAutomated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance SystemsLionel Briand
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABCodeOps Technologies LLP
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
State of the (J)PMML art
State of the (J)PMML artState of the (J)PMML art
State of the (J)PMML artVillu Ruusmann
 
Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0 Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0 Sumanth Chinthagunta
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리NAVER D2
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
 
Combining out - of - band monitoring with AI and big data for datacenter aut...
Combining out - of - band monitoring with AI and big data  for datacenter aut...Combining out - of - band monitoring with AI and big data  for datacenter aut...
Combining out - of - band monitoring with AI and big data for datacenter aut...Ganesan Narayanasamy
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
 
Hadoop Summit 2010 Challenges And Uniqueness Of Qe And Re Processes In Hadoop
Hadoop Summit 2010  Challenges And Uniqueness Of Qe And Re Processes In HadoopHadoop Summit 2010  Challenges And Uniqueness Of Qe And Re Processes In Hadoop
Hadoop Summit 2010 Challenges And Uniqueness Of Qe And Re Processes In HadoopYahoo Developer Network
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysisAmazon Web Services
 
Hanna bosc2010
Hanna bosc2010Hanna bosc2010
Hanna bosc2010BOSC 2010
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...HostedbyConfluent
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSSKevin Crocker
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 

Similar to Parallel Tuning of Machine Learning Algorithms, Thesis Proposal (20)

Automated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance SystemsAutomated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance Systems
 
Biomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLABBiomedical Signal and Image Analytics using MATLAB
Biomedical Signal and Image Analytics using MATLAB
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
State of the (J)PMML art
State of the (J)PMML artState of the (J)PMML art
State of the (J)PMML art
 
Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0 Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리유연하고 확장성 있는 빅데이터 처리
유연하고 확장성 있는 빅데이터 처리
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
Combining out - of - band monitoring with AI and big data for datacenter aut...
Combining out - of - band monitoring with AI and big data  for datacenter aut...Combining out - of - band monitoring with AI and big data  for datacenter aut...
Combining out - of - band monitoring with AI and big data for datacenter aut...
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Hadoop Summit 2010 Challenges And Uniqueness Of Qe And Re Processes In Hadoop
Hadoop Summit 2010  Challenges And Uniqueness Of Qe And Re Processes In HadoopHadoop Summit 2010  Challenges And Uniqueness Of Qe And Re Processes In Hadoop
Hadoop Summit 2010 Challenges And Uniqueness Of Qe And Re Processes In Hadoop
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Hanna bosc2010
Hanna bosc2010Hanna bosc2010
Hanna bosc2010
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 

More from Gianmario Spacagna

Latent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case studyLatent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case studyGianmario Spacagna
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsGianmario Spacagna
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsGianmario Spacagna
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersGianmario Spacagna
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...Gianmario Spacagna
 
Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupGianmario Spacagna
 
Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Gianmario Spacagna
 

More from Gianmario Spacagna (8)

Latent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case studyLatent Panelists Affinities: a Helixa case study
Latent Panelists Affinities: a Helixa case study
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning products
 
Anomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-EncodersAnomaly Detection using Deep Auto-Encoders
Anomaly Detection using Deep Auto-Encoders
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
 
Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...Robust and declarative machine learning pipelines for predictive buying at Ba...
Robust and declarative machine learning pipelines for predictive buying at Ba...
 
TunUp final presentation
TunUp final presentationTunUp final presentation
TunUp final presentation
 

Parallel Tuning of Machine Learning Algorithms, Thesis Proposal

  • 1. Parallel auto-tuning of machine learning algorithms Gianmario Spacagna gm.spacagna@gmail.com 16 October 2012 AgilOne, Inc. (877) 769-3047 1091 N Shoreline Blvd. #250 (408) 404-0152 fax Mountain View, CA 94043 sales@agilone.com
  • 2. Motivation • Increase revenue of cloud service providers à Keep cost curve linear w.r.t. the expected exponential income growth. Income Cost • Technically achievable through Scalability: • Scalability in terms of resources à Distributed Parallel Computing (Hadoop). • Scalability in terms of multi-tenancy à Same system running for several customers. • Scalability in terms of auto-configuration à Avoiding manual tuning up operations. 2
  • 3. Good Work Flow Good ML Good data Algorithm results! Tuning (Adjusting configuration) 3
  • 4. General Tuning diagram Test Data Run algorithm with conf. X Are no Change results configuration good? X yes Tuned 4
  • 5. Tuning of Machine Learning Algorithms • We need tuning when: • New algorithm or version is released. • We want to improve accuracy and/or performance. • New customer comes and the system must be customized for the new dataset and requirements. We need to make it smart, automatic and scalable! 5
  • 6. Vision Request: •  Data set Response: •  Best algorithm •  Application (prediction, Magic •  Optimal clustering, classification…) •  Algorithm Box configuration (ANN, LR, K-means…) •  Metrics •  Fitness metrics evaluation (Std. dev, Prob. of false true, clustering coeff., randomness…) •  Goal constraints (x> 0.9 & 0.3<y<0.5) 6
  • 7. Architecture Design Upper Applications API Initializer Controller Scheduler Executor Executor Executor ANN LR K-Means Evaluator Evaluator Evaluator Data Data Data Sampler Sampler Sampler Cloud Local Hadoop Service 7
  • 8. Upper Applications API Tasks: Possible data format: • Interfaces the communication • JSON between the system and the • STDIN/OUT upper applications layer. • Parse requests and results and generates the related output domain object. 8
  • 9. Initializer Tasks: Possible implementations: • Generates the initial set of • Random points configuration. • Latin Hyper Cube • Dataset similarity 9
  • 10. Controller Tasks: Possible implementations: • Compares and generates • Random search configurations. • Grid search • Decides the convergence of the tuning. • Stochastic Kriging • Genetic Algorithms • Adapt the data sampling request. 10
  • 11. Scheduler Tasks: Possible implementations: • Checks if the requests are • First available covered by the available services. • Oldest idle • Schedules and parallelizes requests executions. • Load balanced • Serialized (single node) • Optimizes resources. • Collects evaluated results. 11
  • 12. Executor Tasks: Possible implementations: • Executes the providing algorithm • Local execution with the specified configuration. • Hadoop cluster • Cloud service Sub components: •  Evaluator: Evaluates results standing to the specified fitness metrics. •  Data Sampler: Down and Up sampling of data. 12
  • 13. Tuning diagram Test Data Test execution Test control Scheduler, Run algorithm Executor with conf. X Initializer, Controller Are no Change results configuration good? X yes Tuned 13
  • 14. SUNS: Simple, Unclever and Not Scalable STDIN/OUT Random Points Random Search – Grid Search Serialized Executor K-Means Evaluator Local 14
  • 15. SNS: Smart but Not Scalable STDIN/OUT or JSON Latin Hyper Cube Genetic Algorithm / Stochastic Kriging Serialized Executor K-Means Evaluator Local 15
  • 16. VSNS: Very Smart but Not Scalable STDIN/OUT or JSON Dataset Similarity Genetic Algorithm / Stochastic Kriging Serialized Executor K-Means Evaluator Local 16
  • 17. VSS: Very Smart and Scalable STDIN/OUT or JSON Dataset Similarity Genetic Algorithm or Stochastic Kriging First Available Executor K-Means Evaluator Hadoop 17
  • 18. VSVSO: Very Smart, Very Scalable and Optimized STDIN/OUT or JSON Dataset Similarity Genetic Algorithm or Stochastic Kriging Load Balanced Executor K-Means Data Evaluator Sampler Hadoop 18
  • 19. Thesis It is possible to build an intelligent system based on Genetic Algorithm/Stochastic Kriging that automatically selects and tunes machine learning algorithms, such as K-Means and LR, parallelizing the work on an Hadoop cluster to scale in a cost-efficient manner. 19
  • 20. Project Plan Order of priorities: 1.  Design the entire application in Scala in a testable and expandable way. 2.  Implement the Genetic Algorithm or the Stochastic Kriging controller. 3.  Implement the Latin Hyper Cube initializer. 4.  Test with local instance algorithms (K-Means and/or LR). 5.  Develop and test at least one algorithm in MapReduce fashion using Hadoop. 6.  Test with real AgilOne cluster of servers. 7.  Implement the Dataset Similarity initializer. 8.  Implement the Dataset Sampler. 20
  • 21. Questions, feedbacks, suggestions? 21