Parallel Tuning of Machine Learning Algorithms, Thesis Proposal
1. Parallel auto-tuning of
machine learning
algorithms
Gianmario Spacagna
gm.spacagna@gmail.com
16 October 2012
AgilOne, Inc. (877) 769-3047
1091 N Shoreline Blvd. #250 (408) 404-0152 fax
Mountain View, CA 94043 sales@agilone.com
2. Motivation
• Increase revenue of cloud service providers
à Keep cost curve linear w.r.t. the expected
exponential income growth. Income Cost
• Technically achievable through Scalability:
• Scalability in terms of resources à Distributed Parallel
Computing (Hadoop).
• Scalability in terms of multi-tenancy à Same system
running for several customers.
• Scalability in terms of auto-configuration à
Avoiding manual tuning up operations.
2
3. Good Work Flow
Good ML Good
data Algorithm results!
Tuning
(Adjusting configuration)
3
4. General Tuning diagram
Test Data
Run algorithm
with conf. X
Are no Change
results configuration
good? X
yes
Tuned
4
5. Tuning of Machine Learning
Algorithms
• We need tuning when:
• New algorithm or version is released.
• We want to improve accuracy and/or performance.
• New customer comes and the system must be customized for the
new dataset and requirements.
We need to make it smart, automatic
and scalable!
5
7. Architecture Design
Upper Applications API
Initializer
Controller
Scheduler
Executor Executor Executor
ANN LR K-Means
Evaluator Evaluator Evaluator
Data Data Data
Sampler Sampler Sampler
Cloud
Local Hadoop
Service
7
8. Upper Applications API
Tasks: Possible data format:
• Interfaces the communication • JSON
between the system and the
• STDIN/OUT
upper applications layer.
• Parse requests and results and
generates the related output
domain object.
8
9. Initializer
Tasks: Possible implementations:
• Generates the initial set of • Random points
configuration.
• Latin Hyper Cube
• Dataset similarity
9
10. Controller
Tasks: Possible implementations:
• Compares and generates • Random search
configurations.
• Grid search
• Decides the convergence of the
tuning. • Stochastic Kriging
• Genetic Algorithms
• Adapt the data sampling request.
10
11. Scheduler
Tasks: Possible implementations:
• Checks if the requests are • First available
covered by the available services.
• Oldest idle
• Schedules and parallelizes
requests executions. • Load balanced
• Serialized (single node)
• Optimizes resources.
• Collects evaluated results.
11
12. Executor
Tasks: Possible implementations:
• Executes the providing algorithm • Local execution
with the specified configuration.
• Hadoop cluster
• Cloud service
Sub components:
• Evaluator: Evaluates results
standing to the specified fitness
metrics.
• Data Sampler: Down and Up
sampling of data.
12
13. Tuning diagram
Test Data
Test execution
Test control
Scheduler, Run algorithm
Executor with conf. X Initializer,
Controller
Are no Change
results configuration
good? X
yes
Tuned
13
14. SUNS: Simple, Unclever and Not
Scalable
STDIN/OUT
Random Points
Random Search – Grid Search
Serialized
Executor
K-Means
Evaluator
Local
14
15. SNS: Smart but Not Scalable
STDIN/OUT or JSON
Latin Hyper Cube
Genetic Algorithm / Stochastic Kriging
Serialized
Executor
K-Means
Evaluator
Local
15
16. VSNS: Very Smart but Not Scalable
STDIN/OUT or JSON
Dataset Similarity
Genetic Algorithm / Stochastic Kriging
Serialized
Executor
K-Means
Evaluator
Local
16
17. VSS: Very Smart and Scalable
STDIN/OUT or JSON
Dataset Similarity
Genetic Algorithm or Stochastic Kriging
First Available
Executor
K-Means
Evaluator
Hadoop
17
18. VSVSO: Very Smart, Very Scalable and
Optimized
STDIN/OUT or JSON
Dataset Similarity
Genetic Algorithm or Stochastic Kriging
Load Balanced
Executor
K-Means
Data
Evaluator
Sampler
Hadoop
18
19. Thesis
It is possible to build an intelligent system
based on Genetic Algorithm/Stochastic
Kriging that automatically selects and
tunes machine learning algorithms, such
as K-Means and LR, parallelizing the
work on an Hadoop cluster to scale in a
cost-efficient manner.
19
20. Project Plan
Order of priorities:
1. Design the entire application in Scala in a testable and expandable
way.
2. Implement the Genetic Algorithm or the Stochastic Kriging controller.
3. Implement the Latin Hyper Cube initializer.
4. Test with local instance algorithms (K-Means and/or LR).
5. Develop and test at least one algorithm in MapReduce fashion using
Hadoop.
6. Test with real AgilOne cluster of servers.
7. Implement the Dataset Similarity initializer.
8. Implement the Dataset Sampler.
20