Abstract:
Apache Spark’s MLlib is a terrific library for fitting large-scale machine learning models. However, translating high-level problem statements like “learn a classifier” into a working model presently requires significant manual effort (via ad hoc parameter tuning) and computational resources (to fit several models). We present our work on the MLbase optimizer – a system designed on top of Spark to quickly and automatically search through a hyperparameter space and find a good model. By leveraging performance enhancements, better search algorithms, and statistical heuristics, our system offers an order of magnitude speedup over standard methods.
2. Problem:"Scalable"implementa.ons"
difficult"for"ML"Developers…
Meta-Statistics
ML Contract +
Code
ML Key Features
t )JHIMFWFMMBOHVBHFGPSOVNFSJDBM
DPNQVUBUJPO
WJTVBMJ[BUJPO
BOEBQ-
QMJDBUJPOEFWFMPQNFOU
t *OUFSBDUJWFFOWJSPONFOUGPSJUFSBUJWF
FYQMPSBUJPO
EFTJHO
BOEQSPCMFN
TPMWJOH
t .BUIFNBUJDBMGVODUJPOTGPSMJOFBS
BMHFCSB
TUBUJTUJDT
'PVSJFSBOBMZTJT
GJMUFSJOH
PQUJNJ[BUJPO
OVNFSJDBM
JOUFHSBUJPO
BOETPMWJOHPSEJOBSZ
EJGGFSFOUJBMFRVBUJPOT
t #VJMUJOHSBQIJDTGPSWJTVBMJ[JOHEBUB
BOEUPPMTGPSDSFBUJOHDVTUPNQMPUT
t %FWFMPQNFOUUPPMTGPSJNQSPWJOH
DPEFRVBMJUZBOENBJOUBJOBCJMJUZ
BOENBYJNJ[JOHQFSGPSNBODF
t 5PPMTGPSCVJMEJOHBQQMJDBUJPOTXJUI
DVTUPNHSBQIJDBMJOUFSGBDFT
t 'VODUJPOTGPSJOUFHSBUJOH.5-#
CBTFEBMHPSJUINTXJUIFYUFSOBMBQ-
QMJDBUJPOTBOEMBOHVBHFTTVDIBT$
+BWB
/5
BOE.JDSPTPGU®YDFM®
The Language of Technical Computing
MATLAB® is a high-level language and
interactive environment for numerical com-putation,
visualization, and programming.
Using MATLAB, you can analyze data,
develop algorithms, and create models and
applications. The language, tools, and built-in
math functions enable you to explore
multiple approaches and reach a solution
faster than with spreadsheets or traditional
programming languages, such as C/C++ or
Java™.
You can use MATLAB for a range of appli-cations,
including signal processing and
communications, image and video process-ing,
control systems, test and measurement,
computational finance, and computational
biology. More than a million engineers and
scientists in industry and academia use
MATLAB, the language of technical
computing.
MATLAB Overview 2:04
Analyzing and visualizing data using the MATLAB
desktop. The MATLAB environment also lets you write
programs and develop algorithms and applications.
14. SQL Result
PAQ Model
Planner for Large-scale Predictive
Analytic Queries
ML
✦ User*declaraBvely*specifies*task
✦ PAQ*=*PredicBve*AnalyBc*Query
✦ Search*through*MLlib*to*find*the*
best*model/pipeline
the develop-ment
enabled a wide
recommen-dations,
speech driven in-terfaces.
supported by
SELECT e.sender, e.subject, e.message
FROM Emails e
WHERE e.user = ’Bob’
AND PREDICT(e.spam, e.message) = false GIVEN
LabeledData
15. A*Standard*ML*Pipeline
!
Data
Feature*
ExtracBon
Model*
Training
Final**
Model
✦ In*pracBce,*model*building*is*an*iteraBve*process*of*conBnuous*
refinement
✦ Our*grand*vision*is*to*automate*the*construcBon*of*these*
pipelines
31. Pu^ng*It*All*Together
●●●●●●●●●●●●●●●●
●●●●
●●●●
●●●●
●●●●
●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●
●●●●●
●●
●●●●●●●●●●●●
●●●●●● ●●●●●●●
●●●
●
●
●●
●●●●●●●●●●●●●●●●● ●●●●●●●●●
0.75
0.50
0.25
0 200 400 600 800
Time elapsed (m)
Best Validation Error Seen So Far
Search Method
●
●
●
Grid − Unoptimized
Random − Optimized
TPE − Optimized
Model Convergence Over Time
✦ First*version*of*MLbase*opBmizer
✦ 30GB*dense*images*(240K*x*16K)
✦ 2*model*families,*5*hyperparams
✦ Baseline:*grid*search
✦ Our*method:*combinaBon*of*
✦ Batching*
✦ Early*stopping*/*Bandit*
✦ Random*or*TPE*
20xspeedupcompared'to'grid'search'
15'minutes'vs'5'hours!'
32. Does*It*Scale?
✦ 1.5TB*dataset*(1.2M*x*160K)
✦ 128*nodes,*thousands*of*passes*
over*data
✦ Tried*32*models*in*15*hours*
✦ Good*answer*aler*11*hours
●●●●● ●●● ●●● ●●
●●
● ● ●
0.75
0.50
0.25
5 10
Time elapsed (h)
Best Validation Error Seen So Far
Convergence of Model Accuracy on 1.5TB Dataset
33. Future*Work
!
Data
Feature*
ExtracBon
Model*
Training
Final**
Model
AutomatedMLPipelineConstruc.on
34. Data Image!
Parser Normalizer Convolver
Pooler
sqrt,mean
Zipper
Linear
Solver
Symmetric!
Rectifier ident,abs
ident,mean
Global
Patch!
Extractor
Patch
Whitener
KMeans!
Clusterer
Feature Extractor
Label!
Extractor
Test!
Feature
Data
Linear!
Extractor
Mapper Model Label!
Extractor
Test
Error
Error!
Computer
No Hyperparameters!
A few Hyperparameters!
Lotsa Hyperparameters
Slide courtesy of Evan Sparks