Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

3,457 views

Published on

Abstract:
Apache Spark’s MLlib is a terrific library for fitting large-scale machine learning models. However, translating high-level problem statements like “learn a classifier” into a working model presently requires significant manual effort (via ad hoc parameter tuning) and computational resources (to fit several models). We present our work on the MLbase optimizer – a system designed on top of Spark to quickly and automatically search through a hyperparameter space and find a good model. By leveraging performance enhancements, better search algorithms, and statistical heuristics, our system offers an order of magnitude speedup over standard methods.

Published in: Technology
  • Be the first to comment

Ameet Talwalkar, assistant professor of Computer Science, UCLA at MLconf SF

  1. 1. Towards*an*OpBmizer* for*MLbase Collaborators:*Evan*Sparks,*Michael*Franklin,*Michael*I.*Jordan,*Tim*Kraska* UC Berkeley Ameet*Talwalkar
  2. 2. Problem:"Scalable"implementa.ons" difficult"for"ML"Developers… Meta-Statistics ML Contract + Code ML Key Features t )JHIMFWFMMBOHVBHFGPSOVNFSJDBM DPNQVUBUJPO WJTVBMJ[BUJPO BOEBQ- QMJDBUJPOEFWFMPQNFOU t *OUFSBDUJWFFOWJSPONFOUGPSJUFSBUJWF FYQMPSBUJPO EFTJHO BOEQSPCMFN TPMWJOH t .BUIFNBUJDBMGVODUJPOTGPSMJOFBS BMHFCSB TUBUJTUJDT 'PVSJFSBOBMZTJT GJMUFSJOH PQUJNJ[BUJPO OVNFSJDBM JOUFHSBUJPO BOETPMWJOHPSEJOBSZ EJGGFSFOUJBMFRVBUJPOT t #VJMUJOHSBQIJDTGPSWJTVBMJ[JOHEBUB BOEUPPMTGPSDSFBUJOHDVTUPNQMPUT t %FWFMPQNFOUUPPMTGPSJNQSPWJOH DPEFRVBMJUZBOENBJOUBJOBCJMJUZ BOENBYJNJ[JOHQFSGPSNBODF t 5PPMTGPSCVJMEJOHBQQMJDBUJPOTXJUI DVTUPNHSBQIJDBMJOUFSGBDFT t 'VODUJPOTGPSJOUFHSBUJOH.5-# CBTFEBMHPSJUINTXJUIFYUFSOBMBQ- QMJDBUJPOTBOEMBOHVBHFTTVDIBT$ +BWB /5 BOE.JDSPTPGU®YDFM® The Language of Technical Computing MATLAB® is a high-level language and interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop algorithms, and create models and applications. The language, tools, and built-in math functions enable you to explore multiple approaches and reach a solution faster than with spreadsheets or traditional programming languages, such as C/C++ or Java™. You can use MATLAB for a range of appli-cations, including signal processing and communications, image and video process-ing, control systems, test and measurement, computational finance, and computational biology. More than a million engineers and scientists in industry and academia use MATLAB, the language of technical computing. MATLAB Overview 2:04 Analyzing and visualizing data using the MATLAB desktop. The MATLAB environment also lets you write programs and develop algorithms and applications.
  3. 3. Problem:Scalableimplementa.ons difficultforMLDevelopers… Meta-Statistics ML Contract + Code ML
  4. 4. Problem:Scalableimplementa.ons difficultforMLDevelopers… Meta-Statistics ML Contract + Code ML CHALLENGE:*Can*we*simplify* distributed*ML*development?
  5. 5. Problem:MLisdifficult forEndUsers… Too*many*ways*to* preprocess… Too*many* knobs… Difficult*to* debug… Doesn’t*scale… Too*many* algorithms…
  6. 6. Problem:MLisdifficult forEndUsers… Too*many*ways*to* preprocess… Too*many* knobs… Difficult*to* debug… CHALLENGE:*Can*we* automate*ML*pipeline* Doesn’t*scale… construcBon? Too*many* algorithms…
  7. 7. MLbase 4 MLOpt MLI MLlib Apache*Spark MLbase'aims'to' simplify'development' and'deployment'of' scalable'ML'pipelines Spark:*Cluster*compuBng*system*designed*for*iteraBve*computaBon MLlib:*Spark’s*core*ML*library MLI:*API*to*simplify*ML*development MLOpt:*DeclaraBve*layer*to*automate*hyperparameter*tuning
  8. 8. MLbase MLbase'aims'to' simplify'development' and'deployment'of' scalable'ML'pipelines MLOpt and MLI are experimental testbeds 4 MLOpt MLI MLlib Apache*Spark Spark:*Cluster*compuBng*system*designed*for*iteraBve*computaBon MLlib:*Spark’s*core*ML*library MLI:*API*to*simplify*ML*development MLOpt:*DeclaraBve*layer*to*automate*hyperparameter*tuning
  9. 9. MLlib +**Scalable*and*fast** +**Simple*development*environment* +**Part*of*Spark’s*robust*ecosystem 5 SparkSQL Spark Streaming MLlib (machine learning) Apache Spark GraphX (graph)
  10. 10. AcBve*Development Ini.alRelease • Developed*by*MLbase*team*in*AMPLab*(11*contributors) • Scala,*Java • Shipped*with*Spark*v0.8*(Sep*2013) 15monthslater… • 60+*contributors*from*various*organizaBons • Scala,*Java,*Python • Many*more*algorithms*and*ML*uBliBes • Improved*documentaBon*/*code*examples,*API*stability • Latest*release*part*of*Spark*v1.1*(Sept*2014)
  11. 11. MLlib,*MLI*and*Roadmap • MLI:*Shield*ML*Developers*from*low]details* • Provide*familiar*mathemaBcal*operators*in*distributed*se^ng* (tables,*matrices,*opBmizaBon*primiBves)* • Standard*APIs*defining*ML*algorithms*and*feature*extractors • MLlib*v1.2*will*include*ML*API*inspired*by*MLI • Longer*term*for*MLlib* • Scalable*implementaBons*of*standard*ML*algorithms*and* underlying*opBmizaBon*primiBves* • Further*support*for*ML*pipeline*development*(including*hyper* parameter*tuning*using*ideas*from*MLOpt)
  12. 12. MLlib,*MLI*and*Roadmap • MLI:*Shield*ML*Developers*from*low]details* • Provide*familiar*mathemaBcal*operators*in*distributed*se^ng* (tables,*matrices,*opBmizaBon*primiBves)* • Standard*APIs*defining*ML*algorithms*and*feature*extractors Feedback'and' • MLlib*v1.2*will*include*ML*API*inspired*by*MLI Contribu9ons'Encouraged! • Longer*term*for*MLlib* • Scalable*implementaBons*of*standard*ML*algorithms*and* underlying*opBmizaBon*primiBves* • Further*support*for*ML*pipeline*development*(including*hyper* parameter*tuning*using*ideas*from*MLOpt)
  13. 13. Vision MLOpt
  14. 14. SQL Result PAQ Model Planner for Large-scale Predictive Analytic Queries ML ✦ User*declaraBvely*specifies*task ✦ PAQ*=*PredicBve*AnalyBc*Query ✦ Search*through*MLlib*to*find*the* best*model/pipeline the develop-ment enabled a wide recommen-dations, speech driven in-terfaces. supported by SELECT e.sender, e.subject, e.message FROM Emails e WHERE e.user = ’Bob’ AND PREDICT(e.spam, e.message) = false GIVEN LabeledData
  15. 15. A*Standard*ML*Pipeline ! Data Feature* ExtracBon Model* Training Final** Model ✦ In*pracBce,*model*building*is*an*iteraBve*process*of*conBnuous* refinement ✦ Our*grand*vision*is*to*automate*the*construcBon*of*these* pipelines
  16. 16. Training*A*Model ✦ For*each*point*in*dataset* ✦ compute*gradient* ✦ update*model* ✦ repeat*unBl*converged ✦ Requires*mul$ple'passes ✦ Common*access*padern* ✦ Naive*Bayes,*Trees,*etc. ✦ Minutes*to*train*an*SVM*on*200GB* of*data*on*a*16]node*cluster
  17. 17. The*Tricky*Part ✦ Algorithms* ✦ LogisBc*Regression,*SVM,*Tree] based,*etc.* ✦ Algorithm*hyper]parameters* ✦ Learning*Rate,*RegularizaBon,*etc. Algorithms Hyper* Parameters FeaturizaBon ✦ FeaturizaBon* ✦ Text:*n]grams,*TF]IDF* ✦ Images:*Gabor*filters,*random* convoluBons* ✦ Random*projecBon?*Scaling?
  18. 18. A*Standard*ML*Pipeline ! Data Feature* ExtracBon Model* Training ✦ In*pracBce,*model*building*is*an*iteraBve*process*of*conBnuous* refinement ✦ Our*grand*vision*is*to*automate*the*construcBon*of*these* pipelines ✦ Start*with*one*aspect*of*the*pipeline*]*model*selecBon Final** Model AutomatedModelSelec.on
  19. 19. One*Approach Learning* Rate RegularizaBon Bestanswer ✦ Try*it*all!* ✦ Search*over*all* hyperparameters,*algorithms,* features,*etc. ✦ Drawbacks* ✦ Expensive*to*compute*models* ✦ Hyperparameter*space*is*large ✦ Some*version*of*this*sBll* olen*done*in*pracBce!
  20. 20. A*Beder*Approach ✦ Beder*resource*uBlizaBon* ✦ through*batching* ✦ Algorithmic*Speedups* ✦ via*early*stopping*/*bandits* ✦ Improved*Search* ✦ e.g.,*via*randomizaBon Learning* Rate RegularizaBon Bestanswer
  21. 21. A*Tale*Of*3*OpBmizaBons BeLerResourceU.liza.on AlgorithmicSpeedups ImprovedSearch
  22. 22. Beder*Resource*UBlizaBon ✦ Typical*model*update*requires*2]4*flops/double ✦ But*modern*memory*much*slower*than*processors* ✦ We*can*do*25*flops*/*double*read!* ✦ This*equates*to*6]8*model*updates*per*double*we* read,*assuming*models*fit*in*cache ✦ Trainmul.plemodelssimultaneously
  23. 23. What*Do*We*See*In*Spark? ✦ 2x*and*5x*increase*in*models* trained/sec*with*batching* ✦ Overhead*from*virtualizaBon,* network,*etc.
  24. 24. What*Do*We*See*In*Spark? ✦ These*numbers*are*with* vector]matrix*mulBplies ✦ Can*do*beder*when*rewriBng* in*terms*of*matrix]matrix* mulBplies
  25. 25. A*Tale*Of*3*OpBmizaBons BeLerResourceU.liza.on AlgorithmicSpeedups ImprovedSearch
  26. 26. Learning* Rate RegularizaBon Bestanswer Bandit*Search ✦ Each*point*is*a*trained*model ✦ Some*models*look*bad*early ✦ So*we*give*up*early! ✦ Can*frame*this*as*a*mul.Q armedbandit*problem ✦ Each*model*is*an*arm
  27. 27. Bandit*Search ✦ Each*point*is*a*trained*model ✦ Some*models*look*bad*early ✦ So*we*give*up*early! ✦ Can*frame*this*as*a*mul.Q armedbandit*problem ✦ Each*model*is*an*arm
  28. 28. A*Tale*Of*3*OpBmizaBons BeLerResourceU.liza.on AlgorithmicSpeedups ImprovedSearch
  29. 29. What*Search*Method?* GRID NELDER_MEAD POWELL RANDOM SMAC SPEARMINT TPE 0.5 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 0.0 australian breast diabetes fourclass splice 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 Method and Maximum Calls Dataset and Validation Error Maximum Calls 16 81 256 625 Comparison of Search Methods Across Learning Problems ✦ Various*derivaBve]free*opBmizaBon*techniques* ✦ Simple*ones*(Grid,*Random)* ✦ Classic*DerivaBve]Free*(Nelder]Mead,*Powell’s*method)* ✦ Bayesian*(SMAC,*TPE,*Spearmint) ✦ What*should*we*do?* ✦ Tried*on*5*datasets,*opBmized*over*4*hyperparameters!
  30. 30. What*Search*Method?* GRID NELDER_MEAD POWELL RANDOM SMAC SPEARMINT TPE 0.5 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 0.0 australian breast diabetes fourclass splice 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 Method and Maximum Calls Dataset and Validation Error Maximum Calls 16 81 256 625 Comparison of Search Methods Across Learning Problems
  31. 31. Pu^ng*It*All*Together ●●●●●●●●●●●●●●●● ●●●● ●●●● ●●●● ●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●● ●● ●●●●●●●●●●●● ●●●●●● ●●●●●●● ●●● ● ● ●● ●●●●●●●●●●●●●●●●● ●●●●●●●●● 0.75 0.50 0.25 0 200 400 600 800 Time elapsed (m) Best Validation Error Seen So Far Search Method ● ● ● Grid − Unoptimized Random − Optimized TPE − Optimized Model Convergence Over Time ✦ First*version*of*MLbase*opBmizer ✦ 30GB*dense*images*(240K*x*16K) ✦ 2*model*families,*5*hyperparams ✦ Baseline:*grid*search ✦ Our*method:*combinaBon*of* ✦ Batching* ✦ Early*stopping*/*Bandit* ✦ Random*or*TPE* 20xspeedupcompared'to'grid'search' 15'minutes'vs'5'hours!'
  32. 32. Does*It*Scale? ✦ 1.5TB*dataset*(1.2M*x*160K) ✦ 128*nodes,*thousands*of*passes* over*data ✦ Tried*32*models*in*15*hours* ✦ Good*answer*aler*11*hours ●●●●● ●●● ●●● ●● ●● ● ● ● 0.75 0.50 0.25 5 10 Time elapsed (h) Best Validation Error Seen So Far Convergence of Model Accuracy on 1.5TB Dataset
  33. 33. Future*Work ! Data Feature* ExtracBon Model* Training Final** Model AutomatedMLPipelineConstruc.on
  34. 34. Data Image! Parser Normalizer Convolver Pooler sqrt,mean Zipper Linear Solver Symmetric! Rectifier ident,abs ident,mean Global Patch! Extractor Patch Whitener KMeans! Clusterer Feature Extractor Label! Extractor Test! Feature Data Linear! Extractor Mapper Model Label! Extractor Test Error Error! Computer No Hyperparameters! A few Hyperparameters! Lotsa Hyperparameters Slide courtesy of Evan Sparks
  35. 35. Other*Future*Work ✦ Ensembling ✦ Leverage*sampling ✦ Beder*parallelism*for*smaller*datasets ✦ MulBple*hypothesis*tesBng*issues
  36. 36. MLOpt:*DeclaraBve*layer*that*aims*to* automate*ML*pipeline*construcBon* MLI:*API*to*simplify*ML*development* MLlib:*Spark’s*core*ML*library* Spark:*Cluster*compuBng*system* designed*for*iteraBve*computaBon MLOpt MLI THANKS! QUESTIONS? ML base ML base ML base ML ML ML ML www.mlbase.org MLlib Apache*Spark

×