Your SlideShare is downloading. ×
MLlib sparkmeetup_8_6_13_final_reduced
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

MLlib sparkmeetup_8_6_13_final_reduced

2,699

Published on

Published in: Technology, Education
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,699
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
101
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Evan  Sparks  and  Ameet  Talwalkar UC  Berkeley UC Berkeley baseML baseML M ML M
  • 2. Three  Converging  Trends
  • 3. Big  Data Three  Converging  Trends
  • 4. Distributed   Compu2ng Big  Data Three  Converging  Trends
  • 5. Distributed   Compu2ng Big  Data Three  Converging  Trends Machine   Learning
  • 6. Distributed   Compu2ng Big  Data Three  Converging  Trends Machine   Learning MLbase
  • 7. Vision MLlib MLI ML  OpAmizer Release  Plan
  • 8. Problem:  Scalable  implementaAons   difficult  for  ML  Developers… Me S ML Contract + Code ML
  • 9. Problem:  Scalable  implementaAons   difficult  for  ML  Developers… Me S ML Contract + Code ML
  • 10. Problem:  Scalable  implementaAons   difficult  for  ML  Developers… Me S ML Contract + Code ML
  • 11. Too  many   algorithms… Problem:  ML  is  difficult for  End  Users…
  • 12. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users…
  • 13. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users… Difficult  to   debug…
  • 14. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users… Difficult  to   debug… Doesn’t  scale…
  • 15. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users… Difficult  to   debug… Reliable Fast Accurate Provable Doesn’t  scale…
  • 16. ML  Experts Systems  ExpertsMLbase
  • 17. 1. Easy  scalable  ML  development  (ML  Developers) 2. User-­‐friendly  ML  at  scale  (End  Users) ML  Experts Systems  ExpertsMLbase
  • 18. 1. Easy  scalable  ML  development  (ML  Developers) 2. User-­‐friendly  ML  at  scale  (End  Users) Along  the  way,  we  gain  insight  into  data  intensive   compu2ng ML  Experts Systems  ExpertsMLbase
  • 19. Matlab  Stack
  • 20. Matlab  Stack Single Machine
  • 21. Lapack Matlab  Stack Single Machine ✦ Lapack:  low-­‐level  Fortran  linear  algebra  library
  • 22. Lapack Matlab Interface Matlab  Stack Single Machine ✦ Lapack:  low-­‐level  Fortran  linear  algebra  library ✦ Matlab  Interface ✦ Higher-­‐level  abstrac2ons  for  data  access  /  processing ✦ More  extensive  func2onality  than  Lapack ✦ Leverages  Lapack  whenever  possible
  • 23. Lapack Matlab Interface Matlab  Stack Single Machine ✦ Lapack:  low-­‐level  Fortran  linear  algebra  library ✦ Matlab  Interface ✦ Higher-­‐level  abstrac2ons  for  data  access  /  processing ✦ More  extensive  func2onality  than  Lapack ✦ Leverages  Lapack  whenever  possible ✦ Similar  stories  for  R  and  Python
  • 24. MLbase  Stack Lapack Matlab Interface Single Machine
  • 25. MLbase  Stack Runtime(s) Lapack Matlab Interface Single Machine
  • 26. MLbase  Stack Runtime(s)Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on Lapack Matlab Interface Single Machine
  • 27. MLbase  Stack Runtime(s) MLlib Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on MLlib:  low-­‐level  ML  library  in  Spark ✦ Callable  from  Scala,  Java Lapack Matlab Interface Single Machine
  • 28. MLbase  Stack Runtime(s) MLlib MLI Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on MLlib:  low-­‐level  ML  library  in  Spark ✦ Callable  from  Scala,  Java MLI:  API  /  plaHorm  for  feature  extrac=on  and  algorithm  development ✦ Includes  higher-­‐level  func2onality  with  faster  dev  cycle  than  MLlib Lapack Matlab Interface Single Machine
  • 29. MLbase  Stack Runtime(s) MLlib MLI ML Optimizer Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on MLlib:  low-­‐level  ML  library  in  Spark ✦ Callable  from  Scala,  Java MLI:  API  /  plaHorm  for  feature  extrac=on  and  algorithm  development ✦ Includes  higher-­‐level  func2onality  with  faster  dev  cycle  than  MLlib ML  OpAmizer:  automates  model  selec=on ✦ Solves  a  search  problem  over  feature  extractors  and  algorithms  in  MLI Lapack Matlab Interface Single Machine
  • 30. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  • 31. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  • 32. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Goal 1: Summer Release Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  • 33. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Goal 1: Summer Release Goal 2: Winter Release Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  • 34. Example:  MLlib
  • 35. Example:  MLlib ✦ Goal:  Classifica2on  of  text  file
  • 36. Example:  MLlib ✦ Goal:  Classifica2on  of  text  file ✦ Featurize  data  manually 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50) 14 .train(featurizedData) 15 } Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML
  • 37. Example:  MLlib ✦ Goal:  Classifica2on  of  text  file ✦ Featurize  data  manually ✦ Calls  MLlib’s  LR  func2on 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50) 14 .train(featurizedData) 15 } Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML
  • 38. Example:  MLI
  • 39. Example:  MLI ✦ Use  built-­‐in  feature  extrac2on  func2onality 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50)
  • 40. Example:  MLI ✦ Use  built-­‐in  feature  extrac2on  func2onality ✦ MLI  Logis2c  Regression  leverages  MLlib 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50)
  • 41. Example:  MLI ✦ Use  built-­‐in  feature  extrac2on  func2onality ✦ MLI  Logis2c  Regression  leverages  MLlib ✦ Extensions: ✦ Embed  in  cross-­‐valida2on  rou2ne ✦ Use  different  feature  extractors  /  algorithms  or   write  new  ones 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50)
  • 42. Example:  ML  Op2mizer var  X  =  load(”text_file”,  2  to  10) var  y  =  load(”text_file”,  1) var  (fn-­‐model,  summary)  =  doClassify(X,  y) ✦ User  declara2vely  specifies  task ✦ ML  Op2mizer  searches  through  MLI
  • 43. Vision MLlib MLI ML  OpAmizer Release  Plan
  • 44. Ease  of  use Performance,   Scalability Lay  of  the  Land
  • 45. Matlab,  R x Ease  of  use Performance,   Scalability Lay  of  the  Land
  • 46. Matlab,  R x Ease  of  use Performance,   Scalability Mahout x Lay  of  the  Land
  • 47. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x Mahout x Lay  of  the  Land
  • 48. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x Mahout x Lay  of  the  Land MLlib x
  • 49. Logis2c  Regression,  Linear  SVM  (+L1,  L2) Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares K-­‐Means SGD,  Parallel  Gradient MLlib ClassificaAon: Regression: CollaboraAve  Filtering: Clustering: OpAmizaAon  PrimiAves:
  • 50. Logis2c  Regression,  Linear  SVM  (+L1,  L2) Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares K-­‐Means SGD,  Parallel  Gradient MLlib ClassificaAon: Regression: CollaboraAve  Filtering: Clustering: OpAmizaAon  PrimiAves: Included  within  Spark  codebase ✦ Unlike  Mahout/Hadoop ✦ Part  of  Spark  0.8  release ✦ Con2nued  support  via  Spark  project
  • 51. MLlib  Performance
  • 52. ✦ WallAme:  elapsed  2me  to  execute  task MLlib  Performance
  • 53. ✦ WallAme:  elapsed  2me  to  execute  task ✦ Weak  scaling ✦ fix  problem  size  per  processor ✦ ideally:  constant  wall2me  as  we  grow  cluster MLlib  Performance
  • 54. ✦ WallAme:  elapsed  2me  to  execute  task ✦ Weak  scaling ✦ fix  problem  size  per  processor ✦ ideally:  constant  wall2me  as  we  grow  cluster ✦ Strong  scaling ✦ fix  total  problem  size ✦ ideally:  linear  speed  up  as  we  grow  cluster MLlib  Performance
  • 55. ✦ WallAme:  elapsed  2me  to  execute  task ✦ Weak  scaling ✦ fix  problem  size  per  processor ✦ ideally:  constant  wall2me  as  we  grow  cluster ✦ Strong  scaling ✦ fix  total  problem  size ✦ ideally:  linear  speed  up  as  we  grow  cluster ✦ EC2  Experiments ✦ m2.4xlarge  instances,  up  to  32  machine  clusters MLlib  Performance
  • 56. Logis2c  Regression  -­‐  Weak  Scaling
  • 57. Logis2c  Regression  -­‐  Weak  Scaling ✦ Full  dataset:  200K  images,  160K  dense  features
  • 58. Logis2c  Regression  -­‐  Weak  Scaling ✦ Full  dataset:  200K  images,  160K  dense  features ✦ Similar  weak  scaling 0 5 10 15 20 25 30 0 2 4 6 8 10 relativewalltime # machines MLbase VW Ideal Fig. 6: Weak scaling for logistic regression 15 20 25 30 35 speedup MLbase VW Ideal MLlib
  • 59. Logis2c  Regression  -­‐  Weak  Scaling ✦ Full  dataset:  200K  images,  160K  dense  features ✦ Similar  weak  scaling ✦ MLlib  within  a  factor  of  2  of  VW’s  wall=me MLbase VW Matlab 0 1000 2000 3000 4000 walltime(s) n=6K, d=160K n=12.5K, d=160K n=25K, d=160K n=50K, d=160K n=100K, d=160K n=200K, d=160K MLlib0 5 10 15 20 25 30 0 2 4 6 8 10 relativewalltime # machines MLbase VW Ideal Fig. 6: Weak scaling for logistic regression 15 20 25 30 35 speedup MLbase VW Ideal MLlib
  • 60. Logis2c  Regression  -­‐  Strong  Scaling
  • 61. Logis2c  Regression  -­‐  Strong  Scaling ✦ Fixed  Dataset:  50K  images,  160K  dense  features
  • 62. Logis2c  Regression  -­‐  Strong  Scaling ✦ Fixed  Dataset:  50K  images,  160K  dense  features ✦ MLlib  exhibits  beTer  scaling  proper=es 0 5 10 15 20 25 30 0 # machines ig. 6: Weak scaling for logistic regression 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 # machines speedup MLbase VW Ideal 8: Strong scaling for logistic regression System Lines of Code MLbase 32 GraphLab 383 MLlib
  • 63. Logis2c  Regression  -­‐  Strong  Scaling ✦ Fixed  Dataset:  50K  images,  160K  dense  features ✦ MLlib  exhibits  beTer  scaling  proper=es ✦ MLlib  faster  than  VW  with  16  and  32  machines MLbase VW Matlab 0 1000 wa Fig. 5: Walltime for weak scaling for logistic regressi MLbase VW Matlab 0 200 400 600 800 1000 1200 1400 walltime(s) 1 Machine 2 Machines 4 Machines 8 Machines 16 Machines 32 Machines Fig. 7: Walltime for strong scaling for logistic regress with respect to computation. In practice, we see comp scaling results as more machines are added. In MATLAB, we implement gradient descent inste SGD, as gradient descent requires roughly the same nu of numeric operations as SGD but does not require an loop to pass over the data. It can thus be implemented MLlib 0 5 10 15 20 25 30 0 # machines ig. 6: Weak scaling for logistic regression 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 # machines speedup MLbase VW Ideal 8: Strong scaling for logistic regression System Lines of Code MLbase 32 GraphLab 383 MLlib
  • 64. ALS  -­‐  Wall2me
  • 65. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines
  • 66. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines System WallAme  (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481
  • 67. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines System WallAme  (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481
  • 68. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines ✦ MLlib  an  order  of  magnitude  faster  than  Mahout ✦ MLlib  within  factor  of  2  of  GraphLab System WallAme  (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481
  • 69. Deployment  Considera2ons
  • 70. Deployment  Considera2ons Vowpal  Wabbit,  GraphLab ✦ Data  prepara=on  specific  to  each  program ✦ Non-­‐trivial  setup  on  cluster ✦ No  fault  tolerance
  • 71. Deployment  Considera2ons Vowpal  Wabbit,  GraphLab ✦ Data  prepara=on  specific  to  each  program ✦ Non-­‐trivial  setup  on  cluster ✦ No  fault  tolerance MLlib ✦ Reads  files  from  HDFS ✦ Launch/compile/run  on  cluster  with  a  few  commands ✦ RDD’s  are  fault  tolerance
  • 72. Vision MLlib MLI ML  OpAmizer Release  Plan
  • 73. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x Mahout x Lay  of  the  Land MLlib x
  • 74. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x MLI x Mahout x Lay  of  the  Land MLlib x
  • 75. Current  Op2ons
  • 76. Current  Op2ons  +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers —    Ad-­‐hoc,  non-­‐scalable  scripts —    Loss  of  transla2on  upon  re-­‐implementa2on
  • 77. Current  Op2ons  +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers —    Ad-­‐hoc,  non-­‐scalable  scripts —    Loss  of  transla2on  upon  re-­‐implementa2on
  • 78. Current  Op2ons  +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers —    Ad-­‐hoc,  non-­‐scalable  scripts —    Loss  of  transla2on  upon  re-­‐implementa2on  +      Scalable  and  (some2mes)  fast  +      Exis2ng  open-­‐source  library  of  ML  algorithms —    Difficult  to  set  up,  extend
  • 79. Examples ML Developer Code
  • 80. Examples ML Developer Code ‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC) ✦ Ini2al  studies  in  MATLAB  (Not  distributed) ✦ Distributed  prototype  involving  compiled  MATLAB
  • 81. Examples ML Developer Code ‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC) ✦ Ini2al  studies  in  MATLAB  (Not  distributed) ✦ Distributed  prototype  involving  compiled  MATLAB Mahout  ALS  with  Early  Stopping ✦ Theory:  simple  if-­‐statement  (3  lines  of  code)
  • 82. Examples ML Developer Code ‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC) ✦ Ini2al  studies  in  MATLAB  (Not  distributed) ✦ Distributed  prototype  involving  compiled  MATLAB Mahout  ALS  with  Early  Stopping ✦ Theory:  simple  if-­‐statement  (3  lines  of  code) ✦ Prac2ce:  sih  through  7  files,  nearly  1K  lines  of  code
  • 83. Insight:  Programming  Abstrac2ons
  • 84. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI)
  • 85. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI) ✦ Table  Computa2on:  MLTable ✦ Linear  Algebra:  MLSubMatrix ✦ Op2miza2on  Primi2ves:  MLSolve
  • 86. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI) ✦ Table  Computa2on:  MLTable ✦ Linear  Algebra:  MLSubMatrix ✦ Op2miza2on  Primi2ves:  MLSolve ✦ MLI  Examples: ✦ DFC:  ~50  lines  of  code
  • 87. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI) ✦ Table  Computa2on:  MLTable ✦ Linear  Algebra:  MLSubMatrix ✦ Op2miza2on  Primi2ves:  MLSolve ✦ MLI  Examples: ✦ DFC:  ~50  lines  of  code ✦ ALS:  early  stopping  in  3  lines;  <  40  lines  total
  • 88. Lines  of  Code
  • 89. Lines  of  Code Logis2c  Regression Alterna2ng  Least  Squares System Lines  of  Code Matlab 11 Vowpal  Wabbit 721 MLI 55 System Lines  of  Code Matlab 20 Mahout 865 GraphLab 383 MLI 32
  • 90. Lines  of  Code Logis2c  Regression Alterna2ng  Least  Squares System Lines  of  Code Matlab 11 Vowpal  Wabbit 721 MLI 55 System Lines  of  Code Matlab 20 Mahout 865 GraphLab 383 MLI 32
  • 91. Lines  of  Code Logis2c  Regression Alterna2ng  Least  Squares System Lines  of  Code Matlab 11 Vowpal  Wabbit 721 MLI 55 System Lines  of  Code Matlab 20 Mahout 865 GraphLab 383 MLI 32
  • 92. MLI  Details
  • 93. MLI  Details OLD val  x:  RDD[Array[Double]]
  • 94. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector]
  • 95. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector]
  • 96. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat]
  • 97. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat]
  • 98. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat] NEW val  x:  MLTable
  • 99. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat] NEW val  x:  MLTable ✦ Generic  interface  for  feature  extrac2on ✦ Common  interface  to  support  an  op2mizer ✦ Abstract  interface  for  arbitrary  backends
  • 100. MLTable ✦ Flexibility  when  loading  data ✦ e.g.,  CSV,  JSON,  XML ✦ Heterogenous  data  across   columns ✦ Missing  Data ✦ Feature  extrac2on ✦ Common  Interface ✦ Supports  MapReduce  and   Rela2onal  Operators   ✦ Inspired  by  DataFrames  (R)  and  Pandas  (Python)
  • 101. Feature  Extrac2on where a ke matrixBatchMap MLSubMatrix ) MLSubMatrix MLNumericTable Execute a data. Outpu table. numRows None Long Returns nu numCols None Long Returns the Fig. 2: MLTable API Illustration. This table captures core operations of th 1 def main(args: Array[String]) { 2 val mc = new MLContext("local") 3 4 //Read in table from file on HDFS. 5 val rawTextTable = mc.textFile(args(0)) 6 7 //Run feature extraction on the raw text - get the top 30000 bigrams. 8 val featurizedTable = tfIdf(nGrams(rawTextTable, n=2, top=30000)) 9 10 //Cluster the data using K-Means. 11 val kMeansModel = KMeans(featurizedTable, k=50) 12 } Fig. 3: Loading, featurizing, and learning clusters on a corpu Family Example Uses Returns Shape dims(mat), mat.numRows, mat.numCols Int or (Int,Int)
  • 102. MLSubMatrix ✦ Linear  algebra  on  local  parAAons ✦ E.g.,  matrix-­‐vector  opera2ons  for   mini-­‐batch  logis2c  regression ✦ E.g.,  solving  linear  system  of  equa2ons   for  Alterna2ng  Least  Squares ✦ Sparse  and  Dense  Matrix  Support
  • 103. Alterna2ng  Least  Squares 19 parfor q=1:n 20 Uq = U(Uinds{q},:); 21 V(q,:) = (Uq’*Uq + lambI) (Uq’ * M(Uinds{q},q)); 22 end 23 end 24 end 1 object BroadcastALS extends Algorithm { 2 def train(trainData: MLNumericTable, trainDataTrans: MLNumericTable, 3 m: Int, n: Int, k: Int, lambda: Int, maxIter: Int): ALSModel = { 4 val lambI = MLSubMatrix.eye(k).mul(lambda) 5 var U = MLSubMatrix.rand(m, k) 6 var V = MLSubMatrix.rand(n, k) 7 var U_b = trainData.context.broadcast(U) 8 var V_b = trainData.context.broadcast(V) 9 for (iter <- 0 until maxIter) { 10 U = trainData.matrixBatchMap(localALS(_, U_b.value, lambI, k)) 11 U_b = trainData.context.broadcast(U) 12 V = trainDataTrans.matrixBatchMap(localALS(_, V_b.value, lambI, k)) 13 V_b = trainData.context.broadcast(V) 14 } 15 new ALSModel(U, V) 16 } 17 18 def localALS(trainDataPart: MLSubMatrix, Y: MLSubMatrix, lambI: MLSubMatrix, k: Int){ 19 var localX = MLSubMatrix.zeros(trainDataPart.numRows, k) 20 for (i <- 0 until trainDataPart.numRows) { 21 val q = trainDataPart.rowID(i) 22 val nz_inds = trainDataPart.nzCols(q) 23 val Yq = Y(trainDataPart.nzCols(q), ??) 24 localX(i, ??) = ((Yq.transpose times Yq) + lambI) 25 .solve(Yq.transpose times trainDataPart(q, nz_inds).transpose) 26 } 27 return localX 28 } 29 }
  • 104. MLSolve ✦ Distributed  implementaAons  of   common  opAmizaAon  paZerns ✦ E.g.,  Stochas2c  Gradient  Descent:   Applicable  to  summable  ML  losses ✦ E.g.,  LBFGS:  An  approximate  2nd-­‐ order  op2miza2on  method   ✦ E.g.,  ADMM:  Decomposi2on  /   coordina2on  procedure
  • 105. Logis2c  Regression 5 grad = X’ * (sigmoid(X * w) - y); 6 w = w - learning_rate * grad; 7 end 8 end 9 10 % applies sigmoid function component-wise on the vector x 11 function s = sigmoid(x) 12 s = 1 ./ (1 + exp(-1 .* x)); 13 end 1 object LogisticRegression extends Algorithm { 2 def sigmoid(z: Scalar) = 1.0 / (1.0 + exp(-1.0*z)) 3 4 def gradientFunction(w: MLSubMatrix, x: MLSubMatrix, y: Scalar): MLSubMatrix = { 5 x.transpose * (sigmoid(x dot w) - y) 6 } 7 8 def train(data: MLNumericTable, p: LogRegParams): LogRegModel = { 9 val d = data.numCols 10 val params = SGDParams(initweights = MLSubMatrix.zeros(d, 1), 11 maxIterations = p.maxIter, learningRate = p.learningRate, 12 gradientFunction = gradientFunction) 13 val weights = SGD(data, params) 14 new LogRegModel(weights) 15 } 16 } 1 object StochasticGradientDescent extends Optimizer { 2 3 def localSGD(x: MLSubMatrix, weights: MLSubMatrix, n: Index, lambda: Scalar, 4 gradientFunction: (MLSubMatrix, MLSubMatrix, Scalar) => MLSubMatrix): MLSubMatr 5 var localWeights = weights 6 for (i <- 0 to x.numRows) {
  • 106. Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares,  [DFC] K-­‐Means,  [DP-­‐Means] Logis2c  Regression,  Linear  SVM  (+L1,  L2),  Mul2nomial   Regression,  [Naive  Bayes,  Decision  Trees] SGD,  Parallel  Gradient,  Local  SGD,  [L-­‐BFGS,  ADMM,   Adagrad] Principal  Component  Analysis  (PCA),  N-­‐grams,  feature   cleaning  /  normaliza2on Cross  Valida2on,  Evalua2on  Metrics MLI  Func2onality Regression: CollaboraAve  Filtering: Clustering: ClassificaAon: OpAmizaAon  PrimiAves: Feature  ExtracAon: ML  Tools:
  • 107. Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares,  [DFC] K-­‐Means,  [DP-­‐Means] Logis2c  Regression,  Linear  SVM  (+L1,  L2),  Mul2nomial   Regression,  [Naive  Bayes,  Decision  Trees] SGD,  Parallel  Gradient,  Local  SGD,  [L-­‐BFGS,  ADMM,   Adagrad] Principal  Component  Analysis  (PCA),  N-­‐grams,  feature   cleaning  /  normaliza2on Cross  Valida2on,  Evalua2on  Metrics MLI  Func2onality Regression: CollaboraAve  Filtering: Clustering: ClassificaAon: OpAmizaAon  PrimiAves: Feature  ExtracAon: ML  Tools:
  • 108. Vision MLlib MLI ML  OpAmizer Release  Plan
  • 109. Build  a  Classifier  for  X What  you  want  to  do
  • 110. Build  a  Classifier  for  X What  you  want  to  do What  you  have  to  do ✦ Learn  the  internals  of  ML  classificaAon   algorithms,  sampling,  feature  selecAon,   X-­‐validaAon,…. ✦ PotenAally  learn  Spark/Hadoop/… ✦ Implement  3-­‐4  algorithms ✦ Implement  grid-­‐search  to  find  the  right   algorithm  parameters ✦ Implement  validaAon  algorithms ✦ Experiment  with  different  sampling-­‐ sizes,  algorithms,  features ✦ ….
  • 111. Build  a  Classifier  for  X What  you  want  to  do What  you  have  to  do ✦ Learn  the  internals  of  ML  classificaAon   algorithms,  sampling,  feature  selecAon,   X-­‐validaAon,…. ✦ PotenAally  learn  Spark/Hadoop/… ✦ Implement  3-­‐4  algorithms ✦ Implement  grid-­‐search  to  find  the  right   algorithm  parameters ✦ Implement  validaAon  algorithms ✦ Experiment  with  different  sampling-­‐ sizes,  algorithms,  features ✦ …. and  in  the  end Ask  For  Help
  • 112. Insight:  A  Declara2ve  Approach SQL Result ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it
  • 113. Insight:  A  Declara2ve  Approach SQL Result MQL Model ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it
  • 114. var  X  =  load(”als_clinical”,  2  to  10) var  y  =  load(”als_clinical”,  1) var  (fn-­‐model,  summary)  =  doClassify(X,  y) Example:  Supervised  ClassificaAon ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it Insight:  A  Declara2ve  Approach
  • 115. var  X  =  load(”als_clinical”,  2  to  10) var  y  =  load(”als_clinical”,  1) var  (fn-­‐model,  summary)  =  doClassify(X,  y) Example:  Supervised  ClassificaAon Algorithm  Independent   ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it Insight:  A  Declara2ve  Approach
  • 116.  ML  Op2mizer:  A  Search  Problem 5min Boosting SVM ✦ System  is  responsible   for  searching  through   model  space ✦ Opportuni2es  for   physical  op2miza2on
  • 117. Systems  Op2miza2on  of  Model   Search 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey
  • 118. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey
  • 119. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA
  • 120. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA QueryB
  • 121. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA QueryB ✦ Single  pass  over  the   data,  many  models   trained
  • 122. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA QueryB ✦ Single  pass  over  the   data,  many  models   trained ✦ Example  –  Logis2c   Regression  via  SGD
  • 123. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI MQL
  • 124. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI ✦ MLI  provides  common  interface  for  all  algorithms MQL
  • 125. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI ✦ MLI  provides  common  interface  for  all  algorithms ✦ Contracts:  Meta-­‐data  for  algorithms  writen  against  MLI MQL
  • 126. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI ✦ MLI  provides  common  interface  for  all  algorithms ✦ Contracts:  Meta-­‐data  for  algorithms  writen  against  MLI ✦ Type  (e.g.,  classifica2on) ✦ Parameters ✦ Run2me  (e.g.,  O(n)) ✦ Input-­‐Specifica2on ✦ Output-­‐Specifica2on ✦ … MQL
  • 127. Vision MLlib MLI ML  OpAmizer Release  Plan
  • 128. Contributors ✦ John  Duchi ✦ Michael  Franklin ✦ Joseph  Gonzalez ✦ Rean  Griffith ✦ Michael  Jordan ✦ Tim  Kraska ✦ Xinghao  Pan ✦ Virginia  Smith ✦ Shivaram  Venkarataram ✦ Matei  Zaharia
  • 129. Contributors ✦ John  Duchi ✦ Michael  Franklin ✦ Joseph  Gonzalez ✦ Rean  Griffith ✦ Michael  Jordan ✦ Tim  Kraska ✦ Xinghao  Pan ✦ Virginia  Smith ✦ Shivaram  Venkarataram ✦ Matei  Zaharia * * * *
  • 130. First  Release  (Summer) MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 131. First  Release  (Summer) ✦ MLlib:  low-­‐level  ML  library  and  underlying  kernels ✦ Callable  from  Scala,  Java ✦ Included  as  part  of  Spark ✦ MLI:  API  for  feature  extrac2on  and  ML  algorithms ✦ Plaworm  for  ML  development ✦ Includes  more  extensive  library  and  with  faster  dev-­‐cycle  than  MLlib MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 132. Second  Release  (Winter) MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 133. Second  Release  (Winter) ✦ ML  OpAmizer:  automated  model  selec2on ✦ Search  problem  over  feature  extractors  and  algorithms  in  MLI ✦ Contracts ✦ Restricted  query  language  (MQL) MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 134. Second  Release  (Winter) ✦ ML  OpAmizer:  automated  model  selec2on ✦ Search  problem  over  feature  extractors  and  algorithms  in  MLI ✦ Contracts ✦ Restricted  query  language  (MQL) ✦ Feature  extracAon  for  image  data MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  • 135. Future  Direc2ons
  • 136. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer
  • 137. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers
  • 138. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers ✦ Plug-­‐ins  to  Python,  R
  • 139. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers ✦ Plug-­‐ins  to  Python,  R ✦ VisualizaAon  for  unsupervised  learning  and  explora2on
  • 140. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers ✦ Plug-­‐ins  to  Python,  R ✦ VisualizaAon  for  unsupervised  learning  and  explora2on ✦ Advanced  ML  capabiliAes ✦ Time-­‐series  algorithms ✦ Graphical  models ✦ Advanced  Op2miza2on  (e.g.,  asynchronous  computa2on) ✦ Online  updates ✦ Sampling  for  efficiency  
  • 141. ContribuAons   encouraged! Berkeley,  CA August  29-­‐30www.mlbase.org baseML baseML baseML ML base ML base ML base ML base

×