Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MLlib sparkmeetup_8_6_13_final_reduced

6,109 views

Published on

Published in: Technology, Education
  • Be the first to comment

MLlib sparkmeetup_8_6_13_final_reduced

  1. 1. Evan  Sparks  and  Ameet  Talwalkar UC  Berkeley UC Berkeley baseML baseML M ML M
  2. 2. Three  Converging  Trends
  3. 3. Big  Data Three  Converging  Trends
  4. 4. Distributed   Compu2ng Big  Data Three  Converging  Trends
  5. 5. Distributed   Compu2ng Big  Data Three  Converging  Trends Machine   Learning
  6. 6. Distributed   Compu2ng Big  Data Three  Converging  Trends Machine   Learning MLbase
  7. 7. Vision MLlib MLI ML  OpAmizer Release  Plan
  8. 8. Problem:  Scalable  implementaAons   difficult  for  ML  Developers… Me S ML Contract + Code ML
  9. 9. Problem:  Scalable  implementaAons   difficult  for  ML  Developers… Me S ML Contract + Code ML
  10. 10. Problem:  Scalable  implementaAons   difficult  for  ML  Developers… Me S ML Contract + Code ML
  11. 11. Too  many   algorithms… Problem:  ML  is  difficult for  End  Users…
  12. 12. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users…
  13. 13. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users… Difficult  to   debug…
  14. 14. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users… Difficult  to   debug… Doesn’t  scale…
  15. 15. Too  many   algorithms… Too  many   knobs… Problem:  ML  is  difficult for  End  Users… Difficult  to   debug… Reliable Fast Accurate Provable Doesn’t  scale…
  16. 16. ML  Experts Systems  ExpertsMLbase
  17. 17. 1. Easy  scalable  ML  development  (ML  Developers) 2. User-­‐friendly  ML  at  scale  (End  Users) ML  Experts Systems  ExpertsMLbase
  18. 18. 1. Easy  scalable  ML  development  (ML  Developers) 2. User-­‐friendly  ML  at  scale  (End  Users) Along  the  way,  we  gain  insight  into  data  intensive   compu2ng ML  Experts Systems  ExpertsMLbase
  19. 19. Matlab  Stack
  20. 20. Matlab  Stack Single Machine
  21. 21. Lapack Matlab  Stack Single Machine ✦ Lapack:  low-­‐level  Fortran  linear  algebra  library
  22. 22. Lapack Matlab Interface Matlab  Stack Single Machine ✦ Lapack:  low-­‐level  Fortran  linear  algebra  library ✦ Matlab  Interface ✦ Higher-­‐level  abstrac2ons  for  data  access  /  processing ✦ More  extensive  func2onality  than  Lapack ✦ Leverages  Lapack  whenever  possible
  23. 23. Lapack Matlab Interface Matlab  Stack Single Machine ✦ Lapack:  low-­‐level  Fortran  linear  algebra  library ✦ Matlab  Interface ✦ Higher-­‐level  abstrac2ons  for  data  access  /  processing ✦ More  extensive  func2onality  than  Lapack ✦ Leverages  Lapack  whenever  possible ✦ Similar  stories  for  R  and  Python
  24. 24. MLbase  Stack Lapack Matlab Interface Single Machine
  25. 25. MLbase  Stack Runtime(s) Lapack Matlab Interface Single Machine
  26. 26. MLbase  Stack Runtime(s)Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on Lapack Matlab Interface Single Machine
  27. 27. MLbase  Stack Runtime(s) MLlib Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on MLlib:  low-­‐level  ML  library  in  Spark ✦ Callable  from  Scala,  Java Lapack Matlab Interface Single Machine
  28. 28. MLbase  Stack Runtime(s) MLlib MLI Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on MLlib:  low-­‐level  ML  library  in  Spark ✦ Callable  from  Scala,  Java MLI:  API  /  plaHorm  for  feature  extrac=on  and  algorithm  development ✦ Includes  higher-­‐level  func2onality  with  faster  dev  cycle  than  MLlib Lapack Matlab Interface Single Machine
  29. 29. MLbase  Stack Runtime(s) MLlib MLI ML Optimizer Spark Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on MLlib:  low-­‐level  ML  library  in  Spark ✦ Callable  from  Scala,  Java MLI:  API  /  plaHorm  for  feature  extrac=on  and  algorithm  development ✦ Includes  higher-­‐level  func2onality  with  faster  dev  cycle  than  MLlib ML  OpAmizer:  automates  model  selec=on ✦ Solves  a  search  problem  over  feature  extractors  and  algorithms  in  MLI Lapack Matlab Interface Single Machine
  30. 30. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  31. 31. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  32. 32. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Goal 1: Summer Release Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  33. 33. MLlib MLI ML Optimizer End  User MLbase  Stack  Status Goal 1: Summer Release Goal 2: Winter Release Spark ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves
  34. 34. Example:  MLlib
  35. 35. Example:  MLlib ✦ Goal:  Classifica2on  of  text  file
  36. 36. Example:  MLlib ✦ Goal:  Classifica2on  of  text  file ✦ Featurize  data  manually 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50) 14 .train(featurizedData) 15 } Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML
  37. 37. Example:  MLlib ✦ Goal:  Classifica2on  of  text  file ✦ Featurize  data  manually ✦ Calls  MLlib’s  LR  func2on 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=1 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50) 14 .train(featurizedData) 15 } Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and ML
  38. 38. Example:  MLI
  39. 39. Example:  MLI ✦ Use  built-­‐in  feature  extrac2on  func2onality 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50)
  40. 40. Example:  MLI ✦ Use  built-­‐in  feature  extrac2on  func2onality ✦ MLI  Logis2c  Regression  leverages  MLlib 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50)
  41. 41. Example:  MLI ✦ Use  built-­‐in  feature  extrac2on  func2onality ✦ MLI  Logis2c  Regression  leverages  MLlib ✦ Extensions: ✦ Embed  in  cross-­‐valida2on  rou2ne ✦ Use  different  feature  extractors  /  algorithms  or   write  new  ones 1 def main(args: Array[String]) { 2 val mc = new MLContext("local", "MLILR") 3 4 //Read in file from HDFS 5 val rawTextTable = mc.csvFile(args(0), Seq("class","text")) 6 7 //Run feature extraction 8 val classes = rawTextTable(??, "class") 9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000)) 10 val featureizedTable = classes.zip(ngrams) 11 12 //Classify the data using Logistic Regression. 13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12) 14 } 1 def main(args: Array[String]) { 2 val sc = new SparkContext("local", "SparkLR") 3 4 //Load data from HDFS 5 val data = sc.textFile(args(0)) //RDD[String] 6 7 //User is responsible for formatting/featurizing/normalizing their RDD! 8 val featurizedData: RDD[(Double,Array[Double])] = processData(data) 9 10 //Train the model using MLlib. 11 val model = new LogisticRegressionLocalRandomSGD() 12 .setStepSize(0.1) 13 .setNumIterations(50)
  42. 42. Example:  ML  Op2mizer var  X  =  load(”text_file”,  2  to  10) var  y  =  load(”text_file”,  1) var  (fn-­‐model,  summary)  =  doClassify(X,  y) ✦ User  declara2vely  specifies  task ✦ ML  Op2mizer  searches  through  MLI
  43. 43. Vision MLlib MLI ML  OpAmizer Release  Plan
  44. 44. Ease  of  use Performance,   Scalability Lay  of  the  Land
  45. 45. Matlab,  R x Ease  of  use Performance,   Scalability Lay  of  the  Land
  46. 46. Matlab,  R x Ease  of  use Performance,   Scalability Mahout x Lay  of  the  Land
  47. 47. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x Mahout x Lay  of  the  Land
  48. 48. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x Mahout x Lay  of  the  Land MLlib x
  49. 49. Logis2c  Regression,  Linear  SVM  (+L1,  L2) Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares K-­‐Means SGD,  Parallel  Gradient MLlib ClassificaAon: Regression: CollaboraAve  Filtering: Clustering: OpAmizaAon  PrimiAves:
  50. 50. Logis2c  Regression,  Linear  SVM  (+L1,  L2) Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares K-­‐Means SGD,  Parallel  Gradient MLlib ClassificaAon: Regression: CollaboraAve  Filtering: Clustering: OpAmizaAon  PrimiAves: Included  within  Spark  codebase ✦ Unlike  Mahout/Hadoop ✦ Part  of  Spark  0.8  release ✦ Con2nued  support  via  Spark  project
  51. 51. MLlib  Performance
  52. 52. ✦ WallAme:  elapsed  2me  to  execute  task MLlib  Performance
  53. 53. ✦ WallAme:  elapsed  2me  to  execute  task ✦ Weak  scaling ✦ fix  problem  size  per  processor ✦ ideally:  constant  wall2me  as  we  grow  cluster MLlib  Performance
  54. 54. ✦ WallAme:  elapsed  2me  to  execute  task ✦ Weak  scaling ✦ fix  problem  size  per  processor ✦ ideally:  constant  wall2me  as  we  grow  cluster ✦ Strong  scaling ✦ fix  total  problem  size ✦ ideally:  linear  speed  up  as  we  grow  cluster MLlib  Performance
  55. 55. ✦ WallAme:  elapsed  2me  to  execute  task ✦ Weak  scaling ✦ fix  problem  size  per  processor ✦ ideally:  constant  wall2me  as  we  grow  cluster ✦ Strong  scaling ✦ fix  total  problem  size ✦ ideally:  linear  speed  up  as  we  grow  cluster ✦ EC2  Experiments ✦ m2.4xlarge  instances,  up  to  32  machine  clusters MLlib  Performance
  56. 56. Logis2c  Regression  -­‐  Weak  Scaling
  57. 57. Logis2c  Regression  -­‐  Weak  Scaling ✦ Full  dataset:  200K  images,  160K  dense  features
  58. 58. Logis2c  Regression  -­‐  Weak  Scaling ✦ Full  dataset:  200K  images,  160K  dense  features ✦ Similar  weak  scaling 0 5 10 15 20 25 30 0 2 4 6 8 10 relativewalltime # machines MLbase VW Ideal Fig. 6: Weak scaling for logistic regression 15 20 25 30 35 speedup MLbase VW Ideal MLlib
  59. 59. Logis2c  Regression  -­‐  Weak  Scaling ✦ Full  dataset:  200K  images,  160K  dense  features ✦ Similar  weak  scaling ✦ MLlib  within  a  factor  of  2  of  VW’s  wall=me MLbase VW Matlab 0 1000 2000 3000 4000 walltime(s) n=6K, d=160K n=12.5K, d=160K n=25K, d=160K n=50K, d=160K n=100K, d=160K n=200K, d=160K MLlib0 5 10 15 20 25 30 0 2 4 6 8 10 relativewalltime # machines MLbase VW Ideal Fig. 6: Weak scaling for logistic regression 15 20 25 30 35 speedup MLbase VW Ideal MLlib
  60. 60. Logis2c  Regression  -­‐  Strong  Scaling
  61. 61. Logis2c  Regression  -­‐  Strong  Scaling ✦ Fixed  Dataset:  50K  images,  160K  dense  features
  62. 62. Logis2c  Regression  -­‐  Strong  Scaling ✦ Fixed  Dataset:  50K  images,  160K  dense  features ✦ MLlib  exhibits  beTer  scaling  proper=es 0 5 10 15 20 25 30 0 # machines ig. 6: Weak scaling for logistic regression 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 # machines speedup MLbase VW Ideal 8: Strong scaling for logistic regression System Lines of Code MLbase 32 GraphLab 383 MLlib
  63. 63. Logis2c  Regression  -­‐  Strong  Scaling ✦ Fixed  Dataset:  50K  images,  160K  dense  features ✦ MLlib  exhibits  beTer  scaling  proper=es ✦ MLlib  faster  than  VW  with  16  and  32  machines MLbase VW Matlab 0 1000 wa Fig. 5: Walltime for weak scaling for logistic regressi MLbase VW Matlab 0 200 400 600 800 1000 1200 1400 walltime(s) 1 Machine 2 Machines 4 Machines 8 Machines 16 Machines 32 Machines Fig. 7: Walltime for strong scaling for logistic regress with respect to computation. In practice, we see comp scaling results as more machines are added. In MATLAB, we implement gradient descent inste SGD, as gradient descent requires roughly the same nu of numeric operations as SGD but does not require an loop to pass over the data. It can thus be implemented MLlib 0 5 10 15 20 25 30 0 # machines ig. 6: Weak scaling for logistic regression 0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 # machines speedup MLbase VW Ideal 8: Strong scaling for logistic regression System Lines of Code MLbase 32 GraphLab 383 MLlib
  64. 64. ALS  -­‐  Wall2me
  65. 65. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines
  66. 66. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines System WallAme  (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481
  67. 67. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines System WallAme  (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481
  68. 68. ALS  -­‐  Wall2me ✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size) ✦ Cluster:  9  machines ✦ MLlib  an  order  of  magnitude  faster  than  Mahout ✦ MLlib  within  factor  of  2  of  GraphLab System WallAme  (seconds) Matlab 15443 Mahout 4206 GraphLab 291 MLlib 481
  69. 69. Deployment  Considera2ons
  70. 70. Deployment  Considera2ons Vowpal  Wabbit,  GraphLab ✦ Data  prepara=on  specific  to  each  program ✦ Non-­‐trivial  setup  on  cluster ✦ No  fault  tolerance
  71. 71. Deployment  Considera2ons Vowpal  Wabbit,  GraphLab ✦ Data  prepara=on  specific  to  each  program ✦ Non-­‐trivial  setup  on  cluster ✦ No  fault  tolerance MLlib ✦ Reads  files  from  HDFS ✦ Launch/compile/run  on  cluster  with  a  few  commands ✦ RDD’s  are  fault  tolerance
  72. 72. Vision MLlib MLI ML  OpAmizer Release  Plan
  73. 73. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x Mahout x Lay  of  the  Land MLlib x
  74. 74. Matlab,  R x Ease  of  use Performance,   Scalability GraphLab,  VW x MLI x Mahout x Lay  of  the  Land MLlib x
  75. 75. Current  Op2ons
  76. 76. Current  Op2ons  +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers —    Ad-­‐hoc,  non-­‐scalable  scripts —    Loss  of  transla2on  upon  re-­‐implementa2on
  77. 77. Current  Op2ons  +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers —    Ad-­‐hoc,  non-­‐scalable  scripts —    Loss  of  transla2on  upon  re-­‐implementa2on
  78. 78. Current  Op2ons  +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers —    Ad-­‐hoc,  non-­‐scalable  scripts —    Loss  of  transla2on  upon  re-­‐implementa2on  +      Scalable  and  (some2mes)  fast  +      Exis2ng  open-­‐source  library  of  ML  algorithms —    Difficult  to  set  up,  extend
  79. 79. Examples ML Developer Code
  80. 80. Examples ML Developer Code ‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC) ✦ Ini2al  studies  in  MATLAB  (Not  distributed) ✦ Distributed  prototype  involving  compiled  MATLAB
  81. 81. Examples ML Developer Code ‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC) ✦ Ini2al  studies  in  MATLAB  (Not  distributed) ✦ Distributed  prototype  involving  compiled  MATLAB Mahout  ALS  with  Early  Stopping ✦ Theory:  simple  if-­‐statement  (3  lines  of  code)
  82. 82. Examples ML Developer Code ‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC) ✦ Ini2al  studies  in  MATLAB  (Not  distributed) ✦ Distributed  prototype  involving  compiled  MATLAB Mahout  ALS  with  Early  Stopping ✦ Theory:  simple  if-­‐statement  (3  lines  of  code) ✦ Prac2ce:  sih  through  7  files,  nearly  1K  lines  of  code
  83. 83. Insight:  Programming  Abstrac2ons
  84. 84. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI)
  85. 85. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI) ✦ Table  Computa2on:  MLTable ✦ Linear  Algebra:  MLSubMatrix ✦ Op2miza2on  Primi2ves:  MLSolve
  86. 86. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI) ✦ Table  Computa2on:  MLTable ✦ Linear  Algebra:  MLSubMatrix ✦ Op2miza2on  Primi2ves:  MLSolve ✦ MLI  Examples: ✦ DFC:  ~50  lines  of  code
  87. 87. Insight:  Programming  Abstrac2ons ✦ Shield  ML  Developers  from  low-­‐details:  provide   familiar  mathema2cal  operators  in  distributed  sejng ✦ ML  Developer  API  (MLI) ✦ Table  Computa2on:  MLTable ✦ Linear  Algebra:  MLSubMatrix ✦ Op2miza2on  Primi2ves:  MLSolve ✦ MLI  Examples: ✦ DFC:  ~50  lines  of  code ✦ ALS:  early  stopping  in  3  lines;  <  40  lines  total
  88. 88. Lines  of  Code
  89. 89. Lines  of  Code Logis2c  Regression Alterna2ng  Least  Squares System Lines  of  Code Matlab 11 Vowpal  Wabbit 721 MLI 55 System Lines  of  Code Matlab 20 Mahout 865 GraphLab 383 MLI 32
  90. 90. Lines  of  Code Logis2c  Regression Alterna2ng  Least  Squares System Lines  of  Code Matlab 11 Vowpal  Wabbit 721 MLI 55 System Lines  of  Code Matlab 20 Mahout 865 GraphLab 383 MLI 32
  91. 91. Lines  of  Code Logis2c  Regression Alterna2ng  Least  Squares System Lines  of  Code Matlab 11 Vowpal  Wabbit 721 MLI 55 System Lines  of  Code Matlab 20 Mahout 865 GraphLab 383 MLI 32
  92. 92. MLI  Details
  93. 93. MLI  Details OLD val  x:  RDD[Array[Double]]
  94. 94. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector]
  95. 95. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector]
  96. 96. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat]
  97. 97. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat]
  98. 98. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat] NEW val  x:  MLTable
  99. 99. MLI  Details OLD val  x:  RDD[Array[Double]] val  x:  RDD[spark.u=l.Vector] val  x:  RDD[breeze.linalg.Vector] val  x:  RDD[BIDMat.SMat] NEW val  x:  MLTable ✦ Generic  interface  for  feature  extrac2on ✦ Common  interface  to  support  an  op2mizer ✦ Abstract  interface  for  arbitrary  backends
  100. 100. MLTable ✦ Flexibility  when  loading  data ✦ e.g.,  CSV,  JSON,  XML ✦ Heterogenous  data  across   columns ✦ Missing  Data ✦ Feature  extrac2on ✦ Common  Interface ✦ Supports  MapReduce  and   Rela2onal  Operators   ✦ Inspired  by  DataFrames  (R)  and  Pandas  (Python)
  101. 101. Feature  Extrac2on where a ke matrixBatchMap MLSubMatrix ) MLSubMatrix MLNumericTable Execute a data. Outpu table. numRows None Long Returns nu numCols None Long Returns the Fig. 2: MLTable API Illustration. This table captures core operations of th 1 def main(args: Array[String]) { 2 val mc = new MLContext("local") 3 4 //Read in table from file on HDFS. 5 val rawTextTable = mc.textFile(args(0)) 6 7 //Run feature extraction on the raw text - get the top 30000 bigrams. 8 val featurizedTable = tfIdf(nGrams(rawTextTable, n=2, top=30000)) 9 10 //Cluster the data using K-Means. 11 val kMeansModel = KMeans(featurizedTable, k=50) 12 } Fig. 3: Loading, featurizing, and learning clusters on a corpu Family Example Uses Returns Shape dims(mat), mat.numRows, mat.numCols Int or (Int,Int)
  102. 102. MLSubMatrix ✦ Linear  algebra  on  local  parAAons ✦ E.g.,  matrix-­‐vector  opera2ons  for   mini-­‐batch  logis2c  regression ✦ E.g.,  solving  linear  system  of  equa2ons   for  Alterna2ng  Least  Squares ✦ Sparse  and  Dense  Matrix  Support
  103. 103. Alterna2ng  Least  Squares 19 parfor q=1:n 20 Uq = U(Uinds{q},:); 21 V(q,:) = (Uq’*Uq + lambI) (Uq’ * M(Uinds{q},q)); 22 end 23 end 24 end 1 object BroadcastALS extends Algorithm { 2 def train(trainData: MLNumericTable, trainDataTrans: MLNumericTable, 3 m: Int, n: Int, k: Int, lambda: Int, maxIter: Int): ALSModel = { 4 val lambI = MLSubMatrix.eye(k).mul(lambda) 5 var U = MLSubMatrix.rand(m, k) 6 var V = MLSubMatrix.rand(n, k) 7 var U_b = trainData.context.broadcast(U) 8 var V_b = trainData.context.broadcast(V) 9 for (iter <- 0 until maxIter) { 10 U = trainData.matrixBatchMap(localALS(_, U_b.value, lambI, k)) 11 U_b = trainData.context.broadcast(U) 12 V = trainDataTrans.matrixBatchMap(localALS(_, V_b.value, lambI, k)) 13 V_b = trainData.context.broadcast(V) 14 } 15 new ALSModel(U, V) 16 } 17 18 def localALS(trainDataPart: MLSubMatrix, Y: MLSubMatrix, lambI: MLSubMatrix, k: Int){ 19 var localX = MLSubMatrix.zeros(trainDataPart.numRows, k) 20 for (i <- 0 until trainDataPart.numRows) { 21 val q = trainDataPart.rowID(i) 22 val nz_inds = trainDataPart.nzCols(q) 23 val Yq = Y(trainDataPart.nzCols(q), ??) 24 localX(i, ??) = ((Yq.transpose times Yq) + lambI) 25 .solve(Yq.transpose times trainDataPart(q, nz_inds).transpose) 26 } 27 return localX 28 } 29 }
  104. 104. MLSolve ✦ Distributed  implementaAons  of   common  opAmizaAon  paZerns ✦ E.g.,  Stochas2c  Gradient  Descent:   Applicable  to  summable  ML  losses ✦ E.g.,  LBFGS:  An  approximate  2nd-­‐ order  op2miza2on  method   ✦ E.g.,  ADMM:  Decomposi2on  /   coordina2on  procedure
  105. 105. Logis2c  Regression 5 grad = X’ * (sigmoid(X * w) - y); 6 w = w - learning_rate * grad; 7 end 8 end 9 10 % applies sigmoid function component-wise on the vector x 11 function s = sigmoid(x) 12 s = 1 ./ (1 + exp(-1 .* x)); 13 end 1 object LogisticRegression extends Algorithm { 2 def sigmoid(z: Scalar) = 1.0 / (1.0 + exp(-1.0*z)) 3 4 def gradientFunction(w: MLSubMatrix, x: MLSubMatrix, y: Scalar): MLSubMatrix = { 5 x.transpose * (sigmoid(x dot w) - y) 6 } 7 8 def train(data: MLNumericTable, p: LogRegParams): LogRegModel = { 9 val d = data.numCols 10 val params = SGDParams(initweights = MLSubMatrix.zeros(d, 1), 11 maxIterations = p.maxIter, learningRate = p.learningRate, 12 gradientFunction = gradientFunction) 13 val weights = SGD(data, params) 14 new LogRegModel(weights) 15 } 16 } 1 object StochasticGradientDescent extends Optimizer { 2 3 def localSGD(x: MLSubMatrix, weights: MLSubMatrix, n: Index, lambda: Scalar, 4 gradientFunction: (MLSubMatrix, MLSubMatrix, Scalar) => MLSubMatrix): MLSubMatr 5 var localWeights = weights 6 for (i <- 0 to x.numRows) {
  106. 106. Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares,  [DFC] K-­‐Means,  [DP-­‐Means] Logis2c  Regression,  Linear  SVM  (+L1,  L2),  Mul2nomial   Regression,  [Naive  Bayes,  Decision  Trees] SGD,  Parallel  Gradient,  Local  SGD,  [L-­‐BFGS,  ADMM,   Adagrad] Principal  Component  Analysis  (PCA),  N-­‐grams,  feature   cleaning  /  normaliza2on Cross  Valida2on,  Evalua2on  Metrics MLI  Func2onality Regression: CollaboraAve  Filtering: Clustering: ClassificaAon: OpAmizaAon  PrimiAves: Feature  ExtracAon: ML  Tools:
  107. 107. Linear  Regression  (+Lasso,  Ridge) Alterna2ng  Least  Squares,  [DFC] K-­‐Means,  [DP-­‐Means] Logis2c  Regression,  Linear  SVM  (+L1,  L2),  Mul2nomial   Regression,  [Naive  Bayes,  Decision  Trees] SGD,  Parallel  Gradient,  Local  SGD,  [L-­‐BFGS,  ADMM,   Adagrad] Principal  Component  Analysis  (PCA),  N-­‐grams,  feature   cleaning  /  normaliza2on Cross  Valida2on,  Evalua2on  Metrics MLI  Func2onality Regression: CollaboraAve  Filtering: Clustering: ClassificaAon: OpAmizaAon  PrimiAves: Feature  ExtracAon: ML  Tools:
  108. 108. Vision MLlib MLI ML  OpAmizer Release  Plan
  109. 109. Build  a  Classifier  for  X What  you  want  to  do
  110. 110. Build  a  Classifier  for  X What  you  want  to  do What  you  have  to  do ✦ Learn  the  internals  of  ML  classificaAon   algorithms,  sampling,  feature  selecAon,   X-­‐validaAon,…. ✦ PotenAally  learn  Spark/Hadoop/… ✦ Implement  3-­‐4  algorithms ✦ Implement  grid-­‐search  to  find  the  right   algorithm  parameters ✦ Implement  validaAon  algorithms ✦ Experiment  with  different  sampling-­‐ sizes,  algorithms,  features ✦ ….
  111. 111. Build  a  Classifier  for  X What  you  want  to  do What  you  have  to  do ✦ Learn  the  internals  of  ML  classificaAon   algorithms,  sampling,  feature  selecAon,   X-­‐validaAon,…. ✦ PotenAally  learn  Spark/Hadoop/… ✦ Implement  3-­‐4  algorithms ✦ Implement  grid-­‐search  to  find  the  right   algorithm  parameters ✦ Implement  validaAon  algorithms ✦ Experiment  with  different  sampling-­‐ sizes,  algorithms,  features ✦ …. and  in  the  end Ask  For  Help
  112. 112. Insight:  A  Declara2ve  Approach SQL Result ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it
  113. 113. Insight:  A  Declara2ve  Approach SQL Result MQL Model ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it
  114. 114. var  X  =  load(”als_clinical”,  2  to  10) var  y  =  load(”als_clinical”,  1) var  (fn-­‐model,  summary)  =  doClassify(X,  y) Example:  Supervised  ClassificaAon ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it Insight:  A  Declara2ve  Approach
  115. 115. var  X  =  load(”als_clinical”,  2  to  10) var  y  =  load(”als_clinical”,  1) var  (fn-­‐model,  summary)  =  doClassify(X,  y) Example:  Supervised  ClassificaAon Algorithm  Independent   ✦ End  Users  tell  the  system  what  they  want,  not  how   to  get  it Insight:  A  Declara2ve  Approach
  116. 116.  ML  Op2mizer:  A  Search  Problem 5min Boosting SVM ✦ System  is  responsible   for  searching  through   model  space ✦ Opportuni2es  for   physical  op2miza2on
  117. 117. Systems  Op2miza2on  of  Model   Search 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey
  118. 118. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey
  119. 119. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA
  120. 120. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA QueryB
  121. 121. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA QueryB ✦ Single  pass  over  the   data,  many  models   trained
  122. 122. Systems  Op2miza2on  of  Model   Search ✦ Idea  from  databases  –   shared  cursor! 35 Observation: We tend to be I/O bound during model training. A B C 1 a Dog 1 b Cat 2 c Cat 2 d Cat 3 e Dog 3 f Horse 4 g Monkey QueryA QueryB ✦ Single  pass  over  the   data,  many  models   trained ✦ Example  –  Logis2c   Regression  via  SGD
  123. 123. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI MQL
  124. 124. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI ✦ MLI  provides  common  interface  for  all  algorithms MQL
  125. 125. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI ✦ MLI  provides  common  interface  for  all  algorithms ✦ Contracts:  Meta-­‐data  for  algorithms  writen  against  MLI MQL
  126. 126. Spark MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Rela2onship  with  MLI ✦ MLI  provides  common  interface  for  all  algorithms ✦ Contracts:  Meta-­‐data  for  algorithms  writen  against  MLI ✦ Type  (e.g.,  classifica2on) ✦ Parameters ✦ Run2me  (e.g.,  O(n)) ✦ Input-­‐Specifica2on ✦ Output-­‐Specifica2on ✦ … MQL
  127. 127. Vision MLlib MLI ML  OpAmizer Release  Plan
  128. 128. Contributors ✦ John  Duchi ✦ Michael  Franklin ✦ Joseph  Gonzalez ✦ Rean  Griffith ✦ Michael  Jordan ✦ Tim  Kraska ✦ Xinghao  Pan ✦ Virginia  Smith ✦ Shivaram  Venkarataram ✦ Matei  Zaharia
  129. 129. Contributors ✦ John  Duchi ✦ Michael  Franklin ✦ Joseph  Gonzalez ✦ Rean  Griffith ✦ Michael  Jordan ✦ Tim  Kraska ✦ Xinghao  Pan ✦ Virginia  Smith ✦ Shivaram  Venkarataram ✦ Matei  Zaharia * * * *
  130. 130. First  Release  (Summer) MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  131. 131. First  Release  (Summer) ✦ MLlib:  low-­‐level  ML  library  and  underlying  kernels ✦ Callable  from  Scala,  Java ✦ Included  as  part  of  Spark ✦ MLI:  API  for  feature  extrac2on  and  ML  algorithms ✦ Plaworm  for  ML  development ✦ Includes  more  extensive  library  and  with  faster  dev-­‐cycle  than  MLlib MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  132. 132. Second  Release  (Winter) MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  133. 133. Second  Release  (Winter) ✦ ML  OpAmizer:  automated  model  selec2on ✦ Search  problem  over  feature  extractors  and  algorithms  in  MLI ✦ Contracts ✦ Restricted  query  language  (MQL) MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  134. 134. Second  Release  (Winter) ✦ ML  OpAmizer:  automated  model  selec2on ✦ Search  problem  over  feature  extractors  and  algorithms  in  MLI ✦ Contracts ✦ Restricted  query  language  (MQL) ✦ Feature  extracAon  for  image  data MLlib MLI ML Optimizer ML Developer Meta-Data Statistics User Declarative ML Task ML Contract + Code Master Server …. result (e.g., fn-model & summary) Optimizer Parser Executor/Monitoring ML Library DMX Runtime DMX Runtime DMX Runtime DMX Runtime LLP PLP MasterSlaves End  User Spark
  135. 135. Future  Direc2ons
  136. 136. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer
  137. 137. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers
  138. 138. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers ✦ Plug-­‐ins  to  Python,  R
  139. 139. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers ✦ Plug-­‐ins  to  Python,  R ✦ VisualizaAon  for  unsupervised  learning  and  explora2on
  140. 140. Future  Direc2ons ✦ IdenAfy  minimal  set  of  ML  operators ✦ Expose  internals  of  ML  algorithms  to  op2mizer ✦ Unified  language  for  End  Users  and  ML  Developers ✦ Plug-­‐ins  to  Python,  R ✦ VisualizaAon  for  unsupervised  learning  and  explora2on ✦ Advanced  ML  capabiliAes ✦ Time-­‐series  algorithms ✦ Graphical  models ✦ Advanced  Op2miza2on  (e.g.,  asynchronous  computa2on) ✦ Online  updates ✦ Sampling  for  efficiency  
  141. 141. ContribuAons   encouraged! Berkeley,  CA August  29-­‐30www.mlbase.org baseML baseML baseML ML base ML base ML base ML base

×