Machine Learning and Hadoop: Present and Future

1,222 views

Published on

Josh Wills Data Science Director @Cloudera talk at Data Science London 06/09/12

Published in: Technology, Education

Machine Learning and Hadoop: Present and Future

  1. 1. Machine  Learning  and  Hadoop  Present  and  Future  Josh  Wills  Cloudera  Data  Science  Team  September  6th,  2012  
  2. 2. About  Me   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  3. 3. Outline  •  Part  1:  Industrial  Machine  Learning  •  Part  2:  ML  and  Hadoop:  The  State  of  the  World  •  Part  3:  ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  4. 4. (Academic)  ML  vs.  (Academic)  StaIsIcs        “Machine  learning  is  sta/s/cs  minus  any  checking  of  models  and  assump/ons.”                  -­‐-­‐  Brian  Ripley,  UseR!  2004                  (provoca/vely  paraphrased)   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  5. 5. Industrial  Machine  Learning:  Truth  #1         The  thing  that  we  are  trying  to  predict  is  rarely  the  thing   that  we  are  trying  to  opImize.   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  6. 6. Industrial  Machine  Learning:  Truth  #2           Systems  precede  algorithms.   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  7. 7. Industrial  Machine  Learning:  Truth  #3   Practice Over Theory Blog Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  8. 8. ImplicaIon         Data  science  requires  predicIon-­‐oriented  machine   learning  models  AND  classical,  rigorous  staIsIcal   analysis.     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  9. 9. Outline  •  Part  1:  Industrial  Machine  Learning  •  Part  2:  ML  and  Hadoop:  The  State  of  the  World  •  Part  3:  ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  10. 10. “Hadoop.  It’s  Where  The  Data  Is.”   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  11. 11. Hadoop  PlaWorm:  Substrate  •  Commodity  servers  •  Open  source  operaFng  system  •  “”  ConfiguraFon  Management  •  “”  CoordinaFon  Service  •  “”  File  System  API  •  “”  Efficient  and  Extensible  File  Formats  •  “”  Efficient  and  Extensible  RPC  Libraries   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  12. 12. Hadoop  PlaWorm:  MapReduce  Frameworks  •  Languages/Environments   •  PigLaFn  (Apache)   •  HiveQL  (Apache)   •  Jaql  (IBM)  •  Java/Scala  APIs   •  Crunch  (Apache  Incubator)   •  Scoobi  (NICTA)   •  Cascading  (Concurrent)   •  Pangool     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  13. 13. ML  and  Hadoop:  The  State  of  the  World   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  14. 14. MapReduce  •  Great  for:   •  Data  PreparaFon   •  Feature  Engineering   •  Model  ValidaFon/EvaluaFon  •  Works  Well  For  Certain  Model  Fing  Problems   •  CollaboraFve  Filtering  Algorithms   •  ExpectaFon  MaximizaFon   •  Decision  Trees  (PLANET;  Gradient  Boosted  Decision  Trees)  •  Not  A  PracIcal  OpIon  for  Many  Kinds  of  Problems  •  Way  More  Detail  in  the  KDD  2011  Talk   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  15. 15. Apache  Mahout  •  The  starFng  place  for  MapReduce-­‐based  machine   learning  algorithms   •  Not  machine-­‐learning-­‐in-­‐a-­‐box   •  Custom  tweaks/modificaFons  are  the  rule  •  A  disparate  collecFon  of  algorithms  for:   •  RecommendaFons   •  Clustering   •  ClassificaFon   •  Frequent  Itemset  Mining   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  16. 16. Apache  Mahout  (cont.)  •  Best  Library:  Taste  Recommender   •  Oldest  project,  most  widely-­‐deployed  in  producFon   •  SVD  implementaFon  is  parFcularly  acFve  •  Good  Libraries:  Online  SGD   •  Does  not  use  MapReduce   •  Vowpal  Rabbit  is  faster,  has  L-­‐BFGS  opFon  •  Roll  Your  Own  Instead:  Naïve  Bayes     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  17. 17. The  Ominous  Challenges   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  18. 18. 1.  The  Secret  Sauce  Effect   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  19. 19. 2.  Delta  Between  Mahout  and  the  Cu_ng  Edge   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  20. 20. ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  21. 21. Moving  Beyond  MapReduce   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  22. 22. The  Contenders  Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  23. 23. AllReduce  •  Developed  at  Yahoo!  Research  •  Defines  the  allreduce  operaFon   •  N  machines  each  have  a  number  =>  each  machine  has  the   sum  of  the  numbers  •  At  the  heart  of  Vowpal  Wabbit’s  performance  •  Implemented  in  C++  •  Can  be  patched  into  Apache  Hadoop  and  used  today   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  24. 24. Spark   •  Developed  at  Berkeley’s   AMP  Lab   •  Defines  operaFons  on   distributed  in-­‐memory   collecFons   •  Wriken  in  Scala   •  Supports  reading  to  and   wriFng  from  HDFS   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  25. 25. GraphLab   •  Developed  at  CMU   •  Lower-­‐level  primiFves   •  (but  higher  than  MPI)   •  Map/Reduce  =>   Update/Sort   •  Flexible,  allows  for   asynchronous   computaFons   •  Reads  from  HDFS   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  26. 26. How  Things  Measure  Up   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  27. 27. Speed  vs.  Reliability  Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  28. 28. Memory  vs.  Disk  Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  29. 29. C++  vs.  JVM  Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  30. 30. QuesIons?  (Ask  Anything.  Anything  At  All.)   jwills@cloudera.com  

×