• Save
Machine Learning and Hadoop: Present and Future
Upcoming SlideShare
Loading in...5
×
 

Machine Learning and Hadoop: Present and Future

on

  • 1,126 views

Josh Wills Data Science Director @Cloudera talk at Data Science London 06/09/12

Josh Wills Data Science Director @Cloudera talk at Data Science London 06/09/12

Statistics

Views

Total Views
1,126
Views on SlideShare
1,126
Embed Views
0

Actions

Likes
3
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Machine Learning and Hadoop: Present and Future Machine Learning and Hadoop: Present and Future Presentation Transcript

  • Machine  Learning  and  Hadoop  Present  and  Future  Josh  Wills  Cloudera  Data  Science  Team  September  6th,  2012  
  • About  Me   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • Outline  •  Part  1:  Industrial  Machine  Learning  •  Part  2:  ML  and  Hadoop:  The  State  of  the  World  •  Part  3:  ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • (Academic)  ML  vs.  (Academic)  StaIsIcs        “Machine  learning  is  sta/s/cs  minus  any  checking  of  models  and  assump/ons.”                  -­‐-­‐  Brian  Ripley,  UseR!  2004                  (provoca/vely  paraphrased)   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • Industrial  Machine  Learning:  Truth  #1         The  thing  that  we  are  trying  to  predict  is  rarely  the  thing   that  we  are  trying  to  opImize.   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • Industrial  Machine  Learning:  Truth  #2           Systems  precede  algorithms.   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • Industrial  Machine  Learning:  Truth  #3   Practice Over Theory Blog Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • ImplicaIon         Data  science  requires  predicIon-­‐oriented  machine   learning  models  AND  classical,  rigorous  staIsIcal   analysis.     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • Outline  •  Part  1:  Industrial  Machine  Learning  •  Part  2:  ML  and  Hadoop:  The  State  of  the  World  •  Part  3:  ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • “Hadoop.  It’s  Where  The  Data  Is.”   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • Hadoop  PlaWorm:  Substrate  •  Commodity  servers  •  Open  source  operaFng  system  •  “”  ConfiguraFon  Management  •  “”  CoordinaFon  Service  •  “”  File  System  API  •  “”  Efficient  and  Extensible  File  Formats  •  “”  Efficient  and  Extensible  RPC  Libraries   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • Hadoop  PlaWorm:  MapReduce  Frameworks  •  Languages/Environments   •  PigLaFn  (Apache)   •  HiveQL  (Apache)   •  Jaql  (IBM)  •  Java/Scala  APIs   •  Crunch  (Apache  Incubator)   •  Scoobi  (NICTA)   •  Cascading  (Concurrent)   •  Pangool     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • ML  and  Hadoop:  The  State  of  the  World   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • MapReduce  •  Great  for:   •  Data  PreparaFon   •  Feature  Engineering   •  Model  ValidaFon/EvaluaFon  •  Works  Well  For  Certain  Model  Fing  Problems   •  CollaboraFve  Filtering  Algorithms   •  ExpectaFon  MaximizaFon   •  Decision  Trees  (PLANET;  Gradient  Boosted  Decision  Trees)  •  Not  A  PracIcal  OpIon  for  Many  Kinds  of  Problems  •  Way  More  Detail  in  the  KDD  2011  Talk   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • Apache  Mahout  •  The  starFng  place  for  MapReduce-­‐based  machine   learning  algorithms   •  Not  machine-­‐learning-­‐in-­‐a-­‐box   •  Custom  tweaks/modificaFons  are  the  rule  •  A  disparate  collecFon  of  algorithms  for:   •  RecommendaFons   •  Clustering   •  ClassificaFon   •  Frequent  Itemset  Mining   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • Apache  Mahout  (cont.)  •  Best  Library:  Taste  Recommender   •  Oldest  project,  most  widely-­‐deployed  in  producFon   •  SVD  implementaFon  is  parFcularly  acFve  •  Good  Libraries:  Online  SGD   •  Does  not  use  MapReduce   •  Vowpal  Rabbit  is  faster,  has  L-­‐BFGS  opFon  •  Roll  Your  Own  Instead:  Naïve  Bayes     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • The  Ominous  Challenges   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 1.  The  Secret  Sauce  Effect   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 2.  Delta  Between  Mahout  and  the  Cu_ng  Edge   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • Moving  Beyond  MapReduce   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • The  Contenders  Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • AllReduce  •  Developed  at  Yahoo!  Research  •  Defines  the  allreduce  operaFon   •  N  machines  each  have  a  number  =>  each  machine  has  the   sum  of  the  numbers  •  At  the  heart  of  Vowpal  Wabbit’s  performance  •  Implemented  in  C++  •  Can  be  patched  into  Apache  Hadoop  and  used  today   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • Spark   •  Developed  at  Berkeley’s   AMP  Lab   •  Defines  operaFons  on   distributed  in-­‐memory   collecFons   •  Wriken  in  Scala   •  Supports  reading  to  and   wriFng  from  HDFS   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • GraphLab   •  Developed  at  CMU   •  Lower-­‐level  primiFves   •  (but  higher  than  MPI)   •  Map/Reduce  =>   Update/Sort   •  Flexible,  allows  for   asynchronous   computaFons   •  Reads  from  HDFS   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • How  Things  Measure  Up   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • Speed  vs.  Reliability  Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • Memory  vs.  Disk  Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • C++  vs.  JVM  Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • QuesIons?  (Ask  Anything.  Anything  At  All.)   jwills@cloudera.com