Machine Learning and Hadoop: Present and future

Cloudera, Inc.
Cloudera, Inc.Cloudera, Inc.
Machine Learning and Hadoop
Present and Future
Josh Wills, Tom Pierce, and Jeff Hammerbacher
Cloudera Data Science Team
December 17th, 2011
High Availability for Data Scientists




                 NIPS




                Copyright 2011 Cloudera Inc. All rights reserved
Agenda

• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
  • State of the World
  • Where Things Are Headed
• Part 3: Things Industry Needs From Academia




                Copyright 2011 Cloudera Inc. All rights reserved
Industrial Machine Learning




   Copyright 2011 Cloudera Inc. All rights reserved
Delta One: Model Evaluation

• ML Systems Are One Piece of a Complex System
• Well-defined objective functions are the exception
   • Multiple, often conflicting goals
   • Weights are fuzzy and shift with business priorities
   • Pareto optimization is the safest play
• Predictive Accuracy Is Only Useful Up to a Point
• Examples
   • Computational advertising
   • Friend recommendations on social networks


                    Copyright 2011 Cloudera Inc. All rights reserved
Delta Two: Systems Precede Algorithms

• Greenfield Projects Hardly Ever Happen
   • (and don’t usually launch)
• Industrial Computational Infrastructure
   • General-purpose
   • Cheap
   • Shared
• Constraints Drive Innovation
   • Vowpal Wabbit Hashing Trick
   • SETI @ Google


                   Copyright 2011 Cloudera Inc. All rights reserved
Delta Three: Workflow




                                                                 Practice Over Theory Blog



              Copyright 2011 Cloudera Inc. All rights reserved
Delta Three: Workflow

• Optimize the Overall Process
   • Model fitting is a small piece of the overall flow time
   • Parallelize everything
• Better Features > Better Models
• Fast Model Deployment
   • Common Feature Extraction Logic
   • Servable Models
• Validation as Sanity Checking
   • Deploy to a small subset of real data and evaluate


                    Copyright 2011 Cloudera Inc. All rights reserved
Agenda

• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
  • State of the World
  • Where Things Are Headed
• Part 3: Things Industry Needs From Academia




                Copyright 2011 Cloudera Inc. All rights reserved
Hadoop: It’s Where The Data Is




    Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Platform: Substrate

• Commodity servers
   • Open Compute
• Open source operating system
   • Linux
• Open source configuration management
   • Puppet
   • Chef
• Coordination service
   • ZooKeeper


                 Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Platform: Storage

• Distributed schema-less storage
   • HDFS
   • Ceph
• Append-only storage formats and metadata
   • Avro
   • RCFile
   • HCatalog
• Mutable key-value storage and metadata
   • HBase


                 Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Platform: Integration

• Tool Access
   • FUSE
   • JDBC
   • ODBC
• Data Ingestion
   • Flume
   • Sqoop




                   Copyright 2011 Cloudera Inc. All rights reserved
ML and Hadoop: The State of the World




        Copyright 2011 Cloudera Inc. All rights reserved
Computation: Plain Old MapReduce

• Great for:
   • Data Preparation
   • Feature Engineering
   • Model Validation/Evaluation
• Works For Certain Model Fitting Problems
   • Recommendation Systems
   • Decision Trees (PLANET; Gradient Boosted Decision Trees)
• Not A Practical Option for Online Learning
• Way More Detail from the KDD 2011 Talk


                   Copyright 2011 Cloudera Inc. All rights reserved
Tools for Data Preparation/Feature Engineering

• Languages/Environments
   • PigLatin
   • HiveQL
   • Need to deal with mismatch between offline/online feature
     generation
• Java/Scala APIs
   •   Crunch (Cloudera)
   •   Scoobi (NICTA)
   •   Cascading (Concurrent)
   •   Jaql (IBM)

                    Copyright 2011 Cloudera Inc. All rights reserved
Apache Mahout

• The starting place for MapReduce-based machine
  learning algorithms
   • Not machine-learning-in-a-box
   • Custom tweaks/modifications are the rule
• A disparate collection of algorithms for:
   •   Recommendations
   •   Clustering
   •   Classification
   •   Frequent Itemset Mining



                    Copyright 2011 Cloudera Inc. All rights reserved
Apache Mahout (cont.)

• Best Library: Taste Recommender
   • Oldest project, most widely-deployed in production
   • SVD implementation is particularly active
• Good Libraries: Online SGD
   • Does not use MapReduce
   • Vowpal Rabbit + AllReduce is faster, has L-BFGS option
• Roll Your Own Instead: Naïve Bayes
• Challenges
   • “Secret sauce” effect
   • Delta between Mahout + the cutting edge in ML

                   Copyright 2011 Cloudera Inc. All rights reserved
More Machine Learning Interfaces for Hadoop

• Based on MapReduce
  • SystemML (IBM)
  • AllReduce (Vowpal Wabbit)
• No MapReduce
  • Spark
• R-Based Systems (Augment MapReduce with R)
  •   Segue
  •   RHIPE
  •   RHadoop
  •   Ricardo (IBM)

                      Copyright 2011 Cloudera Inc. All rights reserved
ML and Hadoop: Where Things are Headed




          Copyright 2011 Cloudera Inc. All rights reserved
MRv2 and YARN

• Eliminates JobTracker bottleneck
   • Separate Resource Manager/Scheduler
   • Individual jobs have their own task masters
• Moves MapReduce into user-land
• Enables Hadoop clusters to run all sorts of jobs
   •   MPI (Hamster; MAPREDUCE-2911)
   •   Native BSP (Giraph)
   •   Spark
   •   AllReduce, GraphLab


                   Copyright 2011 Cloudera Inc. All rights reserved
Agenda

• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
  • State of the World
  • Where Things Are Headed
• Part 3: Things Industry Needs From Academia




                Copyright 2011 Cloudera Inc. All rights reserved
Machine Learning on Multivariate Time Series

 • 1e5 writes/sec
 • Positive events are
   relatively rare
 • Feature extraction
   challenge
 • May not be clear what
   the right time horizon is
 • Tight SLAs
 • Very high stakes

                Copyright 2011 Cloudera Inc. All rights reserved
An Academic Language For Feature Engineering

• Feature extraction/selection is as important as model
  fitting
   • e.g., hierarchical feature representation, impact on training
     time and experiment design, feature cost modeling, etc.
• Academic literature on this problem is sparse and
  dispersed across multiple fields
   • NIPS 2003
   • HCI, NLP, Information Retrieval, etc.
• We need a common language for talking about these
  problems across disciplines

                    Copyright 2011 Cloudera Inc. All rights reserved
A Broader Ontology For Model Selection

• Practical factors that enter into the “best” choice of
  model…
   •   Data arrival rate
   •   Data volume
   •   Scoring latency
   •   Model refresh time
   •   Robustness/reliability
• …in addition to the standard predictive power/simplicity
  tradeoffs


                     Copyright 2011 Cloudera Inc. All rights reserved
Questions?
Want A Job?
  @josh_wills
1 of 26

More Related Content

What's hot(20)

Big Data: Myths and RealitiesBig Data: Myths and Realities
Big Data: Myths and Realities
Toronto-Oracle-Users-Group4.3K views
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
Cloudera, Inc.6K views
Data Science and CDSWData Science and CDSW
Data Science and CDSW
Jason Hubbard1.3K views

Viewers also liked(20)

Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
Andrei Savu2.8K views
Data-driven Innovation - WoodData-driven Innovation - Wood
Data-driven Innovation - Wood
Amazon Web Services893 views
Goverment gazette 2555Goverment gazette 2555
Goverment gazette 2555
สปสช นครสวรรค์1.8K views
Copyright 2007 2Copyright 2007 2
Copyright 2007 2
guest339a0f507 views
Nov7   simulated 1& 2Nov7   simulated 1& 2
Nov7 simulated 1& 2
Peak Review/FSUU834 views
nestle Nutritionnestle Nutrition
nestle Nutrition
Khalil Ahmad707 views
March2011brochureMarch2011brochure
March2011brochure
Loveis1able Khumpuangdee679 views
Manual de tecnoloxiaManual de tecnoloxia
Manual de tecnoloxia
CousasdoCole650 views
El ectiva 3 (programacion) El ectiva 3 (programacion)
El ectiva 3 (programacion)
Edison Marrufo152 views
Siriwat Wittaya Primary SectionSiriwat Wittaya Primary Section
Siriwat Wittaya Primary Section
siriwatwittaya519 views
Luận văn thạc sỹ y họcLuận văn thạc sỹ y học
Luận văn thạc sỹ y học
Luanvanyhoc.com-Zalo 0927.007.596 1.7K views
2010 BDPA Technology Conference Guide2010 BDPA Technology Conference Guide
2010 BDPA Technology Conference Guide
BDPA Education and Technology Foundation6.9K views
Tse copyright 2014Tse copyright 2014
Tse copyright 2014
Valeryia Kazheunikava734 views
Ovarian Cancer: Three sides of the Story: Yuko Abbott, LCSW  Ovarian Cancer: Three sides of the Story: Yuko Abbott, LCSW
Ovarian Cancer: Three sides of the Story: Yuko Abbott, LCSW
Ovarian Cancer Research Fund Alliance807 views
Social Network as a Learning CompanionSocial Network as a Learning Companion
Social Network as a Learning Companion
Nattakul Yamprasert473 views

Similar to Machine Learning and Hadoop: Present and future(20)

More from Cloudera, Inc.(20)

Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.4.5K views

Recently uploaded(20)

METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
Prity Khastgir IPR Strategic India Patent Attorney Amplify Innovation23 views
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet48 views
CXL at OCPCXL at OCP
CXL at OCP
CXL Forum183 views

Machine Learning and Hadoop: Present and future

  • 1. Machine Learning and Hadoop Present and Future Josh Wills, Tom Pierce, and Jeff Hammerbacher Cloudera Data Science Team December 17th, 2011
  • 2. High Availability for Data Scientists NIPS Copyright 2011 Cloudera Inc. All rights reserved
  • 3. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  • 4. Industrial Machine Learning Copyright 2011 Cloudera Inc. All rights reserved
  • 5. Delta One: Model Evaluation • ML Systems Are One Piece of a Complex System • Well-defined objective functions are the exception • Multiple, often conflicting goals • Weights are fuzzy and shift with business priorities • Pareto optimization is the safest play • Predictive Accuracy Is Only Useful Up to a Point • Examples • Computational advertising • Friend recommendations on social networks Copyright 2011 Cloudera Inc. All rights reserved
  • 6. Delta Two: Systems Precede Algorithms • Greenfield Projects Hardly Ever Happen • (and don’t usually launch) • Industrial Computational Infrastructure • General-purpose • Cheap • Shared • Constraints Drive Innovation • Vowpal Wabbit Hashing Trick • SETI @ Google Copyright 2011 Cloudera Inc. All rights reserved
  • 7. Delta Three: Workflow Practice Over Theory Blog Copyright 2011 Cloudera Inc. All rights reserved
  • 8. Delta Three: Workflow • Optimize the Overall Process • Model fitting is a small piece of the overall flow time • Parallelize everything • Better Features > Better Models • Fast Model Deployment • Common Feature Extraction Logic • Servable Models • Validation as Sanity Checking • Deploy to a small subset of real data and evaluate Copyright 2011 Cloudera Inc. All rights reserved
  • 9. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  • 10. Hadoop: It’s Where The Data Is Copyright 2011 Cloudera Inc. All rights reserved
  • 11. Hadoop Platform: Substrate • Commodity servers • Open Compute • Open source operating system • Linux • Open source configuration management • Puppet • Chef • Coordination service • ZooKeeper Copyright 2011 Cloudera Inc. All rights reserved
  • 12. Hadoop Platform: Storage • Distributed schema-less storage • HDFS • Ceph • Append-only storage formats and metadata • Avro • RCFile • HCatalog • Mutable key-value storage and metadata • HBase Copyright 2011 Cloudera Inc. All rights reserved
  • 13. Hadoop Platform: Integration • Tool Access • FUSE • JDBC • ODBC • Data Ingestion • Flume • Sqoop Copyright 2011 Cloudera Inc. All rights reserved
  • 14. ML and Hadoop: The State of the World Copyright 2011 Cloudera Inc. All rights reserved
  • 15. Computation: Plain Old MapReduce • Great for: • Data Preparation • Feature Engineering • Model Validation/Evaluation • Works For Certain Model Fitting Problems • Recommendation Systems • Decision Trees (PLANET; Gradient Boosted Decision Trees) • Not A Practical Option for Online Learning • Way More Detail from the KDD 2011 Talk Copyright 2011 Cloudera Inc. All rights reserved
  • 16. Tools for Data Preparation/Feature Engineering • Languages/Environments • PigLatin • HiveQL • Need to deal with mismatch between offline/online feature generation • Java/Scala APIs • Crunch (Cloudera) • Scoobi (NICTA) • Cascading (Concurrent) • Jaql (IBM) Copyright 2011 Cloudera Inc. All rights reserved
  • 17. Apache Mahout • The starting place for MapReduce-based machine learning algorithms • Not machine-learning-in-a-box • Custom tweaks/modifications are the rule • A disparate collection of algorithms for: • Recommendations • Clustering • Classification • Frequent Itemset Mining Copyright 2011 Cloudera Inc. All rights reserved
  • 18. Apache Mahout (cont.) • Best Library: Taste Recommender • Oldest project, most widely-deployed in production • SVD implementation is particularly active • Good Libraries: Online SGD • Does not use MapReduce • Vowpal Rabbit + AllReduce is faster, has L-BFGS option • Roll Your Own Instead: Naïve Bayes • Challenges • “Secret sauce” effect • Delta between Mahout + the cutting edge in ML Copyright 2011 Cloudera Inc. All rights reserved
  • 19. More Machine Learning Interfaces for Hadoop • Based on MapReduce • SystemML (IBM) • AllReduce (Vowpal Wabbit) • No MapReduce • Spark • R-Based Systems (Augment MapReduce with R) • Segue • RHIPE • RHadoop • Ricardo (IBM) Copyright 2011 Cloudera Inc. All rights reserved
  • 20. ML and Hadoop: Where Things are Headed Copyright 2011 Cloudera Inc. All rights reserved
  • 21. MRv2 and YARN • Eliminates JobTracker bottleneck • Separate Resource Manager/Scheduler • Individual jobs have their own task masters • Moves MapReduce into user-land • Enables Hadoop clusters to run all sorts of jobs • MPI (Hamster; MAPREDUCE-2911) • Native BSP (Giraph) • Spark • AllReduce, GraphLab Copyright 2011 Cloudera Inc. All rights reserved
  • 22. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  • 23. Machine Learning on Multivariate Time Series • 1e5 writes/sec • Positive events are relatively rare • Feature extraction challenge • May not be clear what the right time horizon is • Tight SLAs • Very high stakes Copyright 2011 Cloudera Inc. All rights reserved
  • 24. An Academic Language For Feature Engineering • Feature extraction/selection is as important as model fitting • e.g., hierarchical feature representation, impact on training time and experiment design, feature cost modeling, etc. • Academic literature on this problem is sparse and dispersed across multiple fields • NIPS 2003 • HCI, NLP, Information Retrieval, etc. • We need a common language for talking about these problems across disciplines Copyright 2011 Cloudera Inc. All rights reserved
  • 25. A Broader Ontology For Model Selection • Practical factors that enter into the “best” choice of model… • Data arrival rate • Data volume • Scoring latency • Model refresh time • Robustness/reliability • …in addition to the standard predictive power/simplicity tradeoffs Copyright 2011 Cloudera Inc. All rights reserved
  • 26. Questions? Want A Job? @josh_wills