SlideShare a Scribd company logo
1 of 26
Machine Learning and Hadoop
Present and Future
Josh Wills, Tom Pierce, and Jeff Hammerbacher
Cloudera Data Science Team
December 17th, 2011
High Availability for Data Scientists




                 NIPS




                Copyright 2011 Cloudera Inc. All rights reserved
Agenda

• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
  • State of the World
  • Where Things Are Headed
• Part 3: Things Industry Needs From Academia




                Copyright 2011 Cloudera Inc. All rights reserved
Industrial Machine Learning




   Copyright 2011 Cloudera Inc. All rights reserved
Delta One: Model Evaluation

• ML Systems Are One Piece of a Complex System
• Well-defined objective functions are the exception
   • Multiple, often conflicting goals
   • Weights are fuzzy and shift with business priorities
   • Pareto optimization is the safest play
• Predictive Accuracy Is Only Useful Up to a Point
• Examples
   • Computational advertising
   • Friend recommendations on social networks


                    Copyright 2011 Cloudera Inc. All rights reserved
Delta Two: Systems Precede Algorithms

• Greenfield Projects Hardly Ever Happen
   • (and don’t usually launch)
• Industrial Computational Infrastructure
   • General-purpose
   • Cheap
   • Shared
• Constraints Drive Innovation
   • Vowpal Wabbit Hashing Trick
   • SETI @ Google


                   Copyright 2011 Cloudera Inc. All rights reserved
Delta Three: Workflow




                                                                 Practice Over Theory Blog



              Copyright 2011 Cloudera Inc. All rights reserved
Delta Three: Workflow

• Optimize the Overall Process
   • Model fitting is a small piece of the overall flow time
   • Parallelize everything
• Better Features > Better Models
• Fast Model Deployment
   • Common Feature Extraction Logic
   • Servable Models
• Validation as Sanity Checking
   • Deploy to a small subset of real data and evaluate


                    Copyright 2011 Cloudera Inc. All rights reserved
Agenda

• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
  • State of the World
  • Where Things Are Headed
• Part 3: Things Industry Needs From Academia




                Copyright 2011 Cloudera Inc. All rights reserved
Hadoop: It’s Where The Data Is




    Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Platform: Substrate

• Commodity servers
   • Open Compute
• Open source operating system
   • Linux
• Open source configuration management
   • Puppet
   • Chef
• Coordination service
   • ZooKeeper


                 Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Platform: Storage

• Distributed schema-less storage
   • HDFS
   • Ceph
• Append-only storage formats and metadata
   • Avro
   • RCFile
   • HCatalog
• Mutable key-value storage and metadata
   • HBase


                 Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Platform: Integration

• Tool Access
   • FUSE
   • JDBC
   • ODBC
• Data Ingestion
   • Flume
   • Sqoop




                   Copyright 2011 Cloudera Inc. All rights reserved
ML and Hadoop: The State of the World




        Copyright 2011 Cloudera Inc. All rights reserved
Computation: Plain Old MapReduce

• Great for:
   • Data Preparation
   • Feature Engineering
   • Model Validation/Evaluation
• Works For Certain Model Fitting Problems
   • Recommendation Systems
   • Decision Trees (PLANET; Gradient Boosted Decision Trees)
• Not A Practical Option for Online Learning
• Way More Detail from the KDD 2011 Talk


                   Copyright 2011 Cloudera Inc. All rights reserved
Tools for Data Preparation/Feature Engineering

• Languages/Environments
   • PigLatin
   • HiveQL
   • Need to deal with mismatch between offline/online feature
     generation
• Java/Scala APIs
   •   Crunch (Cloudera)
   •   Scoobi (NICTA)
   •   Cascading (Concurrent)
   •   Jaql (IBM)

                    Copyright 2011 Cloudera Inc. All rights reserved
Apache Mahout

• The starting place for MapReduce-based machine
  learning algorithms
   • Not machine-learning-in-a-box
   • Custom tweaks/modifications are the rule
• A disparate collection of algorithms for:
   •   Recommendations
   •   Clustering
   •   Classification
   •   Frequent Itemset Mining



                    Copyright 2011 Cloudera Inc. All rights reserved
Apache Mahout (cont.)

• Best Library: Taste Recommender
   • Oldest project, most widely-deployed in production
   • SVD implementation is particularly active
• Good Libraries: Online SGD
   • Does not use MapReduce
   • Vowpal Rabbit + AllReduce is faster, has L-BFGS option
• Roll Your Own Instead: Naïve Bayes
• Challenges
   • “Secret sauce” effect
   • Delta between Mahout + the cutting edge in ML

                   Copyright 2011 Cloudera Inc. All rights reserved
More Machine Learning Interfaces for Hadoop

• Based on MapReduce
  • SystemML (IBM)
  • AllReduce (Vowpal Wabbit)
• No MapReduce
  • Spark
• R-Based Systems (Augment MapReduce with R)
  •   Segue
  •   RHIPE
  •   RHadoop
  •   Ricardo (IBM)

                      Copyright 2011 Cloudera Inc. All rights reserved
ML and Hadoop: Where Things are Headed




          Copyright 2011 Cloudera Inc. All rights reserved
MRv2 and YARN

• Eliminates JobTracker bottleneck
   • Separate Resource Manager/Scheduler
   • Individual jobs have their own task masters
• Moves MapReduce into user-land
• Enables Hadoop clusters to run all sorts of jobs
   •   MPI (Hamster; MAPREDUCE-2911)
   •   Native BSP (Giraph)
   •   Spark
   •   AllReduce, GraphLab


                   Copyright 2011 Cloudera Inc. All rights reserved
Agenda

• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
  • State of the World
  • Where Things Are Headed
• Part 3: Things Industry Needs From Academia




                Copyright 2011 Cloudera Inc. All rights reserved
Machine Learning on Multivariate Time Series

 • 1e5 writes/sec
 • Positive events are
   relatively rare
 • Feature extraction
   challenge
 • May not be clear what
   the right time horizon is
 • Tight SLAs
 • Very high stakes

                Copyright 2011 Cloudera Inc. All rights reserved
An Academic Language For Feature Engineering

• Feature extraction/selection is as important as model
  fitting
   • e.g., hierarchical feature representation, impact on training
     time and experiment design, feature cost modeling, etc.
• Academic literature on this problem is sparse and
  dispersed across multiple fields
   • NIPS 2003
   • HCI, NLP, Information Retrieval, etc.
• We need a common language for talking about these
  problems across disciplines

                    Copyright 2011 Cloudera Inc. All rights reserved
A Broader Ontology For Model Selection

• Practical factors that enter into the “best” choice of
  model…
   •   Data arrival rate
   •   Data volume
   •   Scoring latency
   •   Model refresh time
   •   Robustness/reliability
• …in addition to the standard predictive power/simplicity
  tradeoffs


                     Copyright 2011 Cloudera Inc. All rights reserved
Questions?
Want A Job?
  @josh_wills

More Related Content

What's hot

machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Machine learning and big data
Machine learning and big dataMachine learning and big data
Machine learning and big dataPoo Kuan Hoong
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEuropean Data Forum
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningPruet Boonma
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine LearningCorey Chivers
 
Begin with Data Scientist
Begin with Data ScientistBegin with Data Scientist
Begin with Data ScientistNarong Intiruk
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentationDavid Raj Kanthi
 
Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...
Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...
Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...inside-BigData.com
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013MLconf
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDCDrew Farris
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Srinath Perera
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data ScienceSean Taylor
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine LearningMakine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine LearningAli Alkan
 

What's hot (20)

machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Machine learning and big data
Machine learning and big dataMachine learning and big data
Machine learning and big data
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine Learning
 
Begin with Data Scientist
Begin with Data ScientistBegin with Data Scientist
Begin with Data Scientist
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
 
Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...
Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...
Machine & Deep Learning: Practical Deployments and Best Practices for the Nex...
 
Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013Ted Willke, Intel Labs MLconf 2013
Ted Willke, Intel Labs MLconf 2013
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine LearningMakine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
Makine Öğrenmesi ile Görüntü Tanıma | Image Recognition using Machine Learning
 

Viewers also liked

A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The PeopleDaniel Tunkelang
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in PythonImry Kissos
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013Philip Zheng
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientistryanorban
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDevashish Shanker
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesPier Luca Lanzi
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsNhatHai Phan
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkDEEPASHRI HK
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle CompetitionsDataRobot
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentationlpaviglianiti
 
The Business Analytics Value Proposition
The Business Analytics Value PropositionThe Business Analytics Value Proposition
The Business Analytics Value PropositionEric Stephens
 

Viewers also liked (20)

A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentation
 
The Business Analytics Value Proposition
The Business Analytics Value PropositionThe Business Analytics Value Proposition
The Business Analytics Value Proposition
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 

Similar to Machine Learning and Hadoop: Present and Future

Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Abe Taha
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Cloudera, Inc.
 
Oracle SQL Developer Data Modeler - for SQL Server
Oracle SQL Developer Data Modeler - for SQL ServerOracle SQL Developer Data Modeler - for SQL Server
Oracle SQL Developer Data Modeler - for SQL ServerJeff Smith
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedCloudera, Inc.
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)VMware Tanzu
 

Similar to Machine Learning and Hadoop: Present and Future (20)

Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
 
Oracle SQL Developer Data Modeler - for SQL Server
Oracle SQL Developer Data Modeler - for SQL ServerOracle SQL Developer Data Modeler - for SQL Server
Oracle SQL Developer Data Modeler - for SQL Server
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons Learned
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
YARN
YARNYARN
YARN
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Machine Learning and Hadoop: Present and Future

  • 1. Machine Learning and Hadoop Present and Future Josh Wills, Tom Pierce, and Jeff Hammerbacher Cloudera Data Science Team December 17th, 2011
  • 2. High Availability for Data Scientists NIPS Copyright 2011 Cloudera Inc. All rights reserved
  • 3. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  • 4. Industrial Machine Learning Copyright 2011 Cloudera Inc. All rights reserved
  • 5. Delta One: Model Evaluation • ML Systems Are One Piece of a Complex System • Well-defined objective functions are the exception • Multiple, often conflicting goals • Weights are fuzzy and shift with business priorities • Pareto optimization is the safest play • Predictive Accuracy Is Only Useful Up to a Point • Examples • Computational advertising • Friend recommendations on social networks Copyright 2011 Cloudera Inc. All rights reserved
  • 6. Delta Two: Systems Precede Algorithms • Greenfield Projects Hardly Ever Happen • (and don’t usually launch) • Industrial Computational Infrastructure • General-purpose • Cheap • Shared • Constraints Drive Innovation • Vowpal Wabbit Hashing Trick • SETI @ Google Copyright 2011 Cloudera Inc. All rights reserved
  • 7. Delta Three: Workflow Practice Over Theory Blog Copyright 2011 Cloudera Inc. All rights reserved
  • 8. Delta Three: Workflow • Optimize the Overall Process • Model fitting is a small piece of the overall flow time • Parallelize everything • Better Features > Better Models • Fast Model Deployment • Common Feature Extraction Logic • Servable Models • Validation as Sanity Checking • Deploy to a small subset of real data and evaluate Copyright 2011 Cloudera Inc. All rights reserved
  • 9. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  • 10. Hadoop: It’s Where The Data Is Copyright 2011 Cloudera Inc. All rights reserved
  • 11. Hadoop Platform: Substrate • Commodity servers • Open Compute • Open source operating system • Linux • Open source configuration management • Puppet • Chef • Coordination service • ZooKeeper Copyright 2011 Cloudera Inc. All rights reserved
  • 12. Hadoop Platform: Storage • Distributed schema-less storage • HDFS • Ceph • Append-only storage formats and metadata • Avro • RCFile • HCatalog • Mutable key-value storage and metadata • HBase Copyright 2011 Cloudera Inc. All rights reserved
  • 13. Hadoop Platform: Integration • Tool Access • FUSE • JDBC • ODBC • Data Ingestion • Flume • Sqoop Copyright 2011 Cloudera Inc. All rights reserved
  • 14. ML and Hadoop: The State of the World Copyright 2011 Cloudera Inc. All rights reserved
  • 15. Computation: Plain Old MapReduce • Great for: • Data Preparation • Feature Engineering • Model Validation/Evaluation • Works For Certain Model Fitting Problems • Recommendation Systems • Decision Trees (PLANET; Gradient Boosted Decision Trees) • Not A Practical Option for Online Learning • Way More Detail from the KDD 2011 Talk Copyright 2011 Cloudera Inc. All rights reserved
  • 16. Tools for Data Preparation/Feature Engineering • Languages/Environments • PigLatin • HiveQL • Need to deal with mismatch between offline/online feature generation • Java/Scala APIs • Crunch (Cloudera) • Scoobi (NICTA) • Cascading (Concurrent) • Jaql (IBM) Copyright 2011 Cloudera Inc. All rights reserved
  • 17. Apache Mahout • The starting place for MapReduce-based machine learning algorithms • Not machine-learning-in-a-box • Custom tweaks/modifications are the rule • A disparate collection of algorithms for: • Recommendations • Clustering • Classification • Frequent Itemset Mining Copyright 2011 Cloudera Inc. All rights reserved
  • 18. Apache Mahout (cont.) • Best Library: Taste Recommender • Oldest project, most widely-deployed in production • SVD implementation is particularly active • Good Libraries: Online SGD • Does not use MapReduce • Vowpal Rabbit + AllReduce is faster, has L-BFGS option • Roll Your Own Instead: Naïve Bayes • Challenges • “Secret sauce” effect • Delta between Mahout + the cutting edge in ML Copyright 2011 Cloudera Inc. All rights reserved
  • 19. More Machine Learning Interfaces for Hadoop • Based on MapReduce • SystemML (IBM) • AllReduce (Vowpal Wabbit) • No MapReduce • Spark • R-Based Systems (Augment MapReduce with R) • Segue • RHIPE • RHadoop • Ricardo (IBM) Copyright 2011 Cloudera Inc. All rights reserved
  • 20. ML and Hadoop: Where Things are Headed Copyright 2011 Cloudera Inc. All rights reserved
  • 21. MRv2 and YARN • Eliminates JobTracker bottleneck • Separate Resource Manager/Scheduler • Individual jobs have their own task masters • Moves MapReduce into user-land • Enables Hadoop clusters to run all sorts of jobs • MPI (Hamster; MAPREDUCE-2911) • Native BSP (Giraph) • Spark • AllReduce, GraphLab Copyright 2011 Cloudera Inc. All rights reserved
  • 22. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  • 23. Machine Learning on Multivariate Time Series • 1e5 writes/sec • Positive events are relatively rare • Feature extraction challenge • May not be clear what the right time horizon is • Tight SLAs • Very high stakes Copyright 2011 Cloudera Inc. All rights reserved
  • 24. An Academic Language For Feature Engineering • Feature extraction/selection is as important as model fitting • e.g., hierarchical feature representation, impact on training time and experiment design, feature cost modeling, etc. • Academic literature on this problem is sparse and dispersed across multiple fields • NIPS 2003 • HCI, NLP, Information Retrieval, etc. • We need a common language for talking about these problems across disciplines Copyright 2011 Cloudera Inc. All rights reserved
  • 25. A Broader Ontology For Model Selection • Practical factors that enter into the “best” choice of model… • Data arrival rate • Data volume • Scoring latency • Model refresh time • Robustness/reliability • …in addition to the standard predictive power/simplicity tradeoffs Copyright 2011 Cloudera Inc. All rights reserved
  • 26. Questions? Want A Job? @josh_wills