SlideShare a Scribd company logo
1 of 26
Machine Learning and Hadoop
Present and Future
Josh Wills, Tom Pierce, and Jeff Hammerbacher
Cloudera Data Science Team
December 17th, 2011
High Availability for Data Scientists




                 NIPS




                Copyright 2011 Cloudera Inc. All rights reserved
Agenda

• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
  • State of the World
  • Where Things Are Headed
• Part 3: Things Industry Needs From Academia




                Copyright 2011 Cloudera Inc. All rights reserved
Industrial Machine Learning




   Copyright 2011 Cloudera Inc. All rights reserved
Delta One: Model Evaluation

• ML Systems Are One Piece of a Complex System
• Well-defined objective functions are the exception
   • Multiple, often conflicting goals
   • Weights are fuzzy and shift with business priorities
   • Pareto optimization is the safest play
• Predictive Accuracy Is Only Useful Up to a Point
• Examples
   • Computational advertising
   • Friend recommendations on social networks


                    Copyright 2011 Cloudera Inc. All rights reserved
Delta Two: Systems Precede Algorithms

• Greenfield Projects Hardly Ever Happen
   • (and don’t usually launch)
• Industrial Computational Infrastructure
   • General-purpose
   • Cheap
   • Shared
• Constraints Drive Innovation
   • Vowpal Wabbit Hashing Trick
   • SETI @ Google


                   Copyright 2011 Cloudera Inc. All rights reserved
Delta Three: Workflow




                                                                 Practice Over Theory Blog



              Copyright 2011 Cloudera Inc. All rights reserved
Delta Three: Workflow

• Optimize the Overall Process
   • Model fitting is a small piece of the overall flow time
   • Parallelize everything
• Better Features > Better Models
• Fast Model Deployment
   • Common Feature Extraction Logic
   • Servable Models
• Validation as Sanity Checking
   • Deploy to a small subset of real data and evaluate


                    Copyright 2011 Cloudera Inc. All rights reserved
Agenda

• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
  • State of the World
  • Where Things Are Headed
• Part 3: Things Industry Needs From Academia




                Copyright 2011 Cloudera Inc. All rights reserved
Hadoop: It’s Where The Data Is




    Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Platform: Substrate

• Commodity servers
   • Open Compute
• Open source operating system
   • Linux
• Open source configuration management
   • Puppet
   • Chef
• Coordination service
   • ZooKeeper


                 Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Platform: Storage

• Distributed schema-less storage
   • HDFS
   • Ceph
• Append-only storage formats and metadata
   • Avro
   • RCFile
   • HCatalog
• Mutable key-value storage and metadata
   • HBase


                 Copyright 2011 Cloudera Inc. All rights reserved
Hadoop Platform: Integration

• Tool Access
   • FUSE
   • JDBC
   • ODBC
• Data Ingestion
   • Flume
   • Sqoop




                   Copyright 2011 Cloudera Inc. All rights reserved
ML and Hadoop: The State of the World




        Copyright 2011 Cloudera Inc. All rights reserved
Computation: Plain Old MapReduce

• Great for:
   • Data Preparation
   • Feature Engineering
   • Model Validation/Evaluation
• Works For Certain Model Fitting Problems
   • Recommendation Systems
   • Decision Trees (PLANET; Gradient Boosted Decision Trees)
• Not A Practical Option for Online Learning
• Way More Detail from the KDD 2011 Talk


                   Copyright 2011 Cloudera Inc. All rights reserved
Tools for Data Preparation/Feature Engineering

• Languages/Environments
   • PigLatin
   • HiveQL
   • Need to deal with mismatch between offline/online feature
     generation
• Java/Scala APIs
   •   Crunch (Cloudera)
   •   Scoobi (NICTA)
   •   Cascading (Concurrent)
   •   Jaql (IBM)

                    Copyright 2011 Cloudera Inc. All rights reserved
Apache Mahout

• The starting place for MapReduce-based machine
  learning algorithms
   • Not machine-learning-in-a-box
   • Custom tweaks/modifications are the rule
• A disparate collection of algorithms for:
   •   Recommendations
   •   Clustering
   •   Classification
   •   Frequent Itemset Mining



                    Copyright 2011 Cloudera Inc. All rights reserved
Apache Mahout (cont.)

• Best Library: Taste Recommender
   • Oldest project, most widely-deployed in production
   • SVD implementation is particularly active
• Good Libraries: Online SGD
   • Does not use MapReduce
   • Vowpal Rabbit + AllReduce is faster, has L-BFGS option
• Roll Your Own Instead: Naïve Bayes
• Challenges
   • “Secret sauce” effect
   • Delta between Mahout + the cutting edge in ML

                   Copyright 2011 Cloudera Inc. All rights reserved
More Machine Learning Interfaces for Hadoop

• Based on MapReduce
  • SystemML (IBM)
  • AllReduce (Vowpal Wabbit)
• No MapReduce
  • Spark
• R-Based Systems (Augment MapReduce with R)
  •   Segue
  •   RHIPE
  •   RHadoop
  •   Ricardo (IBM)

                      Copyright 2011 Cloudera Inc. All rights reserved
ML and Hadoop: Where Things are Headed




          Copyright 2011 Cloudera Inc. All rights reserved
MRv2 and YARN

• Eliminates JobTracker bottleneck
   • Separate Resource Manager/Scheduler
   • Individual jobs have their own task masters
• Moves MapReduce into user-land
• Enables Hadoop clusters to run all sorts of jobs
   •   MPI (Hamster; MAPREDUCE-2911)
   •   Native BSP (Giraph)
   •   Spark
   •   AllReduce, GraphLab


                   Copyright 2011 Cloudera Inc. All rights reserved
Agenda

• Part 1: Industrial Machine Learning
• Part 2: Machine Learning and Hadoop
  • State of the World
  • Where Things Are Headed
• Part 3: Things Industry Needs From Academia




                Copyright 2011 Cloudera Inc. All rights reserved
Machine Learning on Multivariate Time Series

 • 1e5 writes/sec
 • Positive events are
   relatively rare
 • Feature extraction
   challenge
 • May not be clear what
   the right time horizon is
 • Tight SLAs
 • Very high stakes

                Copyright 2011 Cloudera Inc. All rights reserved
An Academic Language For Feature Engineering

• Feature extraction/selection is as important as model
  fitting
   • e.g., hierarchical feature representation, impact on training
     time and experiment design, feature cost modeling, etc.
• Academic literature on this problem is sparse and
  dispersed across multiple fields
   • NIPS 2003
   • HCI, NLP, Information Retrieval, etc.
• We need a common language for talking about these
  problems across disciplines

                    Copyright 2011 Cloudera Inc. All rights reserved
A Broader Ontology For Model Selection

• Practical factors that enter into the “best” choice of
  model…
   •   Data arrival rate
   •   Data volume
   •   Scoring latency
   •   Model refresh time
   •   Robustness/reliability
• …in addition to the standard predictive power/simplicity
  tradeoffs


                     Copyright 2011 Cloudera Inc. All rights reserved
Questions?
Want A Job?
  @josh_wills

More Related Content

What's hot

Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Cloudera, Inc.
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera, Inc.
 
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchPart 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchCloudera, Inc.
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateCloudera, Inc.
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...Cloudera, Inc.
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
A Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber ThreatsA Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber ThreatsCloudera, Inc.
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudCloudera, Inc.
 
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoProExtreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoProCloudera, Inc.
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...Cloudera, Inc.
 
The Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in ChurnThe Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in ChurnCloudera, Inc.
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...Cloudera, Inc.
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndCloudera, Inc.
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarioskcmallu
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldCloudera, Inc.
 
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine Data
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine DataEdge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine Data
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine DataDataWorks Summit
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSWJason Hubbard
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
 

What's hot (20)

Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
 
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchPart 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science Workbench
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
 
Big Data: Myths and Realities
Big Data: Myths and RealitiesBig Data: Myths and Realities
Big Data: Myths and Realities
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
A Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber ThreatsA Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber Threats
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
 
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoProExtreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 
The Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in ChurnThe Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in Churn
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to End
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
 
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine Data
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine DataEdge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine Data
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine Data
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data Applications
 

Viewers also liked

Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist ToolboxAndrei Savu
 
Copyright 2007 2
Copyright 2007 2Copyright 2007 2
Copyright 2007 2guest339a0f
 
Manual de tecnoloxia
Manual de tecnoloxiaManual de tecnoloxia
Manual de tecnoloxiaCousasdoCole
 
2012 deep research report on china influenza vaccine industry
2012 deep research report on china influenza vaccine industry2012 deep research report on china influenza vaccine industry
2012 deep research report on china influenza vaccine industrysmarter2011
 
El ectiva 3 (programacion)
El ectiva 3 (programacion) El ectiva 3 (programacion)
El ectiva 3 (programacion) Edison Marrufo
 
Siriwat Wittaya Primary Section
Siriwat Wittaya Primary SectionSiriwat Wittaya Primary Section
Siriwat Wittaya Primary Sectionsiriwatwittaya
 
รายชื่อหนังสือที่จัดซื้อใหม่ประจำเดือน มกราคม 2555
รายชื่อหนังสือที่จัดซื้อใหม่ประจำเดือน มกราคม 2555รายชื่อหนังสือที่จัดซื้อใหม่ประจำเดือน มกราคม 2555
รายชื่อหนังสือที่จัดซื้อใหม่ประจำเดือน มกราคม 2555Humanities Information Center
 
Preparedness for retail drug store business development before ASEAN ECONOMI...
Preparedness for retail drug store  business development before ASEAN ECONOMI...Preparedness for retail drug store  business development before ASEAN ECONOMI...
Preparedness for retail drug store business development before ASEAN ECONOMI...Burin T. Sriwong
 
SPOTLIGHT ON THE PREMIUM CHANNEL - Abbott Medical Optics
SPOTLIGHT ON THE PREMIUM CHANNEL - Abbott Medical OpticsSPOTLIGHT ON THE PREMIUM CHANNEL - Abbott Medical Optics
SPOTLIGHT ON THE PREMIUM CHANNEL - Abbott Medical OpticsHealthegy
 
Diagnostic electrophysiology (ep) catheters global trends, estimates and fo...
Diagnostic electrophysiology (ep) catheters   global trends, estimates and fo...Diagnostic electrophysiology (ep) catheters   global trends, estimates and fo...
Diagnostic electrophysiology (ep) catheters global trends, estimates and fo...Research Hub
 
Social Network as a Learning Companion
Social Network as a Learning CompanionSocial Network as a Learning Companion
Social Network as a Learning CompanionNattakul Yamprasert
 

Viewers also liked (20)

Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
 
Data-driven Innovation - Wood
Data-driven Innovation - WoodData-driven Innovation - Wood
Data-driven Innovation - Wood
 
Goverment gazette 2555
Goverment gazette 2555Goverment gazette 2555
Goverment gazette 2555
 
Copyright 2007 2
Copyright 2007 2Copyright 2007 2
Copyright 2007 2
 
Nov7 simulated 1& 2
Nov7   simulated 1& 2Nov7   simulated 1& 2
Nov7 simulated 1& 2
 
nestle Nutrition
nestle Nutritionnestle Nutrition
nestle Nutrition
 
March2011brochure
March2011brochureMarch2011brochure
March2011brochure
 
Manual de tecnoloxia
Manual de tecnoloxiaManual de tecnoloxia
Manual de tecnoloxia
 
2012 deep research report on china influenza vaccine industry
2012 deep research report on china influenza vaccine industry2012 deep research report on china influenza vaccine industry
2012 deep research report on china influenza vaccine industry
 
El ectiva 3 (programacion)
El ectiva 3 (programacion) El ectiva 3 (programacion)
El ectiva 3 (programacion)
 
Siriwat Wittaya Primary Section
Siriwat Wittaya Primary SectionSiriwat Wittaya Primary Section
Siriwat Wittaya Primary Section
 
รายชื่อหนังสือที่จัดซื้อใหม่ประจำเดือน มกราคม 2555
รายชื่อหนังสือที่จัดซื้อใหม่ประจำเดือน มกราคม 2555รายชื่อหนังสือที่จัดซื้อใหม่ประจำเดือน มกราคม 2555
รายชื่อหนังสือที่จัดซื้อใหม่ประจำเดือน มกราคม 2555
 
Luận văn thạc sỹ y học
Luận văn thạc sỹ y họcLuận văn thạc sỹ y học
Luận văn thạc sỹ y học
 
Preparedness for retail drug store business development before ASEAN ECONOMI...
Preparedness for retail drug store  business development before ASEAN ECONOMI...Preparedness for retail drug store  business development before ASEAN ECONOMI...
Preparedness for retail drug store business development before ASEAN ECONOMI...
 
2010 BDPA Technology Conference Guide
2010 BDPA Technology Conference Guide2010 BDPA Technology Conference Guide
2010 BDPA Technology Conference Guide
 
SPOTLIGHT ON THE PREMIUM CHANNEL - Abbott Medical Optics
SPOTLIGHT ON THE PREMIUM CHANNEL - Abbott Medical OpticsSPOTLIGHT ON THE PREMIUM CHANNEL - Abbott Medical Optics
SPOTLIGHT ON THE PREMIUM CHANNEL - Abbott Medical Optics
 
Tse copyright 2014
Tse copyright 2014Tse copyright 2014
Tse copyright 2014
 
Diagnostic electrophysiology (ep) catheters global trends, estimates and fo...
Diagnostic electrophysiology (ep) catheters   global trends, estimates and fo...Diagnostic electrophysiology (ep) catheters   global trends, estimates and fo...
Diagnostic electrophysiology (ep) catheters global trends, estimates and fo...
 
Ovarian Cancer: Three sides of the Story: Yuko Abbott, LCSW
Ovarian Cancer: Three sides of the Story: Yuko Abbott, LCSW  Ovarian Cancer: Three sides of the Story: Yuko Abbott, LCSW
Ovarian Cancer: Three sides of the Story: Yuko Abbott, LCSW
 
Social Network as a Learning Companion
Social Network as a Learning CompanionSocial Network as a Learning Companion
Social Network as a Learning Companion
 

Similar to Machine Learning and Hadoop: Present and future

Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureData Science London
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Abe Taha
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Cloudera, Inc.
 
Oracle SQL Developer Data Modeler - for SQL Server
Oracle SQL Developer Data Modeler - for SQL ServerOracle SQL Developer Data Modeler - for SQL Server
Oracle SQL Developer Data Modeler - for SQL ServerJeff Smith
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedCloudera, Inc.
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)VMware Tanzu
 
Hadoop As The Platform For The Smartgrid At TVA
Hadoop As The Platform For The Smartgrid At TVAHadoop As The Platform For The Smartgrid At TVA
Hadoop As The Platform For The Smartgrid At TVACloudera, Inc.
 

Similar to Machine Learning and Hadoop: Present and future (20)

Machine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and FutureMachine Learning and Hadoop: Present and Future
Machine Learning and Hadoop: Present and Future
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
 
Oracle SQL Developer Data Modeler - for SQL Server
Oracle SQL Developer Data Modeler - for SQL ServerOracle SQL Developer Data Modeler - for SQL Server
Oracle SQL Developer Data Modeler - for SQL Server
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons Learned
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
YARN
YARNYARN
YARN
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)Hack for Good and Profit (Cloud Foundry Summit 2014)
Hack for Good and Profit (Cloud Foundry Summit 2014)
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
 
Hadoop As The Platform For The Smartgrid At TVA
Hadoop As The Platform For The Smartgrid At TVAHadoop As The Platform For The Smartgrid At TVA
Hadoop As The Platform For The Smartgrid At TVA
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Machine Learning and Hadoop: Present and future

  • 1. Machine Learning and Hadoop Present and Future Josh Wills, Tom Pierce, and Jeff Hammerbacher Cloudera Data Science Team December 17th, 2011
  • 2. High Availability for Data Scientists NIPS Copyright 2011 Cloudera Inc. All rights reserved
  • 3. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  • 4. Industrial Machine Learning Copyright 2011 Cloudera Inc. All rights reserved
  • 5. Delta One: Model Evaluation • ML Systems Are One Piece of a Complex System • Well-defined objective functions are the exception • Multiple, often conflicting goals • Weights are fuzzy and shift with business priorities • Pareto optimization is the safest play • Predictive Accuracy Is Only Useful Up to a Point • Examples • Computational advertising • Friend recommendations on social networks Copyright 2011 Cloudera Inc. All rights reserved
  • 6. Delta Two: Systems Precede Algorithms • Greenfield Projects Hardly Ever Happen • (and don’t usually launch) • Industrial Computational Infrastructure • General-purpose • Cheap • Shared • Constraints Drive Innovation • Vowpal Wabbit Hashing Trick • SETI @ Google Copyright 2011 Cloudera Inc. All rights reserved
  • 7. Delta Three: Workflow Practice Over Theory Blog Copyright 2011 Cloudera Inc. All rights reserved
  • 8. Delta Three: Workflow • Optimize the Overall Process • Model fitting is a small piece of the overall flow time • Parallelize everything • Better Features > Better Models • Fast Model Deployment • Common Feature Extraction Logic • Servable Models • Validation as Sanity Checking • Deploy to a small subset of real data and evaluate Copyright 2011 Cloudera Inc. All rights reserved
  • 9. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  • 10. Hadoop: It’s Where The Data Is Copyright 2011 Cloudera Inc. All rights reserved
  • 11. Hadoop Platform: Substrate • Commodity servers • Open Compute • Open source operating system • Linux • Open source configuration management • Puppet • Chef • Coordination service • ZooKeeper Copyright 2011 Cloudera Inc. All rights reserved
  • 12. Hadoop Platform: Storage • Distributed schema-less storage • HDFS • Ceph • Append-only storage formats and metadata • Avro • RCFile • HCatalog • Mutable key-value storage and metadata • HBase Copyright 2011 Cloudera Inc. All rights reserved
  • 13. Hadoop Platform: Integration • Tool Access • FUSE • JDBC • ODBC • Data Ingestion • Flume • Sqoop Copyright 2011 Cloudera Inc. All rights reserved
  • 14. ML and Hadoop: The State of the World Copyright 2011 Cloudera Inc. All rights reserved
  • 15. Computation: Plain Old MapReduce • Great for: • Data Preparation • Feature Engineering • Model Validation/Evaluation • Works For Certain Model Fitting Problems • Recommendation Systems • Decision Trees (PLANET; Gradient Boosted Decision Trees) • Not A Practical Option for Online Learning • Way More Detail from the KDD 2011 Talk Copyright 2011 Cloudera Inc. All rights reserved
  • 16. Tools for Data Preparation/Feature Engineering • Languages/Environments • PigLatin • HiveQL • Need to deal with mismatch between offline/online feature generation • Java/Scala APIs • Crunch (Cloudera) • Scoobi (NICTA) • Cascading (Concurrent) • Jaql (IBM) Copyright 2011 Cloudera Inc. All rights reserved
  • 17. Apache Mahout • The starting place for MapReduce-based machine learning algorithms • Not machine-learning-in-a-box • Custom tweaks/modifications are the rule • A disparate collection of algorithms for: • Recommendations • Clustering • Classification • Frequent Itemset Mining Copyright 2011 Cloudera Inc. All rights reserved
  • 18. Apache Mahout (cont.) • Best Library: Taste Recommender • Oldest project, most widely-deployed in production • SVD implementation is particularly active • Good Libraries: Online SGD • Does not use MapReduce • Vowpal Rabbit + AllReduce is faster, has L-BFGS option • Roll Your Own Instead: Naïve Bayes • Challenges • “Secret sauce” effect • Delta between Mahout + the cutting edge in ML Copyright 2011 Cloudera Inc. All rights reserved
  • 19. More Machine Learning Interfaces for Hadoop • Based on MapReduce • SystemML (IBM) • AllReduce (Vowpal Wabbit) • No MapReduce • Spark • R-Based Systems (Augment MapReduce with R) • Segue • RHIPE • RHadoop • Ricardo (IBM) Copyright 2011 Cloudera Inc. All rights reserved
  • 20. ML and Hadoop: Where Things are Headed Copyright 2011 Cloudera Inc. All rights reserved
  • 21. MRv2 and YARN • Eliminates JobTracker bottleneck • Separate Resource Manager/Scheduler • Individual jobs have their own task masters • Moves MapReduce into user-land • Enables Hadoop clusters to run all sorts of jobs • MPI (Hamster; MAPREDUCE-2911) • Native BSP (Giraph) • Spark • AllReduce, GraphLab Copyright 2011 Cloudera Inc. All rights reserved
  • 22. Agenda • Part 1: Industrial Machine Learning • Part 2: Machine Learning and Hadoop • State of the World • Where Things Are Headed • Part 3: Things Industry Needs From Academia Copyright 2011 Cloudera Inc. All rights reserved
  • 23. Machine Learning on Multivariate Time Series • 1e5 writes/sec • Positive events are relatively rare • Feature extraction challenge • May not be clear what the right time horizon is • Tight SLAs • Very high stakes Copyright 2011 Cloudera Inc. All rights reserved
  • 24. An Academic Language For Feature Engineering • Feature extraction/selection is as important as model fitting • e.g., hierarchical feature representation, impact on training time and experiment design, feature cost modeling, etc. • Academic literature on this problem is sparse and dispersed across multiple fields • NIPS 2003 • HCI, NLP, Information Retrieval, etc. • We need a common language for talking about these problems across disciplines Copyright 2011 Cloudera Inc. All rights reserved
  • 25. A Broader Ontology For Model Selection • Practical factors that enter into the “best” choice of model… • Data arrival rate • Data volume • Scoring latency • Model refresh time • Robustness/reliability • …in addition to the standard predictive power/simplicity tradeoffs Copyright 2011 Cloudera Inc. All rights reserved
  • 26. Questions? Want A Job? @josh_wills