SlideShare a Scribd company logo
1 of 30
Machine	
  Learning	
  and	
  Hadoop	
  
Present	
  and	
  Future	
  
Josh	
  Wills	
  
Cloudera	
  Data	
  Science	
  Team   	
  
September	
  6th,	
  2012	
  
About	
  Me	
  




                  Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Outline	
  

•  Part	
  1:	
  Industrial	
  Machine	
  Learning	
  

•  Part	
  2:	
  ML	
  and	
  Hadoop:	
  The	
  State	
  of	
  the	
  World	
  

•  Part	
  3:	
  ML	
  and	
  Hadoop:	
  Where	
  Things	
  are	
  Headed	
  




                            Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
(Academic)	
  ML	
  vs.	
  (Academic)	
  StaIsIcs	
  

	
  
	
  
	
  
“Machine	
  learning	
  is	
  sta/s/cs	
  minus	
  any	
  checking	
  of	
  
models	
  and	
  assump/ons.”	
  
     	
   	
   	
   	
   	
   	
   	
   	
  -­‐-­‐	
  Brian	
  Ripley,	
  UseR!	
  2004	
  
     	
   	
   	
   	
   	
   	
   	
   	
  (provoca/vely	
  paraphrased)	
  




                             Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Industrial	
  Machine	
  Learning:	
  Truth	
  #1	
  

	
  
	
  
                                             	
  
  The	
  thing	
  that	
  we	
  are	
  trying	
  to	
  predict	
  is	
  rarely	
  the	
  thing	
  
                     that	
  we	
  are	
  trying	
  to	
  opImize.        	
  




                              Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Industrial	
  Machine	
  Learning:	
  Truth	
  #2	
  

	
  
	
  
	
  
	
  
                 Systems	
  precede	
  algorithms.
                                                 	
  




                     Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Industrial	
  Machine	
  Learning:	
  Truth	
  #3	
  




                                                                                              Practice Over Theory Blog



                     Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
ImplicaIon	
  

	
  
	
  
	
  
       Data	
  science	
  requires	
  predicIon-­‐oriented	
  machine	
  
       learning	
  models	
  AND	
  classical,	
  rigorous	
  staIsIcal	
  
                                  analysis.  	
  
	
  




                         Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Outline	
  

•  Part	
  1:	
  Industrial	
  Machine	
  Learning	
  

•  Part	
  2:	
  ML	
  and	
  Hadoop:	
  The	
  State	
  of	
  the	
  World	
  

•  Part	
  3:	
  ML	
  and	
  Hadoop:	
  Where	
  Things	
  are	
  Headed	
  




                            Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
“Hadoop.	
  It’s	
  Where	
  The	
  Data	
  Is.”	
  




                      Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Hadoop	
  PlaWorm:	
  Substrate	
  


•    Commodity	
  servers	
  
•    Open	
  source	
  operaFng	
  system	
  
•    “”	
  ConfiguraFon	
  Management	
  
•    “”	
  CoordinaFon	
  Service	
  
•    “”	
  File	
  System	
  API	
  
•    “”	
  Efficient	
  and	
  Extensible	
  File	
  Formats	
  
•    “”	
  Efficient	
  and	
  Extensible	
  RPC	
  Libraries	
  


                           Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Hadoop	
  PlaWorm:	
  MapReduce	
  Frameworks	
  

•  Languages/Environments	
  
       •  PigLaFn	
  (Apache)	
  
       •  HiveQL	
  (Apache)	
  
       •  Jaql	
  (IBM)	
  
•  Java/Scala	
  APIs	
  
       •    Crunch	
  (Apache	
  Incubator)	
  
       •    Scoobi	
  (NICTA)	
  
       •    Cascading	
  (Concurrent)	
  
       •    Pangool	
  
	
  

                              Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
ML	
  and	
  Hadoop:	
  The	
  State	
  of	
  the	
  World	
  




              Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
MapReduce	
  

•  Great	
  for:	
  
    •  Data	
  PreparaFon	
  
    •  Feature	
  Engineering	
  
    •  Model	
  ValidaFon/EvaluaFon	
  
•  Works	
  Well	
  For	
  Certain	
  Model	
  Fing	
  Problems	
  
    •  CollaboraFve	
  Filtering	
  Algorithms	
  
    •  ExpectaFon	
  MaximizaFon	
  
    •  Decision	
  Trees	
  (PLANET;	
  Gradient	
  Boosted	
  Decision	
  Trees)	
  
•  Not	
  A	
  PracIcal	
  OpIon	
  for	
  Many	
  Kinds	
  of	
  Problems	
  
•  Way	
  More	
  Detail	
  in	
  the	
  KDD	
  2011	
  Talk	
  
                          Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Apache	
  Mahout	
  

•  The	
  starFng	
  place	
  for	
  MapReduce-­‐based	
  machine	
  
   learning	
  algorithms	
  
    •  Not	
  machine-­‐learning-­‐in-­‐a-­‐box	
  
    •  Custom	
  tweaks/modificaFons	
  are	
  the	
  rule	
  
•  A	
  disparate	
  collecFon	
  of	
  algorithms	
  for:	
  
    •    RecommendaFons	
  
    •    Clustering	
  
    •    ClassificaFon	
  
    •    Frequent	
  Itemset	
  Mining	
  



                            Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Apache	
  Mahout	
  (cont.)	
  

•  Best	
  Library:	
  Taste	
  Recommender	
  
    •  Oldest	
  project,	
  most	
  widely-­‐deployed	
  in	
  producFon	
  
    •  SVD	
  implementaFon	
  is	
  parFcularly	
  acFve	
  


•  Good	
  Libraries:	
  Online	
  SGD	
  
    •  Does	
  not	
  use	
  MapReduce	
  
    •  Vowpal	
  Rabbit	
  is	
  faster,	
  has	
  L-­‐BFGS	
  opFon	
  


•  Roll	
  Your	
  Own	
  Instead:	
  Naïve	
  Bayes	
  
	
  
                             Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
The	
  Ominous	
  Challenges
                           	
  




 Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
1.	
  The	
  Secret	
  Sauce	
  Effect
                                    	
  




   Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
2.	
  Delta	
  Between	
  Mahout	
  and	
  the	
  Cu_ng	
  Edge	
  




                  Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
ML	
  and	
  Hadoop:	
  Where	
  Things	
  are	
  Headed	
  




                Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Moving	
  Beyond	
  MapReduce	
  




                 Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
The	
  Contenders
                           	
  




Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
AllReduce	
  

•  Developed	
  at	
  Yahoo!	
  Research	
  
•  Defines	
  the	
  allreduce	
  operaFon	
  
    •  N	
  machines	
  each	
  have	
  a	
  number	
  =>	
  each	
  machine	
  has	
  the	
  
       sum	
  of	
  the	
  numbers	
  
•  At	
  the	
  heart	
  of	
  Vowpal	
  Wabbit’s	
  performance	
  
•  Implemented	
  in	
  C++	
  
•  Can	
  be	
  patched	
  into	
  Apache	
  Hadoop	
  and	
  used	
  today	
  




                             Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Spark	
  

 •  Developed	
  at	
  Berkeley’s	
  
    AMP	
  Lab	
  
 •  Defines	
  operaFons	
  on	
  
    distributed	
  in-­‐memory	
  
    collecFons	
  
 •  Wriken	
  in	
  Scala	
  
 •  Supports	
  reading	
  to	
  and	
  
    wriFng	
  from	
  HDFS	
  


                       Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
GraphLab	
  

 •  Developed	
  at	
  CMU	
  
 •  Lower-­‐level	
  primiFves	
  
     •  (but	
  higher	
  than	
  MPI)	
  
 •  Map/Reduce	
  =>	
  
    Update/Sort	
  
 •  Flexible,	
  allows	
  for	
  
    asynchronous	
  
    computaFons	
  
 •  Reads	
  from	
  HDFS	
  

                          Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
How	
  Things	
  Measure	
  Up	
  




  Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Speed	
  vs.	
  Reliability	
  




Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Memory	
  vs.	
  Disk	
  




Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
C++	
  vs.	
  JVM	
  




Copyright	
  2012	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
QuesIons?	
  
(Ask	
  Anything.	
  Anything	
  At	
  All.)	
  
            jwills@cloudera.com	
  

More Related Content

What's hot

Large scale topic modeling
Large scale topic modelingLarge scale topic modeling
Large scale topic modeling
Sameer Wadkar
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
Joey Echeverria
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
russell_jurney
 

What's hot (20)

Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data HubCloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data Hub
 
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
Cloud-Native Machine Learning: Emerging Trends and the Road AheadCloud-Native Machine Learning: Emerging Trends and the Road Ahead
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
 
快速数据快速分析引擎-Kudu
快速数据快速分析引擎-Kudu快速数据快速分析引擎-Kudu
快速数据快速分析引擎-Kudu
 
Large scale topic modeling
Large scale topic modelingLarge scale topic modeling
Large scale topic modeling
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Apache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in HadoopApache Spark: Usage and Roadmap in Hadoop
Apache Spark: Usage and Roadmap in Hadoop
 
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
Apache Hive: From MapReduce to Enterprise-grade Big Data WarehousingApache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Train, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning modelTrain, predict, serve: How to go into production your machine learning model
Train, predict, serve: How to go into production your machine learning model
 
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
[ScalaMatsuri] グリー初のscalaプロダクト!チャットサービス公開までの苦労と工夫
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Cloudera のサポートエンジニアリング #supennight
Cloudera のサポートエンジニアリング #supennightCloudera のサポートエンジニアリング #supennight
Cloudera のサポートエンジニアリング #supennight
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
 
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache HiveDiscover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
Discover HDP 2.1: Interactive SQL Query in Hadoop with Apache Hive
 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017
 
Cloud Economics: Optimising for Cost
Cloud Economics: Optimising for CostCloud Economics: Optimising for Cost
Cloud Economics: Optimising for Cost
 

Viewers also liked

Hacking With Nmap - Scanning Techniques
Hacking With Nmap - Scanning TechniquesHacking With Nmap - Scanning Techniques
Hacking With Nmap - Scanning Techniques
amiable_indian
 

Viewers also liked (20)

Samsung mobile root
Samsung mobile rootSamsung mobile root
Samsung mobile root
 
Nigerian design and digital marketing agency
Nigerian design and digital marketing agencyNigerian design and digital marketing agency
Nigerian design and digital marketing agency
 
Intro to linux performance analysis
Intro to linux performance analysisIntro to linux performance analysis
Intro to linux performance analysis
 
VideoLan VLC Player App Artifact Report
VideoLan VLC Player App Artifact ReportVideoLan VLC Player App Artifact Report
VideoLan VLC Player App Artifact Report
 
History of L0phtCrack
History of L0phtCrackHistory of L0phtCrack
History of L0phtCrack
 
脆弱性診断って何をどうすればいいの?(おかわり)
脆弱性診断って何をどうすればいいの?(おかわり)脆弱性診断って何をどうすればいいの?(おかわり)
脆弱性診断って何をどうすればいいの?(おかわり)
 
Open Source Security Testing Methodology Manual - OSSTMM by Falgun Rathod
Open Source Security Testing Methodology Manual - OSSTMM by Falgun RathodOpen Source Security Testing Methodology Manual - OSSTMM by Falgun Rathod
Open Source Security Testing Methodology Manual - OSSTMM by Falgun Rathod
 
Dangerous google dorks
Dangerous google dorksDangerous google dorks
Dangerous google dorks
 
How to Setup A Pen test Lab and How to Play CTF
How to Setup A Pen test Lab and How to Play CTF How to Setup A Pen test Lab and How to Play CTF
How to Setup A Pen test Lab and How to Play CTF
 
Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to...
Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to...Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to...
Thesis defence of Dall'Olio Giovanni Marco. Applications of network theory to...
 
Nmap not only a port scanner by ravi rajput comexpo security awareness meet
Nmap not only a port scanner by ravi rajput comexpo security awareness meet Nmap not only a port scanner by ravi rajput comexpo security awareness meet
Nmap not only a port scanner by ravi rajput comexpo security awareness meet
 
Hacking in shadows By - Raghav Bisht
Hacking in shadows By - Raghav BishtHacking in shadows By - Raghav Bisht
Hacking in shadows By - Raghav Bisht
 
Learning sed and awk
Learning sed and awkLearning sed and awk
Learning sed and awk
 
Nmap Basics
Nmap BasicsNmap Basics
Nmap Basics
 
Nmap 9 truth "Nothing to say any more"
Nmap 9 truth "Nothing to say  any more"Nmap 9 truth "Nothing to say  any more"
Nmap 9 truth "Nothing to say any more"
 
Hacking With Nmap - Scanning Techniques
Hacking With Nmap - Scanning TechniquesHacking With Nmap - Scanning Techniques
Hacking With Nmap - Scanning Techniques
 
Linux intro 4 awk + makefile
Linux intro 4  awk + makefileLinux intro 4  awk + makefile
Linux intro 4 awk + makefile
 
Linux intro 5 extra: makefiles
Linux intro 5 extra: makefilesLinux intro 5 extra: makefiles
Linux intro 5 extra: makefiles
 
Linux intro 2 basic terminal
Linux intro 2   basic terminalLinux intro 2   basic terminal
Linux intro 2 basic terminal
 
Linux intro 5 extra: awk
Linux intro 5 extra: awkLinux intro 5 extra: awk
Linux intro 5 extra: awk
 

Similar to Machine Learning and Hadoop: Present and Future

Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
DataWorks Summit
 

Similar to Machine Learning and Hadoop: Present and Future (20)

Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and futureMachine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and future
 
Hadoop Summit 2012 | BranchReduce: Distributed Branch-and-Bound on YARN
Hadoop Summit 2012 | BranchReduce: Distributed Branch-and-Bound on YARNHadoop Summit 2012 | BranchReduce: Distributed Branch-and-Bound on YARN
Hadoop Summit 2012 | BranchReduce: Distributed Branch-and-Bound on YARN
 
YARN
YARNYARN
YARN
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
1 architecture & design
1   architecture & design1   architecture & design
1 architecture & design
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 
One Hadoop, Multiple Clouds
One Hadoop, Multiple CloudsOne Hadoop, Multiple Clouds
One Hadoop, Multiple Clouds
 
Cloudera - Amr Awadallah - Hadoop World 2010
Cloudera - Amr Awadallah - Hadoop World 2010Cloudera - Amr Awadallah - Hadoop World 2010
Cloudera - Amr Awadallah - Hadoop World 2010
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
 

More from Data Science London

Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
Data Science London
 

More from Data Science London (20)

Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Nowcasting Business Performance
Nowcasting Business PerformanceNowcasting Business Performance
Nowcasting Business Performance
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)Python pandas workshop iPython notebook (163 pages)
Python pandas workshop iPython notebook (163 pages)
 
Big Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least SquaresBig Practical Recommendations with Alternating Least Squares
Big Practical Recommendations with Alternating Least Squares
 
Bringing back the excitement to data analysis
Bringing back the excitement to data analysisBringing back the excitement to data analysis
Bringing back the excitement to data analysis
 
Survival Analysis of Web Users
Survival Analysis of Web UsersSurvival Analysis of Web Users
Survival Analysis of Web Users
 
ACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, TodayACM RecSys 2012: Recommender Systems, Today
ACM RecSys 2012: Recommender Systems, Today
 
Beyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems DesignBeyond Accuracy: Goal-Driven Recommender Systems Design
Beyond Accuracy: Goal-Driven Recommender Systems Design
 
Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?Autonomous Discovery: The New Interface?
Autonomous Discovery: The New Interface?
 
Data Science for Live Music
Data Science for Live MusicData Science for Live Music
Data Science for Live Music
 
Research at last.fm
Research at last.fmResearch at last.fm
Research at last.fm
 
Music and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music IndustryMusic and Data: Adding Up the UK Music Industry
Music and Data: Adding Up the UK Music Industry
 
Scientific Article Recommendations with Mahout
Scientific Article Recommendations with MahoutScientific Article Recommendations with Mahout
Scientific Article Recommendations with Mahout
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 
Going Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook UsersGoing Real-Time with Mahout, Predicting gender of Facebook Users
Going Real-Time with Mahout, Predicting gender of Facebook Users
 
Practical Magic with Incanter
Practical Magic with IncanterPractical Magic with Incanter
Practical Magic with Incanter
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Machine Learning and Hadoop: Present and Future

  • 1. Machine  Learning  and  Hadoop   Present  and  Future   Josh  Wills   Cloudera  Data  Science  Team   September  6th,  2012  
  • 2. About  Me   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 3. Outline   •  Part  1:  Industrial  Machine  Learning   •  Part  2:  ML  and  Hadoop:  The  State  of  the  World   •  Part  3:  ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 4. (Academic)  ML  vs.  (Academic)  StaIsIcs         “Machine  learning  is  sta/s/cs  minus  any  checking  of   models  and  assump/ons.”                  -­‐-­‐  Brian  Ripley,  UseR!  2004                  (provoca/vely  paraphrased)   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 5. Industrial  Machine  Learning:  Truth  #1         The  thing  that  we  are  trying  to  predict  is  rarely  the  thing   that  we  are  trying  to  opImize.   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 6. Industrial  Machine  Learning:  Truth  #2           Systems  precede  algorithms.   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 7. Industrial  Machine  Learning:  Truth  #3   Practice Over Theory Blog Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 8. ImplicaIon         Data  science  requires  predicIon-­‐oriented  machine   learning  models  AND  classical,  rigorous  staIsIcal   analysis.     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 9. Outline   •  Part  1:  Industrial  Machine  Learning   •  Part  2:  ML  and  Hadoop:  The  State  of  the  World   •  Part  3:  ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 10. “Hadoop.  It’s  Where  The  Data  Is.”   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 11. Hadoop  PlaWorm:  Substrate   •  Commodity  servers   •  Open  source  operaFng  system   •  “”  ConfiguraFon  Management   •  “”  CoordinaFon  Service   •  “”  File  System  API   •  “”  Efficient  and  Extensible  File  Formats   •  “”  Efficient  and  Extensible  RPC  Libraries   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 12. Hadoop  PlaWorm:  MapReduce  Frameworks   •  Languages/Environments   •  PigLaFn  (Apache)   •  HiveQL  (Apache)   •  Jaql  (IBM)   •  Java/Scala  APIs   •  Crunch  (Apache  Incubator)   •  Scoobi  (NICTA)   •  Cascading  (Concurrent)   •  Pangool     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 13. ML  and  Hadoop:  The  State  of  the  World   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 14. MapReduce   •  Great  for:   •  Data  PreparaFon   •  Feature  Engineering   •  Model  ValidaFon/EvaluaFon   •  Works  Well  For  Certain  Model  Fing  Problems   •  CollaboraFve  Filtering  Algorithms   •  ExpectaFon  MaximizaFon   •  Decision  Trees  (PLANET;  Gradient  Boosted  Decision  Trees)   •  Not  A  PracIcal  OpIon  for  Many  Kinds  of  Problems   •  Way  More  Detail  in  the  KDD  2011  Talk   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 15. Apache  Mahout   •  The  starFng  place  for  MapReduce-­‐based  machine   learning  algorithms   •  Not  machine-­‐learning-­‐in-­‐a-­‐box   •  Custom  tweaks/modificaFons  are  the  rule   •  A  disparate  collecFon  of  algorithms  for:   •  RecommendaFons   •  Clustering   •  ClassificaFon   •  Frequent  Itemset  Mining   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 16. Apache  Mahout  (cont.)   •  Best  Library:  Taste  Recommender   •  Oldest  project,  most  widely-­‐deployed  in  producFon   •  SVD  implementaFon  is  parFcularly  acFve   •  Good  Libraries:  Online  SGD   •  Does  not  use  MapReduce   •  Vowpal  Rabbit  is  faster,  has  L-­‐BFGS  opFon   •  Roll  Your  Own  Instead:  Naïve  Bayes     Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 17. The  Ominous  Challenges   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 18. 1.  The  Secret  Sauce  Effect   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 19. 2.  Delta  Between  Mahout  and  the  Cu_ng  Edge   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 20. ML  and  Hadoop:  Where  Things  are  Headed   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 21. Moving  Beyond  MapReduce   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 22. The  Contenders   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 23. AllReduce   •  Developed  at  Yahoo!  Research   •  Defines  the  allreduce  operaFon   •  N  machines  each  have  a  number  =>  each  machine  has  the   sum  of  the  numbers   •  At  the  heart  of  Vowpal  Wabbit’s  performance   •  Implemented  in  C++   •  Can  be  patched  into  Apache  Hadoop  and  used  today   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 24. Spark   •  Developed  at  Berkeley’s   AMP  Lab   •  Defines  operaFons  on   distributed  in-­‐memory   collecFons   •  Wriken  in  Scala   •  Supports  reading  to  and   wriFng  from  HDFS   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 25. GraphLab   •  Developed  at  CMU   •  Lower-­‐level  primiFves   •  (but  higher  than  MPI)   •  Map/Reduce  =>   Update/Sort   •  Flexible,  allows  for   asynchronous   computaFons   •  Reads  from  HDFS   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 26. How  Things  Measure  Up   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 27. Speed  vs.  Reliability   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 28. Memory  vs.  Disk   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 29. C++  vs.  JVM   Copyright  2012  Cloudera  Inc.  All  rights  reserved  
  • 30. QuesIons?   (Ask  Anything.  Anything  At  All.)   jwills@cloudera.com