Tuesday, June 8, 2010
Evolving a New Analytical Platform
         What Works and What’s Missing


         Jeff Hammerbacher
         Chief Scientist and Vice President of Products, Cloudera
         June 8, 2010



Tuesday, June 8, 2010
My Background
         Thanks for Asking
         ▪   hammer@cloudera.com
         ▪   Studied Mathematics at Harvard
         ▪   Worked as a Quant on Wall Street
         ▪   Conceived, built, and led Data team at Facebook
             ▪   Nearly 30 amazing engineers and data scientists
             ▪   Several open source projects and research papers
         ▪   Founder of Cloudera
             ▪   Chief Scientist
             ▪   Also, check out the book “Beautiful Data”

Tuesday, June 8, 2010
Presentation Outline
         ▪   BI: Science for Profit
             ▪   Need tools for whole research cycle
             ▪   SQL Server 2008 R2: defining the platform
         ▪   State of the Platform Ecosystem
         ▪   New Foundations: Hadoop
             ▪   Boiling the Frog
             ▪   Future developments
         ▪   Questions and Discussion




Tuesday, June 8, 2010
BI is looking more like science (for profit)




Tuesday, June 8, 2010
Jim Gray: Science entering Fourth Paradigm
            “We have to do better at producing tools to
                 support the whole research cycle”




Tuesday, June 8, 2010
RDBMS only a small part of this tool set




Tuesday, June 8, 2010
Example: SQL Server 2008 R2




Tuesday, June 8, 2010
RDBMS: SQL Server




Tuesday, June 8, 2010
ETL: SQL Server Integration Services
                               RDBMS: SQL Server




Tuesday, June 8, 2010
ETL: SQL Server Integration Services
                               RDBMS: SQL Server
                 Reporting: SQL Server Reporting Services




Tuesday, June 8, 2010
ETL: SQL Server Integration Services
                               RDBMS: SQL Server
                 Reporting: SQL Server Reporting Services
                  Analysis: SQL Server Analysis Services




Tuesday, June 8, 2010
ETL: SQL Server Integration Services
                               RDBMS: SQL Server
                 Reporting: SQL Server Reporting Services
                  Analysis: SQL Server Analysis Services
                         Search: Full-Text Search



Tuesday, June 8, 2010
CEP: StreamInsight
                        ETL: SQL Server Integration Services
                               RDBMS: SQL Server
                 Reporting: SQL Server Reporting Services
                  Analysis: SQL Server Analysis Services
                         Search: Full-Text Search



Tuesday, June 8, 2010
CEP: StreamInsight
                        ETL: SQL Server Integration Services
                               RDBMS: SQL Server
                 Reporting: SQL Server Reporting Services
                  Analysis: SQL Server Analysis Services
                         Search: Full-Text Search
                             OLAP: PowerPivot


Tuesday, June 8, 2010
MDM: Master Data Services
                                 CEP: StreamInsight
                        ETL: SQL Server Integration Services
                               RDBMS: SQL Server
                 Reporting: SQL Server Reporting Services
                  Analysis: SQL Server Analysis Services
                         Search: Full-Text Search
                             OLAP: PowerPivot


Tuesday, June 8, 2010
Collaboration: SharePoint
                             MDM: Master Data Services
                                 CEP: StreamInsight
                        ETL: SQL Server Integration Services
                               RDBMS: SQL Server
                 Reporting: SQL Server Reporting Services
                  Analysis: SQL Server Analysis Services
                         Search: Full-Text Search
                             OLAP: PowerPivot


Tuesday, June 8, 2010
What do we call this unified suite?




Tuesday, June 8, 2010
For today: Analytical Data Platform




Tuesday, June 8, 2010
Who makes up the platform ecosystem?




Tuesday, June 8, 2010
Platform Providers




Tuesday, June 8, 2010
Infrastructure Providers
                          Platform Providers




Tuesday, June 8, 2010
Infrastructure Providers
                          Platform Providers
                        Application Developers




Tuesday, June 8, 2010
Content Providers
                        Infrastructure Providers
                          Platform Providers
                        Application Developers




Tuesday, June 8, 2010
Content Providers
                        Infrastructure Providers
                          Platform Providers
                        Application Developers
                              End Users




Tuesday, June 8, 2010
What is new about the ecosystem today?




Tuesday, June 8, 2010
Content Providers
            1. > 95% of enterprise data is unstructured
                  2. Data volumes growing rapidly




Tuesday, June 8, 2010
Infrastructure Providers
                                  1. Cloud
                        2. Warehouse-Scale Computers




Tuesday, June 8, 2010
Platform Providers
                                   1. Open source
                        2. Driven by consumer web properties




Tuesday, June 8, 2010
Application Developers
                           1. Data Scientists
                        2. Diversity of languages




Tuesday, June 8, 2010
End Users
                        1. Move beyond reporting to analytics
                         2. Make use of all enterprise data




Tuesday, June 8, 2010
New foundations: HDFS and MapReduce




Tuesday, June 8, 2010
(This is what boiling a frog feels like)




Tuesday, June 8, 2010
2005: Doug/Mike start project inside Nutch




Tuesday, June 8, 2010
2006: Doug joins Yahoo!




Tuesday, June 8, 2010
2007: Make Hadoop scale




Tuesday, June 8, 2010
2007: Make Hadoop scale
                        Yahoo! makes Pig open source




Tuesday, June 8, 2010
Jim Gray’s “Fourth Paradigm” lecture
                            2007: Make Hadoop scale
                           Yahoo! makes Pig open source




Tuesday, June 8, 2010
Randy Bryant’s “DISC” lecture
                        Jim Gray’s “Fourth Paradigm” lecture
                            2007: Make Hadoop scale
                           Yahoo! makes Pig open source




Tuesday, June 8, 2010
Randy Bryant’s “DISC” lecture
                        Jim Gray’s “Fourth Paradigm” lecture
                            2007: Make Hadoop scale
                           Yahoo! makes Pig open source
                         Powerset makes HBase open source




Tuesday, June 8, 2010
2008: Make Hadoop fast




Tuesday, June 8, 2010
2008: Make Hadoop fast
            Yahoo! wins Daytona terabyte sort benchmark




Tuesday, June 8, 2010
First Hadoop Summit
                        2008: Make Hadoop fast
            Yahoo! wins Daytona terabyte sort benchmark




Tuesday, June 8, 2010
First Hadoop Summit
                        2008: Make Hadoop fast
            Yahoo! wins Daytona terabyte sort benchmark
            Yahoo! builds production webmap with Hadoop




Tuesday, June 8, 2010
Facebook makes Hive open source
                              First Hadoop Summit
                           2008: Make Hadoop fast
            Yahoo! wins Daytona terabyte sort benchmark
            Yahoo! builds production webmap with Hadoop




Tuesday, June 8, 2010
“MapReduce: A Major Step Backwards”
                          Facebook makes Hive open source
                                First Hadoop Summit
                             2008: Make Hadoop fast
            Yahoo! wins Daytona terabyte sort benchmark
            Yahoo! builds production webmap with Hadoop




Tuesday, June 8, 2010
2009: Insert Hadoop into the enterprise




Tuesday, June 8, 2010
2009: Insert Hadoop into the enterprise
                           Cloudera releases CDH




Tuesday, June 8, 2010
First Hadoop World NYC
                  2009: Insert Hadoop into the enterprise
                           Cloudera releases CDH




Tuesday, June 8, 2010
Yahoo! sorts a petabyte with Hadoop
                              First Hadoop World NYC
                  2009: Insert Hadoop into the enterprise
                              Cloudera releases CDH




Tuesday, June 8, 2010
Yahoo! sorts a petabyte with Hadoop
                              First Hadoop World NYC
                  2009: Insert Hadoop into the enterprise
                         Cloudera releases CDH
               Cloudera adds training, support, services




Tuesday, June 8, 2010
“The Unreasonable Effectiveness of Data”
                   Yahoo! sorts a petabyte with Hadoop
                          First Hadoop World NYC
                  2009: Insert Hadoop into the enterprise
                         Cloudera releases CDH
               Cloudera adds training, support, services




Tuesday, June 8, 2010
2010: Integrate Hadoop into the enterprise




Tuesday, June 8, 2010
2010: Integrate Hadoop into the enterprise
                        IBM announces InfoSphere BigInsights




Tuesday, June 8, 2010
Yahoo! completes enterprise-class security
             2010: Integrate Hadoop into the enterprise
                        IBM announces InfoSphere BigInsights




Tuesday, June 8, 2010
Yahoo! completes enterprise-class security
             2010: Integrate Hadoop into the enterprise
                        IBM announces InfoSphere BigInsights
                          Datameer and Karmasphere funded




Tuesday, June 8, 2010
Teradata, Pentaho, and others integrate
              Yahoo! completes enterprise-class security
             2010: Integrate Hadoop into the enterprise
                        IBM announces InfoSphere BigInsights
                          Datameer and Karmasphere funded




Tuesday, June 8, 2010
Hive adds JDBC and ODBC
               Teradata, Pentaho, and others integrate
              Yahoo! completes enterprise-class security
             2010: Integrate Hadoop into the enterprise
                        IBM announces InfoSphere BigInsights
                          Datameer and Karmasphere funded




Tuesday, June 8, 2010
Hadoop will be an Analytical Data Platform




Tuesday, June 8, 2010
What’s Next?




Tuesday, June 8, 2010
Capture: Log collection and CEP




Tuesday, June 8, 2010
Curate: Workflow and Scheduling




Tuesday, June 8, 2010
Curate: Secondary and Full-Text Indexing




Tuesday, June 8, 2010
Curate: Learn Structure from Data




Tuesday, June 8, 2010
Analyze: Mesos-enabled frameworks




Tuesday, June 8, 2010
Analyze: Link local and global data




Tuesday, June 8, 2010
All behind a single pane of glass




Tuesday, June 8, 2010
Cloudera Desktop
                        Making Many Computers Feel Like One




Tuesday, June 8, 2010
(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0




Tuesday, June 8, 2010

20100608sigmod

  • 1.
  • 2.
    Evolving a NewAnalytical Platform What Works and What’s Missing Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera June 8, 2010 Tuesday, June 8, 2010
  • 3.
    My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Chief Scientist ▪ Also, check out the book “Beautiful Data” Tuesday, June 8, 2010
  • 4.
    Presentation Outline ▪ BI: Science for Profit ▪ Need tools for whole research cycle ▪ SQL Server 2008 R2: defining the platform ▪ State of the Platform Ecosystem ▪ New Foundations: Hadoop ▪ Boiling the Frog ▪ Future developments ▪ Questions and Discussion Tuesday, June 8, 2010
  • 5.
    BI is lookingmore like science (for profit) Tuesday, June 8, 2010
  • 6.
    Jim Gray: Scienceentering Fourth Paradigm “We have to do better at producing tools to support the whole research cycle” Tuesday, June 8, 2010
  • 7.
    RDBMS only asmall part of this tool set Tuesday, June 8, 2010
  • 8.
    Example: SQL Server2008 R2 Tuesday, June 8, 2010
  • 9.
  • 10.
    ETL: SQL ServerIntegration Services RDBMS: SQL Server Tuesday, June 8, 2010
  • 11.
    ETL: SQL ServerIntegration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Tuesday, June 8, 2010
  • 12.
    ETL: SQL ServerIntegration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Tuesday, June 8, 2010
  • 13.
    ETL: SQL ServerIntegration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search Tuesday, June 8, 2010
  • 14.
    CEP: StreamInsight ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search Tuesday, June 8, 2010
  • 15.
    CEP: StreamInsight ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search OLAP: PowerPivot Tuesday, June 8, 2010
  • 16.
    MDM: Master DataServices CEP: StreamInsight ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search OLAP: PowerPivot Tuesday, June 8, 2010
  • 17.
    Collaboration: SharePoint MDM: Master Data Services CEP: StreamInsight ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search OLAP: PowerPivot Tuesday, June 8, 2010
  • 18.
    What do wecall this unified suite? Tuesday, June 8, 2010
  • 19.
    For today: AnalyticalData Platform Tuesday, June 8, 2010
  • 20.
    Who makes upthe platform ecosystem? Tuesday, June 8, 2010
  • 21.
  • 22.
    Infrastructure Providers Platform Providers Tuesday, June 8, 2010
  • 23.
    Infrastructure Providers Platform Providers Application Developers Tuesday, June 8, 2010
  • 24.
    Content Providers Infrastructure Providers Platform Providers Application Developers Tuesday, June 8, 2010
  • 25.
    Content Providers Infrastructure Providers Platform Providers Application Developers End Users Tuesday, June 8, 2010
  • 26.
    What is newabout the ecosystem today? Tuesday, June 8, 2010
  • 27.
    Content Providers 1. > 95% of enterprise data is unstructured 2. Data volumes growing rapidly Tuesday, June 8, 2010
  • 28.
    Infrastructure Providers 1. Cloud 2. Warehouse-Scale Computers Tuesday, June 8, 2010
  • 29.
    Platform Providers 1. Open source 2. Driven by consumer web properties Tuesday, June 8, 2010
  • 30.
    Application Developers 1. Data Scientists 2. Diversity of languages Tuesday, June 8, 2010
  • 31.
    End Users 1. Move beyond reporting to analytics 2. Make use of all enterprise data Tuesday, June 8, 2010
  • 32.
    New foundations: HDFSand MapReduce Tuesday, June 8, 2010
  • 33.
    (This is whatboiling a frog feels like) Tuesday, June 8, 2010
  • 34.
    2005: Doug/Mike startproject inside Nutch Tuesday, June 8, 2010
  • 35.
    2006: Doug joinsYahoo! Tuesday, June 8, 2010
  • 36.
    2007: Make Hadoopscale Tuesday, June 8, 2010
  • 37.
    2007: Make Hadoopscale Yahoo! makes Pig open source Tuesday, June 8, 2010
  • 38.
    Jim Gray’s “FourthParadigm” lecture 2007: Make Hadoop scale Yahoo! makes Pig open source Tuesday, June 8, 2010
  • 39.
    Randy Bryant’s “DISC”lecture Jim Gray’s “Fourth Paradigm” lecture 2007: Make Hadoop scale Yahoo! makes Pig open source Tuesday, June 8, 2010
  • 40.
    Randy Bryant’s “DISC”lecture Jim Gray’s “Fourth Paradigm” lecture 2007: Make Hadoop scale Yahoo! makes Pig open source Powerset makes HBase open source Tuesday, June 8, 2010
  • 41.
    2008: Make Hadoopfast Tuesday, June 8, 2010
  • 42.
    2008: Make Hadoopfast Yahoo! wins Daytona terabyte sort benchmark Tuesday, June 8, 2010
  • 43.
    First Hadoop Summit 2008: Make Hadoop fast Yahoo! wins Daytona terabyte sort benchmark Tuesday, June 8, 2010
  • 44.
    First Hadoop Summit 2008: Make Hadoop fast Yahoo! wins Daytona terabyte sort benchmark Yahoo! builds production webmap with Hadoop Tuesday, June 8, 2010
  • 45.
    Facebook makes Hiveopen source First Hadoop Summit 2008: Make Hadoop fast Yahoo! wins Daytona terabyte sort benchmark Yahoo! builds production webmap with Hadoop Tuesday, June 8, 2010
  • 46.
    “MapReduce: A MajorStep Backwards” Facebook makes Hive open source First Hadoop Summit 2008: Make Hadoop fast Yahoo! wins Daytona terabyte sort benchmark Yahoo! builds production webmap with Hadoop Tuesday, June 8, 2010
  • 47.
    2009: Insert Hadoopinto the enterprise Tuesday, June 8, 2010
  • 48.
    2009: Insert Hadoopinto the enterprise Cloudera releases CDH Tuesday, June 8, 2010
  • 49.
    First Hadoop WorldNYC 2009: Insert Hadoop into the enterprise Cloudera releases CDH Tuesday, June 8, 2010
  • 50.
    Yahoo! sorts apetabyte with Hadoop First Hadoop World NYC 2009: Insert Hadoop into the enterprise Cloudera releases CDH Tuesday, June 8, 2010
  • 51.
    Yahoo! sorts apetabyte with Hadoop First Hadoop World NYC 2009: Insert Hadoop into the enterprise Cloudera releases CDH Cloudera adds training, support, services Tuesday, June 8, 2010
  • 52.
    “The Unreasonable Effectivenessof Data” Yahoo! sorts a petabyte with Hadoop First Hadoop World NYC 2009: Insert Hadoop into the enterprise Cloudera releases CDH Cloudera adds training, support, services Tuesday, June 8, 2010
  • 53.
    2010: Integrate Hadoopinto the enterprise Tuesday, June 8, 2010
  • 54.
    2010: Integrate Hadoopinto the enterprise IBM announces InfoSphere BigInsights Tuesday, June 8, 2010
  • 55.
    Yahoo! completes enterprise-classsecurity 2010: Integrate Hadoop into the enterprise IBM announces InfoSphere BigInsights Tuesday, June 8, 2010
  • 56.
    Yahoo! completes enterprise-classsecurity 2010: Integrate Hadoop into the enterprise IBM announces InfoSphere BigInsights Datameer and Karmasphere funded Tuesday, June 8, 2010
  • 57.
    Teradata, Pentaho, andothers integrate Yahoo! completes enterprise-class security 2010: Integrate Hadoop into the enterprise IBM announces InfoSphere BigInsights Datameer and Karmasphere funded Tuesday, June 8, 2010
  • 58.
    Hive adds JDBCand ODBC Teradata, Pentaho, and others integrate Yahoo! completes enterprise-class security 2010: Integrate Hadoop into the enterprise IBM announces InfoSphere BigInsights Datameer and Karmasphere funded Tuesday, June 8, 2010
  • 59.
    Hadoop will bean Analytical Data Platform Tuesday, June 8, 2010
  • 60.
  • 61.
    Capture: Log collectionand CEP Tuesday, June 8, 2010
  • 62.
    Curate: Workflow andScheduling Tuesday, June 8, 2010
  • 63.
    Curate: Secondary andFull-Text Indexing Tuesday, June 8, 2010
  • 64.
    Curate: Learn Structurefrom Data Tuesday, June 8, 2010
  • 65.
  • 66.
    Analyze: Link localand global data Tuesday, June 8, 2010
  • 67.
    All behind asingle pane of glass Tuesday, June 8, 2010
  • 68.
    Cloudera Desktop Making Many Computers Feel Like One Tuesday, June 8, 2010
  • 69.
    (c) 2009 Cloudera,Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Tuesday, June 8, 2010