Saturday, June 12, 2010
Evolving a New Analytical Platform
          What Works and What’s Missing


          Jeff Hammerbacher
          Chief S...
My Background
         Thanks for Asking
         ▪   hammer@cloudera.com
         ▪   Studied Mathematics at Harvard
    ...
Presentation Outline
         ▪   BI: Science for Profit
             ▪   Need tools for whole research cycle
             ...
BI is looking more like science (for profit)




Saturday, June 12, 2010
Jim Gray: Science entering Fourth Paradigm
            “We have to do better at producing tools to
                 suppor...
RDBMS only a small part of this tool set




Saturday, June 12, 2010
Example: SQL Server 2008 R2




Saturday, June 12, 2010
RDBMS: SQL Server




Saturday, June 12, 2010
ETL: SQL Server Integration Services
                                 RDBMS: SQL Server




Saturday, June 12, 2010
ETL: SQL Server Integration Services
                                 RDBMS: SQL Server
                 Reporting: SQL Se...
ETL: SQL Server Integration Services
                                 RDBMS: SQL Server
                 Reporting: SQL Se...
ETL: SQL Server Integration Services
                                 RDBMS: SQL Server
                 Reporting: SQL Se...
CEP: StreamInsight
                          ETL: SQL Server Integration Services
                                 RDBMS: ...
CEP: StreamInsight
                          ETL: SQL Server Integration Services
                                 RDBMS: ...
MDM: Master Data Services
                                   CEP: StreamInsight
                          ETL: SQL Server ...
Collaboration: SharePoint
                               MDM: Master Data Services
                                   CEP:...
What do we call this unified suite?




Saturday, June 12, 2010
For today: Analytical Data Platform




Saturday, June 12, 2010
Who makes up the platform ecosystem?




Saturday, June 12, 2010
Platform Providers




Saturday, June 12, 2010
Infrastructure Providers
                            Platform Providers




Saturday, June 12, 2010
Infrastructure Providers
                            Platform Providers
                          Application Developers

...
Content Providers
                          Infrastructure Providers
                            Platform Providers
      ...
Content Providers
                          Infrastructure Providers
                            Platform Providers
      ...
What is new about the ecosystem today?




Saturday, June 12, 2010
Content Providers
            1. > 95% of enterprise data is unstructured
                  2. Data volumes growing rapidl...
Infrastructure Providers
                                    1. Cloud
                          2. Warehouse-Scale Compute...
Platform Providers
                                     1. Open source
                          2. Driven by consumer web...
Application Developers
                             1. Data Scientists
                          2. Diversity of languages...
End Users
                          1. Move beyond reporting to analytics
                           2. Make use of all en...
New foundations: HDFS and MapReduce




Saturday, June 12, 2010
(This is what boiling a frog feels like)




Saturday, June 12, 2010
2005: Doug/Mike start project inside Nutch




Saturday, June 12, 2010
2006: Doug joins Yahoo!




Saturday, June 12, 2010
2007: Make Hadoop scale




Saturday, June 12, 2010
2007: Make Hadoop scale
                          Yahoo! makes Pig open source




Saturday, June 12, 2010
Jim Gray’s “Fourth Paradigm” lecture
                              2007: Make Hadoop scale
                             Ya...
Randy Bryant’s “DISC” lecture
                          Jim Gray’s “Fourth Paradigm” lecture
                             ...
Randy Bryant’s “DISC” lecture
                          Jim Gray’s “Fourth Paradigm” lecture
                             ...
2008: Make Hadoop fast




Saturday, June 12, 2010
2008: Make Hadoop fast
            Yahoo! wins Daytona terabyte sort benchmark




Saturday, June 12, 2010
First Hadoop Summit
                          2008: Make Hadoop fast
            Yahoo! wins Daytona terabyte sort benchma...
First Hadoop Summit
                          2008: Make Hadoop fast
            Yahoo! wins Daytona terabyte sort benchma...
Facebook makes Hive open source
                                First Hadoop Summit
                             2008: Mak...
“MapReduce: A Major Step Backwards”
                            Facebook makes Hive open source
                          ...
2009: Insert Hadoop into the enterprise




Saturday, June 12, 2010
2009: Insert Hadoop into the enterprise
                            Cloudera releases CDH




Saturday, June 12, 2010
First Hadoop World NYC
                   2009: Insert Hadoop into the enterprise
                            Cloudera rel...
Yahoo! sorts a petabyte with Hadoop
                                First Hadoop World NYC
                   2009: Insert...
Yahoo! sorts a petabyte with Hadoop
                                First Hadoop World NYC
                   2009: Insert...
“The Unreasonable Effectiveness of Data”
                   Yahoo! sorts a petabyte with Hadoop
                          ...
2010: Integrate Hadoop into the enterprise




Saturday, June 12, 2010
2010: Integrate Hadoop into the enterprise
                          IBM announces InfoSphere BigInsights




Saturday, Ju...
Yahoo! completes enterprise-class security
             2010: Integrate Hadoop into the enterprise
                       ...
Yahoo! completes enterprise-class security
             2010: Integrate Hadoop into the enterprise
                       ...
Teradata, Pentaho, and others integrate
              Yahoo! completes enterprise-class security
             2010: Integr...
Hive adds JDBC and ODBC
               Teradata, Pentaho, and others integrate
              Yahoo! completes enterprise-c...
Hadoop will be an Analytical Data Platform




Saturday, June 12, 2010
What’s Next?




Saturday, June 12, 2010
Capture: Log collection and CEP




Saturday, June 12, 2010
Curate: Workflow and Scheduling




Saturday, June 12, 2010
Curate: Secondary and Full-Text Indexing




Saturday, June 12, 2010
Curate: Learn Structure from Data




Saturday, June 12, 2010
Analyze: Mesos-enabled frameworks




Saturday, June 12, 2010
Analyze: Link local and global data




Saturday, June 12, 2010
All behind a single pane of glass




Saturday, June 12, 2010
Cloudera Desktop
                          Making Many Computers Feel Like One




Saturday, June 12, 2010
(c) 2010 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1....
Upcoming SlideShare
Loading in …5
×

Experiences Evolving a New Analytical Platform: What Works and What's Missing

3,409 views
3,289 views

Published on

Presentation from the Industry track of the 2010 ACM SIGMOD conference.

Published in: Technology

Experiences Evolving a New Analytical Platform: What Works and What's Missing

  1. 1. Saturday, June 12, 2010
  2. 2. Evolving a New Analytical Platform What Works and What’s Missing Jeff Hammerbacher Chief Scientist, Cloudera June 8, 2010 Saturday, June 12, 2010
  3. 3. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Chief Scientist ▪ Also, check out the book “Beautiful Data” Saturday, June 12, 2010
  4. 4. Presentation Outline ▪ BI: Science for Profit ▪ Need tools for whole research cycle ▪ SQL Server 2008 R2: defining the platform ▪ State of the Platform Ecosystem ▪ New Foundations: Hadoop ▪ Boiling the Frog ▪ Future developments ▪ Questions and Discussion Saturday, June 12, 2010
  5. 5. BI is looking more like science (for profit) Saturday, June 12, 2010
  6. 6. Jim Gray: Science entering Fourth Paradigm “We have to do better at producing tools to support the whole research cycle” Saturday, June 12, 2010
  7. 7. RDBMS only a small part of this tool set Saturday, June 12, 2010
  8. 8. Example: SQL Server 2008 R2 Saturday, June 12, 2010
  9. 9. RDBMS: SQL Server Saturday, June 12, 2010
  10. 10. ETL: SQL Server Integration Services RDBMS: SQL Server Saturday, June 12, 2010
  11. 11. ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Saturday, June 12, 2010
  12. 12. ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Saturday, June 12, 2010
  13. 13. ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search Saturday, June 12, 2010
  14. 14. CEP: StreamInsight ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search Saturday, June 12, 2010
  15. 15. CEP: StreamInsight ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search OLAP: PowerPivot Saturday, June 12, 2010
  16. 16. MDM: Master Data Services CEP: StreamInsight ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search OLAP: PowerPivot Saturday, June 12, 2010
  17. 17. Collaboration: SharePoint MDM: Master Data Services CEP: StreamInsight ETL: SQL Server Integration Services RDBMS: SQL Server Reporting: SQL Server Reporting Services Analysis: SQL Server Analysis Services Search: Full-Text Search OLAP: PowerPivot Saturday, June 12, 2010
  18. 18. What do we call this unified suite? Saturday, June 12, 2010
  19. 19. For today: Analytical Data Platform Saturday, June 12, 2010
  20. 20. Who makes up the platform ecosystem? Saturday, June 12, 2010
  21. 21. Platform Providers Saturday, June 12, 2010
  22. 22. Infrastructure Providers Platform Providers Saturday, June 12, 2010
  23. 23. Infrastructure Providers Platform Providers Application Developers Saturday, June 12, 2010
  24. 24. Content Providers Infrastructure Providers Platform Providers Application Developers Saturday, June 12, 2010
  25. 25. Content Providers Infrastructure Providers Platform Providers Application Developers End Users Saturday, June 12, 2010
  26. 26. What is new about the ecosystem today? Saturday, June 12, 2010
  27. 27. Content Providers 1. > 95% of enterprise data is unstructured 2. Data volumes growing rapidly Saturday, June 12, 2010
  28. 28. Infrastructure Providers 1. Cloud 2. Warehouse-Scale Computers Saturday, June 12, 2010
  29. 29. Platform Providers 1. Open source 2. Driven by consumer web properties Saturday, June 12, 2010
  30. 30. Application Developers 1. Data Scientists 2. Diversity of languages Saturday, June 12, 2010
  31. 31. End Users 1. Move beyond reporting to analytics 2. Make use of all enterprise data Saturday, June 12, 2010
  32. 32. New foundations: HDFS and MapReduce Saturday, June 12, 2010
  33. 33. (This is what boiling a frog feels like) Saturday, June 12, 2010
  34. 34. 2005: Doug/Mike start project inside Nutch Saturday, June 12, 2010
  35. 35. 2006: Doug joins Yahoo! Saturday, June 12, 2010
  36. 36. 2007: Make Hadoop scale Saturday, June 12, 2010
  37. 37. 2007: Make Hadoop scale Yahoo! makes Pig open source Saturday, June 12, 2010
  38. 38. Jim Gray’s “Fourth Paradigm” lecture 2007: Make Hadoop scale Yahoo! makes Pig open source Saturday, June 12, 2010
  39. 39. Randy Bryant’s “DISC” lecture Jim Gray’s “Fourth Paradigm” lecture 2007: Make Hadoop scale Yahoo! makes Pig open source Saturday, June 12, 2010
  40. 40. Randy Bryant’s “DISC” lecture Jim Gray’s “Fourth Paradigm” lecture 2007: Make Hadoop scale Yahoo! makes Pig open source Powerset makes HBase open source Saturday, June 12, 2010
  41. 41. 2008: Make Hadoop fast Saturday, June 12, 2010
  42. 42. 2008: Make Hadoop fast Yahoo! wins Daytona terabyte sort benchmark Saturday, June 12, 2010
  43. 43. First Hadoop Summit 2008: Make Hadoop fast Yahoo! wins Daytona terabyte sort benchmark Saturday, June 12, 2010
  44. 44. First Hadoop Summit 2008: Make Hadoop fast Yahoo! wins Daytona terabyte sort benchmark Yahoo! builds production webmap with Hadoop Saturday, June 12, 2010
  45. 45. Facebook makes Hive open source First Hadoop Summit 2008: Make Hadoop fast Yahoo! wins Daytona terabyte sort benchmark Yahoo! builds production webmap with Hadoop Saturday, June 12, 2010
  46. 46. “MapReduce: A Major Step Backwards” Facebook makes Hive open source First Hadoop Summit 2008: Make Hadoop fast Yahoo! wins Daytona terabyte sort benchmark Yahoo! builds production webmap with Hadoop Saturday, June 12, 2010
  47. 47. 2009: Insert Hadoop into the enterprise Saturday, June 12, 2010
  48. 48. 2009: Insert Hadoop into the enterprise Cloudera releases CDH Saturday, June 12, 2010
  49. 49. First Hadoop World NYC 2009: Insert Hadoop into the enterprise Cloudera releases CDH Saturday, June 12, 2010
  50. 50. Yahoo! sorts a petabyte with Hadoop First Hadoop World NYC 2009: Insert Hadoop into the enterprise Cloudera releases CDH Saturday, June 12, 2010
  51. 51. Yahoo! sorts a petabyte with Hadoop First Hadoop World NYC 2009: Insert Hadoop into the enterprise Cloudera releases CDH Cloudera adds training, support, services Saturday, June 12, 2010
  52. 52. “The Unreasonable Effectiveness of Data” Yahoo! sorts a petabyte with Hadoop First Hadoop World NYC 2009: Insert Hadoop into the enterprise Cloudera releases CDH Cloudera adds training, support, services Saturday, June 12, 2010
  53. 53. 2010: Integrate Hadoop into the enterprise Saturday, June 12, 2010
  54. 54. 2010: Integrate Hadoop into the enterprise IBM announces InfoSphere BigInsights Saturday, June 12, 2010
  55. 55. Yahoo! completes enterprise-class security 2010: Integrate Hadoop into the enterprise IBM announces InfoSphere BigInsights Saturday, June 12, 2010
  56. 56. Yahoo! completes enterprise-class security 2010: Integrate Hadoop into the enterprise IBM announces InfoSphere BigInsights Datameer and Karmasphere funded Saturday, June 12, 2010
  57. 57. Teradata, Pentaho, and others integrate Yahoo! completes enterprise-class security 2010: Integrate Hadoop into the enterprise IBM announces InfoSphere BigInsights Datameer and Karmasphere funded Saturday, June 12, 2010
  58. 58. Hive adds JDBC and ODBC Teradata, Pentaho, and others integrate Yahoo! completes enterprise-class security 2010: Integrate Hadoop into the enterprise IBM announces InfoSphere BigInsights Datameer and Karmasphere funded Saturday, June 12, 2010
  59. 59. Hadoop will be an Analytical Data Platform Saturday, June 12, 2010
  60. 60. What’s Next? Saturday, June 12, 2010
  61. 61. Capture: Log collection and CEP Saturday, June 12, 2010
  62. 62. Curate: Workflow and Scheduling Saturday, June 12, 2010
  63. 63. Curate: Secondary and Full-Text Indexing Saturday, June 12, 2010
  64. 64. Curate: Learn Structure from Data Saturday, June 12, 2010
  65. 65. Analyze: Mesos-enabled frameworks Saturday, June 12, 2010
  66. 66. Analyze: Link local and global data Saturday, June 12, 2010
  67. 67. All behind a single pane of glass Saturday, June 12, 2010
  68. 68. Cloudera Desktop Making Many Computers Feel Like One Saturday, June 12, 2010
  69. 69. (c) 2010 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Saturday, June 12, 2010

×