Thursday, May 13, 2010
Evolving a New Analytical Platform
         What Works and What’s Missing


         Jeff Hammerbacher
         Chief Scie...
My Background
         Thanks for Asking
         ▪   hammer@cloudera.com
         ▪   Studied Mathematics at Harvard
    ...
Presentation Outline
         ▪   Architectures for large scale data analysis
             ▪   Reference architecture: ETL...
Summary of the Presentation
         (I have a short attention span, too)
         ▪   The abstractions provided by a rela...
Experiences at Facebook
         Early 2006: The First Research Scientist
         ▪   Source data living on horizontally ...
Facebook Data Infrastructure
         2007                                          Scribe Tier                     MySQL ...
Facebook Data Infrastructure
                                                      2008
                                  ...
SQL Server 2008 R2
         Old Features
         ▪   ETL: SQL Server Integration Services
         ▪   DW: SQL Server
   ...
SQL Server 2008 R2
         New Features
         ▪   Stream management: StreamInsight
         ▪   OLAP: PowerPivot
     ...
A New Foundation
         Motivations and Implementation
         ▪   Orders of magnitude growth in data volumes and compl...
(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1....
Upcoming SlideShare
Loading in...5
×

20100513brown

1,021

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,021
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

20100513brown

  1. 1. Thursday, May 13, 2010
  2. 2. Evolving a New Analytical Platform What Works and What’s Missing Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera May 13, 2010 Thursday, May 13, 2010
  3. 3. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Vice President of Products and Chief Scientist ▪ Also, check out the book “Beautiful Data” Thursday, May 13, 2010
  4. 4. Presentation Outline ▪ Architectures for large scale data analysis ▪ Reference architecture: ETL, DW, BI, Analytics ▪ New foundations: HDFS and MapReduce ▪ SQL Server 2008 R2 ▪ The new platform emerges ▪ Building a new platform ▪ Motivations ▪ Implementation ▪ Questions and Discussion Thursday, May 13, 2010
  5. 5. Summary of the Presentation (I have a short attention span, too) ▪ The abstractions provided by a relational database are no longer useful on their own for analytical data management. ▪ The abstraction layer needs to be redrawn to include the functionality provided by ETL, MDM, stream management, reporting, OLAP, and search tools, with a unified user interface for collaboration on investigation and results. ▪ I don’t think the cloud has much to do with the above, except to kill “scale up” once and for all. Thursday, May 13, 2010
  6. 6. Experiences at Facebook Early 2006: The First Research Scientist ▪ Source data living on horizontally partitioned MySQL tier ▪ Intensive historical analysis difficult ▪ No way to assess impact of changes to the site ▪ First try: Python scripts pull data into MySQL ▪ Second try: Python scripts pull data into Oracle ▪ ...and then we turned on impression logging Thursday, May 13, 2010
  7. 7. Facebook Data Infrastructure 2007 Scribe Tier MySQL Tier ▪ “Data Warehousing” ▪ Began with Oracle database ▪ Schedule data collection via cron ▪ Collect data every 24 hours ▪ “ETL” scripts: hand-coded Python Data Collection Server ▪ Data volumes quickly grew ▪ Started at tens of GB in early 2006 Oracle Database Server ▪ Up to about 1 TB per day by mid-2007 ▪ Log files largest source of data growth Thursday, May 13, 2010
  8. 8. Facebook Data Infrastructure 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Thursday, May 13, 2010
  9. 9. SQL Server 2008 R2 Old Features ▪ ETL: SQL Server Integration Services ▪ DW: SQL Server ▪ Reporting: SQL Server Reporting Services ▪ Analytics: SQL Server Analysis Services ▪ Search: Full-Text Search Thursday, May 13, 2010
  10. 10. SQL Server 2008 R2 New Features ▪ Stream management: StreamInsight ▪ OLAP: PowerPivot ▪ Collaboration: SharePoint ▪ MDM: Master Data Services ▪ Scale-out: Parallel Data Warehouse Thursday, May 13, 2010
  11. 11. A New Foundation Motivations and Implementation ▪ Orders of magnitude growth in data volumes and complexity ▪ Often from machine-generated logs ▪ Complex data is vast majority of data ▪ Built by consumer web teams and not enterprise software firms ▪ Open source ▪ Modular collection of tools, not an opaque abstraction ▪ Applications, not just analysis ▪ Solve user needs, don’t implement a spec Thursday, May 13, 2010
  12. 12. (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Thursday, May 13, 2010
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×