Your SlideShare is downloading. ×
0
THE BUSINESS ADVANTAGE OF    HADOOP: LESSONS FROM THE FIELD    Matt Aslett, Research Manager, 451 Research    Mike Olson, ...
Introducing our Speakers      Matt      Mike         Bill      Aaron     Aslett     Olson     Theisinger   Wiebe2
Big Data, Total Data… Hadoop  Matt Aslett - @maslett   • Research manager, data     management and analytics  Total Data...
‘Big Data’ “Big data” describes the realization of greater business intelligence  by storing, processing and analyzing da...
‘Total Data’ The adoption of non-traditional data processing technologies is   driven not just by the nature of the data,...
A virtuous circle? Increased use of interactive applicationsand data-generating machines New commercial opportunities fo...
What is Apache Hadoop? Distributed data storage (HDFS) and processing (MapReduce) Multiple associated data management pr...
What is Apache Hadoop for? Big-data        Hadoop as a platform for storing data that storage          could not previous...
THE EVOLUTION OF HADOOP    And how it’s used in the real world today    Mike Olson    CEO & Co-Founder, Cloudera9
Fastest sort of a TB, 62secsover 1,460 nodesSorted a PB in 16.25hoursover 3,658 nodes
CORE HADOOP COMPONENTS   Apache Hadoop is a platform for   data storage and processing that is…                           ...
2008                 2009                  2011                  2012             BEYOND… CLOUDERA             CDH:       ...
CLOUDERA ENTERPRISE                                                     EDUCATION       CLOUDERA SUPPORT:       OUR TEAM O...
 Cloudera’s software is never installed all by itself  It’s always deployed alongside mission-critical     systems that ...
✛ Disparate data sources ✛ Disparate systems for transforming, processing   and analyzing data ✛ Disparate systems for cap...
Consulting Services     Cloudera University      OPERATORS                                         ENGINEERS              ...
DATA                         ADVANCEDINDUSTRY           PROCESSING                   ANALYTICSWeb                Clickstre...
18
Hadoop@YP                                                                                                                 ...
ChallengesPage 20
What we were facing• Increasing volume of traffic data through our distribution  network• Need for a system to support cha...
Legacy processing flow                                             Data Load Application Log   Data Layer      ETL        ...
Hadoop PlatformPage 23
Hadoop processing flow                        Data       Data    Hadoop Platform     DataApplications               LWES  ...
Next GenerationPage 25
Hadoop processing flow                                                                Data                                ...
Hadoop @ Research In MotionAaron WiebeBlackBerry Infrastructure Architect
Internal Use Only The Problem 1. BlackBerry Services currently generate 500TB of    instrumentation data daily (and growin...
Internal Use Only The Old Way                                     Event Monitoring             Alerting                   ...
Internal Use Only The Hadoop Way                                  Event Monitoring            Alerting                Filt...
Internal Use Only Real Results 1. - 90% code base reduction for ETL Tools 2. - Example Performance: 3.      - Previous Ad-...
Introducing our Speakers      Matt    Mike        Bill      Aaron     Aslett   Olson    Theisinger   Wiebe32
Upcoming SlideShare
Loading in...5
×

The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

4,275

Published on

451 Analyst Matt Aslett, Cloudera CEO Mike Olson and Cloudera customers RIM and YP (formerly AT&T Interactive) to learn:
» Why Cloudera customers have chosen CDH to get started with Hadoop
» The business value resulting from analyzing new data sources in new ways
» How Hadoop will change these Customers’ business and industry over the next 3-5 years

Published in: Business, Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,275
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
143
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Hadoop typically solves two types of problems. Data process is the first step after collection. Data is combined and prepared, features extracted and curated Advanced analytics is where science is applied. Extracting and understanding models of how the business operates. The results are then integrated back into business operations. These go by different terms in different industries The applicability of these solutions is broad We ’ve successfully deployed Hadoop and helped solve a diverse set of business problems
  • Speak to the size and scope of the problem Problems with handling ~100PB of data using traditional methods
  • -Lose data as pipelines progress -Going back for information after the fact is hard, if not impossible. -
  • This is where Hadoop fit for us
  • -But changing to Hadoop has bigger, more massive impacts overall. -Things we couldn ’t even consider doing are now feasible -
  • Transcript of "The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research"

    1. 1. THE BUSINESS ADVANTAGE OF HADOOP: LESSONS FROM THE FIELD Matt Aslett, Research Manager, 451 Research Mike Olson, CEO, Cloudera Bill Theisinger, Executive Director, Platform Data Services, YP Aaron Wiebe, Blackberry Infrastructure Architect, Research In Motion1
    2. 2. Introducing our Speakers Matt Mike Bill Aaron Aslett Olson Theisinger Wiebe2
    3. 3. Big Data, Total Data… Hadoop  Matt Aslett - @maslett • Research manager, data management and analytics  Total Data • Assesses data management approaches in an era of ‘big data’ • Explores the drivers behind new approaches to data management and analytics • Explains the new and existing technologies used to store and process and deliver value from data © 2012 by The 451 Group. All rights reserved
    4. 4. ‘Big Data’ “Big data” describes the realization of greater business intelligence by storing, processing and analyzing data that was previously ignored due to the limitations of traditional data management technologies to handle its volume, velocity and/or variety. Volume Velocity Variety The volume of data The data is being The data lacks the is too large for produced at a rate structure to make it traditional database that is beyond the suitable for storage software tools to performance limits and analysis in cope with of traditional traditional systems databases and data warehouses © 2012 by The 451 Group. All rights reserved
    5. 5. ‘Total Data’ The adoption of non-traditional data processing technologies is driven not just by the nature of the data, but also by the user’s particular data processing requirements.Totality Exploration Frequency DependencyThe desire to The interest in The desire to The reliance onprocess and analyze exploratory analytic increase the rate of existingdata in its entirety, approaches, in analysis in order to technologies andrather than which schema is generate more skills, and the needanalyzing a sample defined in response accurate and timely to balanceof data and to the nature of the business investment in thoseextrapolating the query. intelligence. existingresults. technologies and skills with the adoption of new techniques. © 2012 by The 451 Group. All rights reserved
    6. 6. A virtuous circle? Increased use of interactive applicationsand data-generating machines New commercial opportunities foranalyzing previously ignored data Increased desire to store andprocess all available data More economically feasible to storeand process previously ignored data New infrastructure investments tosupport new data processing software © 2012 by The 451 Group. All rights reserved
    7. 7. What is Apache Hadoop? Distributed data storage (HDFS) and processing (MapReduce) Multiple associated data management projects • Open source • Vendor-supported Chukwa Sqoop ZooKeeper Pig • Clusters of commodity servers HBase Avro Mahout Flume • Storage of large data volumes • Structured, unstructured and MapReduce Whirr semi-structured data • Flexible, schema-on-read Hama processing HDFS Hive • Complex data sets • Connectors to existing Hadoop Common databases, data integration and business intelligence tools © 2012 by The 451 Group. All rights reserved
    8. 8. What is Apache Hadoop for? Big-data  Hadoop as a platform for storing data that storage could not previously be efficiently stored.  Hadoop as a large scale data ingestion/ETL Big-data layer that complements existing databases. integration  Hadoop as a platform for new exploratory Big-data analytic applications. analytics © 2012 by The 451 Group. All rights reserved
    9. 9. THE EVOLUTION OF HADOOP And how it’s used in the real world today Mike Olson CEO & Co-Founder, Cloudera9
    10. 10. Fastest sort of a TB, 62secsover 1,460 nodesSorted a PB in 16.25hoursover 3,658 nodes
    11. 11. CORE HADOOP COMPONENTS Apache Hadoop is a platform for data storage and processing that is… Hadoop Distributed File Scalable System (HDFS) MapReduce Fault tolerant Open source File Sharing & Data Distributed Computing Protection Across Across Physical Servers Physical Servers Has the Flexibility to Store Excels at Scales and Mine Any Type of Data Processing Complex Data EconomicallyAsk questions across structured and Scale-out architecture divides Can be deployed on commodityunstructured data that were previously workloads across multiple nodes hardwareimpossible to ask or solve Flexible file system eliminates ETL Open source platform guards againstNot bound by a single schema bottlenecks vendor lock 11 ©2011 Cloudera, Inc. All Rights Reserved.
    12. 12. 2008 2009 2011 2012 BEYOND… CLOUDERA CDH: CLOUDERA CLOUDERA TRANSFORMING FOUNDED BY MIKE FIRST REACHES 100 ENTERPRISE 4: HOW COMPANIES OLSON, COMMERCIAL PRODUCTION THE STANDARD THINK ABOUT AMR AWADALLAH & APACHE CUSTOMERS FOR HADOOP IN DATA JEFF HADOOP THE ENTERPRISE HAMMERBACHER DISTRIBUTION CHANGING CLO UDERA THE WORLD ENTERPRIS ONE PETABYTE E AT A TIME 4 2009 2010 2011 2012 HADOOP CLOUDERA CLOUDERA CLOUDERA CREATOR DOUG MANAGER: UNIVERSITY CONNECT CUTTING JOINS FIRST EXPANDS TO 140 REACHES 300 CLOUDERA MANAGEMENT COUNTRIES PARTNERS APPLICATION FOR HADOOP12
    13. 13. CLOUDERA ENTERPRISE EDUCATION CLOUDERA SUPPORT: OUR TEAM OF EXPERTS ON CALL TO HELP YOU MEET YOUR SERVICE DEVELOPERS LEVEL AGREEMENTS (SLAS) ADMINISTRATORS CLOUDERA MANAGER: END-TO-END MANAGEMENT APPLICATION FOR THE DEPLOYMENT & OPERATION OF CDH DATA SCIENTISTS CDH: BIG DATA STORAGE, PROCESSING & ANALYTICS PLATFORM BASED CERTIFICATION ON APACHE HADOOP – 100% OPEN SOURCE PROGRAMS PROFESSIONAL SERVICES USE CASE NEW HADOOP PROOF OF PRODUCTION PROCESS & TEAM DEPLOYMENT DISCOVERY DEPLOYMENT CONCEPT PILOTS DEVELOPMENT CERTIFICATION13
    14. 14.  Cloudera’s software is never installed all by itself  It’s always deployed alongside mission-critical systems that represent enormous investment  Extracting value from data requires sharing it across boundaries and among systems Goal: The right storage and the right processing in the right place at the right time14 ©2012 Cloudera, Inc. All Rights Reserved.
    15. 15. ✛ Disparate data sources ✛ Disparate systems for transforming, processing and analyzing data ✛ Disparate systems for capturing and reporting data, and for enforcing business and legislative governance requirements All need to be connected for usability and to unlock the unique value of each15 ©2012 Cloudera, Inc. All Rights Reserved.
    16. 16. Consulting Services Cloudera University OPERATORS ENGINEERS ANALYSTS BUSINESS USERS CUSTOMERS Management Enterprise Web Tools IDE’s BI / Analytics Reporting Application Enterprise Data Warehouse Cloudera Enterprise •CDH •Cloudera Manager Operational Rules •Technical Support Engines Relational Logs Files Web Data Databases16 ©2011 Cloudera, Inc. All Rights Reserved.
    17. 17. DATA ADVANCEDINDUSTRY PROCESSING ANALYTICSWeb Clickstream Sessionization Social Network AnalysisMedia Engagement Content OptimizationTelecom Mediation Network AnalyticsRetail Data Factory Loyalty & PromotionsFinancial Trade Reconciliation Fraud AnalysisGovernment Signal Intelligence (SIGINT) Entity AnalysisBiotech / Pharma Genome Mapping Sequencing Analysis
    18. 18. 18
    19. 19. Hadoop@YP Sept 26, 2012William TheisingerExecutive Director, Platform Computing © 2012 YP Holdings LLC Intellectual Property. All rights reserved. YP Holdings LLC, the YP Holdings LLC logo and all other YP Holdings LLC marks contained herein are trademarks of YP Holdings LLC Intellectual Property and/or YP Holdings LLC affiliated companies. All other marks contained herein are the property of their respective owners. (INTERNAL USE ONLY)
    20. 20. ChallengesPage 20
    21. 21. What we were facing• Increasing volume of traffic data through our distribution network• Need for a system to support changing data complexity and detail• Adhere to tighter SLAs• Provide intra-day reporting• Benefit from the intelligence trapped in our data 21
    22. 22. Legacy processing flow Data Load Application Log Data Layer ETL Data Load Data Warehouse Data processing Data Load• Drop reportable events on the floor• Loading multiple DBs• Processing time was significant• Reporting lag was in days, not hours• High maintainability required Page
    23. 23. Hadoop PlatformPage 23
    24. 24. Hadoop processing flow Data Data Hadoop Platform DataApplications LWES Collection Layer Warehouse• All ETL processing in Hadoop• Several systems integrate to Hadoop platform• All Java MapReduce with some Hive for end user and dependent systems• Reporting lag in hours, not days• Actual reduction in maintainability needs Page
    25. 25. Next GenerationPage 25
    26. 26. Hadoop processing flow Data WarehouseApplications Data Data Hadoop Platform LWES Collection Layer HBase Platform• Migrating some reporting to HBase• Exposing core business KPIs via APIs• Replacing various data marts with HBase tables/schemas• Reducing TCO• Alignment of core skill sets Page
    27. 27. Hadoop @ Research In MotionAaron WiebeBlackBerry Infrastructure Architect
    28. 28. Internal Use Only The Problem 1. BlackBerry Services currently generate 500TB of instrumentation data daily (and growing rapidly). 2. Traditional systems unable to cope with both growth and access requests. 3. Total global dataset of ~100PB.28 Confidential and Proprietary
    29. 29. Internal Use Only The Old Way Event Monitoring Alerting Filter Streaming ETL Complex Correlation Services and Split Streaming ETL Data Warehouse Archive Storage 1. - Focus on reducing data to required data set 2. - Pipeline data flows to avoid hitting disk 3. - Scalability issues at most stages 4. - Going back to the Archive was really time consuming29 Confidential and Proprietary
    30. 30. Internal Use Only The Hadoop Way Event Monitoring Alerting Filter Services and Hadoop Archive Storage Split ETL Data Warehouse Correlation Stage 1 DWH 1. - Archive storage moved to HDFS 2. - ETL processes converted to Hadoop (Pig+Hive) 3. - Some data warehouse functions migrating to Hadoop30 Confidential and Proprietary
    31. 31. Internal Use Only Real Results 1. - 90% code base reduction for ETL Tools 2. - Example Performance: 3. - Previous Ad-Hoc query would take around 4 days - Now takes 53 minutes - Significant capital cost reductions over previous system31 Confidential and Proprietary
    32. 32. Introducing our Speakers Matt Mike Bill Aaron Aslett Olson Theisinger Wiebe32
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×