• Share
  • Email
  • Embed
  • Like
  • Private Content
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research
 

The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research

on

  • 3,994 views

451 Analyst Matt Aslett, Cloudera CEO Mike Olson and Cloudera customers RIM and YP (formerly AT&T Interactive) to learn: ...

451 Analyst Matt Aslett, Cloudera CEO Mike Olson and Cloudera customers RIM and YP (formerly AT&T Interactive) to learn:
» Why Cloudera customers have chosen CDH to get started with Hadoop
» The business value resulting from analyzing new data sources in new ways
» How Hadoop will change these Customers’ business and industry over the next 3-5 years

Statistics

Views

Total Views
3,994
Views on SlideShare
2,355
Embed Views
1,639

Actions

Likes
4
Downloads
122
Comments
0

17 Embeds 1,639

http://www.cloudera.com 1516
http://www.author.cloudera.solutionset.com 36
http://www.stage.cloudera.solutionset.com 21
http://blog.cloudera.com 14
http://cloudera.com 12
http://author01.mtv.cloudera.com 10
http://author.cloudera.solutionset.com 8
http://192.168.10.32 5
http://www.mtv.cloudera.com 4
http://www.david.exitcertified.com 3
http://author01.core.cloudera.com 2
http://publish02.mtv.cloudera.com 2
http://publish01.mtv.cloudera.com 2
http://www.linkedin.com 1
http://cloudera.d.solutionset.com 1
http://staging-author01.mtv.cloudera.com 1
http://192.168.10.63 1
More...

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Hadoop typically solves two types of problems. Data process is the first step after collection. Data is combined and prepared, features extracted and curated Advanced analytics is where science is applied. Extracting and understanding models of how the business operates. The results are then integrated back into business operations. These go by different terms in different industries The applicability of these solutions is broad We ’ve successfully deployed Hadoop and helped solve a diverse set of business problems
  • Speak to the size and scope of the problem Problems with handling ~100PB of data using traditional methods
  • -Lose data as pipelines progress -Going back for information after the fact is hard, if not impossible. -
  • This is where Hadoop fit for us
  • -But changing to Hadoop has bigger, more massive impacts overall. -Things we couldn ’t even consider doing are now feasible -

The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer Webinar Series: 451 Research Presentation Transcript

  • THE BUSINESS ADVANTAGE OF HADOOP: LESSONS FROM THE FIELD Matt Aslett, Research Manager, 451 Research Mike Olson, CEO, Cloudera Bill Theisinger, Executive Director, Platform Data Services, YP Aaron Wiebe, Blackberry Infrastructure Architect, Research In Motion1
  • Introducing our Speakers Matt Mike Bill Aaron Aslett Olson Theisinger Wiebe2
  • Big Data, Total Data… Hadoop  Matt Aslett - @maslett • Research manager, data management and analytics  Total Data • Assesses data management approaches in an era of ‘big data’ • Explores the drivers behind new approaches to data management and analytics • Explains the new and existing technologies used to store and process and deliver value from data © 2012 by The 451 Group. All rights reserved
  • ‘Big Data’ “Big data” describes the realization of greater business intelligence by storing, processing and analyzing data that was previously ignored due to the limitations of traditional data management technologies to handle its volume, velocity and/or variety. Volume Velocity Variety The volume of data The data is being The data lacks the is too large for produced at a rate structure to make it traditional database that is beyond the suitable for storage software tools to performance limits and analysis in cope with of traditional traditional systems databases and data warehouses © 2012 by The 451 Group. All rights reserved
  • ‘Total Data’ The adoption of non-traditional data processing technologies is driven not just by the nature of the data, but also by the user’s particular data processing requirements.Totality Exploration Frequency DependencyThe desire to The interest in The desire to The reliance onprocess and analyze exploratory analytic increase the rate of existingdata in its entirety, approaches, in analysis in order to technologies andrather than which schema is generate more skills, and the needanalyzing a sample defined in response accurate and timely to balanceof data and to the nature of the business investment in thoseextrapolating the query. intelligence. existingresults. technologies and skills with the adoption of new techniques. © 2012 by The 451 Group. All rights reserved
  • A virtuous circle? Increased use of interactive applicationsand data-generating machines New commercial opportunities foranalyzing previously ignored data Increased desire to store andprocess all available data More economically feasible to storeand process previously ignored data New infrastructure investments tosupport new data processing software © 2012 by The 451 Group. All rights reserved
  • What is Apache Hadoop? Distributed data storage (HDFS) and processing (MapReduce) Multiple associated data management projects • Open source • Vendor-supported Chukwa Sqoop ZooKeeper Pig • Clusters of commodity servers HBase Avro Mahout Flume • Storage of large data volumes • Structured, unstructured and MapReduce Whirr semi-structured data • Flexible, schema-on-read Hama processing HDFS Hive • Complex data sets • Connectors to existing Hadoop Common databases, data integration and business intelligence tools © 2012 by The 451 Group. All rights reserved
  • What is Apache Hadoop for? Big-data  Hadoop as a platform for storing data that storage could not previously be efficiently stored.  Hadoop as a large scale data ingestion/ETL Big-data layer that complements existing databases. integration  Hadoop as a platform for new exploratory Big-data analytic applications. analytics © 2012 by The 451 Group. All rights reserved
  • THE EVOLUTION OF HADOOP And how it’s used in the real world today Mike Olson CEO & Co-Founder, Cloudera9
  • Fastest sort of a TB, 62secsover 1,460 nodesSorted a PB in 16.25hoursover 3,658 nodes
  • CORE HADOOP COMPONENTS Apache Hadoop is a platform for data storage and processing that is… Hadoop Distributed File Scalable System (HDFS) MapReduce Fault tolerant Open source File Sharing & Data Distributed Computing Protection Across Across Physical Servers Physical Servers Has the Flexibility to Store Excels at Scales and Mine Any Type of Data Processing Complex Data EconomicallyAsk questions across structured and Scale-out architecture divides Can be deployed on commodityunstructured data that were previously workloads across multiple nodes hardwareimpossible to ask or solve Flexible file system eliminates ETL Open source platform guards againstNot bound by a single schema bottlenecks vendor lock 11 ©2011 Cloudera, Inc. All Rights Reserved.
  • 2008 2009 2011 2012 BEYOND… CLOUDERA CDH: CLOUDERA CLOUDERA TRANSFORMING FOUNDED BY MIKE FIRST REACHES 100 ENTERPRISE 4: HOW COMPANIES OLSON, COMMERCIAL PRODUCTION THE STANDARD THINK ABOUT AMR AWADALLAH & APACHE CUSTOMERS FOR HADOOP IN DATA JEFF HADOOP THE ENTERPRISE HAMMERBACHER DISTRIBUTION CHANGING CLO UDERA THE WORLD ENTERPRIS ONE PETABYTE E AT A TIME 4 2009 2010 2011 2012 HADOOP CLOUDERA CLOUDERA CLOUDERA CREATOR DOUG MANAGER: UNIVERSITY CONNECT CUTTING JOINS FIRST EXPANDS TO 140 REACHES 300 CLOUDERA MANAGEMENT COUNTRIES PARTNERS APPLICATION FOR HADOOP12
  • CLOUDERA ENTERPRISE EDUCATION CLOUDERA SUPPORT: OUR TEAM OF EXPERTS ON CALL TO HELP YOU MEET YOUR SERVICE DEVELOPERS LEVEL AGREEMENTS (SLAS) ADMINISTRATORS CLOUDERA MANAGER: END-TO-END MANAGEMENT APPLICATION FOR THE DEPLOYMENT & OPERATION OF CDH DATA SCIENTISTS CDH: BIG DATA STORAGE, PROCESSING & ANALYTICS PLATFORM BASED CERTIFICATION ON APACHE HADOOP – 100% OPEN SOURCE PROGRAMS PROFESSIONAL SERVICES USE CASE NEW HADOOP PROOF OF PRODUCTION PROCESS & TEAM DEPLOYMENT DISCOVERY DEPLOYMENT CONCEPT PILOTS DEVELOPMENT CERTIFICATION13
  •  Cloudera’s software is never installed all by itself  It’s always deployed alongside mission-critical systems that represent enormous investment  Extracting value from data requires sharing it across boundaries and among systems Goal: The right storage and the right processing in the right place at the right time14 ©2012 Cloudera, Inc. All Rights Reserved.
  • ✛ Disparate data sources ✛ Disparate systems for transforming, processing and analyzing data ✛ Disparate systems for capturing and reporting data, and for enforcing business and legislative governance requirements All need to be connected for usability and to unlock the unique value of each15 ©2012 Cloudera, Inc. All Rights Reserved.
  • Consulting Services Cloudera University OPERATORS ENGINEERS ANALYSTS BUSINESS USERS CUSTOMERS Management Enterprise Web Tools IDE’s BI / Analytics Reporting Application Enterprise Data Warehouse Cloudera Enterprise •CDH •Cloudera Manager Operational Rules •Technical Support Engines Relational Logs Files Web Data Databases16 ©2011 Cloudera, Inc. All Rights Reserved.
  • DATA ADVANCEDINDUSTRY PROCESSING ANALYTICSWeb Clickstream Sessionization Social Network AnalysisMedia Engagement Content OptimizationTelecom Mediation Network AnalyticsRetail Data Factory Loyalty & PromotionsFinancial Trade Reconciliation Fraud AnalysisGovernment Signal Intelligence (SIGINT) Entity AnalysisBiotech / Pharma Genome Mapping Sequencing Analysis
  • 18
  • Hadoop@YP Sept 26, 2012William TheisingerExecutive Director, Platform Computing © 2012 YP Holdings LLC Intellectual Property. All rights reserved. YP Holdings LLC, the YP Holdings LLC logo and all other YP Holdings LLC marks contained herein are trademarks of YP Holdings LLC Intellectual Property and/or YP Holdings LLC affiliated companies. All other marks contained herein are the property of their respective owners. (INTERNAL USE ONLY)
  • ChallengesPage 20
  • What we were facing• Increasing volume of traffic data through our distribution network• Need for a system to support changing data complexity and detail• Adhere to tighter SLAs• Provide intra-day reporting• Benefit from the intelligence trapped in our data 21
  • Legacy processing flow Data Load Application Log Data Layer ETL Data Load Data Warehouse Data processing Data Load• Drop reportable events on the floor• Loading multiple DBs• Processing time was significant• Reporting lag was in days, not hours• High maintainability required Page
  • Hadoop PlatformPage 23
  • Hadoop processing flow Data Data Hadoop Platform DataApplications LWES Collection Layer Warehouse• All ETL processing in Hadoop• Several systems integrate to Hadoop platform• All Java MapReduce with some Hive for end user and dependent systems• Reporting lag in hours, not days• Actual reduction in maintainability needs Page
  • Next GenerationPage 25
  • Hadoop processing flow Data WarehouseApplications Data Data Hadoop Platform LWES Collection Layer HBase Platform• Migrating some reporting to HBase• Exposing core business KPIs via APIs• Replacing various data marts with HBase tables/schemas• Reducing TCO• Alignment of core skill sets Page
  • Hadoop @ Research In MotionAaron WiebeBlackBerry Infrastructure Architect
  • Internal Use Only The Problem 1. BlackBerry Services currently generate 500TB of instrumentation data daily (and growing rapidly). 2. Traditional systems unable to cope with both growth and access requests. 3. Total global dataset of ~100PB.28 Confidential and Proprietary
  • Internal Use Only The Old Way Event Monitoring Alerting Filter Streaming ETL Complex Correlation Services and Split Streaming ETL Data Warehouse Archive Storage 1. - Focus on reducing data to required data set 2. - Pipeline data flows to avoid hitting disk 3. - Scalability issues at most stages 4. - Going back to the Archive was really time consuming29 Confidential and Proprietary
  • Internal Use Only The Hadoop Way Event Monitoring Alerting Filter Services and Hadoop Archive Storage Split ETL Data Warehouse Correlation Stage 1 DWH 1. - Archive storage moved to HDFS 2. - ETL processes converted to Hadoop (Pig+Hive) 3. - Some data warehouse functions migrating to Hadoop30 Confidential and Proprietary
  • Internal Use Only Real Results 1. - 90% code base reduction for ETL Tools 2. - Example Performance: 3. - Previous Ad-Hoc query would take around 4 days - Now takes 53 minutes - Significant capital cost reductions over previous system31 Confidential and Proprietary
  • Introducing our Speakers Matt Mike Bill Aaron Aslett Olson Theisinger Wiebe32