Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics


Published on

Shaun Connolly's presentation at SAS Global Conference

Published in: Software, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data Analytics

  1. 1. Page 1 Hortonworks © 2014 Distilling Hadoop Patterns of Use Shaun Connolly, Hortonworks @shaunconnolly March 25, 2014
  2. 2. Page 2 Hortonworks © 2014 Our Mission: Our Commitment Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills Headquarters: Palo Alto, CA Employees: 300+ and growing Reseller Partners Enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop
  3. 3. Page 3 Hortonworks © 2014 Data Continues to Grow Sharply 2020:   Digital  universe  =  40  Ze'abytes     2012:   Digital  universe  =  20  Ze'abytes   1  Ze2abyte  (ZB)  =  1  billion  Terabytes  (TB)     2014:   31%  of  enterprises  managing  more  than  1  Petabyte   Social   Networks   Machine   Generated   Documents,     Emails   OLTP,  ERP,     CRM  Systems   Geoloca@on   Data   Sensor   Data   Web  Logs,   Click  Streams   85%  of  growth  from  new  types  of   data  with  machine-­‐generated   data  increasing  15x   Sources:  IDC  and  IDG  Enterprise  
  4. 4. Page 4 Hortonworks © 2014 Cameras and microphones widely deployed New routes to market via intelligent objects Content and services via connected products Everything has a URL Remote sensing of objects and environment Augmented reality Situational decision support Building and infrastructure management Over 50% of Internet connections are things: 2011: 15+ billion permanent, 50+ billion intermittent 2020: 30+ billion permanent, >200 billion intermittent Source: Gartner Keynote at Hadoop Summit 2013
  5. 5. Page 5 Hortonworks © 2014 Harnessing Big Data is transformational to business models Enables the move from post-transaction, reactive analysis of subsets of data stored in silos to a world of pre-transaction, interactive insights across all data that impacts both the top and bottom lines
  6. 6. Page 6 Hortonworks © 2014 DATA  SYSTEMS  APPLICATIONS   Repositories   ROOMS Sta@s@cal   Analysis   BI  /  Repor@ng,   Ad  Hoc  Analysis   Interac@ve  Web   &  Mobile  Applica@ons   Enterprise   Applica@ons   EDW MPPRDBMS   EDW   MPP   Governance     &  Integra=on   Security   Opera=ons   Data  Access   Data  Management   SOURCES   OLTP,  ERP,   CRM  Systems   Documents,     Emails   Web  Logs,   Click  Streams   Social   Networks   Machine   Generated   Sensor   Data   Geoloca@on   Data   Modern Data Architecture with Hadoop OPERATIONS  TOOLS   Provision, Manage & Monitor DEV  &  DATA  TOOLS   Build & Test ENTERPRISE HADOOP
  7. 7. Page 7 Hortonworks © 2014 MDA Unlocks New Approach to Insight Enterprise  Hadoop   Mul@ple  Query  Engines   Itera@ve  Process:  Explore,  Transform,  Analyze   SQL   Single  Query  Engine   Repeatable  Linear  Process   Determine   list  of   ques@ons   Current  Approach     Apply  schema  on  write     Dependent  on  IT   Augment  with  Hadoop     Apply  schema  on  read     Support  range  of  access  paRerns  to  data  stored  in  HDFS   Design   solu@ons   Collect   structured   data   Ask   ques@ons   from  list   Detect   addi@onal   ques@ons   Batch   Interac@ve   Real-­‐@me   Streaming  
  8. 8. Page 8 Hortonworks © 2014 Schema-on-Write vs. Schema-on-Read Standard Digital Camera § Zoom & focus first § Capture limited set of pixels § Crop around the focused area Lytro Lightfield Camera § Capture entire lightfield § Infinite zoom & focus § Crop any captured areas
  9. 9. Page 9 Hortonworks © 2014 MDA Uses Commodity Compute + Storage $0 $20,000 $40,000 $60,000 $80,000 $180,000 Cloud Storage HADOOP NAS Engineered System Hadoop Enables Scalable Compute & Storage at a Compelling Cost Structure Fully Loaded Cost per Raw TB of Data (min – max cost) EDW/MPP SAN
  10. 10. Page 10 Hortonworks © 2014 MDA Optimizes Data Warehouse Analytics 20% ETL Process 30% Operations 50% Current Reality §  EDW at capacity; some usage from low value workloads §  Older transformed data archived, unavailable for ongoing exploration §  Source data often discarded Operations 50% Analytics 50% HADOOP Parse, cleanse, apply structure, transform Augment with Hadoop §  Free up EDW resources from low value tasks §  Keep 100% of source data and historical data for ongoing exploration §  Mine data for value after loading it because of schema-on-read
  11. 11. Page 11 Hortonworks © 2014 Integrating with Existing InvestmentsAPPLICATIONS  DATA  SYSTEM  SOURCES   RDBMS   EDW   MPP   Emerging  Sources     (Sensor,  Sen=ment,  Geo,  Unstructured)   HANA BusinessObjects BI OPERATIONAL  TOOLS   DEV  &  DATA  TOOLS   Exis=ng  Sources     (CRM,  ERP,  Clickstream,  Logs)   INFRASTRUCTURE  
  12. 12. Page 12 Hortonworks © 2014 Powering the Modern Data Architecture     Enables  deep   insight  across  a   large,  broad,   diverse  set  of  data   at  efficient  scale     Mul=-­‐Use  Data  PlaSorm   Store  all  data  in  one  place,  process  in  many  ways   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   n   Batch   Interac=ve   Real-­‐=me   Streaming   Data Lake that contains ALL data; raw sources and any processed data over extended periods of time. YARN  :  Data  Opera=ng  System  
  13. 13. Page 13 Hortonworks © 2014 How  Hadoop?     “Hadoop  can  be  used  to  create  a  ‘data  lake’  –  an  integrated   repository  of  data  from  internal  and  external  data  sources...   Data  combined  from  mulVple  silos  can  help  your  organizaVon   find  answers  to  complex  quesVons  that  no  one  has  previously   dared  ask  or  known  how  to  ask.”        -­‐-­‐  Forrester  
  14. 14. Page 14 Hortonworks © 2014 The Common Journey with Hadoop SCALE SCOPE More data and analytic apps New Analytic Apps New types of data LOB-driven A Modern Data Architecture   RDBMS MPP EDW Governance &Integration Security Operations Data Access Data Management
  15. 15. Page 15 Hortonworks © 2014 Unlock Value in New Types of Data 1.  Social Understand how people are feeling and interacting – right now 2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website 3.  Sensor/Machine Discover patterns in data streaming from remote sensors and machines 4.  Geographic Analyze location-based data to manage operations where they occur 5.  Server Logs Diagnose process failures and prevent security breaches 6.  Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents Value + Online archive Data that was once purged or moved to tape can be stored in Hadoop to discover long term trends and previously hidden value
  16. 16. Page 16 Hortonworks © 2014 20 Business Applications of Hadoop Industry Use Case Type of Data Financial Services New Account Risk Screens Text, Server Logs Trading Risk Server Logs Insurance Underwriting Geographic, Sensor, Text Telecom Call Detail Records (CDRs) Machine, Geographic Infrastructure Investment Machine, Server Logs Real-time Bandwidth Allocation Server Logs, Text, Social Retail 360° View of the Customer Clickstream, Text Localized, Personalized Promotions Geographic Website Optimization Clickstream Manufacturing Supply Chain and Logistics Sensor Assembly Line Quality Assurance Sensor Crowdsourced Quality Assurance Social Healthcare Use Genomic Data in Medical Trials Structured Monitor Patient Vitals in Real-Time Sensor Pharmaceuticals Recruit and Retain Patients for Drug Trials Social, Clickstream Improve Prescription Adherence Social, Unstructured, Geographic Oil & Gas Unify Exploration & Production Data Sensor, Geographic & Unstructured Monitor Rig Safety in Real-Time Sensor, Unstructured Government ETL Offload in Response to Federal Budgetary Pressures Structured Sentiment Analysis for Government Programs Social
  17. 17. Page 17 Hortonworks © 2014 360° Customer View for Home Supply Retailer Problem Disjoint customer engagement across all channels Data repositories on website traffic, POS transactions and in- home services exist in separate silos Unable to perform analytics on customer buying behavior across all channels Limited ability for targeted marketing to specific segments Solution Unified system of engagement via “golden record” Golden record enables targeted marketing capabilities: customized coupons, promotions and emails Deep visibility into all customers and all market segments Unlocks rich, informed cross-sell & up-sell opportunities Creating Opportunity Data: Clickstream, Unstructured, Structured Retail Major home improvement retailer >$74B in revenue >300K employees >2,200 stores
  18. 18. Page 18 Hortonworks © 2014 Monetize Anonymous & Aggregate Banking Data Problem Unable to unlock valuable cross-sell banking data Bank possesses data that indicates larger macro-economic trends, which can be monetized in secondary markets Data sets are isolated in legacy silos controlled by LOBs Regulations and company policies protect customer privacy IT challenged by joining data while guaranteeing anonymity Solution Create cross-LOB data lake of de-identified data Mortgage bankers, consumer bankers, credit card group and treasury bankers have access to the same cross-sell data Single point of security & privacy for de-identification, masking, encryption, authentication and access control Interoperability with SAS, Red Hat & Splunk Creating Opportunity Data: Structured, Clickstream, Social & Unstructured Banking One of the largest US banks
  19. 19. Page 19 Hortonworks © 2014 Improving Efficiency Data: SensorOptimize High-Tech Manufacturing Problem Ineffective root cause analysis on product defects 200 million digital storage devices manufactured yearly >10K faulty devices returned by customers every month Limited data available for root cause analysis means that diagnosing problems is highly manual (physical inspections) Subset of sensor data from QA testing retained 3-12 months Solution Created sensor data lake for 10x quality improvement Repository holds 24 months of data for each device Manufacturing dashboard allows >1,000 employees to search data, with results returned in less than 1 second Quality improved 10x: rate down to ~1K faulty devices / month Manufacturing Digital Storage Devices >$15B in revenue >85K employees
  20. 20. Page 20 Hortonworks © 2014 Think Pigabyte, Not Petabyte
  21. 21. Page 21 Hortonworks © 2014 Enabling Hadoop for the Enterprise Journey Capabili=es   Ensure  enterprise  capabili@es   are  delivered  in  100%  open   source  to  benefit  all   1 2Integra=on   Interoperable  with  exis@ng     data  center  investments   Skills   Leverage  your  exis@ng  skills:   development,  analy@cs,   opera@ons    3 Scale Scope More data and analytic apps New Analytic Apps New types of data LOB-driven A Modern Data Architecture   RDBMS MPP EDW Governance &Integration Security Operations Data Access Data Management
  22. 22. Page 22 Hortonworks © 2014 Try Hadoop Today… Get Involved Download the Hortonworks Sandbox Learn Hadoop Build Your Analytic App Try Hadoop 2 San Jose, CA June 3 - 5, 2014 REGISTER NOW Amsterdam April 2 - 3, 2014 REGISTER NOW
  23. 23. Page 23 Hortonworks © 2014 Questions? @shaunconnolly