Powering Next Generation Data Architecture With Apache Hadoop


Published on

Shaun Connolly presentation at Strata_London, October 1-2

Published in: Education
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Powering Next Generation Data Architecture With Apache Hadoop

  1. 1. Powering Next-Generation DataArchitectures with Apache HadoopShaun Connolly, Hortonworks@shaunconnollySeptember 25, 2012© Hortonworks Inc. 2012 Page 1
  2. 2. Big Data: Changing The Game for Organizations Transactions + InteractionsPetabytes BIG DATA Mobile Web + Observations Sentiment User Click Stream SMS/MMS = BIG DATA Speech to Text Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM Business Data Feeds Dynamic Pricing Segmentation External Demographics Search Marketing Customer Touches User Generated Content ERP Megabytes Affiliate Networks Purchase detail Support Contacts HD Video, Audio, Images Dynamic Funnels Purchase record Offer details Offer history Product/Service Logs Payment record Increasing Data Variety and Complexity Page 2 © Hortonworks Inc. 2012
  3. 3. Connecting Transactions + Interactions + Observations Audio, Retain runtime models and Video,Images historical data for ongoing 4 Business refinement & analysis Transactions Docs, Text, & Interactions XML Web Logs, Web, Mobile, CRM, Clicks ERP, SCM, … Big DataSocial, Platform ClassicGraph, 3 Deliver refined data and 1 ETLFeeds runtime models processingSensors, 2Devices, RFID Capture and exchange multi-structured data to BusinessSpatial, unlock value Intelligence GPS & Analytics Retain historical data toEvents, Other unlock additional value 5 Dashboards, Reports, Visualization, … Page 3 © Hortonworks Inc. 2012
  4. 4. Goal: Optimize Outcomes at Scale Media optimize Content Intelligence optimize Detection Finance optimize Algorithms Advertising optimize Performance Fraud optimize PreventionRetail / Wholesale optimize Inventory turns Manufacturing optimize Supply chains Healthcare optimize Patient outcomes Education optimize Learning outcomes Government optimize Citizen services Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation. Page 4 © Hortonworks Inc. 2012
  5. 5. Customer: UC Irvine Medical Center Optimizing patient outcomes while lowering costs•  UC Irvine Medical Center is ranked Current system, Epic holds 22 years of patient among the nations best hospitals by U.S. data, across admissions and clinical information News & World Report –  Significant cost to maintain and run system for the 12th year –  Difficult to access, not-integrated into any systems, stand alone•  More than 400 specialty and primary care physicians Apache Hadoop sunsets legacy system and augments new electronic medical records•  Opened in 1976 1.  Migrate all legacy Epic data to Apache Hadoop –  Replaced existing ETL and temporary databases with Hadoop•  422-bed medical resulting in faster more reliable transforms facility –  Captures all legacy data not just a subset. Exposes this data to EMR and other applications 2.  Eliminate maintenance of legacy system and database licenses –  $500K in annual savings 3.  Integrate data with EMR and clinical front-end –  Better service with complete patient history provided to admissions and doctors –  Enable improved research through complete information Page 5 © Hortonworks Inc. 2012
  6. 6. Emerging Patterns of Use Big Data Transactions + Interactions + Observations Refine Explore Enrich $ Business Case $ Page 6 © Hortonworks Inc. 2012
  7. 7. Operational Data RefineryHadoop as platform for ETL modernization Refine Explore EnrichUnstructured Log files DB data Capture •  Capture new unstructured data along with log files all alongside existing sources •  Retain inputs in raw form for audit and Capture and archive continuity purposes Parse & Cleanse Process Structure and join •  Parse the data & cleanse Upload •  Apply structure and definition Refinery •  Join datasets together across disparate data sources Exchange •  Push to existing data warehouse for downstream consumption Enterprise •  Feeds operational reporting and online systems Data Warehouse Page 7 © Hortonworks Inc. 2012
  8. 8. “Big Bank” Key Benefits• Capture and archive – Retain 3 – 5 years instead of 2 – 10 days – Lower costs – Improved compliance• Transform, change, refine – Turn upstream raw dumps into small list of “new, update, delete” customer records – Convert fixed-width EBCDIC to UTF-8 (Java and DB compatible) – Turn raw weblogs into sessions and behaviors• Upload – Insert into Teradata for downstream “as-is” reporting and tools – Insert into new exploration platform for scientists to play with © Hortonworks Inc. 2012
  9. 9. Big Data Exploration & Visualization Hadoop as agile, ad-hoc data mart Refine Explore Enrich Unstructured Log files DB data Capture •  Capture multi-structured data and retain inputs in raw form for iterative analysis Capture and archive Process •  Parse the data into queryable format Structure and join •  Explore & analyze using Hive, Pig, Mahout and Categorize into tables other tools to discover value upload JDBC / ODBC •  Label data and type information for compatibility and later discovery Explore •  Pre-compute stats, groupings, patterns in dataOptional to accelerate analysis Exchange •  Use visualization tools to facilitate exploration and find key insights Visualization EDW / Datamart Tools •  Optionally move actionable insights into EDW or datamart Page 9 © Hortonworks Inc. 2012
  10. 10. “Hardware Manufacturer” Key Benefits• Capture and archive – Store 10M+ survey forms/year for > 3 years – Capture text, audio, and systems data in one platform• Structure and join – Unlock freeform text and audio data – Un-anonymize customers• Categorize into tables – Create HCatalog tables “customer”, “survey”, “freeform text”• Upload, JDBC – Visualize natural satisfaction levels and groups – Tag customers as “happy” and report back to CRM database © Hortonworks Inc. 2012
  11. 11. Application EnrichmentDeliver Hadoop analysis to online apps Refine Explore EnrichUnstructured Log files DB data Capture •  Capture data that was once too bulky and unmanageable Capture Enrich Parse Process Derive/Filter •  Uncover aggregate characteristics across data Scheduled & near real time •  Use Hive Pig and Map Reduce to identify patterns NoSQL, HBase •  Filter useful data from mass streams (Pig) Low Latency •  Micro or macro batch oriented schedules Exchange •  Push results to HBase or other NoSQL alternative for real time delivery Online •  Use patterns to deliver right content/offer to the Applications right person at the right time Page 11 © Hortonworks Inc. 2012
  12. 12. “Clothing Retailer” Key Benefits• Capture – Capture weblogs together with sales order history, customer master• Derive useful information – Compute relationships between products over time – “people who buy shirts eventually need pants” – Score customer web behavior / sentiment – Connect product recommendations to customer sentiment• Share – Load customer recommendations into HBase for rapid website service © Hortonworks Inc. 2012
  13. 13. Hadoop in Enterprise Data Architectures Existing Business Infrastructure Web New Tech Datameer Tableau Karmasphere IDE & ODS & Applications & Visualization & Web Splunk Dev Tools Datamarts Spreadsheets Intelligence Applications Operations Discovery Low Latency/ Tools EDW NoSQL Custom Existing Templeton WebHDFS Sqoop Flume HCatalog HBase Pig Hive MapReduce HDFS Ambari Oozie HA ZooKeeper Social Exhaust logs files CRM ERP financials Media Data Big Data Sources (transactions, observations, interactions) Page 13 © Hortonworks Inc. 2012
  14. 14. Hortonworks Vision & Role We believe that by the end of 2015, more than half the worlds data will be processed by Apache Hadoop. 1 Be diligent stewards of the open source core 2 Be tireless innovators beyond the core 3 Provide robust data platform services & open APIs 4 Enable vibrant ecosystem at each layer of the stack 5 Make Hadoop platform enterprise-ready & easy to use Page 14 © Hortonworks Inc. 2012
  15. 15. What’s Needed to Drive Success?•  Enterprise tooling to become a complete data platform –  Open deployment & provisioning –  Higher quality data loading –  Monitoring and management –  APIs for easy integration www.hortonworks.com/moore•  Ecosystem needs support & development –  Existing infrastructure vendors need to continue to integrate –  Apps need to continue to be developed on this infrastructure –  Well defined use cases and solution architectures need to be promoted•  Market needs to rally around core Apache Hadoop –  To avoid splintering/market distraction –  To accelerate adoption Page 15 © Hortonworks Inc. 2012
  16. 16. Next Steps?1 Download Hortonworks Data Platform hortonworks.com/download2 Use the getting started guide hortonworks.com/get-started3 Learn more… get support Hortonworks Support •  Expert role based training •  Full lifecycle technical support •  Course for admins, developers across four service levels and operators •  Delivered by Apache Hadoop •  Certification program Experts/Committers •  Custom onsite options •  Forward-compatible hortonworks.com/training hortonworks.com/support Page 16 © Hortonworks Inc. 2012
  17. 17. Thank You!Questions & AnswersFollow: @hortonworks & @shaunconnolly Page 17 © Hortonworks Inc. 2012