Hadoop's Role in the Big Data Architecture, OW2con'12, Paris


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop's Role in the Big Data Architecture, OW2con'12, Paris

  1. 1. Hadoop & HortonworksOpen Source Wild FireNovember 2012OW2 Con© Hortonworks Inc. 2012 Page 1
  2. 2. Big data changes the game Transactions + InteractionsPetabytes BIG DATA Mobile Web + Observations Sentiment User Click Stream SMS/MMS = BIG DATA Speech to Text Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM Business Data Feeds Dynamic Pricing Segmentation External Demographics Search Marketing Customer Touches User Generated Content ERP Megabytes Affiliate Networks Purchase detail Support Contacts HD Video, Audio, Images Dynamic Funnels Purchase record Offer details Offer history Product/Service Logs Payment record Increasing Data Variety and Complexity © Hortonworks Inc. 2012
  3. 3. Big Data: Optimize Outcomes at Scale Sports optimize Championships Intelligence optimize Detection Finance optimize Algorithms Advertising optimize Performance Fraud optimize PreventionRetail / Wholesale optimize Inventory turns Manufacturing optimize Supply chains Healthcare optimize Patient outcomes Education optimize Learning outcomes Government optimize Citizen services Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation. Page 3 © Hortonworks Inc. 2012
  4. 4. Apache HadoopOpen Source data management Key Characteristics • Scalablewith scale-out storage & – Efficiently store and processdistributed processing petabytes of data – Linear scale driven by additional processing and storage HDFS • ReliableStorage • Distributed across “nodes” – Redundant storage • Natively redundant – Failover across nodes and racks • Name node tracks locations • Flexible – Store all types of data in any format – Apply schema on analysis and Map Reduce sharing of the dataProcessing • Splits a task across processors • Economical “near” the data & assembles results – Use commodity hardware • Self-Healing, High Bandwidth – Open source software guards Clustered Storage against vendor lock-in Page 4 © Hortonworks Inc. 2012
  5. 5. What is a Hadoop “Distribution” Talend WebHDFS Sqoop FlumeA complimentary set HCatalogof open source HBase Pig Hivetechnologies that MapReduce HDFSmake up a complete Ambari Oozie HAdata platform ZooKeeper• Tested and pre-packaged to ease installation and usage• Collects the right versions of the components that all have different release cycles and ensures they work together © Hortonworks Inc. 2012
  6. 6. Hadoop in Enterprise Data Architectures Existing Business Infrastructure Web New Tech Datameer Tablaeu Karmasphere IDE & ODS & Applications & Visualization & Web Splunk Dev Tools Datamarts Spreadsheets Intelligence Applications Operations Discovery Low Tools EDW Latency/NoSQ L Custom Existing Templeton WebHDFS Sqoop Flume HCatalog HBase Pig Hive MapReduce HDFS Ambari Oozie HA ZooKeeper Social Exhaust logs files CRM ERP financials Media Data Big Data Sources (transactions, observations, interactions) Page 6 © Hortonworks Inc. 2012
  7. 7. Apache Hadoop & Big Data Use Cases Big Data Transactions, Interactions, Observations Refine Explore Enrich Business Case Page 7 © Hortonworks Inc. 2012
  8. 8. Operational Data RefineryHadoop as platform for ETL modernization Enric Refine Explore hUnstructured Log files DB data Capture • Capture new unstructured data along with log files all alongside existing sources • Retain inputs in raw form for audit and Capture and archive continuity purposes Parse & Cleanse Process Structure and join • Parse the data & cleanse Upload • Apply structure and definition • Join datasets together across disparate data Refinery sources Exchange • Push to existing data warehouse for downstream consumption • Feeds operational reporting and online systems Enterprise Data Warehouse Page 8 © Hortonworks Inc. 2012
  9. 9. Big Data Exploration & Visualization Hadoop as agile, ad-hoc data mart Refine Explore Enrich Unstructured Log files DB data Capture • Capture multi-structured data and retain inputs in raw form for iterative analysis Capture and archive Process • Parse the data into queryable format Structure and join • Explore & analyze using Hive, Pig, Mahout and Categorize into tables other tools to discover value • Label data and type information for upload JDBC / ODBC compatibility and later discovery Explore • Pre-compute stats, groupings, patterns in dataOptional to accelerate analysis Exchange • Use visualization tools to facilitate exploration and find key insights Visualization Tools • Optionally move actionable insights into EDW EDW / Datamart or datamart Page 9 © Hortonworks Inc. 2012
  10. 10. Application EnrichmentDeliver Hadoop analysis to online apps Refine Explore EnrichUnstructured Log files DB data Capture • Capture data that was once too bulky and unmanageable Capture Enrich Parse Process Derive/Filter • Uncover aggregate characteristics across data Scheduled & • Use Hive Pig and Map Reduce to identify patterns near real time NoSQL, HBase • Filter useful data from mass streams (Pig) Low Latency • Micro or macro batch oriented schedules Exchange • Push results to HBase or other NoSQL alternative for real time delivery • Use patterns to deliver right content/offer to the Online right person at the right time Applications Page 10 © Hortonworks Inc. 2012
  11. 11. Balancing Innovation & Stability • Hadoop is “pre-chasm” • Ecosystem still evolvingcustomers relative % • Enterprises endure 1-3 year adoption cycle The CHASM Innovators, Early Early Late majority, Laggards, technology adopters, majority, conservatives Skeptics enthusiasts visionaries pragmatists time Customers want Customers want technology & performance solutions & convenience Source: Geoffrey Moore - Crossing the Chasm Page 11 © Hortonworks Inc. 2012
  12. 12. What Hortonworks does… We believe that by the end of 2015, more than half the worlds data will be processed by Apache Hadoop. Strategy: invest in Apache Hadoop to make it “The enterprise big data platform”Distribution Ecosystem Support• Hortonworks Data • Enable an Ecosystem of • Deliver highest quality Platform (HDP) Big Data Apps support and expertise• Enterprise Ready, Stable, • Our goal os to make sure all • Access to Apache Hadoop Reliable, Tested your tools work WITH Experts• 100% open source Hadoop • Hadoop training an• Built by the architects, • HDP is Hadoop for certification by the Hadoop builders and operators of • Microsoft experts(web, public, private) Apache Hadoop • Teradata Page 12 © Hortonworks Inc. 2012
  13. 13. AMSTERDAM March 20-21, 2013 Enabling the Next Generation Enterprise Data Platform • LEARN: Dozens of Sessions • INTERACT: Community Focused EventRegister today! @ hadoopsummit.org Page 13 © Hortonworks Inc. 2012