Hadoop: Extending your Data Warehouse


Published on

Offload your big data workloads to reduce cost, increase flexibility, meet critical SLAs.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The rationale for multi-tiered DW architecture
  • EDWs are straining to keep up with new demands being placed on them. Data volumes are snowballing and increasing familiar analytic problems are consuming new forms of data.Customer retention– for mass markets, social networks are generating new insights on what customers really think, and who is influencing them. For telcos, preventing customer churn goes well beyond dropped calls. Internet, IM, text, and yes… email… are becoming the bulk of mobile carrier traffic, multiplying the volumes and types of log files that must be dissected to understand the customer experienceOperational efficiency means tapping into the Internet of things – M2M– in addition to traditional OLTP systems for logistics providers to delver goods on time, for airlines to efficiency sequence airport operations, utilities to manage generation & transmission, or mfrs to fine tune operations on the shop floor.Risk mitigation must expand the range of transactions and track externalities to understand their exposure to loss.Data retention requirements – especially in regulated industries – forcing organizations to keep more data longer, forcing hard decisions of what data to keep live.This is breaking the establishing EDW model, optimized for transforming MBytes/GBytes of well-understood structured data. It is breaking the ETL model as data transformations are bursting their batch windows. Internet players such as Facebook discovered this years ago as their nightly batch windows to MySQL DWs were exceeding 24 hrs. the same issue is now crossing over the mainstream enterporises.
  • Surging data volumes drove the need to flatten BI architecture, shifting data transformation loads onto the target system to a pattern known as Extract/Load/Transform (ELT).The DW was still separate from source systems. But in place of a staging server where data was drawn in, transformed, and then moved to a target, data transform was placed inside the EDW. Emergence of ELT pattern reflected reality of shrinking batch windows, and need to minimize data movements.
  • The obvious advantage is that data movements are reduced; only a single movement from source to target was needed. The transformation workload was co-located to where the data was stored and analyzed.
  • Inexpensive data transformation platformCompute cycles cheap because low-cost platform.Well suited for performing batch transformation of data to downstream SQL DWs/data marts. Can replace your ETL staging server.Accommodates all kinds of dataNo need to lay out tables ahead of time because you are loading to a file system.HDFS efficient for sequential loading & processing of all kinds of data.Keep your options openDon’t force-fit data & analyticsLate-binding approach to structuring dataData schema can evolve over time as new sources of data become availableHadoop has multiple options for representing structured data. You can add a Hive metadata layer and/or persist it in HBase tables.Extensibility – Hadoop becoming a platform with multiple personalitiesOriginated with MapReduce processingMany alternatives emerging for different styles of processingStream processing, graph, HPC patterns, etc.Applying data mining algorithms to uncover new insightsNew frameworks emerging rapidlyBest of both worldsSQL convergence via Hive, emerging frameworks such as ImpalaSQL querying for exploring and understanding your data through familiar BI tools;ExtensibilityLow cost of storage allows raw data & transformed data to be kept side-by-sideAbility to accommodate variably structured data allows orgs to gain visibility into data & data sources traditionally outside the reach of SQL DWs
  • Hadoop can replace your ETL server
  • Ovum believes that the hot spot for Hadoop development in 2013 is convergence with SQL.Cloudera has been an active player in making Hadoop SQL-friendly. It has long partnered with leading ETL, BI, and Data warehousing platform and tool providers to offer connectivity between Hadoop and SQL platforms. In turn, many of these technology providers are taking connectivity to new levels by extending their offerings to venture beyond interfacing to Hadoop to operating natively within it.Cloudera’s introduction of the Impala open source framework takes Hadoop-SQL convergence to the next level. Impala, an Apache open source project developed by Cloudera, brings interactive SQL query directly to Hadoop. It offers a high-performance, massively parallel processing framework that works against any Hadoop file format. While Impala utilizes the Hive metadata store, it provides a higher-performance alternative to relying on batch-oriented MapReduce and Hive processes.Impala helps business analysts iterate modeling for data that may eventually be migrated to a data warehouse. Low-cost platforms for iteratively discovering and structuring data.
  • Cloudera Navigator, a new feature of Cloudera Manager, tracks how data is utilized; specifically, it compiles an audit trail detailing what operations were performed against specific pieces of data, by whom, and when. In its initial release, Navigator will track activity against HDFS, Hive, and HBase.
  • Our design strategy is to tightly integrate and couple Impala within the Hadoop system. Impala (and interactive SQL) is just another application that you bring to your data. It’s integrated with Hadoop’s existing security and resource management frameworks and is completely interoperable with existing data formats and processing engines.One pool of dataStorage platforms (HDFS & HBase)Open data formats (files & records)Shared across multiple processing frameworksOne metadata modelNo synchronization of metadata between 2 different systems (analytical DBMS and Hadoop)Same metadata used by other components within Hadoop itself (Hive, Pig, Impala, etc.)One security frameworkSingle model for all of HadoopDoesn’t require “turning off” any portion of native Hadoop securityOne set of system resourcesOne set of nodes – storage, CPU, memoryOne management consoleIntegrated resource managementScale linearly as capacity or performance needs grow
  • Interactive BI/Analytics on more dataRaw, full fidelity data – nothing lost through aggregation or ETL/LTNew sources & types – structured/unstructuredHistorical dataAsking new questionsExploration and data discovery for analytics and machine learning – need to find a data set for a model, which requires lots of simple queries to summarize, count, and validate.Hypothesis testing – avoid having to subset and fit the data to a warehouse just to ask a single questionData processing with tight SLAsCost-effective platformMinimize data movementReduce strain on data warehouseQuery-able storageReplace production data warehouse for DR/active archiveStore decades of data cost effectively (for better modeling or data retention mandates) without sacrificing the capability to analyze
  • Hadoop: Extending your Data Warehouse

    1. 1. 1Hadoop: Extending Your Data WarehouseTony Baer | Principal Analyst, OvumModerated by Matt Brandwein | Product Marketing Manager, ClouderaMay 9, 2013
    2. 2. Welcome to the webinar!• All lines are muted• Q&A after the presentation• Ask questions at any time by typing them in the“Questions” pane on your WebEx panel• Recording of this webinar will be availableon-demand at cloudera.com• Join the conversation on Twitter:@cloudera @TonyBaer #EDWHadoop2
    3. 3. Who is Cloudera?3What the EnterpriseRequires Only 100% open sourceHadoop-based platformwith both batch and real-time processing engines,enterprise-ready withnative high availability Suite of system and datamanagement software Comprehensive supportand consulting services Broadest Hadoop trainingand certification programsExtensive PartnerEcosystem Over 600 partners acrosshardware, software andservicesThe Leader inBig DataManagement Deliver a revolutionarydata managementplatform powered byApache Hadoop World’s leadingcommercial vendor ofApache Hadoop Enable organizations toimprove operationalefficiency and AskBigger Questions of alltheir dataCustomers & UsersAcross Industries More productiondeployments than all othervendors combined
    4. 4. © Copyright Ovum. All rights reserved. Ovum is a subsidiary of Informa plc.4Hadoop: Extending your DataWarehouseTony Baertony.baer@ovum.comMay 9, 2013Twitter: @TonyBaer
    5. 5. © Copyright Ovum. All rights reserved. Ovum is an Informa business.5 The BI Bottleneck Hadoop & Enterprise Data Warehousing strategy How Cloudera supports Hadoop as extended DWAgenda
    6. 6. © Copyright Ovum. All rights reserved. Ovum is an Informa business.6Sources Target(s)StagingServerExtract Transform LoadDataMartsDWTraditional BI/Data warehousing architectureETL Tool
    7. 7. © Copyright Ovum. All rights reserved. Ovum is an Informa business.7 DWs conceived for MBytes/GBytes of structured data Data structured based on expected queries & analytics Multiple tiers to separate distinct workloads OLTP – ongoing, shallow interactions, simple queries Transform – batch-oriented, IOPS-intensive BI/analytics – data-intensive, spikey Reduced, eliminated impact on OLTP More complex architecture, more tradeoffsDW —The base case
    8. 8. © Copyright Ovum. All rights reserved. Ovum is an Informa business.8EDW hitting the wall Data growing in volume & complexity Use cases require more, richer data Customer retention Operational Efficiency Risk Mitigation Data retention mandates/policiesforcing hard decisions ETL bursting batch windows EDWs straining to accommodatevolumes, varieties of data
    9. 9. © Copyright Ovum. All rights reserved. Ovum is an Informa business.9Sources Target(s)Extract Load/TransformDWDataMartsThe ELT pattern
    10. 10. © Copyright Ovum. All rights reserved. Ovum is an Informa business.10The benefits – and limits – of ELT Pros Fewer data movements Flatter architecture Reduced errors with fewer datamovements Cons Transform vs. analytic workloadtradeoffs SLAs jeopardized Triggers arms race for moreinfrastructureProcessingTimesInfrastructureCostsDataVolumesAssuming constant SLAs
    11. 11. © Copyright Ovum. All rights reserved. Ovum is an Informa business.11Enterprise DWs –Size has its limits SLAs hit the wall Software licensing costs PBytes @ $20k - $50k/TByte get$$$$$$ Managing/transforming new datatypes consumes resource
    12. 12. © Copyright Ovum. All rights reserved. Ovum is an Informa business.12But what if... You don’t have to worry about batchwindows You don’t have to trade offtransformation vs. analytic processingcycles You can control s/w license costescalation You can keep that archived data live You can more readily consume newtypes of data & keep your analyticoptions open
    13. 13. © Copyright Ovum. All rights reserved. Ovum is an Informa business.13 The BI Bottleneck Hadoop & Enterprise Data Warehousing strategy How Cloudera supports Hadoop as extended DWAgenda
    14. 14. © Copyright Ovum. All rights reserved. Ovum is an Informa business.14Introducing Hadoop Originally, data processing framework forsolving unique Internet-scale problems Based on Google File System (GFS) &MapReduce Apache Hadoop community emerged todevelop platform for wider scale adoption FS, telcos, retail media discovered Hadoop’sbenefits
    15. 15. © Copyright Ovum. All rights reserved. Ovum is an Informa business.15Hadoop benefitsScalabilityNear linearperformance up to1000s of nodesCost FlexibilityLeverages commodityh/w & open source s/wVersatility with data,analytics & operation
    16. 16. © Copyright Ovum. All rights reserved. Ovum is an Informa business.16Hadoop’s trump card —Flexibility Accommodates all kinds of data Accommodates multipleworkloads Keeps your options open Extensibility Life beyond MapReduce Many personalities Best of both worlds Convergence with SQLGet the best of both worlds
    17. 17. © Copyright Ovum. All rights reserved. Ovum is an Informa business.17Sources TargetExtract Load/TransformDataMartsExistingDW/Data MartenvironmentHadoopDWHadoop as Data transformation platform
    18. 18. © Copyright Ovum. All rights reserved. Ovum is an Informa business.18Why Hadoop as your data transformation platform? Inexpensive cycles/storage Low-cost platform reduces or eliminates tradeoff contingencies No more transformation vs. analytics choice Keep your archive active Flexible division of labor Data can remain in Hadoop or moved to SQL Raw data sits alongside transformed data
    19. 19. © Copyright Ovum. All rights reserved. Ovum is an Informa business.19Why Hadoop as extension to your DW? Efficient division of labor Run time-consuming, resource-intensive analytic workloads insideHadoop Routine query, analytics, & reporting in SQL DW or data mart Query Hadoop directly Most commercial BI tools read Hive metadata Query Hadoop interactively Emerging MapReduce alternatives supporting interactive query
    20. 20. © Copyright Ovum. All rights reserved. Ovum is an Informa business.20 The BI Bottleneck Hadoop & Enterprise Data Warehousing strategy How Cloudera supports Hadoop as extended DWAgenda
    21. 21. © Copyright Ovum. All rights reserved. Ovum is an Informa business.21Cloudera supports SQL convergence Partners with leading ETL, BI, and Data warehousing platform & toolproviders Connect Hadoop & SQL platforms Emerging trend: BI, ETL tools are working natively inside Hadoop Introducing Impala Brings high-performance interactive SQL inside Hadoop Turns Hadoop into an MPP SQL analytic data target Extends, doesnt replace your SQL EDW or data mart Makes your DW strategy more flexible, iterative
    22. 22. © Copyright Ovum. All rights reserved. Ovum is an Informa business.22Taming Hadoop Cloudera Manager Automates deployment and health monitoring Automates Hadoop configuration New side-by-side deployment support Cloudera Navigator New feature of Cloudera Manager Tracks data utilization activity from HDFS, Hive & HBase Stepping stone for data security/stewardship… watch this space Backup & Disaster Recovery (BDR) New feature to automate recovery workflows
    23. 23. © Copyright Ovum. All rights reserved. Ovum is an Informa business.23Hadoop –Takeaways Economical platform for offloading data transformation cycles Extends enterprise analytics Hadoop & SQL are converging– broadening your analytic options Hadoop won’t replace your EDW, but will take more of the workload Cloudera actively broadening CDH to support & extend your EDW SQL convergence Platform manageability Data security & stewardship
    24. 24. Impala: Cloudera’s Design Strategy24StorageIntegrationResource ManagementMetadataBatchProcessingMAPREDUCE,HIVE & PIG…InteractiveSQLIMPALAMathMachineLearning, AnalyticsHDFS HBaseTEXT, RCFILE, PARQUET, AVRO, ETC. RECORDSEnginesComplement MapReduce withinteractive MPP SQL engineOne pool of dataOne metadata modelOne security frameworkOne set of system resources100% open sourceAn Integrated Part of the Hadoop Platform
    25. 25. Impala Use Cases25Interactive BI/analytics on more dataAsking new questionsData processing with tight SLAsQuery-able archive w/ full fidelityCost-effective, ad hoc query environment thatoffloads the data warehouse for:
    26. 26. Leading BI tools work with Impala26
    27. 27. Questions?27• Type in the “Questions” panel• Tweet @cloudera #EDWHadoop• Recording will be availableon-demand at cloudera.com• Contact us:tony.baer@ovum.comTwitter: @TonyBaermbrandwein@cloudera.comTwitter: @MattBrandweinThank you for attending!Try Cloudera todaycloudera.com/downloadsLearn more about Impalacloudera.com/impalaGet Hadoop Traininguniversity.cloudera.comReady to go?Check out Cloudera Quickstartcloudera.com/quickstart