EDWs are straining to keep up with new demands being placed on them. Data volumes are snowballing and increasing familiar analytic problems are consuming new forms of data.Customer retention– for mass markets, social networks are generating new insights on what customers really think, and who is influencing them. For telcos, preventing customer churn goes well beyond dropped calls. Internet, IM, text, and yes… email… are becoming the bulk of mobile carrier traffic, multiplying the volumes and types of log files that must be dissected to understand the customer experienceOperational efficiency means tapping into the Internet of things – M2M– in addition to traditional OLTP systems for logistics providers to delver goods on time, for airlines to efficiency sequence airport operations, utilities to manage generation & transmission, or mfrs to fine tune operations on the shop floor.Risk mitigation must expand the range of transactions and track externalities to understand their exposure to loss.Data retention requirements – especially in regulated industries – forcing organizations to keep more data longer, forcing hard decisions of what data to keep live.This is breaking the establishing EDW model, optimized for transforming MBytes/GBytes of well-understood structured data. It is breaking the ETL model as data transformations are bursting their batch windows. Internet players such as Facebook discovered this years ago as their nightly batch windows to MySQL DWs were exceeding 24 hrs. the same issue is now crossing over the mainstream enterporises.
Surging data volumes drove the need to flatten BI architecture, shifting data transformation loads onto the target system to a pattern known as Extract/Load/Transform (ELT).The DW was still separate from source systems. But in place of a staging server where data was drawn in, transformed, and then moved to a target, data transform was placed inside the EDW. Emergence of ELT pattern reflected reality of shrinking batch windows, and need to minimize data movements.
The obvious advantage is that data movements are reduced; only a single movement from source to target was needed. The transformation workload was co-located to where the data was stored and analyzed.
Inexpensive data transformation platformCompute cycles cheap because low-cost platform.Well suited for performing batch transformation of data to downstream SQL DWs/data marts. Can replace your ETL staging server.Accommodates all kinds of dataNo need to lay out tables ahead of time because you are loading to a file system.HDFS efficient for sequential loading & processing of all kinds of data.Keep your options openDon’t force-fit data & analyticsLate-binding approach to structuring dataData schema can evolve over time as new sources of data become availableHadoop has multiple options for representing structured data. You can add a Hive metadata layer and/or persist it in HBase tables.Extensibility – Hadoop becoming a platform with multiple personalitiesOriginated with MapReduce processingMany alternatives emerging for different styles of processingStream processing, graph, HPC patterns, etc.Applying data mining algorithms to uncover new insightsNew frameworks emerging rapidlyBest of both worldsSQL convergence via Hive, emerging frameworks such as ImpalaSQL querying for exploring and understanding your data through familiar BI tools;ExtensibilityLow cost of storage allows raw data & transformed data to be kept side-by-sideAbility to accommodate variably structured data allows orgs to gain visibility into data & data sources traditionally outside the reach of SQL DWs
Hadoop can replace your ETL server
Ovum believes that the hot spot for Hadoop development in 2013 is convergence with SQL.Cloudera has been an active player in making Hadoop SQL-friendly. It has long partnered with leading ETL, BI, and Data warehousing platform and tool providers to offer connectivity between Hadoop and SQL platforms. In turn, many of these technology providers are taking connectivity to new levels by extending their offerings to venture beyond interfacing to Hadoop to operating natively within it.Cloudera’s introduction of the Impala open source framework takes Hadoop-SQL convergence to the next level. Impala, an Apache open source project developed by Cloudera, brings interactive SQL query directly to Hadoop. It offers a high-performance, massively parallel processing framework that works against any Hadoop file format. While Impala utilizes the Hive metadata store, it provides a higher-performance alternative to relying on batch-oriented MapReduce and Hive processes.Impala helps business analysts iterate modeling for data that may eventually be migrated to a data warehouse. Low-cost platforms for iteratively discovering and structuring data.
Cloudera Navigator, a new feature of Cloudera Manager, tracks how data is utilized; specifically, it compiles an audit trail detailing what operations were performed against specific pieces of data, by whom, and when. In its initial release, Navigator will track activity against HDFS, Hive, and HBase.
Our design strategy is to tightly integrate and couple Impala within the Hadoop system. Impala (and interactive SQL) is just another application that you bring to your data. It’s integrated with Hadoop’s existing security and resource management frameworks and is completely interoperable with existing data formats and processing engines.One pool of dataStorage platforms (HDFS & HBase)Open data formats (files & records)Shared across multiple processing frameworksOne metadata modelNo synchronization of metadata between 2 different systems (analytical DBMS and Hadoop)Same metadata used by other components within Hadoop itself (Hive, Pig, Impala, etc.)One security frameworkSingle model for all of HadoopDoesn’t require “turning off” any portion of native Hadoop securityOne set of system resourcesOne set of nodes – storage, CPU, memoryOne management consoleIntegrated resource managementScale linearly as capacity or performance needs grow
Interactive BI/Analytics on more dataRaw, full fidelity data – nothing lost through aggregation or ETL/LTNew sources & types – structured/unstructuredHistorical dataAsking new questionsExploration and data discovery for analytics and machine learning – need to find a data set for a model, which requires lots of simple queries to summarize, count, and validate.Hypothesis testing – avoid having to subset and fit the data to a warehouse just to ask a single questionData processing with tight SLAsCost-effective platformMinimize data movementReduce strain on data warehouseQuery-able storageReplace production data warehouse for DR/active archiveStore decades of data cost effectively (for better modeling or data retention mandates) without sacrificing the capability to analyze
Hadoop: Extending your Data Warehouse
1Hadoop: Extending Your Data WarehouseTony Baer | Principal Analyst, OvumModerated by Matt Brandwein | Product Marketing Manager, ClouderaMay 9, 2013
Welcome to the webinar!• All lines are muted• Q&A after the presentation• Ask questions at any time by typing them in the“Questions” pane on your WebEx panel• Recording of this webinar will be availableon-demand at cloudera.com• Join the conversation on Twitter:@cloudera @TonyBaer #EDWHadoop2
Who is Cloudera?3What the EnterpriseRequires Only 100% open sourceHadoop-based platformwith both batch and real-time processing engines,enterprise-ready withnative high availability Suite of system and datamanagement software Comprehensive supportand consulting services Broadest Hadoop trainingand certification programsExtensive PartnerEcosystem Over 600 partners acrosshardware, software andservicesThe Leader inBig DataManagement Deliver a revolutionarydata managementplatform powered byApache Hadoop World’s leadingcommercial vendor ofApache Hadoop Enable organizations toimprove operationalefficiency and AskBigger Questions of alltheir dataCustomers & UsersAcross Industries More productiondeployments than all othervendors combined
Impala: Cloudera’s Design Strategy24StorageIntegrationResource ManagementMetadataBatchProcessingMAPREDUCE,HIVE & PIG…InteractiveSQLIMPALAMathMachineLearning, AnalyticsHDFS HBaseTEXT, RCFILE, PARQUET, AVRO, ETC. RECORDSEnginesComplement MapReduce withinteractive MPP SQL engineOne pool of dataOne metadata modelOne security frameworkOne set of system resources100% open sourceAn Integrated Part of the Hadoop Platform
Impala Use Cases25Interactive BI/analytics on more dataAsking new questionsData processing with tight SLAsQuery-able archive w/ full fidelityCost-effective, ad hoc query environment thatoffloads the data warehouse for:
Questions?27• Type in the “Questions” panel• Tweet @cloudera #EDWHadoop• Recording will be availableon-demand at cloudera.com• Contact us:firstname.lastname@example.orgTwitter: @TonyBaermbrandwein@cloudera.comTwitter: @MattBrandweinThank you for attending!Try Cloudera todaycloudera.com/downloadsLearn more about Impalacloudera.com/impalaGet Hadoop Traininguniversity.cloudera.comReady to go?Check out Cloudera Quickstartcloudera.com/quickstart