Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and what’s the fastest path to get there?
Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Cloudera’s enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.
The original scalable, general, processing engine of Hadoop ecosystem - Useful across diverse problem domains - Fueled initial ecosystem explosion
What’s really significant about this architecture is how it unifies diverse access to common data.
In traditional approaches, you’d have separate systems to collect, store, process, explore, model, and serve data. Different teams would use different systems for each workload, and users whose roles span multiple systems would have to use several of them to achieve their objectives.
With Cloudera’s enterprise data hub: You can perform end-to-end data workflows in a single system, dramatically lowering time to value. Each workload can access unlimited data, thanks to the underlying data platform, enhancing the value of each workload. Power users can now access their data in new ways: SQL, search, machine learning, programming, etc. At the same time, new users are enabled by these diverse workloads to interact with data.
Cloudera Enterprise provides comprehensive support for batch, interactive, and real-time workloads: Batch Data integration with Apache Sqoop Data processing with MapReduce, Apache Hive, Apache Pig Memory-centric processing with Apache Spark Interactive Analytic SQL with Impala Search with Apache Solr Machine Learning with Apache Spark Real-Time Data integration with Apache Kafka, Apache Flume Stream processing with Apache Spark Data serving with Apache HBase
Shared resource management ensures that each workload is handled appropriately and abides by IT policy.
What’s more, 3rd party tools, such as SAS or Informatica can run as native workloads inside Cloudera’s enterprise data hub.
Our goal is to provide the best tools for a particular job
* Hive is the best for batch, and of course we want to make that experience better. * Impala is purpose built for interactive BI on Hadoop. Latency, concurrency, vendor ecosystem, and partner certification. * Spark SQL in the future will enable Spark developers to inline SQL as steps within their Spark application
Link to account record in SFDC (valid for Cloudera employees only): https://na6.salesforce.com/0018000000zmcRQ?srPos=0&srKp=001
Premier’s enterprise data hub improves healthcare efficiency, analyzing $41 billion in spend.
Background: Premier is an alliance whose mission is to improve the health of communities. By collecting, integrating, and analyzing clinical, financial, and operational data from the 3,000 U.S. hospitals and 110,000 other healthcare providers in its alliance, Premier’s database is one of the deepest and most comprehensive in the industry. The company has insight into $41 billion in purchases, and one out of every three health system discharges nationwide.
Challenge: Premier must find the most efficient way to collect, cleanse, and load data from thousands of different data sets into its solution, which provides a six to nine month rolling window of history to healthcare providers. Providers use this information for clinical quality and cost analysis, and to guide their medical supply chain management decisions.
In Premier’s incumbent environment, the multi-step process to ingest and make data available for analysis had grown complex, expensive, time-consuming, and wouldn’t scale. As the number of members in Premier’s alliance continued to grow, both the types and volumes of data collected started to balloon, leading to two primary data management challenges for Premier: data ingestion and supply chain data matching. Solution: In December 2013, Premier deployed a multi-tenant enterprise data hub (EDH) on Cloudera in production, supporting four use cases: First, clinical data integration: Premier ingests clinical data into the EDH, and through Impala, makes it available to healthcare practitioners for analysis and visualization via business intelligence tools including IBM Cognos, MicroStrategy, and Tableau. With the MapReduce data processing framework, Premier can automatically eliminate data that hasn’t changed from incoming data sets -- it can just process the subset of data that is new or different. The second use case is data transformation, processing and cleansing: Premier is migrating key ETL processes from traditional tools to Cloudera to simplify the big data environment, streamline data processing, and reduce costs. Cloudera processes all incoming data and then feeds two operational data stores in addition to Premier’s IBM Netezza data warehouse appliance.
The third use case is supply chain data matching: Using MapReduce, Cloudera Search, and Cloudera Impala, Premier can index data in batch, process incoming data sets and match them against the existing index. The results are then made available for querying through Hive and Impala. The fourth use case is interactive, analytical member spend application: Premier combined clinical, financial, and operational data into its centralized EDH to empower an interactive application that analyzes the $41 billion in spend across all of its member organizations.
Results: Premier’s EDH reduces the total amount of data that needs to be processed with each data set provided by hospitals, and total processing time has been significantly reduced -- meaning Premier can deliver new, fresh, and comprehensive data and analytics to healthcare providers faster than before.
And because Hadoop can handle both structured and unstructured data, healthcare providers don’t need to manually enter data that isn’t captured electronically, such as information about images or handwritten nurses’ notes. Raw data in all formats and from numerous sources can be quickly loaded into the system. Eliminating manual data entry also reduces data duplication and keying errors that would otherwise result.
With Premier’s new solution for supply chain data matching, Premier can match 98% of datasets, which was unprecedented. And its member spend application leveraging Cloudera Search for predictive analytics has increased spend categorization across all members from 49.8% to 72%. This enables members to perform market share calculations, spend trending, and identify savings opportunities.
Internal teams are also finding other interesting use cases to take advantage of the processing speed, efficiencies, and analytic flexibility offered by the EDH.
For example, Premier is starting to compare product masters with supplies. This use case is similar to the supply chain data matching operation. Premier will be able to, for example, compare cardiologists to see which ones are performing better, and identify whether specific products lead to higher quality results. Analysts can then drill down to evaluate those products, when they’re being purchased, and at what price, and under what contract. Because Premier’s system has visibility into thousands of healthcare providers’ systems and operations, Premier can ultimately make data driven recommendations to help healthcare providers secure better products at a lower cost.
In response, many organizations have turned to a new architecture – an enterprise data hub – to complement and extend existing investments.
An enterprise data hub can store unlimited data, cost-effectively and reliably, for as long as you need, and lets users access that data in a variety of ways. Data can be collected, stored, processed, explored, modeled, and served in one unified platform. It’s connected to the systems you already rely on.
Cloudera’s enterprise data hub, powered by Apache Hadoop, the popular open source distributed data platform, is differentiated in several crucial areas. We provide: Leading query performance. The enterprise management and governance that you require of all of your mission-critical infrastructure. Comprehensive, transparent, compliance-ready security at the core. An open source platform that is also built of open standards – projects that are supported by multiple vendors to ensure sustainability, portability, and compatibility.
Our platform runs in your choice of environment, whether on-premises or in the cloud.
Cheat Sheet version: Our enterprise data hub is: One place for unlimited data Accessible to anyone Connected to the systems you already depend on Secure, governed, managed & compliant Built on open source and open standards Deployed however you want Coupled with the support and enablement you need to succeed.
Important Note: Our EDH emphasizes “unified analytics” over “unified data”: It’s not practical or probable that customers will actually unify all their data. Much of it lives in the cloud or on storage (e.g. Isilon), in remote datacenters, is of uncertain value vs. cost of moving it to a hub, or security mandates preclude collocation. We enable customers to gather unlimited data, while bringing diverse processing and analytics to that data.
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5