Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hybrid Data Warehouse Hadoop Implementations


Published on

Data Warehouse vendors are evolving to incorporate the best Hadoop has to offer. Similarly, the Hadoop ecosystem is growing to include capabilities previously available only to large scale (MPP) DW platforms.

Hybrid Data Warehouse Hadoop Implementations

  1. 1. The future of hybrid Data Warehouse-Hadoop implementations - - David Portnoy Datalytx, Inc. 312.970.9740 - - - © Copyright 2013 David Portnoy and Datalytx, Inc. - - -
  2. 2. Why this topic? Note on terminology used In this context RDBMS (relational databases) are synonymous with DW (data warehouses)  Data Warehouse vendors are evolving to incorporate the best Hadoop has to offer. Similarly, the Hadoop ecosystem is growing to include capabilities previously available only to large scale (MPP) DW platforms.  Understanding the trends and alternatives helps your organization identify the most effective long term solution  Launched in TDWI forum on LinkedIn (See Who are the winners in the race for the ultimate hybrid DBMS-Hadoop implementation? As described in, the industry is going to a hybrid DBMS-Hadoop model that leverages the best of both worlds. (Microsoft for example is building its Polybase with cost based query optimization that decides whether to push processing to the Hadoop data nodes or the PDW compute nodes.) Which vendors do you see as the current leaders in this race? And for the visionaries and philosophers among you... How do you see it ultimately shaking out? This group extends the TDWI community online and is designed to foster peer network and discussion of key issues relevant to business intelligence and data warehousing managers. TDWI (The Data Warehousing Institute ™) provides education, training, certification, news, and research for executives and information technology (IT) professionals worldwide. Founded in 1995, TDWI is the premier educational institute for business intelligence and data warehousing. Our Web site is
  3. 3. Relationship between Hadoop & DW implementations To leverage the strengths in each platform, traditionally...  Hadoop is used for storage & transformation (specifically, ELT) of vast volumes of raw data  ...while DW is used for analytics on a subset of the processed data Extract from source systems Load to DW Transform in place DW Reporting & OLAP using traditional tools
  4. 4. Why do DW vendors care about Hadoop? ...And why not just ignore it as a special use case solution? 1. Compelling price point to store high volume data, especially if it’s not needed for real-time access 2. Has become the de facto standard for ambitious big data projects 3. HDFS and MapReduce are becoming mature and stable technologies (in development since 2005), despite the fact that the rest of the ecosystem is still rapidly evolving 4. DBMS vendors have been missing a scalable distributed file system, such as HDFS, to provide capability to store and manipulate variable and unstructured data
  5. 5. Do companies with an MPP DW still need Hadoop? For companies that have an MPP (Massively Parallel Processing) data warehouse, such as Teradata or Netezza... Couldn’t the DW platform do everything Hadoop could do? Yes and no. 1. Yes, you can store a lot of unstructured data in text fields, simulate operations of key-value pairs, and scale processing capacity horizontally, just like Hadoop 2. But the types of processing that can be done against the DW is more limited than with Hadoop. (Although some DW vendors now allow open source tools, including MapReduce and R, to crunch the data.) 3. And the cost of storing high volumes of data – especially for low usage frequency and high latency operations – is much lower for Hadoop. So adding Hadoop to the mix can keep the size of the more expensive DW platform in check.
  6. 6. What paths are DW vendors taking? DW vendors can choose from 3 typical paths for responding to competing technologies 1. Ignore: Hope it’s a fad that goes away None of the major vendors see this as a viable option 2. Reactionary: Interoperate with existing Hadoop products This seems to be the most common path for established commercial vendors 3. Proactive: Embrace Hadoop and contribute to extending the ecosystem This seems to be the approach for new entrants competing in a targeted niche
  7. 7. What will happen to major DW vendors? It’s safe to say that major commercial vendors like IBM, Oracle, Teradata and Microsoft will continue to be key players. 1. Each one already has a product roadmap involving some way of responding to or incorporating the Hadoop ecosystem. 2. Many of the most successful new entrants will be acquired by these vendors and incorporated into their product lines. 3. This pattern can me seen in the historical evolution of revolutionary or niche technologies, such as columnar databases, in-memory databases, and self-service BI capabilities.
  8. 8. Partnerships & Alliances Most large commercial DBMS vendors have partnerships with specific Hadoop distribution developers. Oracle (Big Data Appliance) IBM Netezza Microsoft (HDInsight) Teradata & AsterData appliance Greenplum (Prior to GreenplumHD)
  9. 9. Possible phases of a DW platform to evolve into a hybrid There are many ways to get to the end goal, but here’s a possible evolution path for commercial DW vendors to hybrid DBMS-Hadoop solutions Phase Description Independent DW and Hadoop operate independently, storing completely different data sets depending on size and structure and processing them in completely different ways. Batch data movement There’s an efficient method to shuttle data back and forth between DW and Hadoop. Focused primarily on loading a subset of data into the DW for analytics. Transformations typically happen in context of ELT, rather than ETL. Integrated storage Queries are issued against the system, which in turn determines where the requested data resides – DW or Hadoop. If data resides on Hadoop, it pulls data into DW prior to executing a query against it. Optimized processing in place Queries are issued against the system, which in turn determines where the requested data resides – DW or Hadoop. If data resides on Hadoop, the query is executed in place within Hadoop, possibly after converting the logic to MapReduce. The result is brought back to the DW.
  10. 10. How about Hadoop distribution vendors? Hadoop distribution vendors like Cloudera, Hortonworks and MapR are developing products to add DW capabilities and bridge the gap between the two worlds Their solution:  Improve on limitations of MapReduce  Improve on data silos and overhead of moving data between DW and Hadoop
  11. 11. What’s wrong with MapReduce? MapReduce is the legacy and most pure processing environment for Hadoop. But it’s not ideal for a number of reasons  Performance: Long lag between query and results make it difficult for interactive analytics  Requires high degree of skill with Java for processing data. Existing resources with SQL skills would be underutilized.  Doesn’t leverage existing company investment in ETL, reporting, analytics tools
  12. 12. Why not use DW-Hadoop connectors? The original approach of integrating DW with Hadoop using connectors to move data back and forth is not ideal. It introduces costs and inefficiencies of dealing with data silos. To solve this problem, a more practical “SQL-onHadoop” architecture is being adopted. Typically, SQL-on-Hadoop capabilities include:  Interactive analytical queries (readonly)  Parallelism / distributed processing  Efficient joins across multiple tables  ANSI SQL Compliance  Query caching  Ability to use existing ETL, OLAP and reporting tools from commercial vendors SQL-on-Hadoop players:  Cloudera’s Impala  Hortonworks’ Stinger (faster Hive via ORCfile & Tez)  Apache Drill (supported by MapR)  Hadapt / HadoopDP  Greenplum’s HAWQ (on Pivotal HD)  Teradata’s SQL-H (on Aster/PostgreSQL)
  13. 13. DW & Hadoop vendors approach from opposite directions Relational DBMS Add interactivity Hybrid DW vendors start with their relational DBMS platform... and add interactivity to Hadoop    Add DBMS features Initially all processing might occur on DBMS, with Hadoop being used for storage Ultimately evolves to pushing processing to Hadoop cluster Examples: Microsoft PDW / Polybase, Greenplum / Pivotal HD Hadoop vendors start with their Hadoop distribution... and add DBMS features    This is also known as “SQL-onHadoop” Add query optimizer, real-time capabilities, etc. Examples: Cloudera Impala, Hortonworks Stinger, Hadapt In the end both of these approaches might end up with very similar of hybrid solutions.
  14. 14. Utopian vision Ultimately, ideally the user doesn’t know (and doesn’t care) where the data is stored and how it’s processed, as made possible by using...  Single toolset: Use single set of ETL, reporting and analytics tools, regardless where data resides  Automated optimization: DW automatically decides between RDBMS and Hadoop...  Where to store data, based on its structure and directives on how it’s to be used  Where to push processing, based on query cost optimization
  15. 15. Evolution of SQL-on-Hadoop: Operational data store Eventually, Hadoop implementations might evolve into supporting “operational” transactions    Ability to handle workloads to power websites and applications Transaction-orient write capability, rather than read-only of analytical queries “ACID” database capabilities, including concurrency, distributed transactional support, and guarantees of data consistency
  16. 16. Microsoft's Approach Microsoft's strategy involves the Polybase initiative and ability to leverage its extensive range of BI tools  Polybase is the hybrid RDBMS-Hadoop platform, which spans queries across    HDFS on HDInsight (Hortonworks’ distribution that runs on Windows) and Microsoft’s MPP DW platform, SQL Server PDW (Parallel Data Warehouse) Polybase development is phased:    Phase 1: Parallel data transfer between SQL Server Compute Nodes and Hadoop Data Nodes, but all processing is done on DBMS Phase 2: Use query optimizer to decide where to process jobs. Selectively push work to Hadoop by converting queries to MapReduce Integration with Microsoft’s BI stack, including:  Reporting (SQL Server Reporting Services)  OLAP (SQL Server Analysis Services)  Self-service BI (Excel, PowerPivot, Power View)
  17. 17. Microsoft's Approach (cont.) The advantages of going the Microsoft route are numerous...  Make use of Hadoop using familiar high productivity self-service BI tools  Excel itself can handle data extracts from Hadoop  PowerPivot for large-scale data exploration using xVelocity in-memory analytics engine  Power View for ad-hoc visualization in SSRS, accessible via SharePoint or Excel  In Office 2013, PowerPivot and Power View are natively integrated with Excel  Leverage existing and widely available .NET developers  Management of a Hadoop cluster (using Apache Ambari) is integrated with Microsoft System Center, already used by IT Operators for database management  Deliver tighter security through integration with Windows Server Active Directory  Cloud-based Hadoop available through Windows Azure HDInsight Service  Interactive access to Hadoop through Hortonworks' Hive ODBC driver
  18. 18. Microsoft's Approach (cont.) And some of the disadvantages include...  Licensing costs for both Windows Server that run HDInsight nodes and SQL Server nodes associated with Polybase  Uncertain performance and adoption of HDInsight distribution  Many of the advantages in integration and leveraging resources don’t apply for non-Microsoft shops
  19. 19. Greenplum’s Pivotal HD Greenplum’s Pivotal HD implements But... Greenplum on top of HDFS  Proprietary technology means  Yet still capable of running MapReduce  vendor lock-in and jobs if needed  inability to take advantage of a vibrant developer community  More mature than many of its rivals  Capabilities extend well beyond those of  Licenses are expensive and... open source distributions  Open source alternatives (Impala, Drill, Shark) are becoming available
  20. 20. Example SQL-on-Hadoop: HadoopDB / Hadapt Architecture of Hadapt Hadapt is a commercialized version of Daniel Abadi's HadoopDB project  For structured data, each Data Node uses DBMS (Postgres or VectorWise) instead of HDFS  Load balancing and performance optimization on nodes Hive
  21. 21. Example SQL-on-Hadoop: BigSQL BigSQL is PostgreSQL implemented on Hadoop