SQL on Hadoop: Defining the New Generation of Analytics Databases

1,679
-1

Published on

The analytics and data warehousing industries are in the midst of a major period of transformation. Since the publication of Google?s MapReduce paper, we have witnessed the appearance of Apache Hadoop, followed by the arrival of batch-oriented SQL systems like Apache Hive, and the scramble by established SQL vendors to implement Hadoop connectors. This talk addresses the recent emergence of a new generation of analytic databases inspired by Google Dremel. These databases have been designed with the goal of running real-time SQL natively on Hadoop in a manner that fully exploits the flexibility and performance of the underlying platform. Characterized by features including schema-on-read, support for semi-structured data, and pluggable storage engines, these new systems share important architectural details that distinguish them from the previous generation of analytic databases. In this talk, we will discuss the performance limitations of the connector-based approach employed by many established vendors and explain the long-term significance of Apache Hive?s data model. Then, we will unravel the novel architectural features common to next generation analytic database systems like CitusDB and Impala that make real-time SQL-on-Hadoop feasible. Finally, we will conclude by reviewing several important database lessons learned over the previous decades that remain relevant today.

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,679
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • Databases are tools that let you ask questions about data.The architecture of a database depends heavily on the design of the system that stores the data.Hadoop, and HDFS in particular, represent a radical change to the underlying storage infrastructure.In order to capitalize on these changes we need to redesign the database from the ground up. That’s the goal of these new systems.
  • Make sure we’re on the same page.Next: Enterprise Storage Model
  • Availability - Fault tolerance through RAIDAccessibility - Shared files - POSIX file APIProblems:- Cost- ScalabilityOutro:Folks at Google were aware of these problems when they were building their search engine.-Fibre channel,
  • Distributed Block StoreACM interview Sean Quinlan and Kirk McKusick: http://queue.acm.org/detail.cfm?id=1594206
  • Did this solve the problem?Commodity: yesFault tolerance: yesScalability: NoMR is the missing pieceOutro:2005: Mike Cafarella, Doug CuttingNutchDoug Cutting and Mike Cafarella launched the Hadoop project a year later. HDFS + MapReduce
  • SQL on Hadoop: Defining the New Generation of Analytics Databases

    1. 1. Hadoop Summit, June 2013 SQL on Hadoop Defining the New Generation of Analytic Databases
    2. 2. Speaker Bio: Carl Steinbach 1 Currently: Engineer @ Citus Data PMC Chair, Committer -- Apache Hive Project Formerly: Oracle, NetApp, Informatica, Cloudera Twitter: @cwsteinbach LinkedIn: carlsteinbach
    3. 3. This talk is about: 2 A New Type of Distributed Analytic Database
    4. 4. What Is an Analytic Database? 3 OLAP: Online Analytical Processing Consolidation (Roll-up) Drill-down Slicing and Dicing No Transactions Large Sequential Scans I/O Bound
    5. 5. Motivation: The Problem with Enterprise Storage 4 Storage Tier (NAS/SAN) Server/Worker Tier Server Server Server Server Server Server Server Server Server Server Server Server Really Big Pipe
    6. 6. Google File System (’03) A Possible Solution? 5 Design Priorities • Commodity Hardware • Fault Tolerance • Big Files / Big Blocks • Big Sequential Reads/Writes Design Tradeoffs • No random writes (write once/read many) • Slow random reads • Not POSIX compliant
    7. 7. So GFS Solved the problem? 6 - Yes, but not because of anything described in the original paper - Client/Server approach won’t scale - Full scope of GFS revealed one year later with publication of MapReduce (‘04) paper. GFS + MapReduce Key Idea: Eliminate I/O Bottleneck by Colocating Compute and Storage Resources on the Same Node
    8. 8. What’s Good About Hadoop? 7 Commodity Storage Scale-out Fault Tolerance Flexibility MapReduce Multi-structured Data
    9. 9. What’s Bad About Hadoop? 8 MapReduce! No Schemas! Missing Features Optimizer, Indexes, Views Incompatibility with Existing Tools BI, ETL, IDEs
    10. 10. Apache Hive Solved Many of these Problems 9 SQL to MapReduce Compiler + Execution Engine Pluggable Storage Layer (SerDes) Schema-on-Read
    11. 11. But Other Problems Remained 10 Many Missing Features: • ANSI SQL • Cost Based Optimizer • UDFs • Data Types • Security • … Biggest Problem: • MapReduce Latency Overhead
    12. 12. Work in Progress: Hive Improvements 11 Stinger Initiative: • Columnar Query Engine • ORCFile File Format • Replace MR with Tez (Apache Incubator)
    13. 13. One Solution: MPP Database + Hadoop Connector 12 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor MPP Database Cluster Hadoop Cluster
    14. 14. 13 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor Pull Data One Solution: MPP Database + Hadoop Connector
    15. 15. 14 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor Pull Data IO Bottleneck One Solution: MPP Database + Hadoop Connector
    16. 16. A Better Solution: New Architecture for SQL on Hadoop 15 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor Maintain Data Locality Push Work To Data
    17. 17. New Architecture for SQL on Hadoop 16 Data Locality • Block-Aware Query Planner Pushes Work to Data Real-Time Query Performance • Replace MapReduce Schema-on-Read • Pluggable Storage Format Handlers Tight Integration with SQL Ecosystem Tools
    18. 18. Examples of the New Architecture 17 Google Dremel • Interactive ad hoc query system for read-only nested data. Powers BigQuery. Apache Drill • Open source version of Dremel. Implemented in Java. Work in progress. Cloudera Impala • Heavily Influenced by MonetDB/X100. Runtime codegen. CPU cache aware. Implemented in C++. Citus Data • Built on PostgreSQL. Powerful cost based optimizer for disk I/O. Handles failures.
    19. 19. The New Architecture in Detail: CitusDB 18 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers Hadoop Metadata HDFS NameNode PostgreSQL Tools ODBC/JDBC Clients
    20. 20. CitusDB: Metadata Synchronization 19 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers Hadoop Metadata HDFS NameNode Metadata Sync CREATE FOREIGN TABLE emp_{block_id} … PostgreSQL Tools ODBC/JDBC Clients CREATE TABLE emp
    21. 21. CitusDB: Query Execution 20 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers Hadoop Metadata HDFS NameNode PostgreSQL Tools ODBC/JDBC Clients SELECT AVG(sal) FROM emp WHERE job = “manager”;
    22. 22. CitusDB: Query Execution 21 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers PostgreSQL Tools ODBC/JDBC Clients Hadoop Metadata HDFS NameNode Local Queries SELECT SUM(sal), COUNT(sal) FROM emp_{block_id} WHERE job = “manager”;
    23. 23. CitusDB: Query Execution 22 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers PostgreSQL Tools ODBC/JDBC Clients Hadoop Metadata HDFS NameNode Local Results {842176.53, 8} {1234283.00, 12} {0.00, 0} {125500.00, 1} {523100.00, 3} {785300.32, 5}
    24. 24. CitusDB: Query Execution 23 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers PostgreSQL Tools ODBC/JDBC Clients Hadoop Metadata HDFS NameNode {121046.58}
    25. 25. Why We Chose PostgreSQL 24 - Powerful Cost-Based Optimizer - Designed to minimize disk I/O - Extensible, Rich Type System - Pluggable Storage Format Handlers - Lots of Extensions: - Geospatial, Full Text Search, JSON, etc… - Enterprise Features: - ODBC/JDBC - Security - Internationalization
    26. 26. Defining the New Generation of Distributed Analytic Databases 25 SQL  Ease of Use, Increased Productivity Real-time responsiveness  Faster Data Locality  Proven Scalability Schema-on-Read  Flexibility, Lower Cost
    27. 27. Where Are We At? 26 CitusDB SQL on Hadoop is in Open Beta Download our Binary Packages Or Use Our EC2 AMI http://citusdata.com/docs/sql-on-hadoop
    28. 28. We’re Hiring! 27 http://citusdata.com/job
    29. 29. 28 Questions?

    ×