• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
SQL on Hadoop: Defining the New Generation of Analytics Databases
 

SQL on Hadoop: Defining the New Generation of Analytics Databases

on

  • 1,650 views

The analytics and data warehousing industries are in the midst of a major period of transformation. Since the publication of Google?s MapReduce paper, we have witnessed the appearance of Apache ...

The analytics and data warehousing industries are in the midst of a major period of transformation. Since the publication of Google?s MapReduce paper, we have witnessed the appearance of Apache Hadoop, followed by the arrival of batch-oriented SQL systems like Apache Hive, and the scramble by established SQL vendors to implement Hadoop connectors. This talk addresses the recent emergence of a new generation of analytic databases inspired by Google Dremel. These databases have been designed with the goal of running real-time SQL natively on Hadoop in a manner that fully exploits the flexibility and performance of the underlying platform. Characterized by features including schema-on-read, support for semi-structured data, and pluggable storage engines, these new systems share important architectural details that distinguish them from the previous generation of analytic databases. In this talk, we will discuss the performance limitations of the connector-based approach employed by many established vendors and explain the long-term significance of Apache Hive?s data model. Then, we will unravel the novel architectural features common to next generation analytic database systems like CitusDB and Impala that make real-time SQL-on-Hadoop feasible. Finally, we will conclude by reviewing several important database lessons learned over the previous decades that remain relevant today.

Statistics

Views

Total Views
1,650
Views on SlideShare
1,504
Embed Views
146

Actions

Likes
9
Downloads
0
Comments
0

1 Embed 146

http://inergy20.wordpress.com 146

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Databases are tools that let you ask questions about data.The architecture of a database depends heavily on the design of the system that stores the data.Hadoop, and HDFS in particular, represent a radical change to the underlying storage infrastructure.In order to capitalize on these changes we need to redesign the database from the ground up. That’s the goal of these new systems.
  • Make sure we’re on the same page.Next: Enterprise Storage Model
  • Availability - Fault tolerance through RAIDAccessibility - Shared files - POSIX file APIProblems:- Cost- ScalabilityOutro:Folks at Google were aware of these problems when they were building their search engine.-Fibre channel,
  • Distributed Block StoreACM interview Sean Quinlan and Kirk McKusick: http://queue.acm.org/detail.cfm?id=1594206
  • Did this solve the problem?Commodity: yesFault tolerance: yesScalability: NoMR is the missing pieceOutro:2005: Mike Cafarella, Doug CuttingNutchDoug Cutting and Mike Cafarella launched the Hadoop project a year later. HDFS + MapReduce

SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases Presentation Transcript

  • Hadoop Summit, June 2013 SQL on Hadoop Defining the New Generation of Analytic Databases
  • Speaker Bio: Carl Steinbach 1 Currently: Engineer @ Citus Data PMC Chair, Committer -- Apache Hive Project Formerly: Oracle, NetApp, Informatica, Cloudera Twitter: @cwsteinbach LinkedIn: carlsteinbach
  • This talk is about: 2 A New Type of Distributed Analytic Database
  • What Is an Analytic Database? 3 OLAP: Online Analytical Processing Consolidation (Roll-up) Drill-down Slicing and Dicing No Transactions Large Sequential Scans I/O Bound
  • Motivation: The Problem with Enterprise Storage 4 Storage Tier (NAS/SAN) Server/Worker Tier Server Server Server Server Server Server Server Server Server Server Server Server Really Big Pipe
  • Google File System (’03) A Possible Solution? 5 Design Priorities • Commodity Hardware • Fault Tolerance • Big Files / Big Blocks • Big Sequential Reads/Writes Design Tradeoffs • No random writes (write once/read many) • Slow random reads • Not POSIX compliant
  • So GFS Solved the problem? 6 - Yes, but not because of anything described in the original paper - Client/Server approach won’t scale - Full scope of GFS revealed one year later with publication of MapReduce (‘04) paper. GFS + MapReduce Key Idea: Eliminate I/O Bottleneck by Colocating Compute and Storage Resources on the Same Node
  • What’s Good About Hadoop? 7 Commodity Storage Scale-out Fault Tolerance Flexibility MapReduce Multi-structured Data
  • What’s Bad About Hadoop? 8 MapReduce! No Schemas! Missing Features Optimizer, Indexes, Views Incompatibility with Existing Tools BI, ETL, IDEs
  • Apache Hive Solved Many of these Problems 9 SQL to MapReduce Compiler + Execution Engine Pluggable Storage Layer (SerDes) Schema-on-Read
  • But Other Problems Remained 10 Many Missing Features: • ANSI SQL • Cost Based Optimizer • UDFs • Data Types • Security • … Biggest Problem: • MapReduce Latency Overhead
  • Work in Progress: Hive Improvements 11 Stinger Initiative: • Columnar Query Engine • ORCFile File Format • Replace MR with Tez (Apache Incubator)
  • One Solution: MPP Database + Hadoop Connector 12 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor MPP Database Cluster Hadoop Cluster
  • 13 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor Pull Data One Solution: MPP Database + Hadoop Connector
  • 14 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor Pull Data IO Bottleneck One Solution: MPP Database + Hadoop Connector
  • A Better Solution: New Architecture for SQL on Hadoop 15 MPP Worker NodeMPP Worker NodeMPP Worker NodeMPP Worker Node Global Query Executor MPP Master Node HDFS datanode HDFS datanode HDFS datanode HDFS datanode Local Query Executor Local Query Executor Local Query Executor Local Query Executor Maintain Data Locality Push Work To Data
  • New Architecture for SQL on Hadoop 16 Data Locality • Block-Aware Query Planner Pushes Work to Data Real-Time Query Performance • Replace MapReduce Schema-on-Read • Pluggable Storage Format Handlers Tight Integration with SQL Ecosystem Tools
  • Examples of the New Architecture 17 Google Dremel • Interactive ad hoc query system for read-only nested data. Powers BigQuery. Apache Drill • Open source version of Dremel. Implemented in Java. Work in progress. Cloudera Impala • Heavily Influenced by MonetDB/X100. Runtime codegen. CPU cache aware. Implemented in C++. Citus Data • Built on PostgreSQL. Powerful cost based optimizer for disk I/O. Handles failures.
  • The New Architecture in Detail: CitusDB 18 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers Hadoop Metadata HDFS NameNode PostgreSQL Tools ODBC/JDBC Clients
  • CitusDB: Metadata Synchronization 19 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers Hadoop Metadata HDFS NameNode Metadata Sync CREATE FOREIGN TABLE emp_{block_id} … PostgreSQL Tools ODBC/JDBC Clients CREATE TABLE emp
  • CitusDB: Query Execution 20 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers Hadoop Metadata HDFS NameNode PostgreSQL Tools ODBC/JDBC Clients SELECT AVG(sal) FROM emp WHERE job = “manager”;
  • CitusDB: Query Execution 21 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers PostgreSQL Tools ODBC/JDBC Clients Hadoop Metadata HDFS NameNode Local Queries SELECT SUM(sal), COUNT(sal) FROM emp_{block_id} WHERE job = “manager”;
  • CitusDB: Query Execution 22 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers PostgreSQL Tools ODBC/JDBC Clients Hadoop Metadata HDFS NameNode Local Results {842176.53, 8} {1234283.00, 12} {0.00, 0} {125500.00, 1} {523100.00, 3} {785300.32, 5}
  • CitusDB: Query Execution 23 CitusDB Master Node Metadata Distributed Query Planner Distributed Query Executor datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers datanode HDFS Local Query Planner Local Query Executor Foreign Data Wrappers PostgreSQL Tools ODBC/JDBC Clients Hadoop Metadata HDFS NameNode {121046.58}
  • Why We Chose PostgreSQL 24 - Powerful Cost-Based Optimizer - Designed to minimize disk I/O - Extensible, Rich Type System - Pluggable Storage Format Handlers - Lots of Extensions: - Geospatial, Full Text Search, JSON, etc… - Enterprise Features: - ODBC/JDBC - Security - Internationalization
  • Defining the New Generation of Distributed Analytic Databases 25 SQL  Ease of Use, Increased Productivity Real-time responsiveness  Faster Data Locality  Proven Scalability Schema-on-Read  Flexibility, Lower Cost
  • Where Are We At? 26 CitusDB SQL on Hadoop is in Open Beta Download our Binary Packages Or Use Our EC2 AMI http://citusdata.com/docs/sql-on-hadoop
  • We’re Hiring! 27 http://citusdata.com/job
  • 28 Questions?