© 2014 IBM Corporation
Challenges of Building a First
Class SQL-on-Hadoop Engine
Scott C. Gray sgray@us.ibm.com @ScottCGra...
Agenda
► Why and what is Big SQL 3.0?
• Not a sales pitch, I promise!
► Overview of the challenges
► How we solved (some o...
The Perfect Storm
► Increase business interest on SQL on Hadoop to
improve the pace and efficiency of adopting Hadoop
► SQ...
The Result? Big SQL 3.0
► MapReduce replaced with a modern
MPP shared-nothing architecture
► Architected from the ground u...
Big SQL 3.0 At a Glance
Application Portability & Integration
Data shared with Hadoop ecosystem
Comprehensive file formats...
How did we do it?
► Big SQL is derived from an existing IBM shared-nothing RDBMS
• A very mature MPP architecture
• Alread...
Challenges for a traditional RDBMS on Hadoop
► Data placement
• Traditional databases expect to have full control over dat...
Challenges for a traditional RDBMS on Hadoop
► Query optimization
• Statistics on Hadoop are a relatively new concept
• Th...
Architecture Overview
Management Node
Big SQL
Master Node
Management Node
Big SQL
Scheduler
Big SQL
Worker Node
Java
I/O
F...
Big SQL Scheduler
► The Scheduler is the main RDBMS↔Hadoop service interface
• Interfaces with Hive metastore for table me...
I/O Fence Mode Processes
► Native I/O FMP
• The high-speed interface for a limited number of common file formats
► Java I/...
Mgmt Node
Big SQL
Master Node
Big SQL
Scheduler
DDL
FMP
UDF
FMP
Query Compilation There is a lot involved in SQL compilati...
Query Rewrite
► Why is query re-write important?
• There are many ways to express the same query
• Query generators often ...
Query Rewrite
► Most existing query rewrite rules remain unchanged
• 140+ existing query re-writes are leveraged
• Almost ...
Query Rewrite and Indexes
► Column nullability and indexes can help drive query optimization
• Can produce more efficientl...
Query Pushdown
► Pushdown moves processing down as
close to the data as possible
• Projection pushdown – retrieve only
nec...
Statistics
► Big SQL utilizes Hive statistics
collection with some extensions:
• Additional support for column groups,
his...
Costing Model
► Few extensions required to the Cost Model
► TBSCAN operator cost model extended
to evaluate cost of readin...
We can access a Hadoop table as:
► “Scattered” Partitioned:
• Only accesses local data to the node
► Replicated:
• Accesse...
Parallel Join Strategies
Replicated vs. Broadcast join
All tables are “scatter” partitioned
Join predicate:
STORE.STOREKEY...
Parallel Join Strategies
Repartitioned join
All tables are “scatter” partitioned
Join predicate:
DAILY_FORECAST.STOREKEY =...
Future Challenges
► The challenges never end!
• That’s what makes this job fun!
• The Hadoop ecosystem continues to expand...
Future Challenges
► Dynamic split allocation
• React to competing workloads
• If one node is slow, hand work you would hav...
Queries?
(Optimized, of course)
Try Big SQL 3.0 Beta on the cloud!
https://bigsql.imdemocloud.com/
Scott C. Gray sgray@us....
Upcoming SlideShare
Loading in...5
×

Challenges of Implementing an Advanced SQL Engine on Hadoop

991

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
991
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Rewriting a given SQL query into a semantically equivalent form that may be processed more efficiently
  • Mfv and histograms obtain better selectivity estimates for range predicates over data that is non-uniformly distributed.

    Stats stored in Hive metastore for currently hive supported stats and our internal catalog tables for all

    Min. max in hive only for a subset of types
    Avg length of the column values in hive only for strings
    Column and table stats done together
    Next: automatic stats collection
  • Challenges of Implementing an Advanced SQL Engine on Hadoop

    1. 1. © 2014 IBM Corporation Challenges of Building a First Class SQL-on-Hadoop Engine Scott C. Gray sgray@us.ibm.com @ScottCGrayIBM Adriana Zubiri zubiri@ca.ibm.com @adrianaZubiri
    2. 2. Agenda ► Why and what is Big SQL 3.0? • Not a sales pitch, I promise! ► Overview of the challenges ► How we solved (some of) them • Architecture and interaction with Hadoop • Query rewrite • Query optimization ► Future challenges
    3. 3. The Perfect Storm ► Increase business interest on SQL on Hadoop to improve the pace and efficiency of adopting Hadoop ► SQL engines on Hadoop moving away from MR towards MPP architectures ► SQL users expect same level of language expressiveness, features and (somewhat) performance as RDMSs ► IBM has decades of experience and assets on building SQL engines… Why not leverage it?
    4. 4. The Result? Big SQL 3.0 ► MapReduce replaced with a modern MPP shared-nothing architecture ► Architected from the ground up for low latency and high throughput ► Same SQL expressiveness as relational RDBMs, which allows application portability ► Rich enterprise capabilities…
    5. 5. Big SQL 3.0 At a Glance Application Portability & Integration Data shared with Hadoop ecosystem Comprehensive file formats supported Superior enablement of IBM Software Performance Powerful SQL query rewriter Cost based optimizer Optimized for concurrent user throughput performance Result sets not constrained by existing memory Federation Distributed requests to multiple data sources within a single SQL statement Main data sources supported: DB2, Teradata, Oracle, Netezza Enterprise Capabilities Advanced Security / Auditing Resource and Workload Management Self Tuning Memory Management Comprehensive Monitoring Rich SQL Comprehensive SQL support IBM’s SQL PL compatibility
    6. 6. How did we do it? ► Big SQL is derived from an existing IBM shared-nothing RDBMS • A very mature MPP architecture • Already understands distributed joins and optimization ► Behavior is sufficiently different that it is considered a separate product • Certain SQL constructs are disabled • Traditional data warehouse partitioning is unavailable • New SQL constructs introduced ► On the surface, porting a shared nothing RDBMS to a shared nothing cluster (Hadoop) seems easy, but … database partition database partition database partition database partition Traditional Distributed RBMS Architecture
    7. 7. Challenges for a traditional RDBMS on Hadoop ► Data placement • Traditional databases expect to have full control over data placement • Data placement plays an important role in performance (e.g. co-located joins) • Hadoop’s randomly scattered data plays against the grain of this ► Reading and writing Hadoop files • Normally an RDBMS has its own storage format • Format is highly optimized to minimize cost of moving data into memory • Hadoop has a practically unbounded number of storage formats all with different capabilities
    8. 8. Challenges for a traditional RDBMS on Hadoop ► Query optimization • Statistics on Hadoop are a relatively new concept • The are frequently not available • The database optimizer can use statistics not traditionally available in Hive • Hive-style partitioning (grouping data into different files/directories) is a new concept ► Resource management • A database server almost always runs in isolation • In Hadoop the nodes must be shared with many other tasks – Data nodes – MR task tracker and tasks – HBase region servers, etc. • We needed to learn to play nice with others
    9. 9. Architecture Overview Management Node Big SQL Master Node Management Node Big SQL Scheduler Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node Database Service Hive Metastore Hive Server Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node DDL FMP UDF FMP *FMP = Fenced mode process
    10. 10. Big SQL Scheduler ► The Scheduler is the main RDBMS↔Hadoop service interface • Interfaces with Hive metastore for table metadata • Acts like the MapReduce job tracker for Big SQL – Big SQL provides query predicates for scheduler to perform partition elimination – Determines splits for each “table” involved in the query – Schedules splits on available Big SQL nodes (favoring scheduling locally to the data) – Serves work (splits) to I/O engines – Coordinates “commits” after INSERTs ► Scheduler allows the database engine to be largely unaware of the Hadoop world Management Node Big SQL Master Node Big SQL Scheduler DDL FMP UDF FMP Mgmt Node Database Service Hive Metastore Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask TrackerUDF FMP
    11. 11. I/O Fence Mode Processes ► Native I/O FMP • The high-speed interface for a limited number of common file formats ► Java I/O FMP • Handles all other formats via standard Hadoop/Hive API’s ► Both perform multi-threaded direct I/O on local data ► The database engine had to be taught storage format capabilities • Projection list is pushed into I/O format • Predicates are pushed as close to the data as possible (into storage format, if possible) • Predicates that cannot be pushed down are evaluated within the database engine ► The database engine is only aware of which nodes need to read • Scheduler directs the readers to their portion of work Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node
    12. 12. Mgmt Node Big SQL Master Node Big SQL Scheduler DDL FMP UDF FMP Query Compilation There is a lot involved in SQL compilation ► Parsing • Catch syntax errors • Generate internal representation of query ► Semantic checking • Determine if query makes sense • Incorporate view definitions • Add logic for constraint checking ► Query optimization • Modify query to improve performance (Query Rewrite) • Choose the most efficient “access plan” ► Pushdown Analysis • Federation “optimization” ► Threaded code generation • Generate efficient “executable” code
    13. 13. Query Rewrite ► Why is query re-write important? • There are many ways to express the same query • Query generators often produce suboptimal queries and don’t permit “hand optimization” • Complex queries often result in redundancy, especially with views • For Large data volumes optimal access plans more crucial as penalty for poor planning is greater select sum(l_extendedprice) / 7.0 avg_yearly from tpcd.lineitem, tpcd.part where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX' and l_quantity < ( select 0.2 * avg(l_quantity) from tpcd.lineitem where l_partkey = p_partkey); select sum(l_extendedprice) / 7.0 as avg_yearly from temp (l_quantity, avgquantity, l_extendeprice) as (select l_quantity, avg(l_quantity) over (partition by l_partkey) as avgquantity, l_extenedprice from tpcd.lineitem, tpcd.part where p_partkey = l_partkey and p_brand = 'BRAND#23' and p_container = 'MED BOX') where l_quantity < 0.2 * avgquantity • Query correlation eliminated • Line item table accessed only once • Execution time reduced in half!
    14. 14. Query Rewrite ► Most existing query rewrite rules remain unchanged • 140+ existing query re-writes are leveraged • Almost none are impacted by “the Hadoop world” ► There were however a few modifications that were required…
    15. 15. Query Rewrite and Indexes ► Column nullability and indexes can help drive query optimization • Can produce more efficiently decorrelated subqueries and joins • Used to prove uniqueness of joined rows (“early-out” join) ► Very few Hadoop data sources support the concept of an index ► In the Hive metastore all columns are implicitly nullable ► Big SQL introduces advisory constraints and nullability indicators • User can specify whether or not constraints can be “trusted” for query rewrites create hadoop table users ( id int not null primary key, office_id int null, fname varchar(30) not null, lname varchar(30) not null, salary timestamp(3) null, constraint fk_ofc foreign key (office_id) references office (office_id) ) row format delimited fields terminated by '|' stored as textfile; Nullability Indicators Constraints
    16. 16. Query Pushdown ► Pushdown moves processing down as close to the data as possible • Projection pushdown – retrieve only necessary columns • Selection pushdown – push search criteria ► Big SQL understands the capabilities of readers and storage formats involved • As much as possible is pushed down • Residual processing done in the server • Optimizer costs queries based upon how much can be pushed down 3) External Sarg Predicate, Comparison Operator: Equal (=) Subquery Input Required: No Filter Factor: 0.04 Predicate Text: -------------- (Q1.P_BRAND = 'Brand#23') 4) External Sarg Predicate, Comparison Operator: Equal (=) Subquery Input Required: No Filter Factor: 0.025 Predicate Text: -------------- (Q1.P_CONTAINER = 'MED BOX') select sum(l_extendedprice) / 7.0 as avg_yearly from temp (l_quantity, avgquantity, l_extendeprice) as (select l_quantity, avg(l_quantity) over (partition by l_partkey) as avgquantity, l_extenedprice from tpcd.lineitem, tpcd.part where p_partkey = l_partkey and p_brand = 'BRAND#23' and p_container = 'MED BOX') where l_quantity < 0.2 * avgquantity
    17. 17. Statistics ► Big SQL utilizes Hive statistics collection with some extensions: • Additional support for column groups, histograms and frequent values • Automatic determination of partitions that require statistics collection vs. explicit • Partitioned tables: added table-level versions of NDV, Min, Max, Null count, Average column length • Hive catalogs as well as database engine catalogs are also populated • We are restructuring the relevant code for submission back to Hive ► Capability for statistic fabrication if no stats available at compile time Table statistics • Cardinality (count) • Number of Files • Total File Size Column statistics • Minimum value (all types) • Maximum value (all types) • Cardinality (non-nulls) • Distribution (Number of Distinct Values NDV) • Number of null values • Average Length of the column value (all types) • Histogram - Number of buckets configurable • Frequent Values (MFV) – Number configurable Column group statistics
    18. 18. Costing Model ► Few extensions required to the Cost Model ► TBSCAN operator cost model extended to evaluate cost of reading from Hadoop ► New elements taken into account: # of files, size of files, # of partitions, # of nodes ► Optimizer now knows in which subset of nodes the data resides → Better costing! | 2.66667e-08 HSJOIN ( 7) 1.1218e+06 8351 /--------+-------- 5.30119e+08 3.75e+07 BTQ NLJOIN ( 8) ( 11) 948130 146345 7291 1060 | /----+---- 5.76923e+08 1 3.75e+07 LTQ GRPBY FILTER ( 9) ( 12) ( 20) 855793 114241 126068 7291 1060 1060 | | | 5.76923e+08 13 7.5e+07 TBSCAN TBSCAN BTQ ( 10) ( 13) ( 21) 802209 114241 117135 7291 1060 1060 | | | 7.5e+09 13 5.76923e+06 TABLE: TPCH5TB_PARQ TEMP LTQ ORDERS ( 14) ( 22) Q1 114241 108879 1060 1060 | | 13 5.76923e+06 DTQ TBSCAN ( 15) ( 23) 114241 108325 1060 1060 | | 1 7.5e+08 GRPBY TABLE: TPCH5TB_PARQ ( 16) CUSTOMER 114241 Q5 1060 | 1 LTQ ( 17) 114241 1060 | 1 GRPBY ( 18) 114241 1060 | 5.24479e+06 TBSCAN ( 19) 113931 1060 | 7.5e+08 TABLE: TPCH5TB_PARQ CUSTOMER Q2
    19. 19. We can access a Hadoop table as: ► “Scattered” Partitioned: • Only accesses local data to the node ► Replicated: • Accesses local and remote data – Optimizer could also use a broadcast table queue – HDFS shared file system provides replication New Access Plans Data not hash partitioned on a particular columns (aka “Scattered partitioned”) New Parallel Join Strategy introduced
    20. 20. Parallel Join Strategies Replicated vs. Broadcast join All tables are “scatter” partitioned Join predicate: STORE.STOREKEY = DAILY_SALES.STOREKEY 19 Replicate smaller table to partitions of the larger table using: • Broadcast table queue • Replicated HDFS scan Table Queue represents communication between nodes or subagents JOIN Store Daily Sales SCAN SCAN Broadcast TQ SCAN replicated SCAN
    21. 21. Parallel Join Strategies Repartitioned join All tables are “scatter” partitioned Join predicate: DAILY_FORECAST.STOREKEY = DAILY_SALES.STOREKEY 20 • Both tables large • Too expensive to broadcast or replicate either • Repartition both tables on the join columns • Use directed table queue (DTQ) JOIN Daily Forecast Daily Sales SCAN SCAN Directed TQ Directed TQ
    22. 22. Future Challenges ► The challenges never end! • That’s what makes this job fun! • The Hadoop ecosystem continues to expand • New storage techniques, indexing techniques, etc. ► Here are a few areas we’re exploring….
    23. 23. Future Challenges ► Dynamic split allocation • React to competing workloads • If one node is slow, hand work you would have handed it to another node ► More pushdown! • Currently we push projection/selection down • Should we push more advanced operations? Aggregation? Joins? ► Join co-location • Perform co-located joins when tables are partitioned on the same join key ► Explicit MapReduce style parallelism (“SQL MR”) • Expand SQL to explicitly perform partitioned operations
    24. 24. Queries? (Optimized, of course) Try Big SQL 3.0 Beta on the cloud! https://bigsql.imdemocloud.com/ Scott C. Gray sgray@us.ibm.com @ScottCGrayIBM Adriana Zubiri zubiri@ca.ibm.com @adrianaZubiri

    ×