Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Stinger.Next by Alan Gates of Hortonworks


Published on

ver the last 13 months the Apache Hive community, which included 145 developers and 44 companies working together through the Stinger initiative, delivered 390,000 lines of code and 1600 resolved JIRA tickets. This is only the beginning. The Hive community has already started the next phase of extending the Speed, Scale, and SQL compliance in Hive. As Hadoop 2.0 with YARN evolves to enable a dizzying array of powerful engines that allow us to interact with ever growing data in new ways, well known tools such as SQL need to scale with it. This session will provide a technical illustration of the challenges facing SQL on Hadoop today and what the road ahead looks like as the user community drives more innovation. is the next multi-phase initiative to evolve Hive as the de facto SQL engine for Hadoop designed to deliver Speed, Scale and better SQL.

Published in: Technology
  • Be the first to comment

Stinger.Next by Alan Gates of Hortonworks

  1. 1. Alan F Gates @alanfgates December 2014 Page 1 © Hortonworks Inc. 2014
  2. 2. Disclaimer This document may contain product features and technology directions that are under development or may be under development in the future. Technical feasibility, market demand, user feedback, and the Apache Software Foundation community development process can all effect timing and final delivery. This document’s description of these features and technology directions does not represent a contractual commitment from Hortonworks to deliver these features in any generally available product. Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Page 2 © Hortonworks Inc. 2014
  3. 3. Hadoop Summit EU Call For Abstracts Open Open until December 5, 2014 Share your Hadoop knowledge and experience with the wider community Summit is April 15-16 2015 in Brussels Belgium Tracks: • Committer Track • Data Science & Hadoop • Hadoop Governance, Security & Operations • Hadoop Access Engines • Applications of Hadoop and the Data Driven Business • The Future of Apache Hadoop Page 3 © Hortonworks Inc. 2014
  4. 4. Interactive SQL-IN-Hadoop Delivered Stinger Initiative – DELIVERED Next generation SQL based interactive query in Hadoop Speed Improve Hive query performance has increased by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications running against Hadoop Business Analytics Custom SQL Apps Window Functions Apache Hive Apache MapReduce Apache Tez Apache YARN 1 ° ° ° ° ° ° ° ° ° ° ° Apache Hive Contribution… an Open Community at its finest 1,672 Jira Tickets Closed 145 Developers Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 44 Companies ~390,000 Lines Of Code Added… (2x) ° ° N HDFS (Hadoop Distributed File System) Stinger Project Stinger Phase 1: • Base Optimizations • SQL Types • SQL Analytic Functions • ORCFile Modern File Format Stinger Phase 2: HDP 2.1 • SQL Types • SQL Analytic Functions • Advanced Optimizations • Performance Boosts via YARN Stinger Phase 3 • Hive on Apache Tez • Query Service (always on) • Buffer Cache • Cost Based Optimizer (Optiq) 13 Months Governance & Integration Security Operations Data Access Data Management ORC File
  5. 5. Hive – Single tool for all SQL use cases Page 5 © Hortonworks Inc. 2014 OLTP, ERP, CRM Systems Unstructured documents, emails Server logs Clickstream Sentiment, Web Data Sensor. Machine Data Geolocation Interactive Analytics Batch Reports / Deep Analytics Hive - SQL ETL / ELT
  6. 6. - Delivery Themes Beyond Read-Only 2nd Half 2014 • Transactions with ACID allowing insert, update and delete • Temporary Tables • Cost Based Optimizer optimizes star and bushy join queries Page 8 © Hortonworks Inc. 2014 Sub-Second 1st Half 2015 • Sub-Second queries with LLAP • Hive-Spark Machine Learning integration • Operational reporting with Hive Streaming Ingest and Transactions • Replication and SQL/CBO improvements Richer Analytics 2nd Half 2015 • Toward SQL:2011 Analytics • Materialized Views • Cross-Geo Queries • Workload Management via YARN and LLAP integration
  7. 7. Deep Dive: Cost Based Optimizer • Phase 1 • CBO Introduced • CBO does join re-ordering • Initial collection of statistics • Phase 2 • Handle queries with more joins • Better plans for star and bushy (multi-star) join schemas • Opportunistic improvements based on sample queries • Better integration of Calcite into Hive infrastructure • More statistics with better usability • Better predicate handling • Phase 3 • Move existing simple optimizations into cost based optimizer • Build more complex optimization into Calcite [Done] [Hive 0.14] Page 9 © Hortonworks Inc. 2014 SQL CBO Based on Calcite Hive Rule Based Optimizations Query Plan [2015]
  8. 8. Performance Improvement – Query 17 Scale = 30TB Input records ~186mil Page 14 © Hortonworks Inc. 2014 CBO Elapsed Time (sec) Elapsed Time Intermediate data (GB) Output and Intermediate Records OFF 10,683 ~3 hrs 5,017 135,647,792,123 ON 1,284 ~20 mins 275 8,543,232,360
  9. 9. Transaction Use Cases • Reporting with Analytics (YES) • Reporting on data with occasional updates • Corrections to the fact tables, evolving dimension tables • Low concurrency updates, low TPS • Operational Reporting (YES) • High throughput ingest from operational (OLTP) database • Periodic inserts every 5-30 minutes • Requires tool support • Operational (OLTP) Database (NO) • Small Transactions, each doing single line inserts • High Concurrency - Hundreds to thousands of connections Page 15 © Hortonworks Inc. 2014 Analytics Modifications Hive Replication OLTP Hive Hive High Concurrency OLTP
  10. 10. Deep Dive: Transactions Transaction Support in Hive with ACID semantics • Hive native support for INSERT, UPDATE, DELETE. • Split Into Phases: • Phase 1: Hive Streaming Ingest (append) • Phase 2: INSERT / UPDATE / DELETE Support • Phase 3: BEGIN / COMMIT / ROLLBACK Txn [Hive 0.13] [Hive 0.14] Page 16 © Hortonworks Inc. 2014 Read- Optimized ORCFile Delta File Merged Read- Optimized ORCFile 1. Original File Task reads the latest ORCFile Task Read- Optimized ORCFile Task Task 2. Edits Made Task reads the ORCFile and merges the delta file with the edits 3. Edits Merged Task reads the updated ORCFile Hive ACID Compactor periodically merges the delta files in the background
  11. 11. Sub-Second: Tez with LLAP • LLAP is a node resident daemon process • Low latency by reducing setup cost • Multi-threaded engine that runs smaller tasks for query including reads, filter and some joins • Use regular Tez tasks for larger shuffle and other operators • LLAP has In-memory columnar data cache • Low latency by providing data from in-memory cache instead of going to HDFS • Store data in columnar format for vectorization irrespective of underlying file type • Security enforced across queries and users • Uses YARN for resource management Page 17 © Hortonworks Inc. 2014 LLAP = Live Long And Process Node Query Fragment LLAP Process LLAP process running a task for a query LLAP In-Memory columnar cache HDFS
  12. 12. Deeper Dive: Tez with LLAP engine LLAP is an optional daemon process running on multiple nodes, that provides the following: • Caching and data reuse across queries with compressed columnar data in-memory (off-heap) • Multi-threaded execution including reads with predicate pushdown and hash joins • High throughput IO using Async IO Elevator with dedicated thread and core per disk • Granular column level security across applications • YARN will provide workload management in LLAP by using delegation Page 18 © Hortonworks Inc. 2014 LLAP process runs on multiple nodes, accelerating Tez tasks Node LLAP Process HDFS Query Fragment LLAP process running read task for a query LLAP In-Memory columnar cache Node Hive Query Node Node Node Node LLAP LLAP LLAP LLAP
  13. 13. Deep Dive: Engines • Tez • Phase 1 • Pipelined, Vectorized Execution • Low latency startup – Hold on to sessions – Hold on to pre-warmed containers • Phase 2 • Dynamic Partition Pruning • Improved Tez Shuffle – Compression / Vectorization • Tez + LLAP for Sub-Second Queries • Phase 3 • LLAP Processes with: • Multi-threaded Execution Engine • In-Memory Columnar Cache • Phase 4 • YARN workload management for LLAP Page 19 © Hortonworks Inc. 2014 M M M R R M M R M M R M M R HDFS HDFS HDFS Hive LLAP process running read task T T T R R R T T T R M M M R R R M M R R [Done] [Champlain] [1H, 2015] HDFS LLAP In-Memory columnar cache Map – Reduce Intermediate results in HDFS Tez Optimized Pipeline Tez with LLAP Resident process on Nodes Map tasks read HDFS [2H, 2015]
  14. 14. SQL Support Page 20 © Hortonworks Inc. 2014 SQL Datatypes SQL Semantics INT/TINYINT/SMALLINT/BIGINT SELECT, INSERT FLOAT/DOUBLE GROUP BY, ORDER BY, HAVING BOOLEAN Inner, outer, cross and semi joins ARRAY, MAP, STRUCT, UNION Sub-queries in the FROM clause STRING ROLLUP and CUBE BINARY UNION TIMESTAMP Standard aggregations (sum, avg, etc.) DECIMAL Custom Java UDFs DATE Windowing functions (OVER, RANK, etc.) VARCHAR Advanced UDFs (ngram, XPath, URL) CHAR Sub-queries for IN/NOT IN, HAVING Interval Types JOINs in WHERE Clause Common Table Expressions (WITH Clause) INSERT / UPDATE / DELETE Non-equi joins Set functions - Union, Except, Intersect All sub-queries Minor syntax differences resolved – rollup, case Goal: SQL 2011 Analytic Functions Legend Available Now HDP Champlain
  15. 15. Questions? . Page 22 © Hortonworks Inc. 2014