More Related Content Similar to Stinger.Next by Alan Gates of Hortonworks (20) More from Data Con LA (20) Stinger.Next by Alan Gates of Hortonworks2. Disclaimer
This document may contain product features and technology directions that are under
development or may be under development in the future.
Technical feasibility, market demand, user feedback, and the Apache Software
Foundation community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not
represent a contractual commitment from Hortonworks to deliver these features in any
generally available product.
Product features and technology directions are subject to change, and must not be
included in contracts, purchase orders, or sales agreements of any kind.
Page 2 © Hortonworks Inc. 2014
3. Hadoop Summit EU Call For Abstracts Open
Open until December 5, 2014
Share your Hadoop knowledge and experience with the wider community
Summit is April 15-16 2015 in Brussels Belgium
Tracks:
• Committer Track
• Data Science & Hadoop
• Hadoop Governance, Security & Operations
• Hadoop Access Engines
• Applications of Hadoop and the Data Driven Business
• The Future of Apache Hadoop
Page 3 © Hortonworks Inc. 2014
4. Interactive SQL-IN-Hadoop Delivered
Stinger Initiative – DELIVERED
Next generation SQL based
interactive query in Hadoop
Speed
Improve Hive query performance has increased by 100X to allow for
interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed for queries that scale
from TB to PB
SQL
Support broadest range of SQL semantics for analytic applications
running against Hadoop
Business Analytics Custom
SQL
Apps
Window
Functions
Apache Hive
Apache
MapReduce
Apache
Tez
Apache YARN
1 ° ° °
° ° ° °
° ° ° °
Apache Hive Contribution… an Open Community at its finest
1,672
Jira Tickets Closed
145
Developers
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
44
Companies
~390,000
Lines Of Code Added… (2x)
°
°
N
HDFS
(Hadoop Distributed File System)
Stinger Project
Stinger Phase 1:
• Base Optimizations
• SQL Types
• SQL Analytic Functions
• ORCFile Modern File Format
Stinger Phase 2:
HDP 2.1
• SQL Types
• SQL Analytic Functions
• Advanced Optimizations
• Performance Boosts via YARN
Stinger Phase 3
• Hive on Apache Tez
• Query Service (always on)
• Buffer Cache
• Cost Based Optimizer (Optiq)
13
Months
Governance
& Integration
Security
Operations
Data Access
Data
Management
ORC File
5. Hive – Single tool for all SQL use cases
Page 5 © Hortonworks Inc. 2014
OLTP, ERP, CRM Systems
Unstructured documents, emails
Server logs
Clickstream
Sentiment, Web Data
Sensor. Machine Data
Geolocation
Interactive
Analytics
Batch Reports /
Deep Analytics
Hive - SQL
ETL / ELT
6. Stinger.next - Delivery Themes
Beyond Read-Only
2nd Half 2014
• Transactions with ACID allowing
insert, update and delete
• Temporary Tables
• Cost Based Optimizer optimizes
star and bushy join queries
Page 8 © Hortonworks Inc. 2014
Sub-Second
1st Half 2015
• Sub-Second queries with LLAP
• Hive-Spark Machine Learning
integration
• Operational reporting with Hive
Streaming Ingest and
Transactions
• Replication and SQL/CBO
improvements
Richer Analytics
2nd Half 2015
• Toward SQL:2011 Analytics
• Materialized Views
• Cross-Geo Queries
• Workload Management via YARN
and LLAP integration
7. Deep Dive: Cost Based Optimizer
• Phase 1
• CBO Introduced
• CBO does join re-ordering
• Initial collection of statistics
• Phase 2
• Handle queries with more joins
• Better plans for star and bushy (multi-star) join schemas
• Opportunistic improvements based on sample queries
• Better integration of Calcite into Hive infrastructure
• More statistics with better usability
• Better predicate handling
• Phase 3
• Move existing simple optimizations into cost based optimizer
• Build more complex optimization into Calcite
[Done]
[Hive 0.14]
Page 9 © Hortonworks Inc. 2014
SQL
CBO
Based on Calcite
Hive
Rule Based
Optimizations
Query
Plan
[2015]
8. Performance Improvement – Query 17
Scale = 30TB
Input records ~186mil
Page 14 © Hortonworks Inc. 2014
CBO Elapsed
Time (sec)
Elapsed
Time
Intermediate
data (GB)
Output and
Intermediate
Records
OFF 10,683 ~3 hrs 5,017 135,647,792,123
ON 1,284 ~20 mins 275 8,543,232,360
9. Transaction Use Cases
• Reporting with Analytics (YES)
• Reporting on data with occasional updates
• Corrections to the fact tables, evolving dimension tables
• Low concurrency updates, low TPS
• Operational Reporting (YES)
• High throughput ingest from operational (OLTP) database
• Periodic inserts every 5-30 minutes
• Requires tool support
• Operational (OLTP) Database (NO)
• Small Transactions, each doing single line inserts
• High Concurrency - Hundreds to thousands of connections
Page 15 © Hortonworks Inc. 2014
Analytics Modifications
Hive
Replication
OLTP Hive
Hive
High Concurrency
OLTP
10. Deep Dive: Transactions
Transaction Support in Hive with ACID semantics
• Hive native support for INSERT, UPDATE, DELETE.
• Split Into Phases:
• Phase 1: Hive Streaming Ingest (append)
• Phase 2: INSERT / UPDATE / DELETE Support
• Phase 3: BEGIN / COMMIT / ROLLBACK Txn
[Hive 0.13]
[Hive 0.14]
Page 16 © Hortonworks Inc. 2014
Read-
Optimized
ORCFile
Delta File
Merged
Read-
Optimized
ORCFile
1. Original File
Task reads the latest
ORCFile
Task
Read-
Optimized
ORCFile
Task Task
2. Edits Made
Task reads the ORCFile and merges
the delta file with the edits
3. Edits Merged
Task reads the
updated ORCFile
Hive ACID Compactor
periodically merges the delta
files in the background
11. Sub-Second: Tez with LLAP
• LLAP is a node resident daemon process
• Low latency by reducing setup cost
• Multi-threaded engine that runs smaller tasks for query
including reads, filter and some joins
• Use regular Tez tasks for larger shuffle and other
operators
• LLAP has In-memory columnar data cache
• Low latency by providing data from in-memory cache
instead of going to HDFS
• Store data in columnar format for vectorization
irrespective of underlying file type
• Security enforced across queries and users
• Uses YARN for resource management
Page 17 © Hortonworks Inc. 2014
LLAP = Live Long And Process
Node
Query
Fragment
LLAP Process
LLAP process
running a task
for a query
LLAP In-Memory
columnar cache
HDFS
12. Deeper Dive: Tez with LLAP engine
LLAP is an optional daemon process running on multiple nodes, that provides the following:
• Caching and data reuse across queries with compressed columnar data in-memory (off-heap)
• Multi-threaded execution including reads with predicate pushdown and hash joins
• High throughput IO using Async IO Elevator with dedicated thread and core per disk
• Granular column level security across applications
• YARN will provide workload management in LLAP by using delegation
Page 18 © Hortonworks Inc. 2014
LLAP process runs on multiple nodes, accelerating Tez tasks
Node
LLAP Process
HDFS
Query
Fragment
LLAP process running
read task for a query
LLAP In-Memory
columnar cache
Node
Hive
Query
Node Node Node Node
LLAP LLAP LLAP LLAP
13. Deep Dive: Engines
• Tez
• Phase 1
• Pipelined, Vectorized Execution
• Low latency startup
– Hold on to sessions
– Hold on to pre-warmed containers
• Phase 2
• Dynamic Partition Pruning
• Improved Tez Shuffle
– Compression / Vectorization
• Tez + LLAP for Sub-Second Queries
• Phase 3
• LLAP Processes with:
• Multi-threaded Execution Engine
• In-Memory Columnar Cache
• Phase 4
• YARN workload management for
LLAP
Page 19 © Hortonworks Inc. 2014
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
Hive
LLAP process
running read task
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
[Done]
[Champlain]
[1H, 2015]
HDFS
LLAP In-Memory
columnar cache
Map – Reduce
Intermediate results in HDFS
Tez
Optimized Pipeline
Tez with LLAP
Resident process on Nodes
Map tasks
read HDFS
[2H, 2015]
14. SQL Support
Page 20 © Hortonworks Inc. 2014
SQL Datatypes SQL Semantics
INT/TINYINT/SMALLINT/BIGINT SELECT, INSERT
FLOAT/DOUBLE GROUP BY, ORDER BY, HAVING
BOOLEAN Inner, outer, cross and semi joins
ARRAY, MAP, STRUCT, UNION Sub-queries in the FROM clause
STRING ROLLUP and CUBE
BINARY UNION
TIMESTAMP Standard aggregations (sum, avg, etc.)
DECIMAL Custom Java UDFs
DATE Windowing functions (OVER, RANK, etc.)
VARCHAR Advanced UDFs (ngram, XPath, URL)
CHAR Sub-queries for IN/NOT IN, HAVING
Interval Types JOINs in WHERE Clause
Common Table Expressions (WITH Clause)
INSERT / UPDATE / DELETE
Non-equi joins
Set functions - Union, Except, Intersect
All sub-queries
Minor syntax differences resolved – rollup, case
Goal: SQL 2011 Analytic Functions
Legend
Available Now
HDP Champlain
Stinger.next
Editor's Notes explain select i_item_id
> ,i_item_desc
> ,s_state
> ,count(ss_quantity) as store_sales_quantitycount
> ,avg(ss_quantity) as store_sales_quantityave
> ,stddev_samp(ss_quantity) as store_sales_quantitystdev
> ,stddev_samp(ss_quantity)/avg(ss_quantity) as store_sales_quantitycov
> ,count(sr_return_quantity) as_store_returns_quantitycount
> ,avg(sr_return_quantity) as_store_returns_quantityave
> ,stddev_samp(sr_return_quantity) as_store_returns_quantitystdev
> ,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as store_returns_quantitycov
> ,count(cs_quantity) as catalog_sales_quantitycount ,avg(cs_quantity) as catalog_sales_quantityave
> ,stddev_samp(cs_quantity)/avg(cs_quantity) as catalog_sales_quantitystdev
> ,stddev_samp(cs_quantity)/avg(cs_quantity) as catalog_sales_quantitycov
> from store_sales
> ,store_returns
> ,catalog_sales
> ,date_dim d1
> ,date_dim d2
> ,date_dim d3
> ,store
> ,item
> where d1.d_quarter_name = '2000Q1'
> and d1.d_date_sk = store_sales.ss_sold_date_sk
> and ss_sold_date between '2000-01-01' and '2000-03-31'
> and item.i_item_sk = store_sales.ss_item_sk
> and store.s_store_sk = store_sales.ss_store_sk
> and store_sales.ss_customer_sk = store_returns.sr_customer_sk
> and store_sales.ss_item_sk = store_returns.sr_item_sk
> and store_sales.ss_ticket_number = store_returns.sr_ticket_number
> and store_returns.sr_returned_date_sk = d2.d_date_sk
> and d2.d_quarter_name in ('2000Q1','2000Q2','2000Q3')
> and sr_returned_date between '2000-01-01' and '2000-09-01'
> and store_returns.sr_customer_sk = catalog_sales.cs_bill_customer_sk
> and store_returns.sr_item_sk = catalog_sales.cs_item_sk
> and catalog_sales.cs_sold_date_sk = d3.d_date_sk
> and d3.d_quarter_name in ('2000Q1','2000Q2','2000Q3')
> and cs_sold_date between '2000-01-01' and '2000-09-31'
> group by i_item_id
> ,i_item_desc
> ,s_state
> order by i_item_id
> ,i_item_desc
> ,s_state
> limit 100;