© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive Present and Future
Yifeng Jiang
Solutions Engineer, Hortonworks, inc.
July 23, 2015
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
About Me
蒋 燚峰 (Yifeng Jiang)
• Solutions Engineer, Hortonworks inc.
• HBase book author
• Hobbies: hiking, watching movie
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Agenda
• Apache Hive Present
• How Hive Achieved 100x Performance
• Sub-second Response
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hadoop for the Enterprise:
Implement a Modern Data Architecture with HDP
Customer Momentum
• 430+ customers (as of March 31, 2015)
• 105 customers added in Q1 2015
Hortonworks Data Platform
• Completely open multi-tenant platform for any app & any data.
• A centralized architecture of consistent enterprise services for
resource management, security, operations, and governance.
Partner for Customer Success
• Open source community leadership focus on enterprise needs
• Unrivaled world class support
• Founded in 2011
• Original 24 architects, developers,
operators of Hadoop from Yahoo!
• 600+ Employees
• 1100+ Ecosystem Partners
Apache Project Committers
PMC
Members
Hadoop 27 21
Pig 5 5
Hive 18 6
Tez 16 15
HBase 6 4
Phoenix 4 4
Accumulo 2 2
Storm 3 2
Slider 11 11
Falcon 5 3
Flume 1 1
Sqoop 1 1
Ambari 36 28
Oozie 3 2
Zookeeper 2 1
Knox 13 3
Ranger 11 n/a
TOTAL 164 109
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hortonworks Data Platform (HDP) 2.2 Stack
Hive: SQL on Hadoop
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive Present
Transaction, Security, Performance
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive: SQL on Hadoop
• OSS data warehouse built on top of Hadoop
• First Apache Hive released in 2009
• Initial goal was to write MapReduce jobs in SQL
– Most query ran from minutes to hours
– Primary used for batch processing
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive – Single tool for all SQL use cases
OLTP, ERP, CRM Systems
Unstructured documents, emails
Clickstream
Server logs
Sentiment, Web Data
Sensor. Machine Data
Geolocation
Interactive
Analytics
Batch Reports /
Deep Analytics
Hive - SQL
ETL / ELT
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Hive Scales to Any Workload
Page 9
Hive at Facebook
• 100+ PB of data under management
• 15+ TB of data loaded daily
• 60,000+ Hive queries per day
• More than 1,000 users per day
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Transactions
Insert, Update and Delete SQL Statements
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Transaction Use Cases
Reporting with Analytics (YES)
Reporting on data with occasional updates
Corrections to the fact tables, evolving dimension tables
Low concurrency updates, low TPS
Operational (OLTP) Database (NO)
Small Transactions, each doing single line inserts
High Concurrency - Hundreds to thousands of connections
Hive
OLTP Hive
Replication
Analytics Modifications
Hive
High Concurrency
OLTP
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Deep Dive: Transaction
Transaction Support in Hive with ACID semantics
• Hive native support for INSERT, UPDATE, DELETE.
• Split Into Phases:
• Phase 1: Hive Streaming Ingest (append)
• Phase 2: INSERT / UPDATE / DELETE Support
• Phase 3: BEGIN / COMMIT / ROLLBACK Txn
[Done]
[Done]
[Next]
Read-
Optimized
ORCFile
Delta File
Merged
Read-
Optimized
ORCFile
1. Original File
Task reads the latest
ORCFile
Task
Read-
Optimized
ORCFile
Task Task
2. Edits Made
Task reads the ORCFile and merges
the delta file with the edits
3. Edits Merged
Task reads the
updated ORCFile
Hive ACID Compactor
periodically merges the delta
files in the background.
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Compaction
Read-
Optimized
ORCFile
Delta File
Merged
Read-
Optimized
ORCFile
Read-
Optimized
ORCFile
Delta File
Delta File
Delta File
Minor Compaction
10% local
Major Compaction
10% global
Minor / Major compaction
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Security
Hive User’s perspective
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Ranger: Central Security Administration
Apache Ranger
• Security dashboard
• Centralizes administration of
security policy
• Ensures consistent coverage
across the entire Hadoop stack
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Setup Authorization Policy (Hive)
16
file level
access control,
flexible
definition
Control
permissions
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
How Hive Achieved 100x
Performance
ORC, Tez, CBO, Vectorization
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Need for Speed: The Stinger Initiative
Stinger: An Open Roadmap to improve Apache Hive’s performance 100x.
Launched: February 2013; Delivered: April 2014.
Delivered in 100% Apache Open Source.
SQL Engine
Vectorized
SQL Engine
Columnar
Storage
ORCFile
= 100X+ +
Distributed
Execution
Apache Tez
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
TPC-DS Benchmark at 30 Terabyte Scale
Sample of 50 queries from TPC-DS at 30 terabyte scale.
Average 52x Query Speedup, Maximum 160x Query Speedup.
Total benchmark time decreased from 7.8 days to 9.3 hours.(3)
Cost-Based Optimizer added in Hive 14 gave additional 2.5x Speedup.
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC File Format
Columnar Storage for Hive
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
ORCFile – Columnar Storage for Hive
• Columns stored separately
• Knows types
–Uses type-specific encoders
–Stores statistics (min, max, sum, count)
• Has light-weight index
–Skip over blocks of rows that don’t matter
Page 21
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
ORCFile – Columnar Storage for Hive
Large block size ideal for
map/reduce.
Columnar format enables
high compression and high
performance.
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
ORCFile – Create Table
• Defined at table or partition level
• Configurable compression codec
Page 23
create table Addresses (
name string,
street string,
city string,
state string,
zip int
) stored as orc tblproperties ("orc.compress"=”ZLIB");
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
ORCFile – Convert Text to ORC
• Always ORC
• One SQL to convert text to ORC
Page 24
-- Create Text & ORC tables
CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE;
CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC;
-- Load into Text table
LOAD DATA LOCAL INPATH '/home/user/test_details.csv' INTO TABLE test_details_txt;
-- Copy to ORC table
INSERT OVERWRITE INTO test_details_orc SELECT * FROM test_details_txt;
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Tez Engine
Beyond MapReduce
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Job 1 ( Join a & b )
Job 3 ( Group by of c )
Job 2 (Group by of a
Join b)
Job 4 (Join of S & R )
Hive - MR
MR vs. Tez Example
Page 26
Single Job
Hive - Tez
Join a & b
Group by of a Join b
Group by of c
Job 4 (Join of S & R )
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Tez – Introduction
Page 27
• Distributed execution framework for
data-processing applications
–Target for application (framework), not end
user
–Hive on Tez, Pig on Tez, Cascading on Tez, …
• Lessons learned from MapReduce
–Significant performance improvement
–Batch, interactive
–Petabytes scale
• Run on YARN
–Utilize cluster resource
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Tez – Switch from MapReduce
• One command to switch from MapReduce to Tez
Page 28
set hive.execution.engine=tez;
SELECT * FROM my_table;
• Set Tez as default engine on Hadoop 2
$ vi hive-site.xml
hive.execution.engine=tez
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cost Based Optimizer
Making the SQL smarter
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cost Based Optimizer in Hive
Cost-Based Optimizer (CBO) creates optimized execution plan using
Hive table statistics
Why cost-based optimization?
• Simple use – e.g., adjust join order automatically
• Reduce the need for SQL tuning
• Optimized plan relates to better cluster utilization
Page 30
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Performance Improvement – Query 17
Scale = 30TB
Input records ~186M
CBO Elapsed
Time (sec)
Elapsed
Time
Intermediate
data (GB)
Output and
Intermediate
Records
OFF 10,683 ~3 hrs 5,017 135,647,792,123
ON 1,284 ~20 mins 275 8,543,232,360
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
CBO – Enable CBO
• Enable CBO before submitting query
Page 32
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
• Refresh statistics
ANALYZE TABLE my_table COMPUTE STATISTICS FOR COLUMNS;
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Vectorized Query Execution
Process 1024 Rows at a Time
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Vectorization – Vectorized SQL Engine
• Feature:
–Process a block of 1024 rows instead of one row at a time
–Leverage modern hardware architecture
• Benefit:
–Max to 3x faster for big query
–Reduce CPU time, utilize cluster resource
Page 34
© Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015
Vectorization – Enable Vectorization
• Enable vectorized SQL engine
Page 35
set hive.vectorized.execution.enabled = true
set hive.vectorized.execution.reduce.enabled = true;
• Support ORC only
• A few data types and features are not supported
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive on Tez: Conclusion
Hive on Tez delivers fast batch and interactive SQL today.
But users need more speed!
Proven at petabyte scale.
Scalei
The most comprehensive
open-source SQL on
Hadoop.
SQLi
More than 90 Hortonworks
customers use Hive-on-Tez
today for fast SQL.
Speedi
Hortonworks Customer Support metrics as of Feb/2015
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sub-second Query Response
Solving Hive’s Top Performance Challenges
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Next Stop: Stinger.next and Sub-Second SQL
Emergence of LLAP and Hive-on-Spark bring Sub-Second within reach.
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive: Modern ArchitectureStorage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez
Historical
Current
In Development
Legend
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Hive: Modern ArchitectureStorage
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engine
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2
Cache
Block Cache
Linux Cache
Distributed
Execution
Hadoop 1
MapReduce
Hadoop 2
Tez
Vector Cache
LLAP
Persistent Server
Historical
Current
In Development
Legend
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
HBase Meta store: Why?
Page 41Hive & HBase For Transaction Processing
700+ metastore
queries to create
execution plan!
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP: What
Page 42Hive & HBase For Transaction Processing
Node
LLAP
Process
HDFS
Query
Fragm
ent
LLAP In-Memory
columnar cache
LLAP process
running read task
for a query
LLAP process runs on multiple nodes, accelerating Tez tasks
Node
Hive
Query
Node NodeNode Node
LLAP LLAP LLAP LLAP
LLAP = Live Long And Process
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
LLAP: Why?
Page 43
• LLAP is a node resident daemon process
– Low latency by reducing setup cost
• LLAP has in-memory columnar data cache
– Hot data sits in memory, not HDFS
– Store data in columnar format for vectorization
processing
• Use YARN for resource management
– Utilize cluster resource
Node
LLAP Process
Query
Fragment
LLAP In-
Memory
columnar
cache
LLAP
process
running a
task for a
query
HDFS
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Sub-second Response
=
Sub-Second
Hive
Metadata
Fast, Scalable
Metadata
Catalog
Persistent
Server
LLAP
+ +
SQL Engine
Vectorized
Hash Join
Choice of
Execution
Engines
Tez
+
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Key Takeaways
Hive Present and Future
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hive Present and Future
• Hive is the de facto standard of SQL on Hadoop
• One tool, batch and interactive processing
• One tool, all big data SQL use cases: ETL, reporting, BI and analytics
• Hive keeps envolving
• SQL:2011 Analytics support
• Enhance transactions
• Sub-second query response
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Try Hive Today
• Try Hive latest feature today
• Hive on Tez
• ORC file formant
• CBO
• Vectorization
• Just a few lines of configuration/SQL change
• Stay tuned for Hive evolution
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Thank you
Yifeng Jiang, Solutions Engineer, Hortonworks
@uprush

Hive present-and-feature-shanghai

  • 1.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Apache Hive Present and Future Yifeng Jiang Solutions Engineer, Hortonworks, inc. July 23, 2015
  • 2.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved About Me 蒋 燚峰 (Yifeng Jiang) • Solutions Engineer, Hortonworks inc. • HBase book author • Hobbies: hiking, watching movie
  • 3.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Agenda • Apache Hive Present • How Hive Achieved 100x Performance • Sub-second Response
  • 4.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP Customer Momentum • 430+ customers (as of March 31, 2015) • 105 customers added in Q1 2015 Hortonworks Data Platform • Completely open multi-tenant platform for any app & any data. • A centralized architecture of consistent enterprise services for resource management, security, operations, and governance. Partner for Customer Success • Open source community leadership focus on enterprise needs • Unrivaled world class support • Founded in 2011 • Original 24 architects, developers, operators of Hadoop from Yahoo! • 600+ Employees • 1100+ Ecosystem Partners Apache Project Committers PMC Members Hadoop 27 21 Pig 5 5 Hive 18 6 Tez 16 15 HBase 6 4 Phoenix 4 4 Accumulo 2 2 Storm 3 2 Slider 11 11 Falcon 5 3 Flume 1 1 Sqoop 1 1 Ambari 36 28 Oozie 3 2 Zookeeper 2 1 Knox 13 3 Ranger 11 n/a TOTAL 164 109
  • 5.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Hortonworks Data Platform (HDP) 2.2 Stack Hive: SQL on Hadoop
  • 6.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Apache Hive Present Transaction, Security, Performance
  • 7.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Apache Hive: SQL on Hadoop • OSS data warehouse built on top of Hadoop • First Apache Hive released in 2009 • Initial goal was to write MapReduce jobs in SQL – Most query ran from minutes to hours – Primary used for batch processing
  • 8.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Hive – Single tool for all SQL use cases OLTP, ERP, CRM Systems Unstructured documents, emails Clickstream Server logs Sentiment, Web Data Sensor. Machine Data Geolocation Interactive Analytics Batch Reports / Deep Analytics Hive - SQL ETL / ELT
  • 9.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 Hive Scales to Any Workload Page 9 Hive at Facebook • 100+ PB of data under management • 15+ TB of data loaded daily • 60,000+ Hive queries per day • More than 1,000 users per day
  • 10.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Transactions Insert, Update and Delete SQL Statements
  • 11.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Transaction Use Cases Reporting with Analytics (YES) Reporting on data with occasional updates Corrections to the fact tables, evolving dimension tables Low concurrency updates, low TPS Operational (OLTP) Database (NO) Small Transactions, each doing single line inserts High Concurrency - Hundreds to thousands of connections Hive OLTP Hive Replication Analytics Modifications Hive High Concurrency OLTP
  • 12.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Deep Dive: Transaction Transaction Support in Hive with ACID semantics • Hive native support for INSERT, UPDATE, DELETE. • Split Into Phases: • Phase 1: Hive Streaming Ingest (append) • Phase 2: INSERT / UPDATE / DELETE Support • Phase 3: BEGIN / COMMIT / ROLLBACK Txn [Done] [Done] [Next] Read- Optimized ORCFile Delta File Merged Read- Optimized ORCFile 1. Original File Task reads the latest ORCFile Task Read- Optimized ORCFile Task Task 2. Edits Made Task reads the ORCFile and merges the delta file with the edits 3. Edits Merged Task reads the updated ORCFile Hive ACID Compactor periodically merges the delta files in the background.
  • 13.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Hive Compaction Read- Optimized ORCFile Delta File Merged Read- Optimized ORCFile Read- Optimized ORCFile Delta File Delta File Delta File Minor Compaction 10% local Major Compaction 10% global Minor / Major compaction
  • 14.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Security Hive User’s perspective
  • 15.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Ranger: Central Security Administration Apache Ranger • Security dashboard • Centralizes administration of security policy • Ensures consistent coverage across the entire Hadoop stack
  • 16.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Setup Authorization Policy (Hive) 16 file level access control, flexible definition Control permissions
  • 17.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved How Hive Achieved 100x Performance ORC, Tez, CBO, Vectorization
  • 18.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Need for Speed: The Stinger Initiative Stinger: An Open Roadmap to improve Apache Hive’s performance 100x. Launched: February 2013; Delivered: April 2014. Delivered in 100% Apache Open Source. SQL Engine Vectorized SQL Engine Columnar Storage ORCFile = 100X+ + Distributed Execution Apache Tez
  • 19.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved TPC-DS Benchmark at 30 Terabyte Scale Sample of 50 queries from TPC-DS at 30 terabyte scale. Average 52x Query Speedup, Maximum 160x Query Speedup. Total benchmark time decreased from 7.8 days to 9.3 hours.(3) Cost-Based Optimizer added in Hive 14 gave additional 2.5x Speedup.
  • 20.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved ORC File Format Columnar Storage for Hive
  • 21.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 ORCFile – Columnar Storage for Hive • Columns stored separately • Knows types –Uses type-specific encoders –Stores statistics (min, max, sum, count) • Has light-weight index –Skip over blocks of rows that don’t matter Page 21
  • 22.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 ORCFile – Columnar Storage for Hive Large block size ideal for map/reduce. Columnar format enables high compression and high performance.
  • 23.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 ORCFile – Create Table • Defined at table or partition level • Configurable compression codec Page 23 create table Addresses ( name string, street string, city string, state string, zip int ) stored as orc tblproperties ("orc.compress"=”ZLIB");
  • 24.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 ORCFile – Convert Text to ORC • Always ORC • One SQL to convert text to ORC Page 24 -- Create Text & ORC tables CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE; CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC; -- Load into Text table LOAD DATA LOCAL INPATH '/home/user/test_details.csv' INTO TABLE test_details_txt; -- Copy to ORC table INSERT OVERWRITE INTO test_details_orc SELECT * FROM test_details_txt;
  • 25.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Tez Engine Beyond MapReduce
  • 26.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 I/O Synchronization Barrier I/O Synchronization Barrier Job 1 ( Join a & b ) Job 3 ( Group by of c ) Job 2 (Group by of a Join b) Job 4 (Join of S & R ) Hive - MR MR vs. Tez Example Page 26 Single Job Hive - Tez Join a & b Group by of a Join b Group by of c Job 4 (Join of S & R )
  • 27.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 Tez – Introduction Page 27 • Distributed execution framework for data-processing applications –Target for application (framework), not end user –Hive on Tez, Pig on Tez, Cascading on Tez, … • Lessons learned from MapReduce –Significant performance improvement –Batch, interactive –Petabytes scale • Run on YARN –Utilize cluster resource
  • 28.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 Tez – Switch from MapReduce • One command to switch from MapReduce to Tez Page 28 set hive.execution.engine=tez; SELECT * FROM my_table; • Set Tez as default engine on Hadoop 2 $ vi hive-site.xml hive.execution.engine=tez
  • 29.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Cost Based Optimizer Making the SQL smarter
  • 30.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Cost Based Optimizer in Hive Cost-Based Optimizer (CBO) creates optimized execution plan using Hive table statistics Why cost-based optimization? • Simple use – e.g., adjust join order automatically • Reduce the need for SQL tuning • Optimized plan relates to better cluster utilization Page 30
  • 31.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Performance Improvement – Query 17 Scale = 30TB Input records ~186M CBO Elapsed Time (sec) Elapsed Time Intermediate data (GB) Output and Intermediate Records OFF 10,683 ~3 hrs 5,017 135,647,792,123 ON 1,284 ~20 mins 275 8,543,232,360
  • 32.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 CBO – Enable CBO • Enable CBO before submitting query Page 32 set hive.cbo.enable=true; set hive.compute.query.using.stats=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true; • Refresh statistics ANALYZE TABLE my_table COMPUTE STATISTICS FOR COLUMNS;
  • 33.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Vectorized Query Execution Process 1024 Rows at a Time
  • 34.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 Vectorization – Vectorized SQL Engine • Feature: –Process a block of 1024 rows instead of one row at a time –Leverage modern hardware architecture • Benefit: –Max to 3x faster for big query –Reduce CPU time, utilize cluster resource Page 34
  • 35.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved © Hortonworks Inc. 2015 Vectorization – Enable Vectorization • Enable vectorized SQL engine Page 35 set hive.vectorized.execution.enabled = true set hive.vectorized.execution.reduce.enabled = true; • Support ORC only • A few data types and features are not supported
  • 36.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Hive on Tez: Conclusion Hive on Tez delivers fast batch and interactive SQL today. But users need more speed! Proven at petabyte scale. Scalei The most comprehensive open-source SQL on Hadoop. SQLi More than 90 Hortonworks customers use Hive-on-Tez today for fast SQL. Speedi Hortonworks Customer Support metrics as of Feb/2015
  • 37.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Sub-second Query Response Solving Hive’s Top Performance Challenges
  • 38.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Next Stop: Stinger.next and Sub-Second SQL Emergence of LLAP and Hive-on-Spark bring Sub-Second within reach.
  • 39.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Apache Hive: Modern ArchitectureStorage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Historical Current In Development Legend
  • 40.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Apache Hive: Modern ArchitectureStorage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Vector Cache LLAP Persistent Server Historical Current In Development Legend
  • 41.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved HBase Meta store: Why? Page 41Hive & HBase For Transaction Processing 700+ metastore queries to create execution plan!
  • 42.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved LLAP: What Page 42Hive & HBase For Transaction Processing Node LLAP Process HDFS Query Fragm ent LLAP In-Memory columnar cache LLAP process running read task for a query LLAP process runs on multiple nodes, accelerating Tez tasks Node Hive Query Node NodeNode Node LLAP LLAP LLAP LLAP LLAP = Live Long And Process
  • 43.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved LLAP: Why? Page 43 • LLAP is a node resident daemon process – Low latency by reducing setup cost • LLAP has in-memory columnar data cache – Hot data sits in memory, not HDFS – Store data in columnar format for vectorization processing • Use YARN for resource management – Utilize cluster resource Node LLAP Process Query Fragment LLAP In- Memory columnar cache LLAP process running a task for a query HDFS
  • 44.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Hive Sub-second Response = Sub-Second Hive Metadata Fast, Scalable Metadata Catalog Persistent Server LLAP + + SQL Engine Vectorized Hash Join Choice of Execution Engines Tez +
  • 45.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Key Takeaways Hive Present and Future
  • 46.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Hive Present and Future • Hive is the de facto standard of SQL on Hadoop • One tool, batch and interactive processing • One tool, all big data SQL use cases: ETL, reporting, BI and analytics • Hive keeps envolving • SQL:2011 Analytics support • Enhance transactions • Sub-second query response
  • 47.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Try Hive Today • Try Hive latest feature today • Hive on Tez • ORC file formant • CBO • Vectorization • Just a few lines of configuration/SQL change • Stay tuned for Hive evolution
  • 48.
    © Hortonworks Inc.2011 – 2015. All Rights Reserved Thank you Yifeng Jiang, Solutions Engineer, Hortonworks @uprush

Editor's Notes

  • #3 花粉症。。
  • #4 聞いた方手を上げて
  • #5 Hortonworks has a singular focus - enabling Apache Hadoop as an enterprise data platform for any app and any data type We were founded in 2011 by 24 developers from Yahoo where Hadoop was conceived to address data challenges at internet scale. What we now know of as Hadoop really started in 2005, when a team at Yahoo was directed to build out a large-scale data storage and processing technology that would allow them to improve their most critical application, Search. Their challenge was essentially two-fold. First they needed to capture and archive the contents of the internet, and then process the data so that users could search through it effectively an efficiently. Clearly traditional approaches were both technically (due to the size of the data) and commercially (due to the cost) impractical. The result was the Apache Hadoop project that delivered large scale storage (HDFS) and processing (MapReduce). Today we are over 600 employees and have partnered with over 1000 companies who are the leaders in the data center We have also been very fortunate to achieve very significant customer adoption with over 330 customers as of the end of 2014, spanning nearly every vertical.   Hortonworks was founded the sole intent to make Hadoop an enterprise data platform. With YARN as its foundation, HDP delivers a centralized architecture with true multi-tenancy for data-processing and shared services for Security, Governance and Operations to satisfy enterprise requirements, all deeply integrated and certified with leading datacenter technologies. We are uniquely focused on this transformation of Hadoop and doing our work completely in open source. This is all predicated on our leadership in the community, which enables not only to best support users of but also provides uniquely present customer requirements within this open, thriving community.      
  • #30 私の日本語力では。。。
  • #35 Vectorized query execution improves performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.
  • #41 LLAP: Persistent servers cache vectors and start queries instantly. Pluggable integrations with Tez Vectorized Hash Join Solves CPU Boundedness for Hive on Tez. Improved metadata catalog allows instant query planning and optimization for any engine.
  • #44 LLAP is a node resident daemon process Low latency by reducing setup cost Multi-threaded engine that runs smaller tasks for query including reads, filter and some joins Use regular Tez tasks for larger shuffle and other operators LLAP has In-memory columnar data cache High throughput IO using Async IO Elevator with dedicated thread and core per disk Low latency by providing data from in-memory (off heap) cache instead of going to HDFS Store data in columnar format for vectorization irrespective of underlying file type Security enforced across queries and users Uses YARN for resource management