0
Overview of the Hive Stinger
Initiative
Eric N. Hanson
Principal Software Development Engineer
Microsoft HDInsight Team
30...
What is Stinger?  Umbrella term for…
• Faster query in Hive
• ORC
• Vectorization
• Tez
• Better language features for an...
Why Stinger?
• Hive has good functionality
• But it started out sloooowww
• Need to speed it up
• keep it competitive
• ma...
ORC
• A good columnstore format
• Run length encoding, value encoding, dictionary encoding
• Layers stream compression ove...
Using ORC
• create table Tbl (col int) stored as orc;
• orc.compress default ZLIB
• See http://www.slideshare.net/oom65/or...
TPC-DS File Sizes
Page 6
*Courtesy of Hortonworks
Vectorization
Page 7
How the code works (simplified)
Page 8
class LongColumnAddLongScalarExpression {
int inputColumn;
int outputColumn;
long s...
Vectorization and Compilation
• Vectorization “instructions” generated from templates
• Example’s:
–Int add col-col
–Int a...
Example of vectorized template code
} else {
if (batch.selectedInUse) {
for(int j = 0; j != n; j++) {
int i = sel[j];
outp...
Using vectorization in Hive
• set hive.vectorized.execution.enabled = true;
• Run query over ORC
• Only works for scalar t...
Apache Tez (“Speed”)
• Replaces MapReduce as primitive for Pig, Hive, Cascading etc.
– Smaller latency for interactive que...
Tez: Building blocks for scalable data processing
Classical ‘Map’ Classical ‘Reduce’
Intermediate ‘Reduce’ for
Map-Reduce-...
Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-Tez
SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id) GROUP BY a
...
Tez Sessions
… because Map/Reduce query startup is expensive
• Tez Sessions
–Hot containers ready for immediate use
–Remov...
Stinger Phase 3: Interactive Query In Hadoop
Page 16
Hive 10 Trunk (Phase 3)Hive 0.11 (Phase 1)
190x
Improvement
1400s
39s...
How you can use Stinger enhancements
• Use Hive 13
• Use ORC: create table … stored as ORC
• Enable vectorization:
set hiv...
Reference(s)
• Stinger overview, Strata, fall 2013:
http://www.slideshare.net/alanfgates/strata-stingertalk-
oct2013?qid=0...
Upcoming SlideShare
Loading in...5
×

Overview of the Hive Stinger Initiative

461

Published on

Dr. Eric N. Hanson, Principal Software Development Engineer at Microsoft and Apache Hive committer presents the recent improvements in Hive

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
461
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Overview of the Hive Stinger Initiative"

  1. 1. Overview of the Hive Stinger Initiative Eric N. Hanson Principal Software Development Engineer Microsoft HDInsight Team 30 June 2014
  2. 2. What is Stinger?  Umbrella term for… • Faster query in Hive • ORC • Vectorization • Tez • Better language features for analysis • Window functions etc.
  3. 3. Why Stinger? • Hive has good functionality • But it started out sloooowww • Need to speed it up • keep it competitive • make it fun to use
  4. 4. ORC • A good columnstore format • Run length encoding, value encoding, dictionary encoding • Layers stream compression over the top • Written by Owen O’Malley • http://docs.hortonworks.com/HDPDocuments/HDP2/HDP- 2.0.0.2/ds_Hive/orcfile.html
  5. 5. Using ORC • create table Tbl (col int) stored as orc; • orc.compress default ZLIB • See http://www.slideshare.net/oom65/orc- andvectorizationhadoopsummit
  6. 6. TPC-DS File Sizes Page 6 *Courtesy of Hortonworks
  7. 7. Vectorization Page 7
  8. 8. How the code works (simplified) Page 8 class LongColumnAddLongScalarExpression { int inputColumn; int outputColumn; long scalar; void evaluate(VectorizedRowBatch batch) { long [] inVector = ((LongColumnVector) batch.columns[inputColumn]).vector; long [] outVector = ((LongColumnVector) batch.columns[outputColumn]).vector; if (batch.selectedInUse) { for (int j = 0; j < batch.size; j++) { int i = batch.selected[j]; outVector[i] = inVector[i] + scalar; } } else { for (int i = 0; i < batch.size; i++) { outVector[i] = inVector[i] + scalar; } } } } } No method calls Low instruction count Cache locality to 1024 values No pipeline stalls SIMD in Java 8
  9. 9. Vectorization and Compilation • Vectorization “instructions” generated from templates • Example’s: –Int add col-col –Int add col-scalar –Int add scalar-col –Double add col-col –Double add col-scalar –Double add scalar-col –And hundreds more! • Pre-compilation of expressions • Reduces # of function calls and instructions at runtime • Expressions like (a + 2) / b are interpreted with these primitives
  10. 10. Example of vectorized template code } else { if (batch.selectedInUse) { for(int j = 0; j != n; j++) { int i = sel[j]; outputVector[i] = vector1[i] <OperatorSymbol> vector2[i]; } } else { for(int i = 0; i != n; i++) { outputVector[i] = vector1[i] <OperatorSymbol> vector2[i]; } } }
  11. 11. Using vectorization in Hive • set hive.vectorized.execution.enabled = true; • Run query over ORC • Only works for scalar types • https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+ Execution • ~5X CPU reduction
  12. 12. Apache Tez (“Speed”) • Replaces MapReduce as primitive for Pig, Hive, Cascading etc. – Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft YARN ApplicationMaster to run DAG of Tez Tasks Task with pluggable Input, Processor and Output Tez Task - <Input, Processor, Output> Task ProcessorInput Output *Courtesy of Hortonworks
  13. 13. Tez: Building blocks for scalable data processing Classical ‘Map’ Classical ‘Reduce’ Intermediate ‘Reduce’ for Map-Reduce-Reduce Map Processor HDFS Input Sorted Output Reduce Processor Shuffle Input HDFS Output Reduce Processor Shuffle Input Sorted Output *Courtesy of Hortonworks
  14. 14. Hive – MR Hive – Tez Hive-on-MR vs. Hive-on-Tez SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x ORDER BY AVG; SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) SELECT b.id Tez avoids unneeded writes to HDFS *Courtesy of Hortonworks
  15. 15. Tez Sessions … because Map/Reduce query startup is expensive • Tez Sessions –Hot containers ready for immediate use –Removes task and job launch overhead (~5s – 30s) • Hive –Session launch/shutdown in background (seamless, user not aware) –Submits query plan directly to Tez Session Native Hadoop service, not ad-hoc *Courtesy of Hortonworks
  16. 16. Stinger Phase 3: Interactive Query In Hadoop Page 16 Hive 10 Trunk (Phase 3)Hive 0.11 (Phase 1) 190x Improvement 1400s 39s 7.2s TPC-DS Query 27 3200s 65s 14.9s TPC-DS Query 82 200x Improvement Query 27: Pricing Analytics using Star Schema Join Query 82: Inventory Analytics Joining 2 Large Fact Tables All Results at Scale Factor 200 (Approximately 200GB Data) *Courtesy of Hortonworks
  17. 17. How you can use Stinger enhancements • Use Hive 13 • Use ORC: create table … stored as ORC • Enable vectorization: set hive.vectorized.execution.enabled=true • Enable Tez: set hive.execution.engine=tez • See http://hortonworks.com/hadoop-tutorial/supercharging- interactive-queries-hive-tez/
  18. 18. Reference(s) • Stinger overview, Strata, fall 2013: http://www.slideshare.net/alanfgates/strata-stingertalk- oct2013?qid=09d16028-bd7e-47d8-8438- 34f3242c6f0e&v=qf1&b=&from_search=1 Slides marked “Courtesy of Hortonworks” are from Hortonworks talks
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×