1. Overview of the Hive Stinger
Initiative
Eric N. Hanson
Principal Software Development Engineer
Microsoft HDInsight Team
30 June 2014
2. What is Stinger? Umbrella term for…
• Faster query in Hive
• ORC
• Vectorization
• Tez
• Better language features for analysis
• Window functions etc.
3. Why Stinger?
• Hive has good functionality
• But it started out sloooowww
• Need to speed it up
• keep it competitive
• make it fun to use
4. ORC
• A good columnstore format
• Run length encoding, value encoding, dictionary encoding
• Layers stream compression over the top
• Written by Owen O’Malley
• http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-
2.0.0.2/ds_Hive/orcfile.html
5. Using ORC
• create table Tbl (col int) stored as orc;
• orc.compress default ZLIB
• See http://www.slideshare.net/oom65/orc-
andvectorizationhadoopsummit
8. How the code works (simplified)
Page 8
class LongColumnAddLongScalarExpression {
int inputColumn;
int outputColumn;
long scalar;
void evaluate(VectorizedRowBatch batch) {
long [] inVector =
((LongColumnVector) batch.columns[inputColumn]).vector;
long [] outVector =
((LongColumnVector) batch.columns[outputColumn]).vector;
if (batch.selectedInUse) {
for (int j = 0; j < batch.size; j++) {
int i = batch.selected[j];
outVector[i] = inVector[i] + scalar;
}
} else {
for (int i = 0; i < batch.size; i++) {
outVector[i] = inVector[i] + scalar;
}
}
}
}
}
No method calls
Low instruction count
Cache locality to 1024 values
No pipeline stalls
SIMD in Java 8
9. Vectorization and Compilation
• Vectorization “instructions” generated from templates
• Example’s:
–Int add col-col
–Int add col-scalar
–Int add scalar-col
–Double add col-col
–Double add col-scalar
–Double add scalar-col
–And hundreds more!
• Pre-compilation of expressions
• Reduces # of function calls and instructions at runtime
• Expressions like (a + 2) / b are interpreted with these primitives
10. Example of vectorized template code
} else {
if (batch.selectedInUse) {
for(int j = 0; j != n; j++) {
int i = sel[j];
outputVector[i] = vector1[i] <OperatorSymbol> vector2[i];
}
} else {
for(int i = 0; i != n; i++) {
outputVector[i] = vector1[i] <OperatorSymbol> vector2[i];
}
}
}
11. Using vectorization in Hive
• set hive.vectorized.execution.enabled = true;
• Run query over ORC
• Only works for scalar types
• https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+
Execution
• ~5X CPU reduction
12. Apache Tez (“Speed”)
• Replaces MapReduce as primitive for Pig, Hive, Cascading etc.
– Smaller latency for interactive queries
– Higher throughput for batch queries
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft
YARN ApplicationMaster to run DAG of Tez Tasks
Task with pluggable Input, Processor and Output
Tez Task - <Input, Processor, Output>
Task
ProcessorInput Output
*Courtesy of Hortonworks
13. Tez: Building blocks for scalable data processing
Classical ‘Map’ Classical ‘Reduce’
Intermediate ‘Reduce’ for
Map-Reduce-Reduce
Map
Processor
HDFS
Input
Sorted
Output
Reduce
Processor
Shuffle
Input
HDFS
Output
Reduce
Processor
Shuffle
Input
Sorted
Output
*Courtesy of Hortonworks
14. Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-Tez
SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id) GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c GROUP BY x
ORDER BY AVG;
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids unneeded
writes to HDFS
*Courtesy of Hortonworks
15. Tez Sessions
… because Map/Reduce query startup is expensive
• Tez Sessions
–Hot containers ready for immediate use
–Removes task and job launch overhead (~5s – 30s)
• Hive
–Session launch/shutdown in background (seamless, user not aware)
–Submits query plan directly to Tez Session
Native Hadoop service, not ad-hoc
*Courtesy of Hortonworks
16. Stinger Phase 3: Interactive Query In Hadoop
Page 16
Hive 10 Trunk (Phase 3)Hive 0.11 (Phase 1)
190x
Improvement
1400s
39s
7.2s
TPC-DS Query 27
3200s
65s
14.9s
TPC-DS Query 82
200x
Improvement
Query 27: Pricing Analytics using Star Schema Join
Query 82: Inventory Analytics Joining 2 Large Fact Tables
All Results at Scale Factor 200 (Approximately 200GB Data)
*Courtesy of Hortonworks
17. How you can use Stinger enhancements
• Use Hive 13
• Use ORC: create table … stored as ORC
• Enable vectorization:
set hive.vectorized.execution.enabled=true
• Enable Tez: set hive.execution.engine=tez
• See http://hortonworks.com/hadoop-tutorial/supercharging-
interactive-queries-hive-tez/
18. Reference(s)
• Stinger overview, Strata, fall 2013:
http://www.slideshare.net/alanfgates/strata-stingertalk-
oct2013?qid=09d16028-bd7e-47d8-8438-
34f3242c6f0e&v=qf1&b=&from_search=1
Slides marked “Courtesy of Hortonworks” are from Hortonworks talks