Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Query Optimization and
JIT-based Vectorized
Execution in Apache Tajo
Hyunsik Choi
Research Director, Gruter
Hadoop Summit ...
Talk Outline
• Introduction to Apache Tajo
• Key Topics
– Query Optimization in Apache Tajo
• Join Order Optimization
• Pr...
About Me
• Hyunsik Choi (pronounced “Hyeon-shick Cheh”)
• PhD (Computer Science & Engineering, 2013), Korea Uni.
• Directo...
Apache Tajo
• Open-source “SQL-on-H” “Big DW” system
• Apache Top-level project since March 2014
• Supports SQL standards
...
Overall Architecture
Query Optimization
Optimization in Tajo
Query Optimization Steps
Logical Plan Optimization in Tajo
• Rewrite Rules
– Projection Push Down
• push expressions to operators lower as possible...
Join Optimization - Greedy Operator Ordering
Set<LogicalNode> remainRelations = new LinkedHashSet<LogicalNode>();
for (Rel...
Progressive Optimization (in DAG controller)
• Query plans often suboptimal as estimation-based
• Progressive Optimization...
JIT-based Vectorized
Query Engine
Vectorized Processing - Motivation
• So far have focused on I/O throughput
• Achieved 70-110MB/s in disk bound queries
• I...
What is Tuple-at-a-time model?
• Every physical operator produces a tuple
by recursively calling next() of child
operators...
Performance Degradation
Current implementation also uses:
• Immutable Datum classes wrapping Java
primitives
– Used in exp...
Benchmark Breakdown
• TPC-H Q1:
select
l_returnflag, l_linestatus, sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum...
Benchmark Breakdown
• TPC-H dataset (scale factor = 3)
– 17,996,609 (about 18M) rows
• Plain text lineitem table (2.3 GB)
...
Benchmark Breakdown
• H/W environment
– CPU i7-4770 (3.4GHz), 32GB Ram
– 1 SATA Disk (WD2003FZEX)
• Read throughput: 105-1...
Benchmark Breakdown
CPU accounts for 50% total query
processing time in TPC-H Q1
milliseconds
About
100MB/S
Benchmark Breakdownmilliseconds
FROM lineitem
GROUP BY l_returnflag
GROUP BY l_returnflag, l_shipflag
sum(…) x 4
avg(…) x ...
Benchmark Analysis
• Much room for improvement
• Each tuple evaluation may involve overheads in
tuple-at-a-time model
– no...
Our Solution
• Vectorized Processing
– Columnar processing on primitive arrays
• JIT helps vectorization engine
– Eliminat...
Vectorized Processing
• Originated from database research
– Cstore, MonetDB and Vectorwise
• Recently adopted in Hive 0.13...
Vectorized Processing
Id Name Age
101 abc 22
102 def 37
104 ghi 45
105 jkl 25
108 mno 31
112 pqr 27
114 owx 35
101 abc 22 ...
Vectorized Processing
Id
101
102
104
105
108
112
114
Name
abc
def
ghi
jkl
mno
pqr
owx
Age
22
37
45
25
31
27
35
Decompositi...
Vectorized Processing
MapAddLongIntColCol(int vecNum, long [] result, long [] col1, int [] col2,
int [] selVec) {
if (selV...
Vectorized Processing
SelLEQLongIntColCol(int vecNum, int [] resSelVec, long [] col1, int [] col2,
int [] selVec) {
if (se...
Vectorized Processing
vector block 1
vector block 2
vector block 3
Column Values
l_shipdate l_discount l_extprice l_tax re...
Vectorized Processing in Tajo
• Unsafe-based in-memory structure for vectors
– Fast direct memory access
– More opportunit...
• One memory chunk divided into
multiple fixed-length vectors
• Variable length values stored in p
ages of variable areas
...
Vectorization + Just-in-time Compilation
• For single operation types, many type combinations required:
– INT vector (+,-,...
Unsafe-based Cukcoo Hash Table
• Advantages of Cuckoo hash table
– Use of multiple hash functions
– No linked list
– Only ...
Benchmark Breakdown: Tajo JIT + Vec Enginemilliseconds
Scanning lineitem
(throughput 138MB/s)
Expression evaluation
(proje...
Summary
• Tajo uses Join order optimization and
re-optimizes special cases during running
queries
• JIT-based Vectorized E...
Get Involved!
• We are recruiting contributors!
• General
– http://tajo.apache.org
• Getting Started
– http://tajo.apache....
Upcoming SlideShare
Loading in …5
×

Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in Apache Tajo

10,924 views

Published on

Apache Tajo is an open source big data warehouse system on Hadoop. This slide shows two high-tech efforts for performance improvement in Tajo project. First one is query optimization including cost-based join order and progressive optimization. The second effort is JIT-based vectorized processing.

Published in: Software, Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in Apache Tajo

  1. 1. Query Optimization and JIT-based Vectorized Execution in Apache Tajo Hyunsik Choi Research Director, Gruter Hadoop Summit North America 2014
  2. 2. Talk Outline • Introduction to Apache Tajo • Key Topics – Query Optimization in Apache Tajo • Join Order Optimization • Progressive Optimization – JIT-based Vectorized Engines
  3. 3. About Me • Hyunsik Choi (pronounced “Hyeon-shick Cheh”) • PhD (Computer Science & Engineering, 2013), Korea Uni. • Director of Research, Gruter Corp, Seoul, South Korea • Open-source Involvement – Full-time contributor to Apache Tajo (2013.6 ~ ) – Apache Tajo PMC member and committer (2013.3 ~ ) – Apache Giraph PMC member and committer (2011. 8 ~ ) • Contact Info – Email: hyunsik@apache.org – Linkedin: http://linkedin.com/in/hyunsikchoi/
  4. 4. Apache Tajo • Open-source “SQL-on-H” “Big DW” system • Apache Top-level project since March 2014 • Supports SQL standards • Low latency, long running batch queries • Features – Supports Joins (inner and all outer), Groupby, and Sort – Window function – Most SQL data types supported (except for Decimal) • Recent 0.8.0 release – https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0
  5. 5. Overall Architecture
  6. 6. Query Optimization
  7. 7. Optimization in Tajo Query Optimization Steps
  8. 8. Logical Plan Optimization in Tajo • Rewrite Rules – Projection Push Down • push expressions to operators lower as possible • narrow read columns • remove duplicated expressions – if some expressions has common expression – Selection Push Down • reduce rows to be processed earlier as possible – Extensible Rewrite rule interfaces • Allow developers to write their own rewrite rules • Join order optimization – Enumerate possible join orders – Determine the optimized join order in greedy manner – Currently, we use simple cost-model using table volumes.
  9. 9. Join Optimization - Greedy Operator Ordering Set<LogicalNode> remainRelations = new LinkedHashSet<LogicalNode>(); for (RelationNode relation : block.getRelations()) { remainRelations.add(relation); } LogicalNode latestJoin; JoinEdge bestPair; while (remainRelations.size() > 1) { // Find the best join pair among all joinable operators in candidate set. bestPair = getBestPair(plan, joinGraph, remainRelations); // remainRels = remainRels Ti remainRelations.remove(bestPair.getLeftRelation()); // remainRels = remainRels Tj remainRelations.remove(bestPair.getRightRelation()); latestJoin = createJoinNode(plan, bestPair); remainRelations.add(latestJoin); } findBestOrder() in GreedyHeuristicJoinOrderAlgorithm.java
  10. 10. Progressive Optimization (in DAG controller) • Query plans often suboptimal as estimation-based • Progressive Optimization: – Statistics collection over running query in runtime – Re-optimization of remaining plan stages • Optimal ranges and partitions based on operator type (join, aggregation, and sort) in runtime (since v0.2) • In-progress work (planned for 1.0) – Re-optimize join orders – Re-optimize distributed join plan • Symmetric shuffle Join >>> broadcast join – Shrink multiple stages into fewer stages
  11. 11. JIT-based Vectorized Query Engine
  12. 12. Vectorized Processing - Motivation • So far have focused on I/O throughput • Achieved 70-110MB/s in disk bound queries • Increasing customer demand for faster storages such as S AS disk and SSD • BMT with fast storage indicates performance likely CPU-bound rather than disk-bound • Current execution engine based on tuple-at-a-time approach
  13. 13. What is Tuple-at-a-time model? • Every physical operator produces a tuple by recursively calling next() of child operators tuples next() call Upside • Simple Interface • All arbitrary operator combinations Downside (performance degradation) • Too many function calls • Too many branches • Bad for CPU pipelining • Bad data/instruction cache hits
  14. 14. Performance Degradation Current implementation also uses: • Immutable Datum classes wrapping Java primitives – Used in expression evaluation and serialization Resulting in: • Object creation overheads • Big memory footprint (particularly inefficient in-memory op erations) • Expression trees – Each primitive operator evaluation involves function call
  15. 15. Benchmark Breakdown • TPC-H Q1: select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= '1998-09-01’ group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus
  16. 16. Benchmark Breakdown • TPC-H dataset (scale factor = 3) – 17,996,609 (about 18M) rows • Plain text lineitem table (2.3 GB) • CSV dataset >> Parquet format file – To minimize the effectiveness of other factors which may impact CPU cost – No compression – 256MB block size, 1MB pagesize • Single 1GB Parquet file
  17. 17. Benchmark Breakdown • H/W environment – CPU i7-4770 (3.4GHz), 32GB Ram – 1 SATA Disk (WD2003FZEX) • Read throughput: 105-167MB/s (avg. 144 MB/s) according to http://hdd.userbenchmark.com. • Single thread and single machine • Directly call next() of the root of physical op erator tree
  18. 18. Benchmark Breakdown CPU accounts for 50% total query processing time in TPC-H Q1 milliseconds About 100MB/S
  19. 19. Benchmark Breakdownmilliseconds FROM lineitem GROUP BY l_returnflag GROUP BY l_returnflag, l_shipflag sum(…) x 4 avg(…) x 3 TPC-H Q1
  20. 20. Benchmark Analysis • Much room for improvement • Each tuple evaluation may involve overheads in tuple-at-a-time model – not easy to measure cache misses and branch mispredictions • Each expression causes non-trivial CPU costs – Interpret overheads – Composite keys seem to degrade performance • Too many objects created (yourkit profiler analysis) – Difficult to avoid object creation to retain all tuples and datum instances used in in-memory operators • Hash aggregation – Java HashMap - effective, but not cheap – Non-trivial GC time found in other tests when distinct keys > 10M – Java objects - big memory footprint, cache misses
  21. 21. Our Solution • Vectorized Processing – Columnar processing on primitive arrays • JIT helps vectorization engine – Elimination of vectorization impediments • Unsafe-based in-memory structure for vectors – No object creations • Unsafe-based Cukcoo HashTable – Fast lookup and No GC
  22. 22. Vectorized Processing • Originated from database research – Cstore, MonetDB and Vectorwise • Recently adopted in Hive 0.13 • Key ideas: – Use primitive type arrays as column values – Small and simple loop processing – In-cache processing – Less branches for CPU pipelining – SIMD • SIMD in Java?? • http://hg.openjdk.java.net/hsx/hotspot-main/hotspot/file/tip/src/s hare/vm/opto/superword.cpp
  23. 23. Vectorized Processing Id Name Age 101 abc 22 102 def 37 104 ghi 45 105 jkl 25 108 mno 31 112 pqr 27 114 owx 35 101 abc 22 102 def 37 104 ghi 45 105 jkl 25 mn o 31 112 pqr 27 114 owx 35 A relation N-array storage model (NSM) 101 102 104 105 108 112 abc def ghi jkl mn o pqr owx 22 37 45 25 31 27 35 Decomposition storage model (DSM) A Row Column values
  24. 24. Vectorized Processing Id 101 102 104 105 108 112 114 Name abc def ghi jkl mno pqr owx Age 22 37 45 25 31 27 35 Decomposition storage model Id 101 102 104 105 Name abc def ghi jkl Age 22 37 45 25 Vectorized model 108 112 114 mno pqr owx 31 27 35 vector block A (fitting in cache) vector block B (fitting in cache) (bad cache hits) (better cache hits)
  25. 25. Vectorized Processing MapAddLongIntColCol(int vecNum, long [] result, long [] col1, int [] col2, int [] selVec) { if (selVec == null) { for (int i = 0; I = 0; i < vecNum; i++) { result[i] = col1[i] + col2[i]; } } else { int selIdx; for (int i = 0; I = 0; i < vecNum; i++) { selIdx = selVec[i]; result[selIdx] = col1[selIdx] + col2[selIdx]; } } } Example: Add primitive for long and int vectors
  26. 26. Vectorized Processing SelLEQLongIntColCol(int vecNum, int [] resSelVec, long [] col1, int [] col2, int [] selVec) { if (selVec == null) { int selected; for (int rowIdx = 0; rowIdx < vecNum; rowIdx++) { resSelVec[selected] = rowIdx; selected += col1[rowIdx] <= col2[rowIdx] ? 1 : 0; } } else { … } } Example: Less than equal filter primitive for long and int vectors
  27. 27. Vectorized Processing vector block 1 vector block 2 vector block 3 Column Values l_shipdate l_discount l_extprice l_tax returnflag l_shipdate <= '1998-09-01’ 1-l_discount l_extprice * l_tax aggregation An example of vectorized processing
  28. 28. Vectorized Processing in Tajo • Unsafe-based in-memory structure for vectors – Fast direct memory access – More opportunities to use byte-level operations • Vectorization + Just-in-time compilation – Byte code generation for vectorization primitives in runtime – Significantly reduces branches and interpret overheads
  29. 29. • One memory chunk divided into multiple fixed-length vectors • Variable length values stored in p ages of variable areas – Only pointers stored in fixed-length vector • Less data copy and object creation • Fast direct access • Easy byte-level operations – Guava’s FastByteComparisons which compare two strings via long c omparison • Forked it to directly access string vectors Unsafe-based In-memory Structure for Vectors Fixed Area Variable Area variable-length field vector pointers
  30. 30. Vectorization + Just-in-time Compilation • For single operation types, many type combinations required: – INT vector (+,-,*,/,%) INT vector – INT vector (+,-,*,/,%) INT single value – INT single value (+,-,*,/,%) INT vector – INT column (+,-,*,/,%) LONG vector – … – FLOAT column ….. • ASM used to generate Java byte code in runtime for various primitives – Cheaper code maintenance – Composite keys for Sort, Groupby, and Hash functions • Less branches and nested loops • Complex Vectorization Primitive Generation (Planned) – Combining Multiple primitives into one primitive
  31. 31. Unsafe-based Cukcoo Hash Table • Advantages of Cuckoo hash table – Use of multiple hash functions – No linked list – Only one item in each bucket – Worst-case constant lookup time • Single direct memory allocation for a hash table – Indexed chunks used as buckets – No GC overheads even if rehash entire buckets • Simple and fast lookup • Current implementation only supports fixed- length hash bucket
  32. 32. Benchmark Breakdown: Tajo JIT + Vec Enginemilliseconds Scanning lineitem (throughput 138MB/s) Expression evaluation (projection) Hashing groupby key columns Finding all hash bucket ids Aggregation TPC-H Q1
  33. 33. Summary • Tajo uses Join order optimization and re-optimizes special cases during running queries • JIT-based Vectorized Engine prototype – Significantly reduces CPU times through: • Vectorized processing • Unsafe-based vector in-memory structure • Unsafe-based Cuckoo hashing • Future work – A single complex primitive generation to process multiple operators at a time – Improvement for production level
  34. 34. Get Involved! • We are recruiting contributors! • General – http://tajo.apache.org • Getting Started – http://tajo.apache.org/docs/0.8.0/getting_started.html • Downloads – http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html • Jira – Issue Tracker – https://issues.apache.org/jira/browse/TAJO • Join the mailing list – dev-subscribe@tajo.apache.org – issues-subscribe@tajo.apache.org

×