Query Optimization in
Apache Tajo
Jihoon Son / Gruter inc.
About Me
● Jihoon Son (@jihoonson)
○ Tajo project co-founder
○ Committer and PMC member of Apache Tajo
○ Research engineer at Gruter
2
● Introduction to Tajo
● Query processing in Tajo
○ Query plans in Tajo
○ Query processing example
● Query optimization in Tajo
○ Introduction to query optimization
○ Query optimization techniques in Tajo
Outline
3
● Apache Top-level Project
○ Data warehouse system
■ Efficient processing of analytic queries
■ ANSI-SQL compliant
○ Scalable and rapid query execution with own engine
■ Distributed query processing
■ Fault-tolerance
○ Beyond SQL-on-Hadoop
■ Support various types of storage
● HDFS, S3, hbase, rdbms, ...
What is Tajo?
4
Highlighted Features
● Support long-running batch queries as well as
interactive ad-hoc queries
○ Fast query processing
■ Optimized scan performance
● 120 MB/sec per physical disk (SATA)
○ Reliability
■ Fault tolerance
■ No single point of failure with HA support
5
Highlighted Features
● Support of various kinds of data sources
○ HDFS, Amazon S3, Google Cloud Storage, HBase,
RDBMS, ...
● Mature SQL support
○ Various kinds of join support
○ Window function support
○ Cost-based query optimization
● Integration with other systems
○ Notebooks like Zeppelin
○ BI tools
6
Recent Release: 0.11
● Feature highlights
○ Query federation
○ JDBC-based storage support
○ Self-describing data formats support
○ Multi-query support
○ More stable and efficient join execution
○ Index support
○ Python UDF/UDAF support
7
Tajo Master
Catalog Server
Tajo Master
Catalog Server
Architecture Overview
DBMS
HCatalog
Tajo Master
Catalog Server
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
JDBC client
TSQLWebUI
REST API
Storage
Submit
a query
Manage
metadataAllocate
a query
Send tasks
& monitor
Send tasks
& monitor
8
Tajo Worker
Query Master
Tajo Worker
Query Master
Tajo Worker
Query Master
Query Execution Steps
9
Tajo Master
Catalog Server
Tajo Client
① Submit a
query
DBMS
② Assign a
query
● Initializing a query execution
③ Build a query
execution plan
Tajo Worker
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Executor
Storage Service
Query Execution Steps
10
Storage
⑥ Send status
and progress
⑤ Read and
process data
④ Send tasks
& monitor
● Executing a query
Tajo Master
Tajo Worker
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Executor
Storage Service
Query Execution Steps
11
Tajo Client
Storage
⑧ Notify that query
execution is completed
⑦ Store the result
on storage
⑨ Send the
result location
⑩ Read the
result
● Finalizing the query execution
Tajo Master
Query Processing in Tajo
12
● Given a user query, a query execution plan is an
ordered set of steps to execute the query
○ Example
■ Read data from storage, and then do join on some join
keys, and finally aggregate with some aggregation keys
● In Tajo, there are three kinds of query plans
○ Query master generates a logical query plan and a
distributed query plan
○ Query executor of tajo workers generates a local query
plan
Query Execution Plan
13
Query Planning Steps in Tajo
14
SQL
SQL
Analyzer
Algebraic
Expression
Logical
Planner
Logical Query
Plan
Global
Planner
Distributed
Query Plan
Physical
Planner
Local Query
Plan
Query Executor
Query Master
Distributed to
tajo workers
Join
Logical Query Plan
● A tree of relational algebras
● Example
15
SELECT
item.brand,
sum(price)
FROM
sales,
item
WHERE
sales.item_key =
item.item_key
GROUP BY
item.brand,
Scan on
item
Scan on
sales
Group by
< SQL > < Logical query plan >
key: item_key
key: brand
func: sum(price)
Distributed Query Plan
● A plan with additional annotations for distributed
execution
○ Data exchange (shuffle) keys, methods, ...
16
< Distributed query plan >
Join
Scan on
item
Scan on
sales
Group by
< Logical query plan >
key: item_key
key: brand
func: sum(price)
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Hash shuffle with
item_key
Hash shuffle with
item_key
Range shuffle
with brand
Local Query Plan
● A plan with additional annotations for local execution
○ In-memory algorithm, disk-based algorithm, …
17
< Distributed query plan >
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Hash shuffle with
item_key
Hash shuffle with
item_key
Range shuffle
with brand
< Local query plan >
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Hash shuffle with
item_key
Hash shuffle with
item_key
Range shuffle
with brandSort-merge
join
Hash
aggregation
Query Processing in Tajo
● A query is executed by executing multiple stages
subsequently
○ A stage is a minimum unit to execute at least a single
operator
● Each stage is processed by multiple query executors of
tajo worker in parallel
18
Join
Scan on
item
Scan on
sales
key: item_key
Stage 2
Stage 1
● SQL ● Logical query plan
Query Processing Example
19
Join
SELECT
item.brand,
sum(price)
FROM
sales,
item
WHERE
sales.item_key =
item.item_key
GROUP BY
item.brand,
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
● Logical query plan ● Distributed query plan
Query Processing Example
20
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Stage 3
Stage 2
Stage 1
Hash shuffle
with item_key
Range shuffle
with brand
Hash shuffle
with item_key
Query Processing Example
● Distributed query plan
21
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Stage 3
Stage 2
Stage 1
Hash shuffle
with item_key
Range shuffle
with brand
Hash shuffle
with item_key
item item sales sales sales
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Scan
● Distributed processing
Query Processing Example
22
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Stage 3
Stage 2
Stage 1
Hash shuffle
with item_key
Range shuffle
with brand
Hash shuffle
with item_key
item item sales sales sales
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Join
Worker
Join
Worker
Join
Worker
Join
Worker
Join
shuffle
● Distributed query plan ● Distributed processing
Query Processing Example
● Distributed query plan
23
Join
Scan on
item
Scan on
sales
Group by
key: item_key
key: brand
func: sum(price)
Stage 3
Stage 2
Stage 1
Hash shuffle
with item_key
Range shuffle
with brand
Hash shuffle
with item_key
item item sales sales sales
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Scan
Worker
Join
Worker
Join
Worker
Join
Worker
Join
Worker
Join
Worker
Group by
Worker
Group by
Worker
Group by
Worker
Group by
Worker
Group by
shuffle
shuffle
● Distributed processing
Query Optimization in Tajo
24
Query Optimization
● Mostly, user queries are not optimized for
performance
● The query optimizer attempts to determine the most
efficient way to execute a user query
○ Considering the possible query plans, and choosing the
best one
25
Extreme Example
● Query
○ select * from t where name like 'tajo%' order by id;
● Possible plans
26
Scan
Sort
Filter
Scan with
Filter
Sort● Naive plan
○ Filtering out tuples
after sort
○ Large cost for sort
● Better plan
○ Filtering out tuples
after scan immediately
○ Small cost for sort
○ Reduced number of
operations
Two Kinds of Query Optimization
● Rule-based optimization
○ A set of predefined rules is used to choose a good plan
○ Usually, heuristic approaches are used
■ Ex) filters should be pushed down to the lower part of the
query plan as much as possible
● Cost-based optimization
○ Enumerating possible query plans and choosing the one
having the lowest cost
○ Cost function has an important role
● Tajo utilizes both types of optimization
27
Query Optimization in Tajo
● Difference from traditional query optimization
○ Unlike traditional database systems, pre-collected
statistics is not so important
■ Data may be added or updated by several systems
including Flume, Kafka, Tajo, …
■ Pre-collected statistics can be useful, but is not fully
trustworthy
○ It is important to optimize query plans with minimal
statistics
■ Volume of input relations
28
Query Optimization in Tajo
● Tajo has two different approaches for query
optimization
○ Static optimization
■ Traditional approach
■ Optimizing the plan during the query planning phase
○ Progressive optimization
■ Optimizing the plan based on the intermediate statistics
while executing the query
● A query plan can be optimized without pre-collected
statistics
● Especially effective for queries which require multiple stage
execution 29
Logical Query Plan Optimization
● Rule-based optimization
○ Access path rewrite rule
■ Choosing access path to data
■ Index scan has the highest priority if available
○ Distributivity rule
■ Reducing filters based on distributivity
○ Filter pushdown rule
■ Pushing down filters to the lowest part as much as
possible
○ In-subquery rewrite rule
■ Transforming subqueries in 'IN' filters to semi(anti) joins
30
Logical Query Plan Optimization
● Rule-based optimization (cont')
○ Projection pushdown rule
■ Pushing down projections to the lowest part as much as
possible
● Cost-based optimization
○ Join order optimization
■ Finding a join order of lowest cost
■ Greedy heuristic: ordering relations from small ones to
large ones
● Very effective in single computing environment
● Need to improve for parallel computing environment
31
Distributed Query Plan Optimization
● Rule-based optimization
○ Two-phase execution of operators
■ Operators which require data shuffling like aggregation,
join, or sort are executed in two-phase
■ First phase is for local computing to reduce the amount of
shuffled data
■ Second phase is to get the result of the operation
32
Two-phase Execution Example
● Logical query plan
33
● Distributed query plan
Group by
Scan
Sort
Group by
Scan
SortStage 3
Stage 2
Stage 1
Group by
Sort
Local
group by
Local
sort
Distributed Query Plan Optimization
● Distributed join algorithm selection
○ Two representative distributed join algorithms
■ Join cannot be performed within a single stage in
distributed systems
● Tuples of the same join key may be distributed over cluster
nodes
■ Repartition join
● Both input relations are shuffled with the join key columns
■ Broadcast join
● Small relations are broadcasted to every node before join
34
Example of Repartition Join
● select … from employee e, department d where e.DeptName = d.
DeptName
35
Example of Broadcast Join
● select … from employee e, department d where e.DeptName = d.
DeptName
36
Distributed Join Algorithm Selection
● Repartition join VS broadcast join
○ Given a set of joins, some parts can be executed with
broadcast join while remaining parts are executed with
repartition join
● Which parts will be executed with broadcast join?
○ Greedy heuristic: broadcast join is used as many as
possible
■ The size of input relation should be smaller than pre-
defined threshold
■ The total volume of broadcasted relations should not
exceed pre-defined threshold 37
Distributed Join Algorithm Selection Example
● select … from lineitem, nation, region …
38
Local Query Plan Optimization
● Selecting the best algorithm based on the current
resource status
○ Aggregation
■ Hash aggregation, sort aggregation
○ Join
■ Hash join, sort-merge join
● For sort, hash sort is basically used with spilling data to
disk when it doesn't fit into memory
39
Progressive Optimization
● Data repartition
○ Some operators like join or aggregation require to
shuffle data with keys
○ The number of result partitions of shuffle should be
carefully decided
■ The number of partitions is related to the number of tasks
of the next stage
● At the beginning of each stage, the number of
partitions is decided based on the input size
40
Progressive Optimization Example
41
Group by
Scan on item
(100GB)
SortStage 3
Stage 2
Stage 1
Group by
Sort
# of partitions: 100
● If the default task size is 1GB,
Group by
Scan on item
SortStage 3
Stage 2
Stage 1
Group by
(50GB)
Sort
# of partitions: 50
# of tasks: 100
# of tasks: 50
Future Work
● Adding more optimization methods
● Improve cost functions for more effective cost-based
optimization
● Adding new approaches for progressive optimization
○ Runtime query rewriting
○ Integrating with genetic algorithm
○ …
42
43
Get Involved!
● General
○ http://tajo.apache.org
● Getting Started
○ http://tajo.apache.org/docs/current/getting_started.html
● Downloads
○ http://tajo.apache.org/downloads.html
● Jira – Issue Tracker
○ https://issues.apache.org/jira/browse/TAJO
● Join the mailing list
○ dev-subscribe@tajo.apache.org
○ issues-subscribe@tajo.apache.org
44
Thanks!

Query optimization in Apache Tajo

  • 1.
    Query Optimization in ApacheTajo Jihoon Son / Gruter inc.
  • 2.
    About Me ● JihoonSon (@jihoonson) ○ Tajo project co-founder ○ Committer and PMC member of Apache Tajo ○ Research engineer at Gruter 2
  • 3.
    ● Introduction toTajo ● Query processing in Tajo ○ Query plans in Tajo ○ Query processing example ● Query optimization in Tajo ○ Introduction to query optimization ○ Query optimization techniques in Tajo Outline 3
  • 4.
    ● Apache Top-levelProject ○ Data warehouse system ■ Efficient processing of analytic queries ■ ANSI-SQL compliant ○ Scalable and rapid query execution with own engine ■ Distributed query processing ■ Fault-tolerance ○ Beyond SQL-on-Hadoop ■ Support various types of storage ● HDFS, S3, hbase, rdbms, ... What is Tajo? 4
  • 5.
    Highlighted Features ● Supportlong-running batch queries as well as interactive ad-hoc queries ○ Fast query processing ■ Optimized scan performance ● 120 MB/sec per physical disk (SATA) ○ Reliability ■ Fault tolerance ■ No single point of failure with HA support 5
  • 6.
    Highlighted Features ● Supportof various kinds of data sources ○ HDFS, Amazon S3, Google Cloud Storage, HBase, RDBMS, ... ● Mature SQL support ○ Various kinds of join support ○ Window function support ○ Cost-based query optimization ● Integration with other systems ○ Notebooks like Zeppelin ○ BI tools 6
  • 7.
    Recent Release: 0.11 ●Feature highlights ○ Query federation ○ JDBC-based storage support ○ Self-describing data formats support ○ Multi-query support ○ More stable and efficient join execution ○ Index support ○ Python UDF/UDAF support 7
  • 8.
    Tajo Master Catalog Server TajoMaster Catalog Server Architecture Overview DBMS HCatalog Tajo Master Catalog Server Tajo Worker Query Master Query Executor Storage Service Tajo Worker Query Master Query Executor Storage Service Tajo Worker Query Master Query Executor Storage Service JDBC client TSQLWebUI REST API Storage Submit a query Manage metadataAllocate a query Send tasks & monitor Send tasks & monitor 8
  • 9.
    Tajo Worker Query Master TajoWorker Query Master Tajo Worker Query Master Query Execution Steps 9 Tajo Master Catalog Server Tajo Client ① Submit a query DBMS ② Assign a query ● Initializing a query execution ③ Build a query execution plan
  • 10.
    Tajo Worker Query Executor StorageService Tajo Worker Query Master Query Executor Storage Service Tajo Worker Query Executor Storage Service Query Execution Steps 10 Storage ⑥ Send status and progress ⑤ Read and process data ④ Send tasks & monitor ● Executing a query Tajo Master
  • 11.
    Tajo Worker Query Executor StorageService Tajo Worker Query Master Query Executor Storage Service Tajo Worker Query Executor Storage Service Query Execution Steps 11 Tajo Client Storage ⑧ Notify that query execution is completed ⑦ Store the result on storage ⑨ Send the result location ⑩ Read the result ● Finalizing the query execution Tajo Master
  • 12.
  • 13.
    ● Given auser query, a query execution plan is an ordered set of steps to execute the query ○ Example ■ Read data from storage, and then do join on some join keys, and finally aggregate with some aggregation keys ● In Tajo, there are three kinds of query plans ○ Query master generates a logical query plan and a distributed query plan ○ Query executor of tajo workers generates a local query plan Query Execution Plan 13
  • 14.
    Query Planning Stepsin Tajo 14 SQL SQL Analyzer Algebraic Expression Logical Planner Logical Query Plan Global Planner Distributed Query Plan Physical Planner Local Query Plan Query Executor Query Master Distributed to tajo workers
  • 15.
    Join Logical Query Plan ●A tree of relational algebras ● Example 15 SELECT item.brand, sum(price) FROM sales, item WHERE sales.item_key = item.item_key GROUP BY item.brand, Scan on item Scan on sales Group by < SQL > < Logical query plan > key: item_key key: brand func: sum(price)
  • 16.
    Distributed Query Plan ●A plan with additional annotations for distributed execution ○ Data exchange (shuffle) keys, methods, ... 16 < Distributed query plan > Join Scan on item Scan on sales Group by < Logical query plan > key: item_key key: brand func: sum(price) Join Scan on item Scan on sales Group by key: item_key key: brand func: sum(price) Hash shuffle with item_key Hash shuffle with item_key Range shuffle with brand
  • 17.
    Local Query Plan ●A plan with additional annotations for local execution ○ In-memory algorithm, disk-based algorithm, … 17 < Distributed query plan > Join Scan on item Scan on sales Group by key: item_key key: brand func: sum(price) Hash shuffle with item_key Hash shuffle with item_key Range shuffle with brand < Local query plan > Join Scan on item Scan on sales Group by key: item_key key: brand func: sum(price) Hash shuffle with item_key Hash shuffle with item_key Range shuffle with brandSort-merge join Hash aggregation
  • 18.
    Query Processing inTajo ● A query is executed by executing multiple stages subsequently ○ A stage is a minimum unit to execute at least a single operator ● Each stage is processed by multiple query executors of tajo worker in parallel 18 Join Scan on item Scan on sales key: item_key Stage 2 Stage 1
  • 19.
    ● SQL ●Logical query plan Query Processing Example 19 Join SELECT item.brand, sum(price) FROM sales, item WHERE sales.item_key = item.item_key GROUP BY item.brand, Scan on item Scan on sales Group by key: item_key key: brand func: sum(price)
  • 20.
    ● Logical queryplan ● Distributed query plan Query Processing Example 20 Join Scan on item Scan on sales Group by key: item_key key: brand func: sum(price) Join Scan on item Scan on sales Group by key: item_key key: brand func: sum(price) Stage 3 Stage 2 Stage 1 Hash shuffle with item_key Range shuffle with brand Hash shuffle with item_key
  • 21.
    Query Processing Example ●Distributed query plan 21 Join Scan on item Scan on sales Group by key: item_key key: brand func: sum(price) Stage 3 Stage 2 Stage 1 Hash shuffle with item_key Range shuffle with brand Hash shuffle with item_key item item sales sales sales Worker Scan Worker Scan Worker Scan Worker Scan Worker Scan ● Distributed processing
  • 22.
    Query Processing Example 22 Join Scanon item Scan on sales Group by key: item_key key: brand func: sum(price) Stage 3 Stage 2 Stage 1 Hash shuffle with item_key Range shuffle with brand Hash shuffle with item_key item item sales sales sales Worker Scan Worker Scan Worker Scan Worker Scan Worker Scan Worker Join Worker Join Worker Join Worker Join Worker Join shuffle ● Distributed query plan ● Distributed processing
  • 23.
    Query Processing Example ●Distributed query plan 23 Join Scan on item Scan on sales Group by key: item_key key: brand func: sum(price) Stage 3 Stage 2 Stage 1 Hash shuffle with item_key Range shuffle with brand Hash shuffle with item_key item item sales sales sales Worker Scan Worker Scan Worker Scan Worker Scan Worker Scan Worker Join Worker Join Worker Join Worker Join Worker Join Worker Group by Worker Group by Worker Group by Worker Group by Worker Group by shuffle shuffle ● Distributed processing
  • 24.
  • 25.
    Query Optimization ● Mostly,user queries are not optimized for performance ● The query optimizer attempts to determine the most efficient way to execute a user query ○ Considering the possible query plans, and choosing the best one 25
  • 26.
    Extreme Example ● Query ○select * from t where name like 'tajo%' order by id; ● Possible plans 26 Scan Sort Filter Scan with Filter Sort● Naive plan ○ Filtering out tuples after sort ○ Large cost for sort ● Better plan ○ Filtering out tuples after scan immediately ○ Small cost for sort ○ Reduced number of operations
  • 27.
    Two Kinds ofQuery Optimization ● Rule-based optimization ○ A set of predefined rules is used to choose a good plan ○ Usually, heuristic approaches are used ■ Ex) filters should be pushed down to the lower part of the query plan as much as possible ● Cost-based optimization ○ Enumerating possible query plans and choosing the one having the lowest cost ○ Cost function has an important role ● Tajo utilizes both types of optimization 27
  • 28.
    Query Optimization inTajo ● Difference from traditional query optimization ○ Unlike traditional database systems, pre-collected statistics is not so important ■ Data may be added or updated by several systems including Flume, Kafka, Tajo, … ■ Pre-collected statistics can be useful, but is not fully trustworthy ○ It is important to optimize query plans with minimal statistics ■ Volume of input relations 28
  • 29.
    Query Optimization inTajo ● Tajo has two different approaches for query optimization ○ Static optimization ■ Traditional approach ■ Optimizing the plan during the query planning phase ○ Progressive optimization ■ Optimizing the plan based on the intermediate statistics while executing the query ● A query plan can be optimized without pre-collected statistics ● Especially effective for queries which require multiple stage execution 29
  • 30.
    Logical Query PlanOptimization ● Rule-based optimization ○ Access path rewrite rule ■ Choosing access path to data ■ Index scan has the highest priority if available ○ Distributivity rule ■ Reducing filters based on distributivity ○ Filter pushdown rule ■ Pushing down filters to the lowest part as much as possible ○ In-subquery rewrite rule ■ Transforming subqueries in 'IN' filters to semi(anti) joins 30
  • 31.
    Logical Query PlanOptimization ● Rule-based optimization (cont') ○ Projection pushdown rule ■ Pushing down projections to the lowest part as much as possible ● Cost-based optimization ○ Join order optimization ■ Finding a join order of lowest cost ■ Greedy heuristic: ordering relations from small ones to large ones ● Very effective in single computing environment ● Need to improve for parallel computing environment 31
  • 32.
    Distributed Query PlanOptimization ● Rule-based optimization ○ Two-phase execution of operators ■ Operators which require data shuffling like aggregation, join, or sort are executed in two-phase ■ First phase is for local computing to reduce the amount of shuffled data ■ Second phase is to get the result of the operation 32
  • 33.
    Two-phase Execution Example ●Logical query plan 33 ● Distributed query plan Group by Scan Sort Group by Scan SortStage 3 Stage 2 Stage 1 Group by Sort Local group by Local sort
  • 34.
    Distributed Query PlanOptimization ● Distributed join algorithm selection ○ Two representative distributed join algorithms ■ Join cannot be performed within a single stage in distributed systems ● Tuples of the same join key may be distributed over cluster nodes ■ Repartition join ● Both input relations are shuffled with the join key columns ■ Broadcast join ● Small relations are broadcasted to every node before join 34
  • 35.
    Example of RepartitionJoin ● select … from employee e, department d where e.DeptName = d. DeptName 35
  • 36.
    Example of BroadcastJoin ● select … from employee e, department d where e.DeptName = d. DeptName 36
  • 37.
    Distributed Join AlgorithmSelection ● Repartition join VS broadcast join ○ Given a set of joins, some parts can be executed with broadcast join while remaining parts are executed with repartition join ● Which parts will be executed with broadcast join? ○ Greedy heuristic: broadcast join is used as many as possible ■ The size of input relation should be smaller than pre- defined threshold ■ The total volume of broadcasted relations should not exceed pre-defined threshold 37
  • 38.
    Distributed Join AlgorithmSelection Example ● select … from lineitem, nation, region … 38
  • 39.
    Local Query PlanOptimization ● Selecting the best algorithm based on the current resource status ○ Aggregation ■ Hash aggregation, sort aggregation ○ Join ■ Hash join, sort-merge join ● For sort, hash sort is basically used with spilling data to disk when it doesn't fit into memory 39
  • 40.
    Progressive Optimization ● Datarepartition ○ Some operators like join or aggregation require to shuffle data with keys ○ The number of result partitions of shuffle should be carefully decided ■ The number of partitions is related to the number of tasks of the next stage ● At the beginning of each stage, the number of partitions is decided based on the input size 40
  • 41.
    Progressive Optimization Example 41 Groupby Scan on item (100GB) SortStage 3 Stage 2 Stage 1 Group by Sort # of partitions: 100 ● If the default task size is 1GB, Group by Scan on item SortStage 3 Stage 2 Stage 1 Group by (50GB) Sort # of partitions: 50 # of tasks: 100 # of tasks: 50
  • 42.
    Future Work ● Addingmore optimization methods ● Improve cost functions for more effective cost-based optimization ● Adding new approaches for progressive optimization ○ Runtime query rewriting ○ Integrating with genetic algorithm ○ … 42
  • 43.
    43 Get Involved! ● General ○http://tajo.apache.org ● Getting Started ○ http://tajo.apache.org/docs/current/getting_started.html ● Downloads ○ http://tajo.apache.org/downloads.html ● Jira – Issue Tracker ○ https://issues.apache.org/jira/browse/TAJO ● Join the mailing list ○ dev-subscribe@tajo.apache.org ○ issues-subscribe@tajo.apache.org
  • 44.