Apache Pig를 위한 
Tez 연산 엔진 개발하기 
박철수 엔지니어 / 넷플릭스 빅데이터플랫폼팀 
Netflix Big Data Platform
CONTENTS 
1. Background 
2. What is Pig on Tez? 
3. Why Apache Tez? 
4. Shortcomings and What’s Next
1. Background
1.1 Netflix Data Pipeline 
Cloud 
apps 
Events Data Pipeline 
Suro Ursula 
Cassandra 
Stateful Data Pipeline 
SS 
Tables 
Aegisthus 
S3 
DW 
15 min 
Daily
1.2 Netflix Big Data Platform 
S3 
DW 
Hadoop clusters 
Federated 
execution 
engine 
Federated 
metadata 
service 
Data Lineage 
Data Visualization 
Data Movement 
Data Quality 
Pig Workflow 
Visualization 
Job/Cluster 
Performance 
Visualization
1.3 Data Volume 
~200 billions events/day 
~40 TB incoming data/day (compressed) 
~1.2 PB data read/day 
~100 TB data wrote/day 
10+ PB DW on S3
1.4 Netflix Big Data Platform 
S3 
DW 
Hadoop clusters 
Federated 
execution 
engine 
Federated 
metadata 
service 
Data Lineage 
Data Visualization 
Data Movement 
Data Quality 
Pig Workflow 
Visualization 
Job/Cluster 
Performance 
Visualization 
With ever growing 
data, ETL runs 
slower and slower.
1.5 ETL Completion Trend
1.6 Common Problems 
Common problems across organizations 
1. Similar data platform architecture 
1. Pig for ETL jobs 
2. Hive/Presto for ad-hoc queries
1.7 Pig on Tez Team 
• Alex Bain (LinkedIn: 2013/08~2014/01, Dev) 
• Mark Wagner (LinkedIn: 2013/08~2014/01, Dev) 
• Cheolsoo Park (Netflix: 2013/08~2014/08, Dev) 
• Olga Natkovich (Yahoo: 2013/08~present, PM) 
• Rohini Palaniswamy (Yahoo: 2013/08~present, Dev) 
• Daniel Dai (Hortonworks: 2013/08~present, Dev)
2. What is Pig on Tez?
2.1 Pig Concepts 
Non-blocking operators 
1. LOAD / STORE 
2. FOREACH __ GENERATE __ 
3. FILTER __ BY __ 
Blocking operators 
1. GROUP __ BY __ 
2. ORDER __ BY __ 
3. JOIN __ BY __ 
Translated to a MapReduce shuffle
2.2 MapReduce Plan 
LOAD 
FOREACH 
GROUP BY 
FOREACH 
STORE 
LOAD 
FOREACH 
LOCAL 
REARRANGE 
GLOBAL 
REARRANGE 
PACKAGE 
FOREACH 
STORE 
LOAD 
FOREACH 
LOCAL 
REARRANGE 
Shuffle 
PACKAGE 
FOREACH 
STORE 
Logical Plan 
Physical Plan MR Plan
2.3 What’s Problem? 
Restrictions by MapReduce 
1. Extra intermediate output on HDFS 
2. Artificial synchronization barriers 
3. Inefficient use of resources 
4. Multi-query optimization
2.4 Tez Concepts 
Low-level DAG Framework 
1. Build DAG by defining vertices and edges. 
2. Customize scheduling of DAG and movement of data. 
• Sequential and concurrent 
• 1-1, broadcasting, scatter and gather 
Flexible Input-Processor-Output Model 
1. Thin API layer to wrap around arbitrary application code. 
2. Compose inputs, processor, and outputs to execute arbitrary processing. 
Input Processor Output 
initialize 
initialize 
getReader 
run 
handleEvents 
handleEvents 
close 
close 
initialize 
getWriter 
handleEvents 
close
2.5 Pig on Tez 
Logical Plan 
LogToPhyTranslationVisitor 
Physical Plan 
TezCompiler MRCompiler 
Tez Plan 
Tez Execution Engine 
MR Plan 
MR Execution Engine
2.6 Tez DAG: Split + Group By + Join 
Load ‘foo’ 
Split multiplex De-multiplex 
Group by y, Group by z 
HDFS HDFS 
Load g1, Load g2 
Join g1, g2 
Load ‘foo’ 
Multiple outputs 
Group 
by y 
Group 
by z 
Reducer follows 
reducer 
Join g1, g2 
a = LOAD ‘foo’ AS (x, y, z); 
b = GROUP a BY y; 
c = GROUP a BY z; 
d = JOIN b BY group; 
c BY group;
2.7 Tez DAG: Order By 
Sample 
Aggregate 
Load, Partition 
Sort 
HDFS 
Load, Sample 
Partition 
Sort 
Aggregate a = LOAD ‘foo’ AS (x, y); 
b = FILTER a BY y is not null; 
c = ORDER b BY x; 
Stage sample map 
on distributed cache 
Broadcast sample map 
1-1 Unsorted 
edge 
Cache sample map
3. Why Apache Tez?
3.1 DAG Execution 
DAG Execution 
1. Eliminate HDFS writes between workflow jobs. 
2. Eliminate job launch overhead of workflow jobs. 
3. Eliminate identity mappers in every workflow jobs. 
Benefits 
1. Faster execution and higher predictability.
3.2 MR vs. Tez
3.3 AM / Container Reuse 
AM Reuse 
1. Grunt shell uses one AM for all commands till timeout. 
2. More than one DAGs submitted for merge join, collected group, and exec. 
Container Reuse 
1. Rerun new tasks on already warmed-up JVM. 
Benefits 
1. Reduce container launch overhead. 
2. Reduce networks IO. 
• 1-1 edge tasks are launched on same node.
3.4 Broadcast Edge / Object Cache 
Broadcast Edge 
1. Broadcast same data to all tasks in successor vertex. 
Object Cache 
1. Shared in memory objects for scope of vertex and DAG. 
Benefits 
1. Replace use of distributed cache. 
2. Avoid input fetching if cache is available on container reuse. 
• Replicated join runs faster on small cluster.
3.5 Vertex Group 
Vertex Group 
1. Group multiple vertices into a vertex group and produce a combiner output. 
Benefits 
1. Better performance due to elimination of an additional vertex. 
Load b Load a 
Group 
Load b Load a 
Union 
Group 
a = LOAD ‘a’; 
b = LOAD ‘b’; 
c = UNION a, b; 
d = GROUP c BY $0;
3.6 Slow Start/Pre-launch 
Slow Start/Pre-launch 
1. Pluggable vertex manager pre-launches the reducers before all maps have co 
mpleted so that shuffle can start (e.g. LIMIT not following ORDER BY). 
Benefits 
1. Better performance due to parallel execution of multiple vertices.
3.7 Performance Numbers 
250 
200 
150 
100 
50 
0 
1h22m vs 28m 
3h57m vs 3h54m 
Job 1 (2x) Job 2 (3x) Job 3 (1.7x) Job 4 (1.2x) Job 5 (1.0x) 
MR 
Tez 
20m vs 10m 
2h17m vs 1h15m 
33m vs 28m
3.8 Performance Deep Dive 
This MR job blocks DAG.
3.9 Performance Deep Dive 
Huge amount of intermediate 
files are written to HDFS.
4. Shortcomings 
And What’s Next
4.1 Shortcomings 
Auto Parallelism 
1. Eliminating mappers without adjusting parallelisms can make jobs run slower. 
In MR, combiners 
run with 1600 tasks. 
In Tez, combiners 
Run With 500 tasks.
4.2 Shortcomings 
Current Status 
1. User-specified parallelism always takes precedence. 
2. If no parallelism is specified, Pig estimates using static rules. For eg, if vertex 
contains filter-by, reduce its parallelism by 50%. 
3. At execution time, parallelism is adjusted again based on per-vertex sampling. 
Problems 
1. In legacy Pig jobs, parallelism is optimized for MR. So honoring user-specified 
parallelism can hurt performance in Tez. 
2. Static-rule-based estimation cannot be always accurate. 
3. Sample-based estimation cannot be always accurate.
4.3 Shortcomings 
Web UI and Tools Integration 
1. Tez AM has no UI (i.e. no job page). 
2. Tez hasn’t integrated with YARN ATS (i.e. no job history page). 
3. Tez hasn’t integrated with Netflix internal tools such as Inviso and Lipstick.
4.4 What’s Next? 
Tez 
1. Resolve TEZ-8: Tez UI for progress tracking and history. 
• Tez 0.5.x release (latest) doesn’t include TEZ-8. 
Pig on Tez 
1. Improve auto parallelism and usability. 
• Pig on Tez will be included in Pig 0.14 release, but these issues might be 
still there.
Q&A
THANK YOU

[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종

  • 2.
    Apache Pig를 위한 Tez 연산 엔진 개발하기 박철수 엔지니어 / 넷플릭스 빅데이터플랫폼팀 Netflix Big Data Platform
  • 3.
    CONTENTS 1. Background 2. What is Pig on Tez? 3. Why Apache Tez? 4. Shortcomings and What’s Next
  • 4.
  • 5.
    1.1 Netflix DataPipeline Cloud apps Events Data Pipeline Suro Ursula Cassandra Stateful Data Pipeline SS Tables Aegisthus S3 DW 15 min Daily
  • 6.
    1.2 Netflix BigData Platform S3 DW Hadoop clusters Federated execution engine Federated metadata service Data Lineage Data Visualization Data Movement Data Quality Pig Workflow Visualization Job/Cluster Performance Visualization
  • 7.
    1.3 Data Volume ~200 billions events/day ~40 TB incoming data/day (compressed) ~1.2 PB data read/day ~100 TB data wrote/day 10+ PB DW on S3
  • 8.
    1.4 Netflix BigData Platform S3 DW Hadoop clusters Federated execution engine Federated metadata service Data Lineage Data Visualization Data Movement Data Quality Pig Workflow Visualization Job/Cluster Performance Visualization With ever growing data, ETL runs slower and slower.
  • 9.
  • 10.
    1.6 Common Problems Common problems across organizations 1. Similar data platform architecture 1. Pig for ETL jobs 2. Hive/Presto for ad-hoc queries
  • 11.
    1.7 Pig onTez Team • Alex Bain (LinkedIn: 2013/08~2014/01, Dev) • Mark Wagner (LinkedIn: 2013/08~2014/01, Dev) • Cheolsoo Park (Netflix: 2013/08~2014/08, Dev) • Olga Natkovich (Yahoo: 2013/08~present, PM) • Rohini Palaniswamy (Yahoo: 2013/08~present, Dev) • Daniel Dai (Hortonworks: 2013/08~present, Dev)
  • 12.
    2. What isPig on Tez?
  • 13.
    2.1 Pig Concepts Non-blocking operators 1. LOAD / STORE 2. FOREACH __ GENERATE __ 3. FILTER __ BY __ Blocking operators 1. GROUP __ BY __ 2. ORDER __ BY __ 3. JOIN __ BY __ Translated to a MapReduce shuffle
  • 14.
    2.2 MapReduce Plan LOAD FOREACH GROUP BY FOREACH STORE LOAD FOREACH LOCAL REARRANGE GLOBAL REARRANGE PACKAGE FOREACH STORE LOAD FOREACH LOCAL REARRANGE Shuffle PACKAGE FOREACH STORE Logical Plan Physical Plan MR Plan
  • 15.
    2.3 What’s Problem? Restrictions by MapReduce 1. Extra intermediate output on HDFS 2. Artificial synchronization barriers 3. Inefficient use of resources 4. Multi-query optimization
  • 16.
    2.4 Tez Concepts Low-level DAG Framework 1. Build DAG by defining vertices and edges. 2. Customize scheduling of DAG and movement of data. • Sequential and concurrent • 1-1, broadcasting, scatter and gather Flexible Input-Processor-Output Model 1. Thin API layer to wrap around arbitrary application code. 2. Compose inputs, processor, and outputs to execute arbitrary processing. Input Processor Output initialize initialize getReader run handleEvents handleEvents close close initialize getWriter handleEvents close
  • 17.
    2.5 Pig onTez Logical Plan LogToPhyTranslationVisitor Physical Plan TezCompiler MRCompiler Tez Plan Tez Execution Engine MR Plan MR Execution Engine
  • 18.
    2.6 Tez DAG:Split + Group By + Join Load ‘foo’ Split multiplex De-multiplex Group by y, Group by z HDFS HDFS Load g1, Load g2 Join g1, g2 Load ‘foo’ Multiple outputs Group by y Group by z Reducer follows reducer Join g1, g2 a = LOAD ‘foo’ AS (x, y, z); b = GROUP a BY y; c = GROUP a BY z; d = JOIN b BY group; c BY group;
  • 19.
    2.7 Tez DAG:Order By Sample Aggregate Load, Partition Sort HDFS Load, Sample Partition Sort Aggregate a = LOAD ‘foo’ AS (x, y); b = FILTER a BY y is not null; c = ORDER b BY x; Stage sample map on distributed cache Broadcast sample map 1-1 Unsorted edge Cache sample map
  • 20.
  • 21.
    3.1 DAG Execution DAG Execution 1. Eliminate HDFS writes between workflow jobs. 2. Eliminate job launch overhead of workflow jobs. 3. Eliminate identity mappers in every workflow jobs. Benefits 1. Faster execution and higher predictability.
  • 22.
  • 23.
    3.3 AM /Container Reuse AM Reuse 1. Grunt shell uses one AM for all commands till timeout. 2. More than one DAGs submitted for merge join, collected group, and exec. Container Reuse 1. Rerun new tasks on already warmed-up JVM. Benefits 1. Reduce container launch overhead. 2. Reduce networks IO. • 1-1 edge tasks are launched on same node.
  • 24.
    3.4 Broadcast Edge/ Object Cache Broadcast Edge 1. Broadcast same data to all tasks in successor vertex. Object Cache 1. Shared in memory objects for scope of vertex and DAG. Benefits 1. Replace use of distributed cache. 2. Avoid input fetching if cache is available on container reuse. • Replicated join runs faster on small cluster.
  • 25.
    3.5 Vertex Group Vertex Group 1. Group multiple vertices into a vertex group and produce a combiner output. Benefits 1. Better performance due to elimination of an additional vertex. Load b Load a Group Load b Load a Union Group a = LOAD ‘a’; b = LOAD ‘b’; c = UNION a, b; d = GROUP c BY $0;
  • 26.
    3.6 Slow Start/Pre-launch Slow Start/Pre-launch 1. Pluggable vertex manager pre-launches the reducers before all maps have co mpleted so that shuffle can start (e.g. LIMIT not following ORDER BY). Benefits 1. Better performance due to parallel execution of multiple vertices.
  • 27.
    3.7 Performance Numbers 250 200 150 100 50 0 1h22m vs 28m 3h57m vs 3h54m Job 1 (2x) Job 2 (3x) Job 3 (1.7x) Job 4 (1.2x) Job 5 (1.0x) MR Tez 20m vs 10m 2h17m vs 1h15m 33m vs 28m
  • 28.
    3.8 Performance DeepDive This MR job blocks DAG.
  • 29.
    3.9 Performance DeepDive Huge amount of intermediate files are written to HDFS.
  • 30.
    4. Shortcomings AndWhat’s Next
  • 31.
    4.1 Shortcomings AutoParallelism 1. Eliminating mappers without adjusting parallelisms can make jobs run slower. In MR, combiners run with 1600 tasks. In Tez, combiners Run With 500 tasks.
  • 32.
    4.2 Shortcomings CurrentStatus 1. User-specified parallelism always takes precedence. 2. If no parallelism is specified, Pig estimates using static rules. For eg, if vertex contains filter-by, reduce its parallelism by 50%. 3. At execution time, parallelism is adjusted again based on per-vertex sampling. Problems 1. In legacy Pig jobs, parallelism is optimized for MR. So honoring user-specified parallelism can hurt performance in Tez. 2. Static-rule-based estimation cannot be always accurate. 3. Sample-based estimation cannot be always accurate.
  • 33.
    4.3 Shortcomings WebUI and Tools Integration 1. Tez AM has no UI (i.e. no job page). 2. Tez hasn’t integrated with YARN ATS (i.e. no job history page). 3. Tez hasn’t integrated with Netflix internal tools such as Inviso and Lipstick.
  • 34.
    4.4 What’s Next? Tez 1. Resolve TEZ-8: Tez UI for progress tracking and history. • Tez 0.5.x release (latest) doesn’t include TEZ-8. Pig on Tez 1. Improve auto parallelism and usability. • Pig on Tez will be included in Pig 0.14 release, but these issues might be still there.
  • 35.
  • 36.