5. 1.1 Netflix Data Pipeline
Cloud
apps
Events Data Pipeline
Suro Ursula
Cassandra
Stateful Data Pipeline
SS
Tables
Aegisthus
S3
DW
15 min
Daily
6. 1.2 Netflix Big Data Platform
S3
DW
Hadoop clusters
Federated
execution
engine
Federated
metadata
service
Data Lineage
Data Visualization
Data Movement
Data Quality
Pig Workflow
Visualization
Job/Cluster
Performance
Visualization
7. 1.3 Data Volume
~200 billions events/day
~40 TB incoming data/day (compressed)
~1.2 PB data read/day
~100 TB data wrote/day
10+ PB DW on S3
8. 1.4 Netflix Big Data Platform
S3
DW
Hadoop clusters
Federated
execution
engine
Federated
metadata
service
Data Lineage
Data Visualization
Data Movement
Data Quality
Pig Workflow
Visualization
Job/Cluster
Performance
Visualization
With ever growing
data, ETL runs
slower and slower.
10. 1.6 Common Problems
Common problems across organizations
1. Similar data platform architecture
1. Pig for ETL jobs
2. Hive/Presto for ad-hoc queries
11. 1.7 Pig on Tez Team
• Alex Bain (LinkedIn: 2013/08~2014/01, Dev)
• Mark Wagner (LinkedIn: 2013/08~2014/01, Dev)
• Cheolsoo Park (Netflix: 2013/08~2014/08, Dev)
• Olga Natkovich (Yahoo: 2013/08~present, PM)
• Rohini Palaniswamy (Yahoo: 2013/08~present, Dev)
• Daniel Dai (Hortonworks: 2013/08~present, Dev)
13. 2.1 Pig Concepts
Non-blocking operators
1. LOAD / STORE
2. FOREACH __ GENERATE __
3. FILTER __ BY __
Blocking operators
1. GROUP __ BY __
2. ORDER __ BY __
3. JOIN __ BY __
Translated to a MapReduce shuffle
14. 2.2 MapReduce Plan
LOAD
FOREACH
GROUP BY
FOREACH
STORE
LOAD
FOREACH
LOCAL
REARRANGE
GLOBAL
REARRANGE
PACKAGE
FOREACH
STORE
LOAD
FOREACH
LOCAL
REARRANGE
Shuffle
PACKAGE
FOREACH
STORE
Logical Plan
Physical Plan MR Plan
15. 2.3 What’s Problem?
Restrictions by MapReduce
1. Extra intermediate output on HDFS
2. Artificial synchronization barriers
3. Inefficient use of resources
4. Multi-query optimization
16. 2.4 Tez Concepts
Low-level DAG Framework
1. Build DAG by defining vertices and edges.
2. Customize scheduling of DAG and movement of data.
• Sequential and concurrent
• 1-1, broadcasting, scatter and gather
Flexible Input-Processor-Output Model
1. Thin API layer to wrap around arbitrary application code.
2. Compose inputs, processor, and outputs to execute arbitrary processing.
Input Processor Output
initialize
initialize
getReader
run
handleEvents
handleEvents
close
close
initialize
getWriter
handleEvents
close
17. 2.5 Pig on Tez
Logical Plan
LogToPhyTranslationVisitor
Physical Plan
TezCompiler MRCompiler
Tez Plan
Tez Execution Engine
MR Plan
MR Execution Engine
18. 2.6 Tez DAG: Split + Group By + Join
Load ‘foo’
Split multiplex De-multiplex
Group by y, Group by z
HDFS HDFS
Load g1, Load g2
Join g1, g2
Load ‘foo’
Multiple outputs
Group
by y
Group
by z
Reducer follows
reducer
Join g1, g2
a = LOAD ‘foo’ AS (x, y, z);
b = GROUP a BY y;
c = GROUP a BY z;
d = JOIN b BY group;
c BY group;
19. 2.7 Tez DAG: Order By
Sample
Aggregate
Load, Partition
Sort
HDFS
Load, Sample
Partition
Sort
Aggregate a = LOAD ‘foo’ AS (x, y);
b = FILTER a BY y is not null;
c = ORDER b BY x;
Stage sample map
on distributed cache
Broadcast sample map
1-1 Unsorted
edge
Cache sample map
23. 3.3 AM / Container Reuse
AM Reuse
1. Grunt shell uses one AM for all commands till timeout.
2. More than one DAGs submitted for merge join, collected group, and exec.
Container Reuse
1. Rerun new tasks on already warmed-up JVM.
Benefits
1. Reduce container launch overhead.
2. Reduce networks IO.
• 1-1 edge tasks are launched on same node.
24. 3.4 Broadcast Edge / Object Cache
Broadcast Edge
1. Broadcast same data to all tasks in successor vertex.
Object Cache
1. Shared in memory objects for scope of vertex and DAG.
Benefits
1. Replace use of distributed cache.
2. Avoid input fetching if cache is available on container reuse.
• Replicated join runs faster on small cluster.
25. 3.5 Vertex Group
Vertex Group
1. Group multiple vertices into a vertex group and produce a combiner output.
Benefits
1. Better performance due to elimination of an additional vertex.
Load b Load a
Group
Load b Load a
Union
Group
a = LOAD ‘a’;
b = LOAD ‘b’;
c = UNION a, b;
d = GROUP c BY $0;
26. 3.6 Slow Start/Pre-launch
Slow Start/Pre-launch
1. Pluggable vertex manager pre-launches the reducers before all maps have co
mpleted so that shuffle can start (e.g. LIMIT not following ORDER BY).
Benefits
1. Better performance due to parallel execution of multiple vertices.
27. 3.7 Performance Numbers
250
200
150
100
50
0
1h22m vs 28m
3h57m vs 3h54m
Job 1 (2x) Job 2 (3x) Job 3 (1.7x) Job 4 (1.2x) Job 5 (1.0x)
MR
Tez
20m vs 10m
2h17m vs 1h15m
33m vs 28m
31. 4.1 Shortcomings
Auto Parallelism
1. Eliminating mappers without adjusting parallelisms can make jobs run slower.
In MR, combiners
run with 1600 tasks.
In Tez, combiners
Run With 500 tasks.
32. 4.2 Shortcomings
Current Status
1. User-specified parallelism always takes precedence.
2. If no parallelism is specified, Pig estimates using static rules. For eg, if vertex
contains filter-by, reduce its parallelism by 50%.
3. At execution time, parallelism is adjusted again based on per-vertex sampling.
Problems
1. In legacy Pig jobs, parallelism is optimized for MR. So honoring user-specified
parallelism can hurt performance in Tez.
2. Static-rule-based estimation cannot be always accurate.
3. Sample-based estimation cannot be always accurate.
33. 4.3 Shortcomings
Web UI and Tools Integration
1. Tez AM has no UI (i.e. no job page).
2. Tez hasn’t integrated with YARN ATS (i.e. no job history page).
3. Tez hasn’t integrated with Netflix internal tools such as Inviso and Lipstick.
34. 4.4 What’s Next?
Tez
1. Resolve TEZ-8: Tez UI for progress tracking and history.
• Tez 0.5.x release (latest) doesn’t include TEZ-8.
Pig on Tez
1. Improve auto parallelism and usability.
• Pig on Tez will be included in Pig 0.14 release, but these issues might be
still there.