Pig on Tez
Low Latency Data
Processing with Big
Data
Daniel Dai
@daijy
Rohini Palaniswamy
@rohini_pswamy
H a d o o p S u m m i t 2 0 1 5 , B r u s s e l s
Agenda
 Team Introduction
 Apache Pig
 Why Pig on Tez?
 Pig on Tez
- Design
- Tez features in Pig
- Performance
- Current status
- Future Plan
2
3
Apache Pig on Tez Team
Daniel Dai
Pig PMC
Hortonworks
Rohini Palaniswamy
VP Pig, Pig PMC
Yahoo!
Olga Natkovich
Pig PMC
Yahoo!
Cheolsoo Park
Pig PMC
Netflix
Mark Wagner
Pig Committer
LinkedIn
Alex Bain
Pig Contributor
LinkedIn
Pig Latin
 Procedural data processing language
 More than SQL and Feature rich
 Turing complete: Macro, looping, branching
4
Multiquery Nested Foreach Scalars
Algebraic and Accumulator java
UDFs
non-java UDFs (jython, python,
javascript, groovy, jruby)
Distributed Orderby, Skewed
Join
a = load 'student.txt' as (name, age,
gpa);
b = filter a by age > 20 and age <=25;
c = group b by age;
d = foreach c generate age, AVG(gpa);
store d into 'output'
Pig users
 ETL user
- Pig syntax is very similar to ETL tools
 Data Scientist
- feature rich
- Python UDF
- Looping
 At Yahoo!
- 60% of total hadoop jobs run daily
- 12 million monthly pig jobs
 Other heavy users
- Twitter
- Netflix
- LinkedIn
- Ebay
- Salesforce
5
Why Pig on Tez?
 MR Restriction
- Too restricted, Pig cannot process as fast as it can
 New execution engine
- General DAG engine
- Powerful and Rich API
- Leverage Hadoop
 Tez is a perfect fit
- Low level DAG framework
- Powerful, define vertex and edge semantics, customize with plugins
- Performance - Without having to increase memory
- Resource efficient
- Natively built on top of YARN
 Multi-tenancy, resource allocation come for free
 Scale
 Stable
 Security
- Excellent support from Tez community
 Bikas Saha, Siddharth Seth, Hitesh Shah, Jeff Zhang, Rajesh Balamohan
6
PIG on TEZ
Design
8
Logical Plan
Tez Plan MR Plan
Physical Plan
Tez Execution Engine MR Execution Engine
LogToPhyTranslationVisitor
MRCompilerTezCompiler
DAG Plan – Split Group by + Join
9
f = LOAD ‘foo’ AS (x, y, z);
g1 = GROUP f BY y;
g2 = GROUP f BY z;
j = JOIN g1 BY group,
g2 BY group;
Group by y Group by z
Load foo
Join
Load g1 and Load g2
Group by y Group by z
Load foo
Join
Multiple outputs
Reduce follows
reduce
HDFS HDFS
Split multiplex de-multiplex
DAG Execution - Visualization
10
Vertex 1
(Load)
Vertex 2
(Group)
Vertex 3
(Group)
Vertex 4
(Join)
MROutput
MRInput
DAG Plan – Distributed Orderby
11
Aggregate
Sample
Sort
Partition
A = LOAD ‘foo’ AS (x, y);
B = FILTER A by $0 is not
null;
C = ORDER f BY x;
Stage sample map
on distributed cache
Load/Filter
& Sample
Aggregate
Partition
Sort
Broadcast sample map
HDFS
HDFS
Load/FilterHDFS
HDFS
Map
Reduce
Map
Reduce
Map
1-1 Unsorted
Edge
Cache sample map
Custom Vertex Input/Output/Processor/Manager
 Vertex
- Data pipeline
 Edge
- Unsorted input/output
 Union, sample
- Broadcast Edge (Replicate join, Orderby and Skewed join)
- 1-1 Edge (Order by, Skewed join and Multiquery off)
 1-1 edge tasks are launched on same node
 Custom Vertex Manager – Automatic Parallelism Estimation
12
Session Reuse
 Mapreduce problem
- Every Mapreduce job require a separate AM
- AM killed after every Mapreduce job
- Resource congestion
 Tez solution
- Every DAG only need a single AM
- Session reuse
 How Pig uses session Reuse
- Typical Pig script produce only one DAG
- Pig Tez session pool
 Grunt shell uses one session for all commands till timeout
 More than one DAG submitted for merge join, ‘exec’
 Multiple DAGs launched by a python script
13
Container Reuse and Object Caching
 Mapreduce problem
- Container get killed after every task
 Launch jvm takes time, more obvious in small jobs
- Resource localization overhead
- Resource congestion
 Tez Solution
- Reuse the container whenever possible
- Object caching
 User impact
- Have to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static
variables and memory leaks due to jvm reuse.
14
Vertex Groups
 Mapreduce problem
- A separate Mapreduce job to do the union
 Tez solution
- Ability to group multiple vertices into one vertex group and produce a combined
output
15
A = LOAD ‘input’;
B = GROUP A by $0;
C = GROUP A by $1;
D = UNION B, C;
Load A
GROUP B GROUP C
UNION
Automatic Parallelism
 Set parallelism manually is hard
 Automatic Parallelism
- Preliminary calculation at compile time
 Rough, since we don’t have the stats of the data
- Gradually adjust the parallelism when DAG progress
- Dynamically change parallelism before vertex start
16
Dynamic Parallelism
 Dynamic adjust parallelism before vertex start
 Tez VertexManagerPlugin
- Custom policy to determine parallelism at runtime
- Library of common policy: ShuffleVertexManager
17
Dynamic Parallelism - ShuffleVertexManager
18
Load A
JOIN
Load A
JOIN 4 2
Load B
Load B
 Used by Group, Hash Join, etc
 Dynamic reduce parallelism of vertex based on estimated input size
Dynamic Parallelism – PartitionerDefinedVertexManager
 Custom VertexManager Used by Order by / Skewed Join
 Dynamic increase / decrease parallelism based on input size
19
Load/Filter
& Sample
Sample
Aggregate
Partition
Sort (with VertexManager)
Calculate
#Parallelism
Pig Grace Parallelism
 Problem of dynamic parallelism
- When the vertex is about to start, parents already run, so
 Tez can only decrease parallelism, not increase
 Merge partition is possible, but there is a cost associate with it
 Idea: Change parallelism when the DAG progress
20
Pig Grace Parallelism
21
A B
C
D
E
F
G
H
I
10 15
2020
200
999
20
20
999
->20
->100
->100
->100
->500
->50
->200
Tez UI
22
Tez UI
23
 Functional equivalent or superior to MR UI
 Rich information about an application
- DAG graph
- Application / DAG / Vertex / Task / TaskAttempts
- Swimlane
- Counters
 Build on Yarn Timeline server
- In active development, will benefit from scalability improvements
- Possibility to extend
 Pig specific view
 Vertex show Pig code snip
PERFORMANCE
Performance numbers –
25
0
10
20
30
40
50
60
Prod script 1
1.5x
1 MR Job
3172 vs 3172 Tasks
Prod script 2
2.1x
12 MR jobs
966 vs 941 Tasks
Prod script 3
1.5x
4 MR jobs on 8.4 TB input
21397 vs 21382 Tasks
Timeinmins
MR
Tez
28 vs 18m
11 vs 5m
50 vs 35m
Performance numbers –
26
Job HDFS_BYTES_READ HDFS_BYTES_WRITTEN CPU_MILLIS
WALL
CLOCK
(ms)
Job1
1,012,118,530 1,687,272,011 728,200
Job2 3,870,164,205 405,630,672 2,310,550
Job3 2,092,933,483 5,628,320,670 653,560
Total 6,975,216,218 7,721,223,353 3,692,310 503,069
Job HDFS_BYTES_READ HDFS_BYTES_WRITTEN CPU_MILLIS
WALL
CLOCK
(ms)
Job1
3,194,970,415 5,628,460,999 2,585,630 172,223
MAPREDUCE
TEZ
3.5GB less bytes read, 2GB less bytes written, 2.9x faster
Performance numbers –
27
0
20
40
60
80
100
120
140
160
Prod script 1
2.52x
5 MR Jobs
Prod script 2
2.02x
5 MR Jobs
Prod script 3
2.22x
12 MR Jobs
Prod script 4
1.75x
15 MR jobs
Timeinmins
MR
Tez
25 vs 10m
34 vs 16m
2h 22m vs 1h 21m
1h 46m vs 48m
Performance Numbers – Interactive Query
28
0
100
200
300
400
500
600
700
10G 5G 1G 500M
Timeinsecs
Input Size
TPC-H Q10
MR
Tez
2.49X
3.41X
4.89X 6X
 When the input data is small, latency dominates
 Tez significantly reduce latency through session/container reuse
Performance Numbers – Iterative Algorithm
29
 Pig can be used to implement iterative algorithm using embedding
 Iterative algorithm is ideal for container reuse
 Example: k-means Algorithm
- Each iteration takes an average 1.48s after the first iteration (vs 27s for MR)
0
1000
2000
3000
10 50 100
Timeinsecs
Iteration
k-means
MR
Tez
14.84X
13.12X
5.37X
* Source code can be downloaded at http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding
Performance is proportional to …
 Number of stages in the DAG
- Higher the number of stages in the DAG, performance of Tez over MR will be
better due to elimination of map read stages.
 Size of intermediate output
- More the size of intermediate output, the performance of Tez over MR will be
better due to reduced HDFS usage.
 Cluster/queue capacity
- More congested a queue is, the performance of Tez over MR will be better due
to container reuse.
 Size of data in the job
- For smaller data and more stages, the performance of Tez over MR will be
better as percentage of launch overhead in the total time is high for smaller
jobs.
30
CURRENT & FUTURE
User Impact
 Tez
- Zero pain deployment
- Tez library installation on local disk and copy to HDFS
 Pig
- No pain migration from Pig on MR to Pig on Tez
 Existing scripts work as is without any modification
 Only two additional steps to execute in Tez mode
– export TEZ_HOME=/tez-install-location
– pig -x tez myscript.pig
- Users to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static
variables and memory leaks due to jvm reuse.
32
Pig on Tez Release
 Already released with Pig 0.14.0 in November 2014
- Illustrate is not implemented on Tez
- All 1000+ MR MiniCluster unit tests pass on Tez
- All 683 e2e tests pass on Tez
- Integrated with Oozie
33
Improvement in Pig 0.15.0
 Local mode stabilization
- Port all MR local mode unit tests to tez
 Bug fixes
- AM scalability
- Error with complex scripts
 Tez UI integration
 Performance improvements
- Shared-edge support
 Grace automatic parallelism
34
Yahoo! Production Experience
 Both Pig 0.14 and Pig 0.11 deployed on all clusters.
 Pig 0.14 current version on research clusters.
- 5K nodes and 10-15K pig jobs in a day in the biggest research cluster.
 Pig 0.11 still current version on production clusters. In the process of
fixing Tez and ATS scale issues before making them current on prod
clusters running 100-150K pig jobs per day.
- Scalability issues with ATS and its backend. Rewrote ATS LevelDB Plugin.
Exploring RocksDB till ATS v2 with Hbase backend is available.
- Issues with Tez UI.
- Scalability issues with Tez for huge DAGs (>50 vertices) with high parallelism.
 All the Pig fixes during Yahoo! Production will be in Pig 0.15.0 (May 2015)
 All the Tez fixes will be in Tez 0.7.0 (May 2015)
35
What next?
 Improve Tez UI
- Tez UI with Pig specific view
- Tez UI scalability
 Custom edge manager and data routing for skewed join
 Groupby and join using hashing and avoid sorting
 Dynamic reconfiguration of DAG
- Automatically determine type of join - replicate, skewed or hash join
 PIG-3839 – Umbrella jira for more performance improvements
36
Questions
37

Pig on Tez: Low Latency Data Processing with Big Data

  • 1.
    Pig on Tez LowLatency Data Processing with Big Data Daniel Dai @daijy Rohini Palaniswamy @rohini_pswamy H a d o o p S u m m i t 2 0 1 5 , B r u s s e l s
  • 2.
    Agenda  Team Introduction Apache Pig  Why Pig on Tez?  Pig on Tez - Design - Tez features in Pig - Performance - Current status - Future Plan 2
  • 3.
    3 Apache Pig onTez Team Daniel Dai Pig PMC Hortonworks Rohini Palaniswamy VP Pig, Pig PMC Yahoo! Olga Natkovich Pig PMC Yahoo! Cheolsoo Park Pig PMC Netflix Mark Wagner Pig Committer LinkedIn Alex Bain Pig Contributor LinkedIn
  • 4.
    Pig Latin  Proceduraldata processing language  More than SQL and Feature rich  Turing complete: Macro, looping, branching 4 Multiquery Nested Foreach Scalars Algebraic and Accumulator java UDFs non-java UDFs (jython, python, javascript, groovy, jruby) Distributed Orderby, Skewed Join a = load 'student.txt' as (name, age, gpa); b = filter a by age > 20 and age <=25; c = group b by age; d = foreach c generate age, AVG(gpa); store d into 'output'
  • 5.
    Pig users  ETLuser - Pig syntax is very similar to ETL tools  Data Scientist - feature rich - Python UDF - Looping  At Yahoo! - 60% of total hadoop jobs run daily - 12 million monthly pig jobs  Other heavy users - Twitter - Netflix - LinkedIn - Ebay - Salesforce 5
  • 6.
    Why Pig onTez?  MR Restriction - Too restricted, Pig cannot process as fast as it can  New execution engine - General DAG engine - Powerful and Rich API - Leverage Hadoop  Tez is a perfect fit - Low level DAG framework - Powerful, define vertex and edge semantics, customize with plugins - Performance - Without having to increase memory - Resource efficient - Natively built on top of YARN  Multi-tenancy, resource allocation come for free  Scale  Stable  Security - Excellent support from Tez community  Bikas Saha, Siddharth Seth, Hitesh Shah, Jeff Zhang, Rajesh Balamohan 6
  • 7.
  • 8.
    Design 8 Logical Plan Tez PlanMR Plan Physical Plan Tez Execution Engine MR Execution Engine LogToPhyTranslationVisitor MRCompilerTezCompiler
  • 9.
    DAG Plan –Split Group by + Join 9 f = LOAD ‘foo’ AS (x, y, z); g1 = GROUP f BY y; g2 = GROUP f BY z; j = JOIN g1 BY group, g2 BY group; Group by y Group by z Load foo Join Load g1 and Load g2 Group by y Group by z Load foo Join Multiple outputs Reduce follows reduce HDFS HDFS Split multiplex de-multiplex
  • 10.
    DAG Execution -Visualization 10 Vertex 1 (Load) Vertex 2 (Group) Vertex 3 (Group) Vertex 4 (Join) MROutput MRInput
  • 11.
    DAG Plan –Distributed Orderby 11 Aggregate Sample Sort Partition A = LOAD ‘foo’ AS (x, y); B = FILTER A by $0 is not null; C = ORDER f BY x; Stage sample map on distributed cache Load/Filter & Sample Aggregate Partition Sort Broadcast sample map HDFS HDFS Load/FilterHDFS HDFS Map Reduce Map Reduce Map 1-1 Unsorted Edge Cache sample map
  • 12.
    Custom Vertex Input/Output/Processor/Manager Vertex - Data pipeline  Edge - Unsorted input/output  Union, sample - Broadcast Edge (Replicate join, Orderby and Skewed join) - 1-1 Edge (Order by, Skewed join and Multiquery off)  1-1 edge tasks are launched on same node  Custom Vertex Manager – Automatic Parallelism Estimation 12
  • 13.
    Session Reuse  Mapreduceproblem - Every Mapreduce job require a separate AM - AM killed after every Mapreduce job - Resource congestion  Tez solution - Every DAG only need a single AM - Session reuse  How Pig uses session Reuse - Typical Pig script produce only one DAG - Pig Tez session pool  Grunt shell uses one session for all commands till timeout  More than one DAG submitted for merge join, ‘exec’  Multiple DAGs launched by a python script 13
  • 14.
    Container Reuse andObject Caching  Mapreduce problem - Container get killed after every task  Launch jvm takes time, more obvious in small jobs - Resource localization overhead - Resource congestion  Tez Solution - Reuse the container whenever possible - Object caching  User impact - Have to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static variables and memory leaks due to jvm reuse. 14
  • 15.
    Vertex Groups  Mapreduceproblem - A separate Mapreduce job to do the union  Tez solution - Ability to group multiple vertices into one vertex group and produce a combined output 15 A = LOAD ‘input’; B = GROUP A by $0; C = GROUP A by $1; D = UNION B, C; Load A GROUP B GROUP C UNION
  • 16.
    Automatic Parallelism  Setparallelism manually is hard  Automatic Parallelism - Preliminary calculation at compile time  Rough, since we don’t have the stats of the data - Gradually adjust the parallelism when DAG progress - Dynamically change parallelism before vertex start 16
  • 17.
    Dynamic Parallelism  Dynamicadjust parallelism before vertex start  Tez VertexManagerPlugin - Custom policy to determine parallelism at runtime - Library of common policy: ShuffleVertexManager 17
  • 18.
    Dynamic Parallelism -ShuffleVertexManager 18 Load A JOIN Load A JOIN 4 2 Load B Load B  Used by Group, Hash Join, etc  Dynamic reduce parallelism of vertex based on estimated input size
  • 19.
    Dynamic Parallelism –PartitionerDefinedVertexManager  Custom VertexManager Used by Order by / Skewed Join  Dynamic increase / decrease parallelism based on input size 19 Load/Filter & Sample Sample Aggregate Partition Sort (with VertexManager) Calculate #Parallelism
  • 20.
    Pig Grace Parallelism Problem of dynamic parallelism - When the vertex is about to start, parents already run, so  Tez can only decrease parallelism, not increase  Merge partition is possible, but there is a cost associate with it  Idea: Change parallelism when the DAG progress 20
  • 21.
    Pig Grace Parallelism 21 AB C D E F G H I 10 15 2020 200 999 20 20 999 ->20 ->100 ->100 ->100 ->500 ->50 ->200
  • 22.
  • 23.
    Tez UI 23  Functionalequivalent or superior to MR UI  Rich information about an application - DAG graph - Application / DAG / Vertex / Task / TaskAttempts - Swimlane - Counters  Build on Yarn Timeline server - In active development, will benefit from scalability improvements - Possibility to extend  Pig specific view  Vertex show Pig code snip
  • 24.
  • 25.
    Performance numbers – 25 0 10 20 30 40 50 60 Prodscript 1 1.5x 1 MR Job 3172 vs 3172 Tasks Prod script 2 2.1x 12 MR jobs 966 vs 941 Tasks Prod script 3 1.5x 4 MR jobs on 8.4 TB input 21397 vs 21382 Tasks Timeinmins MR Tez 28 vs 18m 11 vs 5m 50 vs 35m
  • 26.
    Performance numbers – 26 JobHDFS_BYTES_READ HDFS_BYTES_WRITTEN CPU_MILLIS WALL CLOCK (ms) Job1 1,012,118,530 1,687,272,011 728,200 Job2 3,870,164,205 405,630,672 2,310,550 Job3 2,092,933,483 5,628,320,670 653,560 Total 6,975,216,218 7,721,223,353 3,692,310 503,069 Job HDFS_BYTES_READ HDFS_BYTES_WRITTEN CPU_MILLIS WALL CLOCK (ms) Job1 3,194,970,415 5,628,460,999 2,585,630 172,223 MAPREDUCE TEZ 3.5GB less bytes read, 2GB less bytes written, 2.9x faster
  • 27.
    Performance numbers – 27 0 20 40 60 80 100 120 140 160 Prodscript 1 2.52x 5 MR Jobs Prod script 2 2.02x 5 MR Jobs Prod script 3 2.22x 12 MR Jobs Prod script 4 1.75x 15 MR jobs Timeinmins MR Tez 25 vs 10m 34 vs 16m 2h 22m vs 1h 21m 1h 46m vs 48m
  • 28.
    Performance Numbers –Interactive Query 28 0 100 200 300 400 500 600 700 10G 5G 1G 500M Timeinsecs Input Size TPC-H Q10 MR Tez 2.49X 3.41X 4.89X 6X  When the input data is small, latency dominates  Tez significantly reduce latency through session/container reuse
  • 29.
    Performance Numbers –Iterative Algorithm 29  Pig can be used to implement iterative algorithm using embedding  Iterative algorithm is ideal for container reuse  Example: k-means Algorithm - Each iteration takes an average 1.48s after the first iteration (vs 27s for MR) 0 1000 2000 3000 10 50 100 Timeinsecs Iteration k-means MR Tez 14.84X 13.12X 5.37X * Source code can be downloaded at http://hortonworks.com/blog/new-apache-pig-features-part-2-embedding
  • 30.
    Performance is proportionalto …  Number of stages in the DAG - Higher the number of stages in the DAG, performance of Tez over MR will be better due to elimination of map read stages.  Size of intermediate output - More the size of intermediate output, the performance of Tez over MR will be better due to reduced HDFS usage.  Cluster/queue capacity - More congested a queue is, the performance of Tez over MR will be better due to container reuse.  Size of data in the job - For smaller data and more stages, the performance of Tez over MR will be better as percentage of launch overhead in the total time is high for smaller jobs. 30
  • 31.
  • 32.
    User Impact  Tez -Zero pain deployment - Tez library installation on local disk and copy to HDFS  Pig - No pain migration from Pig on MR to Pig on Tez  Existing scripts work as is without any modification  Only two additional steps to execute in Tez mode – export TEZ_HOME=/tez-install-location – pig -x tez myscript.pig - Users to review/profile and fix custom LoadFunc/StoreFunc/UDFs for static variables and memory leaks due to jvm reuse. 32
  • 33.
    Pig on TezRelease  Already released with Pig 0.14.0 in November 2014 - Illustrate is not implemented on Tez - All 1000+ MR MiniCluster unit tests pass on Tez - All 683 e2e tests pass on Tez - Integrated with Oozie 33
  • 34.
    Improvement in Pig0.15.0  Local mode stabilization - Port all MR local mode unit tests to tez  Bug fixes - AM scalability - Error with complex scripts  Tez UI integration  Performance improvements - Shared-edge support  Grace automatic parallelism 34
  • 35.
    Yahoo! Production Experience Both Pig 0.14 and Pig 0.11 deployed on all clusters.  Pig 0.14 current version on research clusters. - 5K nodes and 10-15K pig jobs in a day in the biggest research cluster.  Pig 0.11 still current version on production clusters. In the process of fixing Tez and ATS scale issues before making them current on prod clusters running 100-150K pig jobs per day. - Scalability issues with ATS and its backend. Rewrote ATS LevelDB Plugin. Exploring RocksDB till ATS v2 with Hbase backend is available. - Issues with Tez UI. - Scalability issues with Tez for huge DAGs (>50 vertices) with high parallelism.  All the Pig fixes during Yahoo! Production will be in Pig 0.15.0 (May 2015)  All the Tez fixes will be in Tez 0.7.0 (May 2015) 35
  • 36.
    What next?  ImproveTez UI - Tez UI with Pig specific view - Tez UI scalability  Custom edge manager and data routing for skewed join  Groupby and join using hashing and avoid sorting  Dynamic reconfiguration of DAG - Automatically determine type of join - replicate, skewed or hash join  PIG-3839 – Umbrella jira for more performance improvements 36
  • 37.

Editor's Notes

  • #4 First, let me introduce Pig on Tez project team. This team is kind of special compare to other projects. There is no single company driving the development, instead, we have a virtual team which consists of Rohine, Olga from Yahoo!, Cheolsoo from Netflix, Mark and Alex from Linkedin and me from Hortonworks. We work independently but in a very coordinated manner. We have weekly standup, we have sprint, and we use Apache to cooperate. And this model turn out works very well. Actually Cheolsoo and Mark gives a talk in last year’s ApacheCon to talk about this model.
  • #5 Let me spend one slide to talk about what is Pig. Pig is a data processing language. It is SQL like but there are some differences. Pig is a procedural language, you process the data step by step. Here is one example of Pig script. In this example….. Unlike SQL, you don’t have to scramble everything into a single SQL statement, which is more natural and intuitive. Pig can process a hdfs file directly. Which means you don’t have to create a table first. You don’t even need schema. Pig is feature rich. It has the features SQL doesn’t have, such as …. Pig is also turing complete. You can write Pig Macro, you can embed Pig inside Python script so you can do things like loop, branching, which you cannot do in SQL.
  • #6 We typically see two major group of people who are using Pig. ETL people use Pig because Pig syntax is very similar to ETL tools. Pig operator such as filter, foreach, group, join, are also exist in ETL tools. Another group of people who are using Pig are data scientists. They need richer features which SQL cannot provide, such as nested operators, they like to use python to write a UDF, they want to do looping in their script. We see a lot of companies are using Pig. At Yahoo! …. Other heavy users include…..
  • #7 Pig traditionally build on top of Mapreduce. Mapreduce is good, it is stable and scalable. Mapreduce provide a simple API, but in the mean time, very restrictive. Things works very well initially. However, when more and more people migrate their workload to Pig, more and more people depends on Pig to do their daily job, they request more. They want Pig to be fast. However, with the restriction of Mapreduce, Pig cannot process the data as fast as it can. So we are looking for a new engine. This new engine must be general DAG based, that’s what a traditional database engine based on, so that we can do everything a traditional database engine can do. We need a powerful API to deal with the complex requirement Pig needs. In the meantime, we don’t want to change things completely. We want to continue to use Hadoop cluster, we like the stability and scalability of Hadoop, we want to continue to use the enterprise features of Hadoop, such as multi-tenancy, security, encryption, rolling upgrade. After we evaluation different kinds of DAG execution engine, we feel Tez fit all out needs the best. Tez is a low level DAG execution engine. The API is much complex than MapReduce so it is not end user facing. However, for things like Pig, Hive, Cascading, that’s not a problem. These tools can hide the engine complexity internally and not expose to the user. Tez offers a rich API. We can define vertex, edge with semantics, plugin to customize the DAG behavior. Tez is fast, whether you have a lot of memory or not. If you have memory, Tez will make use of it. If not, Tez still perform much faster than Mapreduce. Tez is resource efficient. Compare to Mapreduce, it needs less processing node. Tez is build on top of Yarn, so it inherit a lot of features of Yarn natively, it is multi-tenancy enabled, it is scalable, it is relatively stable as a new engine since it leverage a lot of existing Mapreduce code, you can use Kerboros for security, etc. Last but not the least, the Tez team is awesome. Thanks Bikas, Sidd, Hitesh, they provide excellent community support, if you hit issues, you can count on them to fix it quickly.
  • #11 Let’s take a deeper look of the Tez DAG to see what exactly happens. Vertex 1 load the data from hdfs only once. Just like mapreduce, it will partition the data and shuffle to different downstream vertexes. Notice Vertex 1 has two output, so the data are partitioned and shuffled on different key independently to vertex 2 and vertex 3. At vertex 2 and vertex 3, data are merged and sorted. After processing the data, vertex 2 and vertex 3 will partition and shuffle the data on the same key to vertex 4, which will perform a join. We can see the processing of Tez is very similar to Mapreduce, except that we don’t need to store the intermediate file into hdfs. All the data movement are necessary since we need to do a shuffle to perform group and join. So this DAG is very optimal to process the Pig script.
  • #13 Pig uses a lot of features and customizations of Tez. We need to customize the vertex, we need to customize input and output between vertices. We need to use vertex manager to tune the parallelism of vertices. Here I will give you some examples. Vertex needs to be customized, that’s obvious. Vertex is the place Pig run data pipeline, so we need to customize the vertex to process data properly. We need to customize edge as well. Unlike MR, which blindly partition, shuffle and sort data between map and reduce, we can do more flexible customization in Tez. In many cases, sorting is not necessary. For example, union only need to combine two inputs without sorting, sample file in order by and right table in replicated join needs to be broadcast without sorting. In MR, we only do scatter-gather between map and reduce. If we want to do a broadcast, such as distribute a scalar, we will have to use a hack way. The most common hack is to put the file on hdfs, and every task read from hdfs. This is very problematic since with thousands read from hdfs, we can easily take namenode down. A better hack is to use distributed cache, while this is much better than hdfs approach, we still need to hit hdfs multiple times since every nodemanager need to read from hdfs and localize the distributed cache. In tez, we can use a standard broadcast edge to do that. We will read hdfs only once and broadcast the file over the network. Another edge type is 1-1 edge, in that one task will and will only process the output of a particular upstream task. Pig use 1-1 edge in order by and skewed join. In such case, tez will schedule the two task on the same node manager thus completely avoid network traffic. Besides vertices and edges, we can further customize the DAG behavior through a plugin called VertexManager. When the DAG progress, VertexManager will receive notification of the DAG and adjust vertex and edge accordingly. Currently Pig use VertexManager to adjust the vertex parallelism dynamically.
  • #14 For every Mapreduce job, we will need a AM. The job of AM is to launch map tasks and reduce tasks. After the mapreduce job complete, we kill the AM. The next mapreduce job will launch another AM. A Pig script usually contains dozens of MR jobs, so that’s a significant delay and waste of resources since AM is only doing admin work not the real work we need MR to do. This often leads to resource congestion. In one extreme case, we see hundreds of AM is running on a cluster but one real work can be done since there is no container available to do the real work, thus no one can make progress. Tez solve this problem completely. For a Tez DAG which usually represents the workload of a dozens of MR jobs, require only one AM. Further, even if the DAG finishes, Tez does not kill the AM, instead, the same AM can be reused to launch another DAG, which is called session reuse. This benefit Pig a lot. Typically, we will compile a Pig script into a single DAG, so that only one AM is required. There are couple of cases which Pig need to submit multiple DAGs, such as multiple exec statement, multiple DAG launched by a single grunt session, multiple DAGs generated by a loop inside a python scripts. In that, Pig maintain a session pool and submit the DAG to existing AM whenever possible.
  • #15 Container reuse is even more useful. Container is the place where Pig run the data pipelines. In MR, once the task finishes, we kill container jvm. A large Mapreduce job may consists of several thousand tasks, which means we need to kill and restart jvm thousands of times. Further, in some vertex, we need to spend time to initialize the resource, for example, in replicated join, we need to load the right side relation and put it in memory, then we will need to do that thousands of times as well. In a congested cluster, request a new container means wait in line, and compete resource with other jobs, and we need to to thousands of times. That’s a significant delay of processing. In Tez, we don’t kill the container immediately after the task finishes. Instead, we try to reuse the container whenever possible. The first task can cache any memory object into a key-value store called ObjectCache. The next task will retrieve the object by name from the ObjectCache. Pig use ObjectCache to store the right relation of the right table in replicated join, sample file in order by and skewed join. With container reuse and ObjectCaching, we minimize the cost of kill and restart jvm, initialize the resource, and compete the resource in a busy cluster. We see significant speedup for replicated join, and we see order of magnitude speedup for smaller job. One thing to note, however, custom Pig LoadFunc/StoreFunc/UDFs might need to be hardened. Since Tez will reuse the same jvm, static variable needs to be reinitalized when vertex started, and some memory leak which is not obvious in MR may cause issues in Tez.
  • #16 Another optimization is vertex group. Look at the example script, if we do it in mapreduce, we will need 2 mapreduce job to process it. The first mapreduce job will process 2 group by simultaneously. Thanks to multiquery optimization, otherwise we will need 3 mapreduce jobs. The second mapreduce job only do a union. Its job is very simple, just combining two outputs together. In Tez, we introduce a concept called vertex group. It is a virtual vertex which moves outputs of two group vertex into the same folder, thus avoid a real Tez task.
  • #17 In Pig, user can set parallelism manually by parallel statement or define a global default_parallel in Pig script. However, set parallelism manually is hard. User usually don’t have idea how to set parallelism properly. In MR, we use automatic parallelism if user leave the parallelism blank and we continue to do that in Tez and we do it better. There are multiple layers of parallelism adjustment. At compile time, we estimate the parallelism of each vertex based on the DAG input and the data pipeline inside each vertex. This is very rough however, unlike Hive, which is able to collect data stats and save into metastore, Pig don’t have the stats of the input data. When the DAG progress, we will find we underestimate or overestimate the parallelism originally. VertexManager provide an opportunity to monitor the DAG progress and adjust the parallelism dynamically.
  • #18 When a vertex is about to start, VertexManager will estimate its input data, and adjust the parallelism of the vertex. This is called dynamic parallelism. The most common VertexManager is already implemented in Tez, which is the ShuffleVertexManager. It is able to handle typical Scatter-gather edge. There are other VertexManagers available in Tez. And of cause, you can implement customized VertexManager, and Pig did that for order by and skewed join.
  • #19 Let’s take a deeper look into ShuffleVertexManager. It is widely used in Pig to perform most operation needs shuffle, such as group, hash join. In this scenario, we initially set the parallelism 4 for the join vertex. When the JOIN vertex is about to start, we find the actual data coming out of A and B is lesser than estimated. We will only need parallelism 2 according to our policy. ShuffleVertexManager will get this notification, and decide to change the parallelism. So we eliminated task #3 and #4. Notice vertex A and vertex B is already started, so the output data is already partitioned into 4 partitions. Tez will do something very sophisticated, it will reroute the partition originally going to task 3 and task 4 to task 1 and task 2. Thus, parallelism of JOIN vertex is changed to 2 at runtime. Notice the way ShuffleVertexManager estimate the input data size of JOIN vertex. It wait for a certain percentage of input task finishes, then estimate the input size based on the finished tasks. If the input tasks are highly skewed, this is problematic. Also, ShuffleVertexManager only decrease the parallelism but not increase. Decrease the parallelism only need to reroute the eliminated partition to existing task. On the other hand, increase the parallelism require to repartiton the existing partitions, which is harder to do.
  • #20 Order by, on the other hand, estimate the parallelism based on the samples collected. In order by, we will first sample the input data to decide how many tasks to be used in sorting, and the key range for every task. This number will send to the VertexManager of the sorting vertex. Since the upstream node is not started yet, which means input data is not partitioned, we can freely adjust the parallelism of the sorting vertex, either decrease parallelism, or increase parallelism. The upstream vertex will then partition the data with the right parallelism. And because the data size is estimated based on a complete random sample of the input data, it is much more accurate.
  • #21 When we are discussing about ShuffleVertexManager, we already covered limitation of it. We can only decrease parallelism but not increase it. Also, there is overhead involved when we reroute the partition. So it is better to get things right before the upstream vertex start. This is why we need another layer of parallelism adjustment, and we call it Pig grace parallelism. The idea is when the DAG progress, we adjust the parallelism of the downstream vertex accordingly with the better estimation. Even when ShuffleVertexManager cannot bring everything right, it will not off too much.