Pig on Tez
PRESENTED BY

Cheolsoo Park, Netflix
R o h i n i P a l a n i s w a m y , Ya h o o !

The Apache Software Foundation
Apache Pig on Tez Team
Name

Role

Company

Apache Pig Contributor

Linkedin

Cheolsoo Park

VP. Apache Pig

Netflix

Daniel Dai

Apache Pig PMC

Hortonworks

Mark Wagner

Apache Pig Committer

Linkedin

Olga Natkovich

Apache Pig PMC, Pig on Tez Project
Manager

Yahoo!

Rohini Palaniswamy

Apache Pig PMC

Yahoo!

Alex Bain

The Apache Software Foundation

2
Agenda
 Overview
 Pig and Hive
 Pig on Tez


Why Tez?



Benefits of Tez



Design



Operator DAGs



Performance



Known Issues



Where are we?



What next?

The Apache Software Foundation

3
Pig Overview
 Apache top-level project for ETL on hadoop.
 PIG Latin - Procedural scripting language that translates sequence of data processing
steps into MapReduce jobs.
 Easy to write, read and reuse and very extensible.
 Feature parity with SQL (FILTER BY, CROSS, JOIN (OUTER, INNER), ORDER BY, LIMIT, RANK, ROLLUP,
CUBE), Custom Loader and Storer, User defined functions (java and non-java), Nested
ForEach, Streaming, macros and much more
PAGEVIEWS = LOAD ‘/data/pageviews’ as (user, url);
GRP = GROUP PAGEVIEWS BY user;
CNT = FOREACH GRP GENERATE group, COUNT(url) as numvisits;
STORE CNT into ‘/data/visited’ using PigStorage(‘,’);

The Apache Software Foundation

4
Pig and Hive
Pig
Language

Hive

PIG Latin - Procedural

SQL - Declarative

Features

Feature rich. Can easily add new
operators and constructs. For eg:
Nested Foreach, Switch case,
Macros, Scalars.

Limited to SQL operators

Developer code

Load/StoreFunc, Algebraic and
Accumulator java UDFs, non-java
UDFs (jython, python, javascript,
groovy, jruby), Custom Partitioners.

StorageHandler, java UDFs

Complex Processing

Well suited. Multi-query works well
with 1000s of lines of pig script.

Not a good fit

Server

Only client. Can work with Hive
Metastore using
HCatLoader/Storer.

Requires Metastore server
and data has to be registered
in it. HiverServer2 for jdbc

The Apache Software Foundation

5
Pig and Hive - Continued
Pig
Tez as execution engine

Hive

Planned for 0.14

Planned for Hive 0.13

ORCFile Support

Patch available. Currently through
HCatLoader

From Hive 0.12 onwards.
Huge performance gains

Vectorization

No. May be in future.

Yes. Huge performance gains

Transactions

No

Yes. In works

Cost-based optimizer

No

Yes. In works

JDBC support, Integration with BI
tools

No

Yes. HiveServer2 with
Microstrategy/Tableau

Area of application

Pipeline processing language
standard

Interactive Analytics
/Reporting Platform

The Apache Software Foundation

6
Why Tez?
 Built on top of YARN



Multi-tenancy (queues, capacity management)
Resource allocation

 DAG execution framework



Natural fit for Pig and Hive than MR as their execution plans are DAGs.
Better than running a DAG of MR jobs passing data in between jobs using HDFS as intermediate store.

 Different types of edges


ONE_ONE, BROADCAST, SHUFFLE

 Flexible Input-Processor-Output runtime model









Custom Vertex Processors. For eg: Map Processor, Reduce Processor, Pig Processor
Custom Inputs. For eg: MRFileInput (input to map), ShuffledMergedInput (input to reduce)
Custom Outputs. For eg: OnFileSortedOutput (output of map), MRFileOutput (output of reduce)

Multiple inputs and outputs
Highly extensible
Security
Support from Tez Community and Hive Community
The Apache Software Foundation

7
Why Tez? – As a end user





Better Performance
Reduced Resource Usage (Containers/Memory/CPU)
Reduced Network I/O
Reduced Namenode and Datanode load

The Apache Software Foundation

8
Benefits of Tez
Features

Benefits
•

No intermediate data storage

•
•

•
Single AM for whole DAG

The Apache Software Foundation

•

Less pressure on Namenode
- Lesser calls for listing and getting block locations
- Smaller namespace usage
- Cuts down on GC
Less pressure on Datanode
- Cuts down on IO in network for both writing and reading.
- Saves space as there are no 3 replicas
Eliminates extra step of map reads from HDFS in every
intermediate job in DAG
- Saves on capacity by eliminating the need for map task
containers
Saves on capacity. For a 5 stage MR job, there would be 5 AM
containers launched.
Eliminates issue of queue and resource contention faced in MR
for jobs started after previous job in DAG completes.
9
Benefits of Tez - Continued
Features

Benefits
•

Container reuse
•

Reduced launch overhead
- Container request and release overhead
- Resource localization overhead
- JVM launch time overhead
Reduced network IO
- Reduce tasks can be launched on same node as Map
- 1-1 edge tasks can be launched on same node

•

Memory structures like small tables used for join can be cached
in jvm and reused for next task on container reuse. Provides
significant performance speedup.

•

Using unsorted input and output where possible saves a lot of
CPU usage and increases performance

•

Saves on capacity. Can have reducers based on data size
instead of having fixed number of reducers.

Vertex caching
Custom inputs and outputs
Dynamic reducer estimation
The Apache Software Foundation

10
Pig on Tez - Design
Logical Plan
LogToPhyTranslationVisitor
Physical Plan
TezCompiler

MRCompiler

Tez Plan

MR Plan

Tez Execution Engine

MR Execution Engine

The Apache Software Foundation

11
Pig on Tez – Join
Left
split

Right
split

Left
split

Load L and R

Right
split

l = LOAD ‘left’ AS (x, y);
r = LOAD ‘right’ AS (x, z);
j = JOIN l BY x, r BY x;
Configuration
per input

Configuration
per job

Join

The Apache Software Foundation

Left
split

Left
split

Right
split

Load L

Load R

Join

12

Right
split
Pig on Tez – Split + Group-by
Load foo
Split multiplex

de-multiplex

Group by y Group by z
HDFS

f = LOAD ‘foo’ AS (x, y, z);
g1 = GROUP f BY y;
g2 = GROUP f BY z;
j = JOIN g1 BY group,
g2 BY group;

Load foo
Multiple outputs

Group by y

HDFS

Group by z

Load g1 and Load g2

Reduce follows
reduce

Join

The Apache Software Foundation

Join

13
Pig on Tez – Order-by

Aggregate
HDFS

Load &
Sample

f = LOAD ‘foo’ AS (x, y);
o = ORDER f BY x;

Sample

Aggregate

Stage sample map
on distributed cache

Pass through input
via 1-1 edge

Broadcast sample map

Partition

Partition

Sort

Sort

The Apache Software Foundation

14
Pig on Tez – Skewed join
l = LOAD ‘left’ AS (x, y);
r = LOAD ‘right’ AS (x, z);
j = JOIN l BY x, r BY x
USING ‘skewed’;

Sample L

Load &
Sample

Aggregate
HDFS

Aggregate
Pass through input
via 1-1 edge

Stage sample map
on distributed cache

Broadcast
sample map

Partition L

Partition R

Partition L and Partition R
Join
Join

The Apache Software Foundation

15
Time in secs

Performance numbers
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
0

MR
Tez
Replicated
Join (2.8x)

The Apache Software Foundation

Join +
Groupby
(1.5x)

Join +
3 way Split +
Groupby +
Join +
Orderby
Groupby +
(1.5x)
Orderby
(2.6x)

16
Factors affecting performance
 Number of stages in the DAG


Higher the number of stages in the DAG, performance of Tez over MR will be better.

 Cluster/queue capacity


More congested a queue is, the performance of Tez over MR will be better due to container reuse.

 Size of intermediate output


More the size of intermediate output, the performance of Tez over MR will be better due to reduced
HDFS usage.

 Size of data in the job


For smaller data and more stages, the performance of Tez over MR will be better as percentage of
launch overhead in the total time is high for smaller jobs.

 Vertex caching

The Apache Software Foundation

17
Container usage
MR

Tez

Savings

Tez with
container reuse

7563

7562

1

180

Join + Groupby +
Orderby

7655

7603

52

180

Join + Groupby +
Orderby

7663

7609

54

180

3 way Split + Join +
Groupby + Order by

622

563

59

180

Query
Replicated Join

Note. The cluster size was 25 nodes with 180 containers (1.5G each) and Tez reused
them again and again for tasks.
The Apache Software Foundation

18
Known issues
 Container reuse will have issues when there are


Static variables in LoadFunc, StoreFunc, UDFs



Memory leaks in LoadFunc, StoreFunc, UDFs

 With single DAG execution of whole script, AM retries can be very costly until
Tez supports checkpointing and resuming.

The Apache Software Foundation

19
Where are we?
 Major operators



Split, Union



Group-by, Distinct, Limit



Order-by










Load, Store, Filter-by, Foreach

Hash join, Replicated join, Skewed join, Merge join

UDFs (Java and non-Java)
Streaming
Multi-query on and off
Macros
Scalars
95% of e2e tests pass for finished features.
The Apache Software Foundation

20
What next?
 Feature Parity with MR




Local mode
Port all unit and e2e tests
Support for remaining Operators




CROSS, RANK, CUBE, ROLLUP

Support for Native Mapreduce (Low priority)

 Merge tez branch with trunk
 Stability


Handling failures
 Testing and tuning for large data and DAGs with > 10 stages

 Usability


Counters
 Progress Information
 Log information and debuggability
The Apache Software Foundation

21
What next? – Performance Improvements
›

Dynamic Reducer Estimation

›

Better memory management

›

Calculate input splits in AM and let Tez do combining of input splits for
pig.maxCombinedSplitSize

›

Vertex Grouping to write data directly into one output directory from multiple vertices in
case of union

›

Using unsorted shuffle in Union, Orderby, Skewed Join, etc to improve performance.

›

Shared Edges for multiple outputs if same data has to go to multiple downstream
vertices. For eg: Multi-query off, skewed join sample aggregation output.

›

HDFS Caching

The Apache Software Foundation

22
C ontri butors Wel come

The Apache Software Foundation
Pi g User Group Meetup at Li nkedIn
14 th March 2014

The Apache Software Foundation
Questi ons ???

The Apache Software Foundation

February 2014 HUG : Pig On Tez

  • 1.
    Pig on Tez PRESENTEDBY Cheolsoo Park, Netflix R o h i n i P a l a n i s w a m y , Ya h o o ! The Apache Software Foundation
  • 2.
    Apache Pig onTez Team Name Role Company Apache Pig Contributor Linkedin Cheolsoo Park VP. Apache Pig Netflix Daniel Dai Apache Pig PMC Hortonworks Mark Wagner Apache Pig Committer Linkedin Olga Natkovich Apache Pig PMC, Pig on Tez Project Manager Yahoo! Rohini Palaniswamy Apache Pig PMC Yahoo! Alex Bain The Apache Software Foundation 2
  • 3.
    Agenda  Overview  Pigand Hive  Pig on Tez  Why Tez?  Benefits of Tez  Design  Operator DAGs  Performance  Known Issues  Where are we?  What next? The Apache Software Foundation 3
  • 4.
    Pig Overview  Apachetop-level project for ETL on hadoop.  PIG Latin - Procedural scripting language that translates sequence of data processing steps into MapReduce jobs.  Easy to write, read and reuse and very extensible.  Feature parity with SQL (FILTER BY, CROSS, JOIN (OUTER, INNER), ORDER BY, LIMIT, RANK, ROLLUP, CUBE), Custom Loader and Storer, User defined functions (java and non-java), Nested ForEach, Streaming, macros and much more PAGEVIEWS = LOAD ‘/data/pageviews’ as (user, url); GRP = GROUP PAGEVIEWS BY user; CNT = FOREACH GRP GENERATE group, COUNT(url) as numvisits; STORE CNT into ‘/data/visited’ using PigStorage(‘,’); The Apache Software Foundation 4
  • 5.
    Pig and Hive Pig Language Hive PIGLatin - Procedural SQL - Declarative Features Feature rich. Can easily add new operators and constructs. For eg: Nested Foreach, Switch case, Macros, Scalars. Limited to SQL operators Developer code Load/StoreFunc, Algebraic and Accumulator java UDFs, non-java UDFs (jython, python, javascript, groovy, jruby), Custom Partitioners. StorageHandler, java UDFs Complex Processing Well suited. Multi-query works well with 1000s of lines of pig script. Not a good fit Server Only client. Can work with Hive Metastore using HCatLoader/Storer. Requires Metastore server and data has to be registered in it. HiverServer2 for jdbc The Apache Software Foundation 5
  • 6.
    Pig and Hive- Continued Pig Tez as execution engine Hive Planned for 0.14 Planned for Hive 0.13 ORCFile Support Patch available. Currently through HCatLoader From Hive 0.12 onwards. Huge performance gains Vectorization No. May be in future. Yes. Huge performance gains Transactions No Yes. In works Cost-based optimizer No Yes. In works JDBC support, Integration with BI tools No Yes. HiveServer2 with Microstrategy/Tableau Area of application Pipeline processing language standard Interactive Analytics /Reporting Platform The Apache Software Foundation 6
  • 7.
    Why Tez?  Builton top of YARN   Multi-tenancy (queues, capacity management) Resource allocation  DAG execution framework   Natural fit for Pig and Hive than MR as their execution plans are DAGs. Better than running a DAG of MR jobs passing data in between jobs using HDFS as intermediate store.  Different types of edges  ONE_ONE, BROADCAST, SHUFFLE  Flexible Input-Processor-Output runtime model        Custom Vertex Processors. For eg: Map Processor, Reduce Processor, Pig Processor Custom Inputs. For eg: MRFileInput (input to map), ShuffledMergedInput (input to reduce) Custom Outputs. For eg: OnFileSortedOutput (output of map), MRFileOutput (output of reduce) Multiple inputs and outputs Highly extensible Security Support from Tez Community and Hive Community The Apache Software Foundation 7
  • 8.
    Why Tez? –As a end user     Better Performance Reduced Resource Usage (Containers/Memory/CPU) Reduced Network I/O Reduced Namenode and Datanode load The Apache Software Foundation 8
  • 9.
    Benefits of Tez Features Benefits • Nointermediate data storage • • • Single AM for whole DAG The Apache Software Foundation • Less pressure on Namenode - Lesser calls for listing and getting block locations - Smaller namespace usage - Cuts down on GC Less pressure on Datanode - Cuts down on IO in network for both writing and reading. - Saves space as there are no 3 replicas Eliminates extra step of map reads from HDFS in every intermediate job in DAG - Saves on capacity by eliminating the need for map task containers Saves on capacity. For a 5 stage MR job, there would be 5 AM containers launched. Eliminates issue of queue and resource contention faced in MR for jobs started after previous job in DAG completes. 9
  • 10.
    Benefits of Tez- Continued Features Benefits • Container reuse • Reduced launch overhead - Container request and release overhead - Resource localization overhead - JVM launch time overhead Reduced network IO - Reduce tasks can be launched on same node as Map - 1-1 edge tasks can be launched on same node • Memory structures like small tables used for join can be cached in jvm and reused for next task on container reuse. Provides significant performance speedup. • Using unsorted input and output where possible saves a lot of CPU usage and increases performance • Saves on capacity. Can have reducers based on data size instead of having fixed number of reducers. Vertex caching Custom inputs and outputs Dynamic reducer estimation The Apache Software Foundation 10
  • 11.
    Pig on Tez- Design Logical Plan LogToPhyTranslationVisitor Physical Plan TezCompiler MRCompiler Tez Plan MR Plan Tez Execution Engine MR Execution Engine The Apache Software Foundation 11
  • 12.
    Pig on Tez– Join Left split Right split Left split Load L and R Right split l = LOAD ‘left’ AS (x, y); r = LOAD ‘right’ AS (x, z); j = JOIN l BY x, r BY x; Configuration per input Configuration per job Join The Apache Software Foundation Left split Left split Right split Load L Load R Join 12 Right split
  • 13.
    Pig on Tez– Split + Group-by Load foo Split multiplex de-multiplex Group by y Group by z HDFS f = LOAD ‘foo’ AS (x, y, z); g1 = GROUP f BY y; g2 = GROUP f BY z; j = JOIN g1 BY group, g2 BY group; Load foo Multiple outputs Group by y HDFS Group by z Load g1 and Load g2 Reduce follows reduce Join The Apache Software Foundation Join 13
  • 14.
    Pig on Tez– Order-by Aggregate HDFS Load & Sample f = LOAD ‘foo’ AS (x, y); o = ORDER f BY x; Sample Aggregate Stage sample map on distributed cache Pass through input via 1-1 edge Broadcast sample map Partition Partition Sort Sort The Apache Software Foundation 14
  • 15.
    Pig on Tez– Skewed join l = LOAD ‘left’ AS (x, y); r = LOAD ‘right’ AS (x, z); j = JOIN l BY x, r BY x USING ‘skewed’; Sample L Load & Sample Aggregate HDFS Aggregate Pass through input via 1-1 edge Stage sample map on distributed cache Broadcast sample map Partition L Partition R Partition L and Partition R Join Join The Apache Software Foundation 15
  • 16.
    Time in secs Performancenumbers 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 MR Tez Replicated Join (2.8x) The Apache Software Foundation Join + Groupby (1.5x) Join + 3 way Split + Groupby + Join + Orderby Groupby + (1.5x) Orderby (2.6x) 16
  • 17.
    Factors affecting performance Number of stages in the DAG  Higher the number of stages in the DAG, performance of Tez over MR will be better.  Cluster/queue capacity  More congested a queue is, the performance of Tez over MR will be better due to container reuse.  Size of intermediate output  More the size of intermediate output, the performance of Tez over MR will be better due to reduced HDFS usage.  Size of data in the job  For smaller data and more stages, the performance of Tez over MR will be better as percentage of launch overhead in the total time is high for smaller jobs.  Vertex caching The Apache Software Foundation 17
  • 18.
    Container usage MR Tez Savings Tez with containerreuse 7563 7562 1 180 Join + Groupby + Orderby 7655 7603 52 180 Join + Groupby + Orderby 7663 7609 54 180 3 way Split + Join + Groupby + Order by 622 563 59 180 Query Replicated Join Note. The cluster size was 25 nodes with 180 containers (1.5G each) and Tez reused them again and again for tasks. The Apache Software Foundation 18
  • 19.
    Known issues  Containerreuse will have issues when there are  Static variables in LoadFunc, StoreFunc, UDFs  Memory leaks in LoadFunc, StoreFunc, UDFs  With single DAG execution of whole script, AM retries can be very costly until Tez supports checkpointing and resuming. The Apache Software Foundation 19
  • 20.
    Where are we? Major operators   Split, Union  Group-by, Distinct, Limit  Order-by        Load, Store, Filter-by, Foreach Hash join, Replicated join, Skewed join, Merge join UDFs (Java and non-Java) Streaming Multi-query on and off Macros Scalars 95% of e2e tests pass for finished features. The Apache Software Foundation 20
  • 21.
    What next?  FeatureParity with MR    Local mode Port all unit and e2e tests Support for remaining Operators   CROSS, RANK, CUBE, ROLLUP Support for Native Mapreduce (Low priority)  Merge tez branch with trunk  Stability  Handling failures  Testing and tuning for large data and DAGs with > 10 stages  Usability  Counters  Progress Information  Log information and debuggability The Apache Software Foundation 21
  • 22.
    What next? –Performance Improvements › Dynamic Reducer Estimation › Better memory management › Calculate input splits in AM and let Tez do combining of input splits for pig.maxCombinedSplitSize › Vertex Grouping to write data directly into one output directory from multiple vertices in case of union › Using unsorted shuffle in Union, Orderby, Skewed Join, etc to improve performance. › Shared Edges for multiple outputs if same data has to go to multiple downstream vertices. For eg: Multi-query off, skewed join sample aggregation output. › HDFS Caching The Apache Software Foundation 22
  • 23.
    C ontri butorsWel come The Apache Software Foundation
  • 24.
    Pi g UserGroup Meetup at Li nkedIn 14 th March 2014 The Apache Software Foundation
  • 25.
    Questi ons ??? TheApache Software Foundation

Editor's Notes

  • #17 Pigmix queries
  • #20 - Either turn off container reuse or fix code
  • #21 - Both Algebraic and Accumulator UDFs
  • #22 - Both Algebraic and Accumulator UDFs