Developing Pig on TezDeveloping Pig on Tez
Mark WagnerMark Wagner
Committer, Apache PigCommitter, Apache Pig
LinkedInLinke...
What is Pig
●
Apache project since 2008
●
Higher level language for Hadoop that provides a dataflow language
with a MapRed...
Pig Concepts
●
LOAD
●
STORE
●
FOREACH ___ GENERATE ___
●
FILTER ___ BY ___
Pig Concepts
GROUP ___ BY ___
●
'Blocking' operator
●
Translates to a MapReduce shuffle
Pig Concepts
Joins:
●
Hash Join
●
Replicated Join
●
Skewed Join
Pig Latin
A = LOAD 'input.txt';
B = FOREACH A GENERATE
flatten(TOKENIZE((chararray)$0))
AS word;
C = GROUP B BY word;
D = ...
Logical Plan
Physical Plan
Map Reduce Plan Map
Reduce
What's the problem
●
Extra intermediate output
●
Artificial synchronization barriers
●
Inefficient use of resources
●
Mult...
Apache Tez
●
Incubating project
●
Express data processing as a directed acyclic graph
●
Runs on YARN
●
Aims for lower late...
Tez Concepts
●
Job expressed as directed acyclic graph (DAG)
●
Processing done at vertices
●
Data flows along edges
Mapper...
Benefits & Optimizations
●
Fewer synchronization barriers
●
Container Reuse
●
Object caches at the vertices
●
Dynamic para...
What we've done for Pig
●
New execution engine based on Tez
●
Physical Plan translated to Tez Plan instead of Map Reduce P...
Along the way
●
New pluggable execution backend
●
Made operator set more generic
●
Motivated Tez improvements
Group By
LOAD
GROUP BY, SUM
Identity
GROUP BY
HDFS
LOAD
GROUP BY, STORE
GROUP BY, SUM
f = LOAD ‘foo’
AS (x:int, y:int);
g ...
Join
LOAD l, r
JOIN, STORE
LOAD r
JOIN, STORE
LOAD l
l = LOAD ‘left’ AS (x, y);
r = LOAD ‘right’ AS (x, z);
j = JOIN l BY ...
Group By
LOAD
GROUP f BY x,
GROUP f BY y
LOAD g, h
JOIN
HDFS
LOAD
JOIN
GROUP BYGROUP BY
f = LOAD ‘foo’
AS (x:int, y:int);
...
Order By
SAMPLE
AGGREGATE
PARTITION
SORT
HDFS
LOAD, SAMPLE
SORT
PARTITION
AGGREGATE
f = LOAD ‘foo’ AS (x, y);
o = ORDER f ...
Performance Comparison
Replicated Join
(2.8x)
Join + Group By
(1.5x)
Join + Group By +
Order By (1.5x)
3 way Split + Join
...
How it started
Shared interests across organizations
• Similar data platform architecture.
• Pig for ETL jobs
• Hive for a...
How it started
Shared interests across organizations
• Hortonworks wants Tez to succeed.
Community meet-ups helped
• Twitter presented summer intern’s POC work at Tez meet-up.
• Pig devs exchanged interests.
Org...
Organizing team
Community meet-ups helped
• Tez team hosted tutorial sessions for Pig devs.
• Pig team got together to bra...
Companies showed commitment to the project
• Hortonworks: Daniel Dai
• LinkedIn: Alex Bain, Mark Wagner
• Netflix: Cheolso...
Make Pig 2x faster within 6 months
• Hive-on-Tez showed 2x performance gain.
• Rewriting the Pig backend within 6 months s...
Acting as team
Sprint
• Monthly planning meetings
• Twice-a-week stand-up conference calls
Issues / discussions
• PIG-3446...
• Pig old timer Daniel Dai acted as mentor.
• Everyone got to work on core functionalities.
• Everyone became an expert on...
Sharing credit
• Elected as a new committer and PMC chair.
• Gave talks at Hadoop User Group and Pig User Group meet-ups.
...
Further collaborations
Looking for more collaborations
• Parquet Hive SerDe improvements.
• Sharing experiences with SQL-o...
Mind shift
“If we can’t hire all these good people, why don’t we use them in a
collaboration?”
• Collaboration instead of ...
Mind shift
“Why do we reinvent the wheel?”
• Share the same technologies while creating different services.
Believe in the Apache way
Upcoming SlideShare
Loading in...5
×

Developing Pig on Tez (ApacheCon 2014)

445

Published on

Published in: Engineering, Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
445
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
21
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Developing Pig on Tez (ApacheCon 2014)

  1. 1. Developing Pig on TezDeveloping Pig on Tez Mark WagnerMark Wagner Committer, Apache PigCommitter, Apache Pig LinkedInLinkedIn Cheolsoo ParkCheolsoo Park VP, Apache PigVP, Apache Pig NetflixNetflix
  2. 2. What is Pig ● Apache project since 2008 ● Higher level language for Hadoop that provides a dataflow language with a MapReduce based execution engine A = LOAD 'input.txt'; B = FOREACH A GENERATE flatten(TOKENIZE((chararray)$0)) AS word; C = GROUP B BY word; D = FOREACH C GENERATE group, COUNT(B); STORE D INTO './output.txt';
  3. 3. Pig Concepts ● LOAD ● STORE ● FOREACH ___ GENERATE ___ ● FILTER ___ BY ___
  4. 4. Pig Concepts GROUP ___ BY ___ ● 'Blocking' operator ● Translates to a MapReduce shuffle
  5. 5. Pig Concepts Joins: ● Hash Join ● Replicated Join ● Skewed Join
  6. 6. Pig Latin A = LOAD 'input.txt'; B = FOREACH A GENERATE flatten(TOKENIZE((chararray)$0)) AS word; C = GROUP B BY word; D = FOREACH C GENERATE group, COUNT(B); STORE D INTO './output.txt';
  7. 7. Logical Plan
  8. 8. Physical Plan
  9. 9. Map Reduce Plan Map Reduce
  10. 10. What's the problem ● Extra intermediate output ● Artificial synchronization barriers ● Inefficient use of resources ● Multiquery Optimizer ● Alleviates some problems ● Has its own
  11. 11. Apache Tez ● Incubating project ● Express data processing as a directed acyclic graph ● Runs on YARN ● Aims for lower latency and higher throughput than Map Reduce
  12. 12. Tez Concepts ● Job expressed as directed acyclic graph (DAG) ● Processing done at vertices ● Data flows along edges Mapper Reducer Processor Processor Processor Processor
  13. 13. Benefits & Optimizations ● Fewer synchronization barriers ● Container Reuse ● Object caches at the vertices ● Dynamic parallelism estimation ● Custom data transfer between processors
  14. 14. What we've done for Pig ● New execution engine based on Tez ● Physical Plan translated to Tez Plan instead of Map Reduce Plan ● Same Physical Plan and operators ● Custom processors run the execution plan on Tez
  15. 15. Along the way ● New pluggable execution backend ● Made operator set more generic ● Motivated Tez improvements
  16. 16. Group By LOAD GROUP BY, SUM Identity GROUP BY HDFS LOAD GROUP BY, STORE GROUP BY, SUM f = LOAD ‘foo’ AS (x:int, y:int); g = GROUP f BY x; h = FOREACH g GENERATE group AS r, SUM(f.y) as s; i = GROUP h BY s;
  17. 17. Join LOAD l, r JOIN, STORE LOAD r JOIN, STORE LOAD l l = LOAD ‘left’ AS (x, y); r = LOAD ‘right’ AS (x, z); j = JOIN l BY x, r BY x;
  18. 18. Group By LOAD GROUP f BY x, GROUP f BY y LOAD g, h JOIN HDFS LOAD JOIN GROUP BYGROUP BY f = LOAD ‘foo’ AS (x:int, y:int); g = GROUP f BY x; h = GROUP f BY y; i = JOIN g BY group, h BY group;
  19. 19. Order By SAMPLE AGGREGATE PARTITION SORT HDFS LOAD, SAMPLE SORT PARTITION AGGREGATE f = LOAD ‘foo’ AS (x, y); o = ORDER f BY x;
  20. 20. Performance Comparison Replicated Join (2.8x) Join + Group By (1.5x) Join + Group By + Order By (1.5x) 3 way Split + Join + Group By (2.6x) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Map Reduce Tez
  21. 21. How it started Shared interests across organizations • Similar data platform architecture. • Pig for ETL jobs • Hive for ad-hoc queries
  22. 22. How it started Shared interests across organizations • Hortonworks wants Tez to succeed.
  23. 23. Community meet-ups helped • Twitter presented summer intern’s POC work at Tez meet-up. • Pig devs exchanged interests. Organizing team
  24. 24. Organizing team Community meet-ups helped • Tez team hosted tutorial sessions for Pig devs. • Pig team got together to brainstorm implementation design.
  25. 25. Companies showed commitment to the project • Hortonworks: Daniel Dai • LinkedIn: Alex Bain, Mark Wagner • Netflix: Cheolsoo Park • Yahoo: Olga Natkovich, Rohini Palaniswamy Building trust
  26. 26. Make Pig 2x faster within 6 months • Hive-on-Tez showed 2x performance gain. • Rewriting the Pig backend within 6 months seemed reasonable. Setting goals
  27. 27. Acting as team Sprint • Monthly planning meetings • Twice-a-week stand-up conference calls Issues / discussions • PIG-3446 umbrella jira for Pig on Tez • Whiteboard discussions at meetings
  28. 28. • Pig old timer Daniel Dai acted as mentor. • Everyone got to work on core functionalities. • Everyone became an expert on the Pig backend. Knowledge transfer
  29. 29. Sharing credit • Elected as a new committer and PMC chair. • Gave talks at Hadoop User Group and Pig User Group meet-ups. • Speaking at ApacheCon and upcoming Hadoop Summit.
  30. 30. Further collaborations Looking for more collaborations • Parquet Hive SerDe improvements. • Sharing experiences with SQL-on-Hadoop solutions.
  31. 31. Mind shift “If we can’t hire all these good people, why don’t we use them in a collaboration?” • Collaboration instead of competition.
  32. 32. Mind shift “Why do we reinvent the wheel?” • Share the same technologies while creating different services.
  33. 33. Believe in the Apache way
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×