Poster Hadoop summit 2011: pig embedding in scripting languages

1,503 views

Published on

Poster presented at Hadoop summit 2011

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,503
On SlideShare
0
From Embeds
0
Number of Embeds
31
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Poster Hadoop summit 2011: pig embedding in scripting languages

  1. 1. PIG Scripting<br />Making Pig Turing-complete through embedding in a scripting language<br />Julien Le Dem - Yahoo<br />Overview<br />Pig execution model<br /> - Map/Reduce: a solution to process big data. <br /> - Pig: makes it easier to manipulate data on the grid. <br /> - Pig scripting: makes Pig easier with iterative algorithms and User Defined Functions.<br />Jiras: PIG-1479, PIG-1794 (in 0.9), PIG-928 (in 0.8)<br />Example: Transitive closure<br /> - Iterative process: requires a loop and a termination condition<br /> - Requires multiple join/group by: typical Pig usage<br /> - Requires User Defined Functions<br />Solution 2: using Pig scripting<br />Solution 1: plain Pig<br />1 file required:<br />7 files required:<br />UDFs take the elements of the tuple as parameters, not a tuple.<br />The output schema is specified using a decorator (it can be a function if you need to manipulate the input schema)<br />Modification flow:<br />Modification flow:<br />UDFs use standard Python constructs (tuple/list/dictionary) automatically converted to Pig.<br />Python functions are automatically available as UDFs<br />Embedded Pig calls in Python<br />Python variables can be used in the pig scripts $n, $i, ...<br />Unfortunately this solution does not fit here.<br />It requires 7 artifacts:<br /><ul><li> 3 Java UDFs: 3 classes must be compiled and packaged in a jar. The average UDF size is 50 to 100 lines of code.
  2. 2. 3 Pig scripts in 3 separate files: init, main loop content, finalization. The average Pig script size is 5 to 10 lines.
  3. 3. Main program: executes and coordinates the Pig scripts. It contains about 50 lines of code.</li></ul>This solution does fit on the poster.<br />It requires 1 artifact, containing:<br /><ul><li> 3 UDFs provided as functions: The average UDF size is 8 to 12 lines of code.
  4. 4. 3 Pig queries: init, main loop content, finalization. Each Pig query adds an average 5 to 10 lines.
  5. 5. Main function: executes and coordinates the Pig scripts. It adds about 10 lines to the script.</li></ul>Access to the output of pig scripts<br />

×