Yahoo! Hadoop User Group - May 2010 Meetup - What's new with Pig? Alan Gates, Yahoo!

4,797 views

Published on

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,797
On SlideShare
0
From Embeds
0
Number of Embeds
2,841
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Brief description of combiner and algebraicOnly used if all UDFs in a foreach can use it
  • Yahoo! Hadoop User Group - May 2010 Meetup - What's new with Pig? Alan Gates, Yahoo!

    1. 1. Pig 0.6 and 0.7<br />Alan Gates<br />What’s New With Pig<br />
    2. 2. Accumulator<br />A = load ‘clicks’;<br />B = group A by user;<br />C = foreach B {<br /> C1 = order A by timestamp;<br /> generate user, sessionize(C1);<br />}<br />…<br />Many aggregate operations cannot use combiner but do not need all records for a single key together<br />New in 0.6, Accumulator interface which can be implemented by UDFs<br />Pig calls accumulate multiple times with partial list of tuples, then when the key changes calls getValue<br />
    3. 3. Also in 0.6<br />UDFContext, allows UDFs to pass info from frontend to backend and to access JobConf<br />A lot of work with memory manager to reduce the number of GCOverhead and out of heap errors<br />
    4. 4. New Load and Store Interfaces<br />0.6 and before<br />Want to write a LoadFunc that works on files and uses standard splits? Easy<br />Want to write a LoadFunc that works on something other than files or uses non-standard splits? Hard; have to write a Slicer (which mostly duplicates Hadoop’sInputFormat)<br />Want to write a StoreFunc that works on something other than files? Sorry<br />0.7<br />LoadFunc now sits atop InputFormat, so if you have an InputFormat for your data, writing a LoadFunc is easy<br />StoreFunc now sits atop OutputFormat, …<br />Not backward compatible, will require rewrite of custom Load and StoreFuncs<br />
    5. 5. Also in 0.7<br />Moved local mode to Hadoop’sLocalJobRunner; means debugging environment much closer to runtime environment<br />More aggressive use of Hadoop distributed cache for features such as replicated join and order by<br />
    6. 6. What We Are Working On Now<br />Runtime statistics – track what features your script used, how many records it processed, etc. Results stored in Pig logs and job history files<br />Adding UDFs in scripting languages (python initially) - PIG-928<br />Allow users to set a custom partitioner in some cases - PIG-282<br />Make Pig available in Maven repositories - PIG-1334<br />Label Interfaces for audience and stability - PIG-1311<br />Part of Hadoop’s compatibility plan, see the following blog posthttp://bit.ly/9yRDlH<br />
    7. 7. Questions<br />

    ×