Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Yahoo! Hadoop User Group - May 2010 Meetup - What's new with Pig? Alan Gates, Yahoo!


Published on

Published in: Technology, Business
  • Login to see the comments

Yahoo! Hadoop User Group - May 2010 Meetup - What's new with Pig? Alan Gates, Yahoo!

  1. 1. Pig 0.6 and 0.7<br />Alan Gates<br />What’s New With Pig<br />
  2. 2. Accumulator<br />A = load ‘clicks’;<br />B = group A by user;<br />C = foreach B {<br /> C1 = order A by timestamp;<br /> generate user, sessionize(C1);<br />}<br />…<br />Many aggregate operations cannot use combiner but do not need all records for a single key together<br />New in 0.6, Accumulator interface which can be implemented by UDFs<br />Pig calls accumulate multiple times with partial list of tuples, then when the key changes calls getValue<br />
  3. 3. Also in 0.6<br />UDFContext, allows UDFs to pass info from frontend to backend and to access JobConf<br />A lot of work with memory manager to reduce the number of GCOverhead and out of heap errors<br />
  4. 4. New Load and Store Interfaces<br />0.6 and before<br />Want to write a LoadFunc that works on files and uses standard splits? Easy<br />Want to write a LoadFunc that works on something other than files or uses non-standard splits? Hard; have to write a Slicer (which mostly duplicates Hadoop’sInputFormat)<br />Want to write a StoreFunc that works on something other than files? Sorry<br />0.7<br />LoadFunc now sits atop InputFormat, so if you have an InputFormat for your data, writing a LoadFunc is easy<br />StoreFunc now sits atop OutputFormat, …<br />Not backward compatible, will require rewrite of custom Load and StoreFuncs<br />
  5. 5. Also in 0.7<br />Moved local mode to Hadoop’sLocalJobRunner; means debugging environment much closer to runtime environment<br />More aggressive use of Hadoop distributed cache for features such as replicated join and order by<br />
  6. 6. What We Are Working On Now<br />Runtime statistics – track what features your script used, how many records it processed, etc. Results stored in Pig logs and job history files<br />Adding UDFs in scripting languages (python initially) - PIG-928<br />Allow users to set a custom partitioner in some cases - PIG-282<br />Make Pig available in Maven repositories - PIG-1334<br />Label Interfaces for audience and stability - PIG-1311<br />Part of Hadoop’s compatibility plan, see the following blog post<br />
  7. 7. Questions<br />