Your SlideShare is downloading. ×
Yahoo! Hadoop User Group - May 2010 Meetup - What's new with Pig? Alan Gates, Yahoo!
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Yahoo! Hadoop User Group - May 2010 Meetup - What's new with Pig? Alan Gates, Yahoo!

3,231
views

Published on

Published in: Technology, Business

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,231
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Brief description of combiner and algebraicOnly used if all UDFs in a foreach can use it
  • Transcript

    • 1. Pig 0.6 and 0.7
      Alan Gates
      What’s New With Pig
    • 2. Accumulator
      A = load ‘clicks’;
      B = group A by user;
      C = foreach B {
      C1 = order A by timestamp;
      generate user, sessionize(C1);
      }

      Many aggregate operations cannot use combiner but do not need all records for a single key together
      New in 0.6, Accumulator interface which can be implemented by UDFs
      Pig calls accumulate multiple times with partial list of tuples, then when the key changes calls getValue
    • 3. Also in 0.6
      UDFContext, allows UDFs to pass info from frontend to backend and to access JobConf
      A lot of work with memory manager to reduce the number of GCOverhead and out of heap errors
    • 4. New Load and Store Interfaces
      0.6 and before
      Want to write a LoadFunc that works on files and uses standard splits? Easy
      Want to write a LoadFunc that works on something other than files or uses non-standard splits? Hard; have to write a Slicer (which mostly duplicates Hadoop’sInputFormat)
      Want to write a StoreFunc that works on something other than files? Sorry
      0.7
      LoadFunc now sits atop InputFormat, so if you have an InputFormat for your data, writing a LoadFunc is easy
      StoreFunc now sits atop OutputFormat, …
      Not backward compatible, will require rewrite of custom Load and StoreFuncs
    • 5. Also in 0.7
      Moved local mode to Hadoop’sLocalJobRunner; means debugging environment much closer to runtime environment
      More aggressive use of Hadoop distributed cache for features such as replicated join and order by
    • 6. What We Are Working On Now
      Runtime statistics – track what features your script used, how many records it processed, etc. Results stored in Pig logs and job history files
      Adding UDFs in scripting languages (python initially) - PIG-928
      Allow users to set a custom partitioner in some cases - PIG-282
      Make Pig available in Maven repositories - PIG-1334
      Label Interfaces for audience and stability - PIG-1311
      Part of Hadoop’s compatibility plan, see the following blog posthttp://bit.ly/9yRDlH
    • 7. Questions