• Save
Yahoo! Hadoop User Group - May 2010 Meetup - What's new with Pig? Alan Gates, Yahoo!
Upcoming SlideShare
Loading in...5
×
 

Yahoo! Hadoop User Group - May 2010 Meetup - What's new with Pig? Alan Gates, Yahoo!

on

  • 3,846 views

 

Statistics

Views

Total Views
3,846
Views on SlideShare
2,183
Embed Views
1,663

Actions

Likes
3
Downloads
0
Comments
0

7 Embeds 1,663

http://nosql.mypopescu.com 912
http://developer.yahoo.net 340
http://developer.yahoo.com 338
http://www.slideshare.net 32
https://developer.yahoo.com 32
http://static.slidesharecdn.com 8
http://cache.baidu.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Brief description of combiner and algebraicOnly used if all UDFs in a foreach can use it

Yahoo! Hadoop User Group - May 2010 Meetup - What's new with Pig? Alan Gates, Yahoo! Yahoo! Hadoop User Group - May 2010 Meetup - What's new with Pig? Alan Gates, Yahoo! Presentation Transcript

  • Pig 0.6 and 0.7
    Alan Gates
    What’s New With Pig
  • Accumulator
    A = load ‘clicks’;
    B = group A by user;
    C = foreach B {
    C1 = order A by timestamp;
    generate user, sessionize(C1);
    }

    Many aggregate operations cannot use combiner but do not need all records for a single key together
    New in 0.6, Accumulator interface which can be implemented by UDFs
    Pig calls accumulate multiple times with partial list of tuples, then when the key changes calls getValue
  • Also in 0.6
    UDFContext, allows UDFs to pass info from frontend to backend and to access JobConf
    A lot of work with memory manager to reduce the number of GCOverhead and out of heap errors
  • New Load and Store Interfaces
    0.6 and before
    Want to write a LoadFunc that works on files and uses standard splits? Easy
    Want to write a LoadFunc that works on something other than files or uses non-standard splits? Hard; have to write a Slicer (which mostly duplicates Hadoop’sInputFormat)
    Want to write a StoreFunc that works on something other than files? Sorry
    0.7
    LoadFunc now sits atop InputFormat, so if you have an InputFormat for your data, writing a LoadFunc is easy
    StoreFunc now sits atop OutputFormat, …
    Not backward compatible, will require rewrite of custom Load and StoreFuncs
  • Also in 0.7
    Moved local mode to Hadoop’sLocalJobRunner; means debugging environment much closer to runtime environment
    More aggressive use of Hadoop distributed cache for features such as replicated join and order by
  • What We Are Working On Now
    Runtime statistics – track what features your script used, how many records it processed, etc. Results stored in Pig logs and job history files
    Adding UDFs in scripting languages (python initially) - PIG-928
    Allow users to set a custom partitioner in some cases - PIG-282
    Make Pig available in Maven repositories - PIG-1334
    Label Interfaces for audience and stability - PIG-1311
    Part of Hadoop’s compatibility plan, see the following blog posthttp://bit.ly/9yRDlH
  • Questions