Alan F. GatesPig 0.8 New Features
Who am I?Pig committer and PMC memberArchitect in the grid team at YahooPhoto credit:  Steven Guarnaccia, The Three Little Pigs
Focus of Pig 0.8UsabilityIntegrationPerformanceBackwards compatibility with 0.7
UDFs in Scripting LanguagesEvaluation functions can now be written in scripting languages that compile down to the JVMReference implementation provided in JythonJruby, others, could be added with minimal codeJavaScript implementation in progressJython sold separately
Example Python UDFtest.py:@outputSchema(”sqr:long”)def square(num): return ((num)*(num)) test.pig:register 'test.py' using jythonas myfuncs;A = load ‘input’ as (i:int);B = foreachA generate myfuncs.square(i);dump B;
Better statisticsStatistics printed out at end of job runPig information stored in Hadoop’s job history files so you can mine the information and analyze your Pig usageLoader for reading job history files included in PiggybankNew PigRunnerinterface that allows users to invoke Pig and get back a statistics object that contains stats informationCan also pass listener to track Pig jobs as they runDone for Oozie so it can show users Pig statistics
Sample stats infoJob Stats (time in seconds):JobId   Maps  Reduces  MxMTMnMT  AMT  MxRTMnRT  ART  Alias     job_0   2     1        15    3     9    27    27    27   a,b,c,d,ejob_1   1     1        3     3     3    12    12    12   g,hjob_2   1     1        3     3     3    12    12    12   ijob_3   1     1        3     3     3    12    12    12   iInput(s):Successfully read 10000 records from: “studenttab10k"Successfully read 10000 records from: “votertab10k"Output(s):Successfully stored 6 records (150 bytes) in: ”outfile"Counters:Total records written : 6Total bytes written : 150
Invoke Static Java Functions as UDFsOften UDF you need already exists as Java function, e.g. Java’s URLDecoder.decode() for decoding URLsdefine UrlDecodeInvokeForString('java.net.URLDecoder.decode',    'String String');A = load 'encoded.txt' as (e:chararray);B = foreachA generate UrlDecode(e, 'UTF-8');Currently only works with simple types and static functions
Improved HBase IntegrationCan now read records as bytes instead of auto converting to stringsFilters can be pushed downCan store data in HBase as well as load from itWorks with HBase 0.20 but not 0.89 or 0.90.  Patch in PIG-1680 addresses this but has not been committed yet.
Casting Relations to ScalarsSay you want to calculate what percentage of page views per browser type (i.e. IE, Firefox, etc.)  views = load ‘views’ as (url, browser);gv = group views all;numviews = foreachgvgenerate COUNT(views) as total;gb = group views by browser;perbrowser = foreachgbgenerate group, COUNT(browser) / (long)numviews.total;Now it is possible to cast the relation numviewsto a scalar value for use in later calculationsPig handles storing the results in a file and retrieving it when neededOnly works for single row results
Integrating MapReduce JobsSometimes you need to integrate MR and Pig jobsLegacy codeAlgorithm that’s hard to implement in PigA = load 'WordcountInput.txt'; B = mapreduce'wordcount.jar’store A into 'inputDir’load 'outputDir' as (word:chararray, count: int)	`org.myorg.WordCountinputDiroutputDir`; C = foreachB …
Plus a Whole Lot MoreCustom PartitionersB = group A by $0 partition by YourPartitionerparallel 2;Greatly expanded string and math built in UDFsPerformance ImprovementsAutomatic merging of small filesCompression of intermediate resultsSafety FeaturesParallel set automatically when not specifiedMonitor your UDF by annotating it with @MonitoredUDF.  If it takes too long to return Pig will kill it and return a default value instead.PigUnit for unit testing your Pig Latin scripts
Plus Even More I Probably Don’t Have Time to Talk AboutNew option for UNION to merge schemasMap side COGROUPDESCRIBE now works in nested FOREACHLocal shell commands can now be run from GruntSupport for jars and scripts stored on dfsArbitrary jobconf key-value pairs can be set inside Pig Latin script using SETMerge join extendedSupport for more than two tables for inner joinSupport for left, right, or full outer join for 2 tablesPig artifacts now available via mavenSignificant memory improvements.
What’s Next?Preview of Pig 0.9Integrate Pig with scripting languages for control flowAdd macros to Pig LatinRevive ILLUSTRATEFix most runtime type errorsRewrite parser to give useful error messagesProgramming Pig from O’Reilly Press
AcknowledgementsMuch of the content of this talk was taken from DmitriyRyaboy’s very nice summary of features in Pig 0.8:  http://squarecog.wordpress.com/2010/12/19/new-features-in-apache-pig-0-8/The Pig team, for writing and testing all this code; including many non-Yahoo Pig team contributors who contributed significantly to this release

January 2011 HUG: Pig Presentation

  • 1.
    Alan F. GatesPig0.8 New Features
  • 2.
    Who am I?Pigcommitter and PMC memberArchitect in the grid team at YahooPhoto credit: Steven Guarnaccia, The Three Little Pigs
  • 3.
    Focus of Pig0.8UsabilityIntegrationPerformanceBackwards compatibility with 0.7
  • 4.
    UDFs in ScriptingLanguagesEvaluation functions can now be written in scripting languages that compile down to the JVMReference implementation provided in JythonJruby, others, could be added with minimal codeJavaScript implementation in progressJython sold separately
  • 5.
    Example Python UDFtest.py:@outputSchema(”sqr:long”)defsquare(num): return ((num)*(num)) test.pig:register 'test.py' using jythonas myfuncs;A = load ‘input’ as (i:int);B = foreachA generate myfuncs.square(i);dump B;
  • 6.
    Better statisticsStatistics printedout at end of job runPig information stored in Hadoop’s job history files so you can mine the information and analyze your Pig usageLoader for reading job history files included in PiggybankNew PigRunnerinterface that allows users to invoke Pig and get back a statistics object that contains stats informationCan also pass listener to track Pig jobs as they runDone for Oozie so it can show users Pig statistics
  • 7.
    Sample stats infoJobStats (time in seconds):JobId Maps Reduces MxMTMnMT AMT MxRTMnRT ART Alias job_0 2 1 15 3 9 27 27 27 a,b,c,d,ejob_1 1 1 3 3 3 12 12 12 g,hjob_2 1 1 3 3 3 12 12 12 ijob_3 1 1 3 3 3 12 12 12 iInput(s):Successfully read 10000 records from: “studenttab10k"Successfully read 10000 records from: “votertab10k"Output(s):Successfully stored 6 records (150 bytes) in: ”outfile"Counters:Total records written : 6Total bytes written : 150
  • 8.
    Invoke Static JavaFunctions as UDFsOften UDF you need already exists as Java function, e.g. Java’s URLDecoder.decode() for decoding URLsdefine UrlDecodeInvokeForString('java.net.URLDecoder.decode', 'String String');A = load 'encoded.txt' as (e:chararray);B = foreachA generate UrlDecode(e, 'UTF-8');Currently only works with simple types and static functions
  • 9.
    Improved HBase IntegrationCannow read records as bytes instead of auto converting to stringsFilters can be pushed downCan store data in HBase as well as load from itWorks with HBase 0.20 but not 0.89 or 0.90. Patch in PIG-1680 addresses this but has not been committed yet.
  • 10.
    Casting Relations toScalarsSay you want to calculate what percentage of page views per browser type (i.e. IE, Firefox, etc.) views = load ‘views’ as (url, browser);gv = group views all;numviews = foreachgvgenerate COUNT(views) as total;gb = group views by browser;perbrowser = foreachgbgenerate group, COUNT(browser) / (long)numviews.total;Now it is possible to cast the relation numviewsto a scalar value for use in later calculationsPig handles storing the results in a file and retrieving it when neededOnly works for single row results
  • 11.
    Integrating MapReduce JobsSometimesyou need to integrate MR and Pig jobsLegacy codeAlgorithm that’s hard to implement in PigA = load 'WordcountInput.txt'; B = mapreduce'wordcount.jar’store A into 'inputDir’load 'outputDir' as (word:chararray, count: int) `org.myorg.WordCountinputDiroutputDir`; C = foreachB …
  • 12.
    Plus a WholeLot MoreCustom PartitionersB = group A by $0 partition by YourPartitionerparallel 2;Greatly expanded string and math built in UDFsPerformance ImprovementsAutomatic merging of small filesCompression of intermediate resultsSafety FeaturesParallel set automatically when not specifiedMonitor your UDF by annotating it with @MonitoredUDF. If it takes too long to return Pig will kill it and return a default value instead.PigUnit for unit testing your Pig Latin scripts
  • 13.
    Plus Even MoreI Probably Don’t Have Time to Talk AboutNew option for UNION to merge schemasMap side COGROUPDESCRIBE now works in nested FOREACHLocal shell commands can now be run from GruntSupport for jars and scripts stored on dfsArbitrary jobconf key-value pairs can be set inside Pig Latin script using SETMerge join extendedSupport for more than two tables for inner joinSupport for left, right, or full outer join for 2 tablesPig artifacts now available via mavenSignificant memory improvements.
  • 14.
    What’s Next?Preview ofPig 0.9Integrate Pig with scripting languages for control flowAdd macros to Pig LatinRevive ILLUSTRATEFix most runtime type errorsRewrite parser to give useful error messagesProgramming Pig from O’Reilly Press
  • 15.
    AcknowledgementsMuch of thecontent of this talk was taken from DmitriyRyaboy’s very nice summary of features in Pig 0.8: http://squarecog.wordpress.com/2010/12/19/new-features-in-apache-pig-0-8/The Pig team, for writing and testing all this code; including many non-Yahoo Pig team contributors who contributed significantly to this release

Editor's Notes

  • #6 Can’t yet inline the Python functions in Pig Latin script. In 0.9 we’ll add the ability to put them in the same file.
  • #11 Before 0.8 this is hard in Pig because you cannot re-use the results of Pig Latin operation in another operation without joining them, even if the result is a scalar value