January 2011 HUG: Pig Presentation


Published on

Published in: Technology
  • Be the first to comment

January 2011 HUG: Pig Presentation

  1. 1. Alan F. Gates<br />Pig 0.8 New Features<br />
  2. 2. Who am I?<br />Pig committer and PMC member<br />Architect in the grid team at Yahoo<br />Photo credit: Steven Guarnaccia, The Three Little Pigs<br />
  3. 3. Focus of Pig 0.8<br />Usability<br />Integration<br />Performance<br />Backwards compatibility with 0.7<br />
  4. 4. UDFs in Scripting Languages<br />Evaluation functions can now be written in scripting languages that compile down to the JVM<br />Reference implementation provided in Jython<br />Jruby, others, could be added with minimal code<br />JavaScript implementation in progress<br />Jython sold separately<br />
  5. 5. Example Python UDF<br />test.py:<br />@outputSchema(”sqr:long”)<br />def square(num): return ((num)*(num)) <br />test.pig:<br />register 'test.py' using jythonas myfuncs;<br />A = load ‘input’ as (i:int);<br />B = foreachA generate myfuncs.square(i);<br />dump B;<br />
  6. 6. Better statistics<br />Statistics printed out at end of job run<br />Pig information stored in Hadoop’s job history files so you can mine the information and analyze your Pig usage<br />Loader for reading job history files included in Piggybank<br />New PigRunnerinterface that allows users to invoke Pig and get back a statistics object that contains stats information<br />Can also pass listener to track Pig jobs as they run<br />Done for Oozie so it can show users Pig statistics<br />
  7. 7. Sample stats info<br />Job Stats (time in seconds):<br />JobId Maps Reduces MxMTMnMT AMT MxRTMnRT ART Alias <br />job_0 2 1 15 3 9 27 27 27 a,b,c,d,e<br />job_1 1 1 3 3 3 12 12 12 g,h<br />job_2 1 1 3 3 3 12 12 12 i<br />job_3 1 1 3 3 3 12 12 12 i<br />Input(s):<br />Successfully read 10000 records from: “studenttab10k"<br />Successfully read 10000 records from: “votertab10k"<br />Output(s):<br />Successfully stored 6 records (150 bytes) in: ”outfile"<br />Counters:<br />Total records written : 6<br />Total bytes written : 150<br />
  8. 8. Invoke Static Java Functions as UDFs<br />Often UDF you need already exists as Java function, e.g. Java’s URLDecoder.decode() for decoding URLs<br />define UrlDecodeInvokeForString('java.net.URLDecoder.decode', 'String String');A = load 'encoded.txt' as (e:chararray);B = foreachA generate UrlDecode(e, 'UTF-8');<br />Currently only works with simple types and static functions <br />
  9. 9. Improved HBase Integration<br />Can now read records as bytes instead of auto converting to strings<br />Filters can be pushed down<br />Can store data in HBase as well as load from it<br />Works with HBase 0.20 but not 0.89 or 0.90. Patch in PIG-1680 addresses this but has not been committed yet.<br />
  10. 10. Casting Relations to Scalars<br />Say you want to calculate what percentage of page views per browser type (i.e. IE, Firefox, etc.)<br /> views = load ‘views’ as (url, browser);gv = group views all;numviews = foreachgvgenerate COUNT(views) as total;gb = group views by browser;perbrowser = foreachgbgenerate group, <br />COUNT(browser) / (long)numviews.total;<br />Now it is possible to cast the relation numviewsto a scalar value for use in later calculations<br />Pig handles storing the results in a file and retrieving it when needed<br />Only works for single row results<br />
  11. 11. Integrating MapReduce Jobs<br />Sometimes you need to integrate MR and Pig jobs<br />Legacy code<br />Algorithm that’s hard to implement in Pig<br />A = load 'WordcountInput.txt'; <br />B = mapreduce'wordcount.jar’store A into 'inputDir’load 'outputDir' as (word:chararray, count: int) `org.myorg.WordCountinputDiroutputDir`; <br />C = foreachB …<br />
  12. 12. Plus a Whole Lot More<br />Custom PartitionersB = group A by $0 partition by YourPartitionerparallel 2;<br />Greatly expanded string and math built in UDFs<br />Performance Improvements<br />Automatic merging of small files<br />Compression of intermediate results<br />Safety Features<br />Parallel set automatically when not specified<br />Monitor your UDF by annotating it with @MonitoredUDF. If it takes too long to return Pig will kill it and return a default value instead.<br />PigUnit for unit testing your Pig Latin scripts<br />
  13. 13. Plus Even More I Probably Don’t Have Time to Talk About<br />New option for UNION to merge schemas<br />Map side COGROUP<br />DESCRIBE now works in nested FOREACH<br />Local shell commands can now be run from Grunt<br />Support for jars and scripts stored on dfs<br />Arbitrary jobconf key-value pairs can be set inside Pig Latin script using SET<br />Merge join extended<br />Support for more than two tables for inner join<br />Support for left, right, or full outer join for 2 tables<br /><ul><li>Pig artifacts now available via maven</li></ul>Significant memory improvements.<br />
  14. 14. What’s Next?<br />Preview of Pig 0.9<br />Integrate Pig with scripting languages for control flow<br />Add macros to Pig Latin<br />Revive ILLUSTRATE<br />Fix most runtime type errors<br />Rewrite parser to give useful error messages<br />Programming Pig from O’Reilly Press<br />
  15. 15. Acknowledgements<br />Much of the content of this talk was taken from DmitriyRyaboy’s very nice summary of features in Pig 0.8: http://squarecog.wordpress.com/2010/12/19/new-features-in-apache-pig-0-8/<br />The Pig team, for writing and testing all this code; including many non-Yahoo Pig team contributors who contributed significantly to this release<br />