2. Who am I? Pig committer and PMC member Architect in the grid team at Yahoo Photo credit: Steven Guarnaccia, The Three Little Pigs
3. Focus of Pig 0.8 Usability Integration Performance Backwards compatibility with 0.7
4. UDFs in Scripting Languages Evaluation functions can now be written in scripting languages that compile down to the JVM Reference implementation provided in Jython Jruby, others, could be added with minimal code JavaScript implementation in progress Jython sold separately
5. Example Python UDF test.py: @outputSchema(”sqr:long”) def square(num): return ((num)*(num)) test.pig: register 'test.py' using jythonas myfuncs; A = load ‘input’ as (i:int); B = foreachA generate myfuncs.square(i); dump B;
6. Better statistics Statistics printed out at end of job run Pig information stored in Hadoop’s job history files so you can mine the information and analyze your Pig usage Loader for reading job history files included in Piggybank New PigRunnerinterface that allows users to invoke Pig and get back a statistics object that contains stats information Can also pass listener to track Pig jobs as they run Done for Oozie so it can show users Pig statistics
7. Sample stats info Job Stats (time in seconds): JobId Maps Reduces MxMTMnMT AMT MxRTMnRT ART Alias job_0 2 1 15 3 9 27 27 27 a,b,c,d,e job_1 1 1 3 3 3 12 12 12 g,h job_2 1 1 3 3 3 12 12 12 i job_3 1 1 3 3 3 12 12 12 i Input(s): Successfully read 10000 records from: “studenttab10k" Successfully read 10000 records from: “votertab10k" Output(s): Successfully stored 6 records (150 bytes) in: ”outfile" Counters: Total records written : 6 Total bytes written : 150
8. Invoke Static Java Functions as UDFs Often UDF you need already exists as Java function, e.g. Java’s URLDecoder.decode() for decoding URLs define UrlDecodeInvokeForString('java.net.URLDecoder.decode', 'String String');A = load 'encoded.txt' as (e:chararray);B = foreachA generate UrlDecode(e, 'UTF-8'); Currently only works with simple types and static functions
9. Improved HBase Integration Can now read records as bytes instead of auto converting to strings Filters can be pushed down Can store data in HBase as well as load from it Works with HBase 0.20 but not 0.89 or 0.90. Patch in PIG-1680 addresses this but has not been committed yet.
10. Casting Relations to Scalars Say you want to calculate what percentage of page views per browser type (i.e. IE, Firefox, etc.) views = load ‘views’ as (url, browser);gv = group views all;numviews = foreachgvgenerate COUNT(views) as total;gb = group views by browser;perbrowser = foreachgbgenerate group, COUNT(browser) / (long)numviews.total; Now it is possible to cast the relation numviewsto a scalar value for use in later calculations Pig handles storing the results in a file and retrieving it when needed Only works for single row results
11. Integrating MapReduce Jobs Sometimes you need to integrate MR and Pig jobs Legacy code Algorithm that’s hard to implement in Pig A = load 'WordcountInput.txt'; B = mapreduce'wordcount.jar’store A into 'inputDir’load 'outputDir' as (word:chararray, count: int) `org.myorg.WordCountinputDiroutputDir`; C = foreachB …
12. Plus a Whole Lot More Custom PartitionersB = group A by $0 partition by YourPartitionerparallel 2; Greatly expanded string and math built in UDFs Performance Improvements Automatic merging of small files Compression of intermediate results Safety Features Parallel set automatically when not specified Monitor your UDF by annotating it with @MonitoredUDF. If it takes too long to return Pig will kill it and return a default value instead. PigUnit for unit testing your Pig Latin scripts
13.
14. What’s Next? Preview of Pig 0.9 Integrate Pig with scripting languages for control flow Add macros to Pig Latin Revive ILLUSTRATE Fix most runtime type errors Rewrite parser to give useful error messages Programming Pig from O’Reilly Press
15. Acknowledgements Much of the content of this talk was taken from DmitriyRyaboy’s very nice summary of features in Pig 0.8: http://squarecog.wordpress.com/2010/12/19/new-features-in-apache-pig-0-8/ The Pig team, for writing and testing all this code; including many non-Yahoo Pig team contributors who contributed significantly to this release
Editor's Notes
Can’t yet inline the Python functions in Pig Latin script. In 0.9 we’ll add the ability to put them in the same file.
Before 0.8 this is hard in Pig because you cannot re-use the results of Pig Latin operation in another operation without joining them, even if the result is a scalar value