Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Practical Pig and PigUnit (Michael Noll, Verisign)


Published on

This talk was held at the second meeting of the Swiss Big Data User Group on July 16 at ETH Zürich.

Published in: Technology
  • Be the first to comment

Practical Pig and PigUnit (Michael Noll, Verisign)

  1. 1. Practical Pig + PigUnit Michael G. Noll, Verisign July 2012
  2. 2. This talk is about Apache Pig • High-level data flow language (think: DSL) for writing Hadoop MapReduce jobs • Why and when should you care about Pig? • You are an Hadoop beginner • … and want to implement a JOIN, for instance • You are an Hadoop expert • You only scratch your head when you see public static void main(String args...) • You think Java is not the best tool for this job [pun!] • Think: too low-level, too many lines of code, no interactive mode for exploratory analysis, readability > performance, et cetera Apache Hadoop, Pig and Hive are trademarks of the Apache Software Foundation.Verisign Public Java is a trademark of Oracle Corporation. 2
  3. 3. A basic Pig script • Example: sorting user records by users’ age records = LOAD ‘/path/to/input’ AS (user:chararray, age:int); sorted_records = ORDER records BY age DESC; STORE sorted_records INTO ‘/path/to/output’; • Popular alternatives to Pig • Hive: ~ SQL for Hadoop • Hadoop Streaming: use any programming language for MR • Even though you still write code in a “real” programming language, Streaming provides an environment that makes it more convenient than native Hadoop Java code.Verisign Public 3
  4. 4. Preliminaries • Talk is based on Pig 0.10.0, released in April ’12 • Some notable 0.10.0 improvements • Hadoop 2.0 support • Loading and storing JSON • Ctrl-C’ing a Pig job will terminate all associated Hadoop jobs • Amazon S3 supportVerisign Public 4
  5. 5. Testing Pig – a primerVerisign Public 5
  6. 6. “Testing” Pig scripts – some examples DESCRIBE | EXPLAIN | ILLUSTRATE | DUMP $ pig -x local $ pig [-debug | -dryrun] $ pig -param input=/path/to/small-sample.txtVerisign Public 6
  7. 7. “Testing” Pig scripts (cont.) • JobTracker UI • PigStats, JobStats, HadoopJobHistoryLoader Now what have you been using? Also: inspecting Hadoop log files, …Verisign Public 7
  8. 8. However… • Previous approaches are primarily useful (and used) for creating the Pig script in the first place • Like ILLUSTRATE • None of them are really geared towards unit testing • Difficult to automate (think: production environment) #!/bin/bash pig –param date=$1 –param output=$2 myscript.pig hadoop fs –copyToLocal $2 /tmp/jobresult if [ ARGH!!! ] ... • Difficult to integrate into a typical development workflow, e.g. backed by Maven, Java and a CI server $ mvn clean test ??Verisign Public Maven is a trademark of JFrog ltd. 8
  9. 9. PigUnitVerisign Public 9
  10. 10. PigUnit • Available in Pig since version 0.8 “PigUnit provides a unit-testing framework that plugs into JUnit to help you write unit tests that can be run on a regular basis.” -- Alan F. Gates, Programming Pig • Easy way to add Pig unit testing to your dev workflow iff you are a Java developer • See “Tips and Tricks” later for working around this constraint • Works with both JUnit and TestNG • PigUnit docs have “potential” • Some basic examples, then it’s looking at the source code of both PigUnit and Pig (but it’s manageable) • Public 10
  11. 11. Getting PigUnit up and running • PigUnit is not included in current Pig releases :( • You must manually build the PigUnit jar file $ cd /path/to/pig-sources # can be a release tarball $ ant jar pigunit-jar ... $ ls -l pig*jar -rw-r—r-- 1 mnoll mnoll 17768497 ... pig.jar -rw-r—r-- 1 mnoll mnoll 285627 ... pigunit.jar • Add these jar(s) to your CLASSPATH, done!Verisign Public 11
  12. 12. PigUnit and Maven • Unfortunately the Apache Pig project does not yet publish an official Maven artifact for PigUnit WILL NOT WORK IN pom.xml :( <dependency> <groupId>org.apache.pig</groupId> <artifactId>pigunit</artifactId> <version>0.10.0</version> </dependency> • Alternatives: • Publish to your local Artifactory instance • Use a local file-based <repository> • Use a <system> scope in pom.xml (not recommended) • Use trusted third-party repos like Cloudera’sVerisign Public Artifactory is a trademark of JFrog ltd. 12
  13. 13. A simple PigUnit testVerisign Public 13
  14. 14. A simple PigUnit test • Here, we provide input + output data in the Java code • Pig script is read from file wordcount.pig @Test public void testSimpleExample() { PigTest simpleTest = new PigTest(‚wordcount.pig‛); String[] input = { ‚foo‛, ‚bar‛, ‚foo‛ }; String[] expectedOutput = { ‚(foo,2)‛, ‚(bar,1)‛ }; simpleTest.assertOutput( ‚aliasInput‛, input, ‚aliasOutput‛, expectedOutput ); }Verisign Public 14
  15. 15. A simple PigUnit test (cont.) • wordcount.pig -- PigUnit populates the alias ‘aliasInput’ -- with the test input data aliasInput = LOAD ‘<tmpLoc>’ AS <schema>; -- comes your actual code... -- PigUnit will treat the contents of the alias -- ‘aliasOutput’ as the actual output data in -- the assert statement aliasOutput = <your_final_statement>; -- Note: PigUnit ignores STORE operations by default STORE aliasOutput INTO ‘output’;Verisign Public 15
  16. 16. A simple PigUnit test (cont.) simpleTest.assertOutput( 1 ‚aliasInput‛, input, 2 ‚aliasOutput‛, expectedOutput ); 1 Pig injects input[] = { ‚foo‛, ‚bar‛, ‚foo‛ } into the alias named aliasInput in the Pig script. For this purpose Pig creates a temporary file, writes the equivalent of StringUtils.join(input, ‚n‛) to the file, and finally makes its location available to the LOAD operation. 2 Pig opens an iterator on the content of aliasOutput, and runs assertEquals() based on StringUtils.join(..., ‚n‛) with expectedOutput and the actual content. See o.a.p.pigunit.{PigTest, Cluster} and o.a.p.test.Util.Verisign Public 16
  17. 17. PigUnit drawbacks• How to divide your “main” Pig script into testable units? • Only run a single end-to-end test for the full script? • Extract testable snippets from the main script? • Argh, code duplication! • Split the main script into logical units = smaller scripts; then run individual tests and include the smaller scripts in the main script • Ok-ish but splitting too much makes the Pig code hard to understand (too many trees, no forest).• PigUnit is a nice tool but batteries are not included • It does work but it is not as convenient or powerful as you’d like. • Notably you still need to know and write Java to use it. But one compelling reason for Pig is that you can do without Java. • You may end up writing your own wrapper/helper lib around it. • Consider contributing this back to the Apache Pig project!Verisign Public 17
  18. 18. Tips and tricksVerisign Public 18
  19. 19. Connecting to a real cluster (default: local mode) // this is not enough to enable cluster mode in PigUnit pigServer = new PigServer(ExecType.MAPREDUCE); // PigUnit stuff... // rather: Properties props = System.getProperties(); if (clusterMode) props.setProperty(‚pigunit.exectype.cluster‛, ‚true‛); else props.removeProperty(‚pigunit.exectype.cluster‛); • $HADOOP_CONF_DIR must be in CLASSPATH • Similar approach for enabling LZO support • mapred.output.compress => ‚true‛ • mapred.output.compression.codec => ‚c.h.c.lzo.LzopCodec‛Verisign Public 19
  20. 20. Write a convenient PigUnit runner for your users • Pig user != Java developer • Pig users should only need to provide three files: • pig/myscript.pig • input/testdata.txt • output/expected.txt • PigUnit runner discovers and runs tests for users • PigTest#assertOutput() can also handle files • But you must manage file uploads and similar “glue” yourself pigUnitRunner.runPigTest( new Path(scriptFile), new Path(inputFile), new Path(expectedOutputFile) );Verisign Public 20
  21. 21. Slightly off-topic: Java/Pig combo • Pig API provides nifty features to control Pig workflows through Java • Similar to how working with PigUnit feels • Definitely worth a look! // ‘pigParams’ is the main glue between Java and Pig here, // e.g. to specify the location of input data pigServer.registerScript(scriptInputStream, pigParams); ExecJob job = ‚aliasOutput‛, ‚/path/to/output‛, ‚PigStorage()‛ ); if (job != null && job.getStatus() == JOB_STATUS.COMPLETED) System.out.println(‚Happy world!‛);Verisign Public 21
  22. 22. Thank You© 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, anddesigns are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the UnitedStates and in foreign countries. All other trademarks are property of their respective owners.