Practical Pig and PigUnit (Michael Noll, Verisign)

Practical Pig + PigUnit

Michael G. Noll, Verisign
July 2012

This talk is about Apache Pig

• High-level data flow language (think: DSL) for writing
Hadoop MapReduce jobs
• Why and when should you care about Pig?
• You are an Hadoop beginner
• … and want to implement a JOIN, for instance
• You are an Hadoop expert
• You only scratch your head when you see
public static void main(String args...)
• You think Java is not the best tool for this job [pun!]
• Think: too low-level, too many lines of code, no interactive mode
for exploratory analysis, readability > performance, et cetera

Apache Hadoop, Pig and Hive are trademarks of the Apache Software Foundation.
Verisign Public Java is a trademark of Oracle Corporation. 2

A basic Pig script

• Example: sorting user records by users’ age
records = LOAD ‘/path/to/input’
AS (user:chararray, age:int);

sorted_records = ORDER records BY age DESC;

STORE sorted_records INTO ‘/path/to/output’;

• Popular alternatives to Pig
• Hive: ~ SQL for Hadoop
• Hadoop Streaming: use any programming language for MR
• Even though you still write code in a “real” programming
language, Streaming provides an environment that makes it more
convenient than native Hadoop Java code.

Verisign Public 3

Preliminaries

• Talk is based on Pig 0.10.0, released in April ’12
• Some notable 0.10.0 improvements
• Hadoop 2.0 support
• Loading and storing JSON
• Ctrl-C’ing a Pig job will terminate all associated Hadoop jobs
• Amazon S3 support

Verisign Public 4

Testing Pig – a primer

Verisign Public 5

“Testing” Pig scripts – some examples

DESCRIBE | EXPLAIN | ILLUSTRATE | DUMP

$ pig -x local

$ pig [-debug | -dryrun]

$ pig -param input=/path/to/small-sample.txt

Verisign Public 6

“Testing” Pig scripts (cont.)

• JobTracker UI • PigStats, JobStats,
HadoopJobHistoryLoader

Now what have you been using?

Also: inspecting Hadoop log files, …

Verisign Public 7

However…

• Previous approaches are primarily useful (and used)
for creating the Pig script in the first place
• Like ILLUSTRATE
• None of them are really geared towards unit testing
• Difficult to automate (think: production environment)
#!/bin/bash
pig –param date=$1 –param output=$2 myscript.pig
hadoop fs –copyToLocal $2 /tmp/jobresult
if [ ARGH!!! ] ...

• Difficult to integrate into a typical development
workflow, e.g. backed by Maven, Java and a CI server
$ mvn clean test ??

Verisign Public Maven is a trademark of JFrog ltd. 8

PigUnit

Verisign Public 9

PigUnit

• Available in Pig since version 0.8
“PigUnit provides a unit-testing framework that plugs into JUnit
to help you write unit tests that can be run on a regular basis.”
-- Alan F. Gates, Programming Pig

• Easy way to add Pig unit testing to your dev workflow
iff you are a Java developer
• See “Tips and Tricks” later for working around this constraint
• Works with both JUnit and TestNG
• PigUnit docs have “potential”
• Some basic examples, then it’s looking at the source code of
both PigUnit and Pig (but it’s manageable)
• http://pig.apache.org/docs/r0.10.0/test.html#pigunit

Verisign Public 10

Getting PigUnit up and running

• PigUnit is not included in current Pig releases :(
• You must manually build the PigUnit jar file

$ cd /path/to/pig-sources # can be a release tarball
$ ant jar pigunit-jar
...
$ ls -l pig*jar
-rw-r—r-- 1 mnoll mnoll 17768497 ... pig.jar
-rw-r—r-- 1 mnoll mnoll 285627 ... pigunit.jar

• Add these jar(s) to your CLASSPATH, done!

Verisign Public 11

PigUnit and Maven

• Unfortunately the Apache Pig project does not yet
publish an official Maven artifact for PigUnit
WILL NOT WORK IN pom.xml :(
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pigunit</artifactId>
<version>0.10.0</version>
</dependency>

• Alternatives:
• Publish to your local Artifactory instance
• Use a local file-based <repository>
• Use a <system> scope in pom.xml (not recommended)
• Use trusted third-party repos like Cloudera’s

Verisign Public Artifactory is a trademark of JFrog ltd. 12

A simple PigUnit test

Verisign Public 13

A simple PigUnit test

• Here, we provide input + output data in the Java code
• Pig script is read from file wordcount.pig
@Test
public void testSimpleExample() {
PigTest simpleTest = new PigTest(‚wordcount.pig‛);

String[] input = { ‚foo‛, ‚bar‛, ‚foo‛ };
String[] expectedOutput = {
‚(foo,2)‛,
‚(bar,1)‛
};

simpleTest.assertOutput(
‚aliasInput‛, input,
‚aliasOutput‛, expectedOutput
);
}
Verisign Public 14

A simple PigUnit test (cont.)

• wordcount.pig
-- PigUnit populates the alias ‘aliasInput’
-- with the test input data
aliasInput = LOAD ‘<tmpLoc>’ AS <schema>;

-- ...here comes your actual code...

-- PigUnit will treat the contents of the alias
-- ‘aliasOutput’ as the actual output data in
-- the assert statement
aliasOutput = <your_final_statement>;

-- Note: PigUnit ignores STORE operations by default
STORE aliasOutput INTO ‘output’;

Verisign Public 15

A simple PigUnit test (cont.)
simpleTest.assertOutput(
1 ‚aliasInput‛, input,
2 ‚aliasOutput‛, expectedOutput
);

1 Pig injects input[] = { ‚foo‛, ‚bar‛, ‚foo‛ } into the
alias named aliasInput in the Pig script.
For this purpose Pig creates a temporary file, writes the
equivalent of StringUtils.join(input, ‚n‛) to the file,
and finally makes its location available to the LOAD operation.

2 Pig opens an iterator on the content of aliasOutput, and runs
assertEquals() based on StringUtils.join(..., ‚n‛)
with expectedOutput and the actual content.

See o.a.p.pigunit.{PigTest, Cluster} and o.a.p.test.Util.

Verisign Public 16

PigUnit drawbacks

• How to divide your “main” Pig script into testable units?
• Only run a single end-to-end test for the full script?
• Extract testable snippets from the main script?
• Argh, code duplication!
• Split the main script into logical units = smaller scripts; then run
individual tests and include the smaller scripts in the main script
• Ok-ish but splitting too much makes the Pig code hard to
understand (too many trees, no forest).
• PigUnit is a nice tool but batteries are not included
• It does work but it is not as convenient or powerful as you’d like.
• Notably you still need to know and write Java to use it. But one
compelling reason for Pig is that you can do without Java.
• You may end up writing your own wrapper/helper lib around it.
• Consider contributing this back to the Apache Pig project!

Verisign Public 17

Tips and tricks

Verisign Public 18

Connecting to a real cluster (default: local mode)

// this is not enough to enable cluster mode in PigUnit
pigServer = new PigServer(ExecType.MAPREDUCE);
// ...do PigUnit stuff...

// rather:
Properties props = System.getProperties();
if (clusterMode)
props.setProperty(‚pigunit.exectype.cluster‛, ‚true‛);
else
props.removeProperty(‚pigunit.exectype.cluster‛);

• $HADOOP_CONF_DIR must be in CLASSPATH
• Similar approach for enabling LZO support
• mapred.output.compress => ‚true‛
• mapred.output.compression.codec => ‚c.h.c.lzo.LzopCodec‛

Verisign Public 19

Write a convenient PigUnit runner for your users

• Pig user != Java developer
• Pig users should only need to provide three files:
• pig/myscript.pig
• input/testdata.txt
• output/expected.txt
• PigUnit runner discovers and runs tests for users
• PigTest#assertOutput() can also handle files
• But you must manage file uploads and similar “glue” yourself

pigUnitRunner.runPigTest(
new Path(scriptFile),
new Path(inputFile),
new Path(expectedOutputFile)
);

Verisign Public 20

Slightly off-topic: Java/Pig combo

• Pig API provides nifty features to control Pig workflows
through Java
• Similar to how working with PigUnit feels
• Definitely worth a look!
// ‘pigParams’ is the main glue between Java and Pig here,
// e.g. to specify the location of input data
pigServer.registerScript(scriptInputStream, pigParams);

ExecJob job = pigServer.store(
‚aliasOutput‛,
‚/path/to/output‛,
‚PigStorage()‛
);

if (job != null && job.getStatus() == JOB_STATUS.COMPLETED)
System.out.println(‚Happy world!‛);

Verisign Public 21

Thank You

© 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and
designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United
States and in foreign countries. All other trademarks are property of their respective owners.

Practical Pig and PigUnit (Michael Noll, Verisign)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Practical Pig and PigUnit (Michael Noll, Verisign)

Similar to Practical Pig and PigUnit (Michael Noll, Verisign) (20)

More from Swiss Big Data User Group

More from Swiss Big Data User Group (20)

Recently uploaded

Recently uploaded (20)

Practical Pig and PigUnit (Michael Noll, Verisign)