SlideShare a Scribd company logo
1 of 22
Practical Pig + PigUnit




 Michael G. Noll, Verisign
 July 2012
This talk is about Apache Pig

   • High-level data flow language (think: DSL) for writing
     Hadoop MapReduce jobs
   • Why and when should you care about Pig?
           • You are an Hadoop beginner
                  • … and want to implement a JOIN, for instance
           • You are an Hadoop expert
           • You only scratch your head when you see
                public static void main(String args...)
           • You think Java is not the best tool for this job [pun!]
                  • Think: too low-level, too many lines of code, no interactive mode
                    for exploratory analysis, readability > performance, et cetera




                     Apache Hadoop, Pig and Hive are trademarks of the Apache Software Foundation.
Verisign Public      Java is a trademark of Oracle Corporation.                                      2
A basic Pig script

   • Example: sorting user records by users’ age
           records = LOAD ‘/path/to/input’
                        AS (user:chararray, age:int);

           sorted_records = ORDER records BY age DESC;

           STORE sorted_records INTO ‘/path/to/output’;



   • Popular alternatives to Pig
           • Hive: ~ SQL for Hadoop
           • Hadoop Streaming: use any programming language for MR
                  • Even though you still write code in a “real” programming
                    language, Streaming provides an environment that makes it more
                    convenient than native Hadoop Java code.

Verisign Public                                                                  3
Preliminaries

   • Talk is based on Pig 0.10.0, released in April ’12
   • Some notable 0.10.0 improvements
           •      Hadoop 2.0 support
           •      Loading and storing JSON
           •      Ctrl-C’ing a Pig job will terminate all associated Hadoop jobs
           •      Amazon S3 support




Verisign Public                                                                    4
Testing Pig – a primer




Verisign Public             5
“Testing” Pig scripts – some examples


              DESCRIBE | EXPLAIN | ILLUSTRATE | DUMP


              $ pig -x local


              $ pig [-debug | -dryrun]


              $ pig -param input=/path/to/small-sample.txt




Verisign Public                                              6
“Testing” Pig scripts (cont.)

   • JobTracker UI              • PigStats, JobStats,
                                  HadoopJobHistoryLoader



  Now what have you been using?



     Also: inspecting Hadoop log files, …


Verisign Public                                            7
However…

   • Previous approaches are primarily useful (and used)
     for creating the Pig script in the first place
           • Like ILLUSTRATE
   • None of them are really geared towards unit testing
   • Difficult to automate (think: production environment)
                  #!/bin/bash
                  pig –param date=$1 –param output=$2 myscript.pig
                  hadoop fs –copyToLocal $2 /tmp/jobresult
                  if [ ARGH!!! ] ...


   • Difficult to integrate into a typical development
     workflow, e.g. backed by Maven, Java and a CI server
                  $ mvn clean test              ??

Verisign Public     Maven is a trademark of JFrog ltd.               8
PigUnit




Verisign Public   9
PigUnit

   • Available in Pig since version 0.8
              “PigUnit provides a unit-testing framework that plugs into JUnit
              to help you write unit tests that can be run on a regular basis.”
              -- Alan F. Gates, Programming Pig

   • Easy way to add Pig unit testing to your dev workflow
     iff you are a Java developer
           • See “Tips and Tricks” later for working around this constraint
   • Works with both JUnit and TestNG
   • PigUnit docs have “potential”
           • Some basic examples, then it’s looking at the source code of
             both PigUnit and Pig (but it’s manageable)
   • http://pig.apache.org/docs/r0.10.0/test.html#pigunit

Verisign Public                                                                   10
Getting PigUnit up and running

   • PigUnit is not included in current Pig releases :(
   • You must manually build the PigUnit jar file

         $ cd /path/to/pig-sources # can be a release tarball
         $ ant jar pigunit-jar
         ...
         $ ls -l pig*jar
         -rw-r—r-- 1 mnoll mnoll 17768497 ... pig.jar
         -rw-r—r-- 1 mnoll mnoll   285627 ... pigunit.jar



   • Add these jar(s) to your CLASSPATH, done!




Verisign Public                                                 11
PigUnit and Maven

   • Unfortunately the Apache Pig project does not yet
     publish an official Maven artifact for PigUnit
                  WILL NOT WORK IN pom.xml :(
                  <dependency>
                      <groupId>org.apache.pig</groupId>
                      <artifactId>pigunit</artifactId>
                      <version>0.10.0</version>
                  </dependency>

   • Alternatives:
           •      Publish to your local Artifactory instance
           •      Use a local file-based <repository>
           •      Use a <system> scope in pom.xml (not recommended)
           •      Use trusted third-party repos like Cloudera’s


Verisign Public       Artifactory is a trademark of JFrog ltd.        12
A simple PigUnit test




Verisign Public            13
A simple PigUnit test

   • Here, we provide input + output data in the Java code
   • Pig script is read from file wordcount.pig
           @Test
           public void testSimpleExample() {
               PigTest simpleTest = new PigTest(‚wordcount.pig‛);

                  String[] input = { ‚foo‛, ‚bar‛, ‚foo‛ };
                  String[] expectedOutput = {
                      ‚(foo,2)‛,
                      ‚(bar,1)‛
                  };

                  simpleTest.assertOutput(
                      ‚aliasInput‛, input,
                      ‚aliasOutput‛, expectedOutput
                  );
           }
Verisign Public                                                     14
A simple PigUnit test (cont.)

   • wordcount.pig
           -- PigUnit populates the alias ‘aliasInput’
           -- with the test input data
           aliasInput = LOAD ‘<tmpLoc>’ AS <schema>;

           -- ...here comes your actual code...

           -- PigUnit will treat the contents of the alias
           -- ‘aliasOutput’ as the actual output data in
           -- the assert statement
           aliasOutput = <your_final_statement>;

           -- Note: PigUnit ignores STORE operations by default
           STORE aliasOutput INTO ‘output’;




Verisign Public                                                   15
A simple PigUnit test (cont.)
                   simpleTest.assertOutput(
       1               ‚aliasInput‛, input,
       2               ‚aliasOutput‛, expectedOutput
                   );



       1          Pig injects input[] = { ‚foo‛, ‚bar‛, ‚foo‛ } into the
                  alias named aliasInput in the Pig script.
                  For this purpose Pig creates a temporary file, writes the
                  equivalent of StringUtils.join(input, ‚n‛) to the file,
                  and finally makes its location available to the LOAD operation.


       2          Pig opens an iterator on the content of aliasOutput, and runs
                  assertEquals() based on StringUtils.join(..., ‚n‛)
                  with expectedOutput and the actual content.

           See o.a.p.pigunit.{PigTest, Cluster} and o.a.p.test.Util.

Verisign Public                                                                     16
PigUnit drawbacks

• How to divide your “main” Pig script into testable units?
       • Only run a single end-to-end test for the full script?
       • Extract testable snippets from the main script?
                  • Argh, code duplication!
       • Split the main script into logical units = smaller scripts; then run
         individual tests and include the smaller scripts in the main script
                  • Ok-ish but splitting too much makes the Pig code hard to
                    understand (too many trees, no forest).
• PigUnit is a nice tool but batteries are not included
       • It does work but it is not as convenient or powerful as you’d like.
                  • Notably you still need to know and write Java to use it. But one
                    compelling reason for Pig is that you can do without Java.
       • You may end up writing your own wrapper/helper lib around it.
                  • Consider contributing this back to the Apache Pig project!


Verisign Public                                                                        17
Tips and tricks




Verisign Public      18
Connecting to a real cluster (default: local mode)

     // this is not enough to enable cluster mode in PigUnit
     pigServer = new PigServer(ExecType.MAPREDUCE);
     // ...do PigUnit stuff...

     // rather:
     Properties props = System.getProperties();
     if (clusterMode)
         props.setProperty(‚pigunit.exectype.cluster‛, ‚true‛);
     else
         props.removeProperty(‚pigunit.exectype.cluster‛);

   • $HADOOP_CONF_DIR must be in CLASSPATH
   • Similar approach for enabling LZO support
           • mapred.output.compress => ‚true‛
           • mapred.output.compression.codec => ‚c.h.c.lzo.LzopCodec‛



Verisign Public                                                     19
Write a convenient PigUnit runner for your users

   • Pig user != Java developer
   • Pig users should only need to provide three files:
           •    pig/myscript.pig
           • input/testdata.txt
           • output/expected.txt
   • PigUnit runner discovers and runs tests for users
           • PigTest#assertOutput() can also handle files
           • But you must manage file uploads and similar “glue” yourself

      pigUnitRunner.runPigTest(
          new Path(scriptFile),
          new Path(inputFile),
          new Path(expectedOutputFile)
      );


Verisign Public                                                             20
Slightly off-topic: Java/Pig combo

   • Pig API provides nifty features to control Pig workflows
     through Java
           • Similar to how working with PigUnit feels
   • Definitely worth a look!
   // ‘pigParams’ is the main glue between Java and Pig here,
   // e.g. to specify the location of input data
   pigServer.registerScript(scriptInputStream, pigParams);

   ExecJob job = pigServer.store(
           ‚aliasOutput‛,
           ‚/path/to/output‛,
           ‚PigStorage()‛
       );

   if (job != null && job.getStatus() == JOB_STATUS.COMPLETED)
       System.out.println(‚Happy world!‛);

Verisign Public                                                  21
Thank You




© 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and
designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United
States and in foreign countries. All other trademarks are property of their respective owners.

More Related Content

What's hot

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 

What's hot (20)

Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
DBD::Gofer 200809
DBD::Gofer 200809DBD::Gofer 200809
DBD::Gofer 200809
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big data
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 

Viewers also liked

Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
Big Data - Hadoop and MapReduce for QA and testing by Aditya GargBig Data - Hadoop and MapReduce for QA and testing by Aditya Garg
Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
QA or the Highway
 

Viewers also liked (14)

Unit testing pig
Unit testing pigUnit testing pig
Unit testing pig
 
Coscup 2013 : Continuous Integration on top of hadoop
Coscup 2013 : Continuous Integration on top of hadoopCoscup 2013 : Continuous Integration on top of hadoop
Coscup 2013 : Continuous Integration on top of hadoop
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest Group
 
Feb 2013 HUG: HIT (Hadoop Integration Testing) for Automated Certification an...
Feb 2013 HUG: HIT (Hadoop Integration Testing) for Automated Certification an...Feb 2013 HUG: HIT (Hadoop Integration Testing) for Automated Certification an...
Feb 2013 HUG: HIT (Hadoop Integration Testing) for Automated Certification an...
 
Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
Big Data - Hadoop and MapReduce for QA and testing by Aditya GargBig Data - Hadoop and MapReduce for QA and testing by Aditya Garg
Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Introduction to Big data tdd and pig unit
Introduction to Big data tdd and pig unitIntroduction to Big data tdd and pig unit
Introduction to Big data tdd and pig unit
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and Hadoop
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 

Similar to Practical Pig and PigUnit (Michael Noll, Verisign)

Similar to Practical Pig and PigUnit (Michael Noll, Verisign) (20)

Testing your puppet code
Testing your puppet codeTesting your puppet code
Testing your puppet code
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
How I hack on puppet modules
How I hack on puppet modulesHow I hack on puppet modules
How I hack on puppet modules
 
Pipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as CodePipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as Code
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in python
 
Puppet and the HashiCorp Suite
Puppet and the HashiCorp SuitePuppet and the HashiCorp Suite
Puppet and the HashiCorp Suite
 
From SaltStack to Puppet and beyond...
From SaltStack to Puppet and beyond...From SaltStack to Puppet and beyond...
From SaltStack to Puppet and beyond...
 
Puppet Development Workflow
Puppet Development WorkflowPuppet Development Workflow
Puppet Development Workflow
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
 
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
 
Using the puppet debugger for lightweight exploration
Using the puppet debugger for lightweight explorationUsing the puppet debugger for lightweight exploration
Using the puppet debugger for lightweight exploration
 
ASP.NET 5 auf Raspberry PI & docker
ASP.NET 5 auf Raspberry PI & dockerASP.NET 5 auf Raspberry PI & docker
ASP.NET 5 auf Raspberry PI & docker
 
Arbeiten mit distribute, pip und virtualenv
Arbeiten mit distribute, pip und virtualenvArbeiten mit distribute, pip und virtualenv
Arbeiten mit distribute, pip und virtualenv
 
Django dev-env-my-way
Django dev-env-my-wayDjango dev-env-my-way
Django dev-env-my-way
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with Puppet
 
Virtualenv
VirtualenvVirtualenv
Virtualenv
 
How to deploy spark instance using ansible 2.0 in fiware lab v2
How to deploy spark instance using ansible 2.0 in fiware lab v2How to deploy spark instance using ansible 2.0 in fiware lab v2
How to deploy spark instance using ansible 2.0 in fiware lab v2
 
How to Deploy Spark Instance Using Ansible 2.0 in FIWARE Lab
How to Deploy Spark Instance Using Ansible 2.0 in FIWARE LabHow to Deploy Spark Instance Using Ansible 2.0 in FIWARE Lab
How to Deploy Spark Instance Using Ansible 2.0 in FIWARE Lab
 
PyParis2018 - Python tooling for continuous deployment
PyParis2018 - Python tooling for continuous deploymentPyParis2018 - Python tooling for continuous deployment
PyParis2018 - Python tooling for continuous deployment
 
Puppet getting started by Dirk Götz
Puppet getting started by Dirk GötzPuppet getting started by Dirk Götz
Puppet getting started by Dirk Götz
 

More from Swiss Big Data User Group

Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
Swiss Big Data User Group
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
Swiss Big Data User Group
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
Swiss Big Data User Group
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
Swiss Big Data User Group
 

More from Swiss Big Data User Group (20)

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to use
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operator
 
Data Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2CData Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2C
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Educating Data Scientists of the Future
Educating Data Scientists of the FutureEducating Data Scientists of the Future
Educating Data Scientists of the Future
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data Warehouse
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexity
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computing
 
In-Store Analysis with Hadoop
In-Store Analysis with HadoopIn-Store Analysis with Hadoop
In-Store Analysis with Hadoop
 
Big Data Visualization With ParaView
Big Data Visualization With ParaViewBig Data Visualization With ParaView
Big Data Visualization With ParaView
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Practical Pig and PigUnit (Michael Noll, Verisign)

  • 1. Practical Pig + PigUnit Michael G. Noll, Verisign July 2012
  • 2. This talk is about Apache Pig • High-level data flow language (think: DSL) for writing Hadoop MapReduce jobs • Why and when should you care about Pig? • You are an Hadoop beginner • … and want to implement a JOIN, for instance • You are an Hadoop expert • You only scratch your head when you see public static void main(String args...) • You think Java is not the best tool for this job [pun!] • Think: too low-level, too many lines of code, no interactive mode for exploratory analysis, readability > performance, et cetera Apache Hadoop, Pig and Hive are trademarks of the Apache Software Foundation. Verisign Public Java is a trademark of Oracle Corporation. 2
  • 3. A basic Pig script • Example: sorting user records by users’ age records = LOAD ‘/path/to/input’ AS (user:chararray, age:int); sorted_records = ORDER records BY age DESC; STORE sorted_records INTO ‘/path/to/output’; • Popular alternatives to Pig • Hive: ~ SQL for Hadoop • Hadoop Streaming: use any programming language for MR • Even though you still write code in a “real” programming language, Streaming provides an environment that makes it more convenient than native Hadoop Java code. Verisign Public 3
  • 4. Preliminaries • Talk is based on Pig 0.10.0, released in April ’12 • Some notable 0.10.0 improvements • Hadoop 2.0 support • Loading and storing JSON • Ctrl-C’ing a Pig job will terminate all associated Hadoop jobs • Amazon S3 support Verisign Public 4
  • 5. Testing Pig – a primer Verisign Public 5
  • 6. “Testing” Pig scripts – some examples DESCRIBE | EXPLAIN | ILLUSTRATE | DUMP $ pig -x local $ pig [-debug | -dryrun] $ pig -param input=/path/to/small-sample.txt Verisign Public 6
  • 7. “Testing” Pig scripts (cont.) • JobTracker UI • PigStats, JobStats, HadoopJobHistoryLoader Now what have you been using? Also: inspecting Hadoop log files, … Verisign Public 7
  • 8. However… • Previous approaches are primarily useful (and used) for creating the Pig script in the first place • Like ILLUSTRATE • None of them are really geared towards unit testing • Difficult to automate (think: production environment) #!/bin/bash pig –param date=$1 –param output=$2 myscript.pig hadoop fs –copyToLocal $2 /tmp/jobresult if [ ARGH!!! ] ... • Difficult to integrate into a typical development workflow, e.g. backed by Maven, Java and a CI server $ mvn clean test ?? Verisign Public Maven is a trademark of JFrog ltd. 8
  • 10. PigUnit • Available in Pig since version 0.8 “PigUnit provides a unit-testing framework that plugs into JUnit to help you write unit tests that can be run on a regular basis.” -- Alan F. Gates, Programming Pig • Easy way to add Pig unit testing to your dev workflow iff you are a Java developer • See “Tips and Tricks” later for working around this constraint • Works with both JUnit and TestNG • PigUnit docs have “potential” • Some basic examples, then it’s looking at the source code of both PigUnit and Pig (but it’s manageable) • http://pig.apache.org/docs/r0.10.0/test.html#pigunit Verisign Public 10
  • 11. Getting PigUnit up and running • PigUnit is not included in current Pig releases :( • You must manually build the PigUnit jar file $ cd /path/to/pig-sources # can be a release tarball $ ant jar pigunit-jar ... $ ls -l pig*jar -rw-r—r-- 1 mnoll mnoll 17768497 ... pig.jar -rw-r—r-- 1 mnoll mnoll 285627 ... pigunit.jar • Add these jar(s) to your CLASSPATH, done! Verisign Public 11
  • 12. PigUnit and Maven • Unfortunately the Apache Pig project does not yet publish an official Maven artifact for PigUnit WILL NOT WORK IN pom.xml :( <dependency> <groupId>org.apache.pig</groupId> <artifactId>pigunit</artifactId> <version>0.10.0</version> </dependency> • Alternatives: • Publish to your local Artifactory instance • Use a local file-based <repository> • Use a <system> scope in pom.xml (not recommended) • Use trusted third-party repos like Cloudera’s Verisign Public Artifactory is a trademark of JFrog ltd. 12
  • 13. A simple PigUnit test Verisign Public 13
  • 14. A simple PigUnit test • Here, we provide input + output data in the Java code • Pig script is read from file wordcount.pig @Test public void testSimpleExample() { PigTest simpleTest = new PigTest(‚wordcount.pig‛); String[] input = { ‚foo‛, ‚bar‛, ‚foo‛ }; String[] expectedOutput = { ‚(foo,2)‛, ‚(bar,1)‛ }; simpleTest.assertOutput( ‚aliasInput‛, input, ‚aliasOutput‛, expectedOutput ); } Verisign Public 14
  • 15. A simple PigUnit test (cont.) • wordcount.pig -- PigUnit populates the alias ‘aliasInput’ -- with the test input data aliasInput = LOAD ‘<tmpLoc>’ AS <schema>; -- ...here comes your actual code... -- PigUnit will treat the contents of the alias -- ‘aliasOutput’ as the actual output data in -- the assert statement aliasOutput = <your_final_statement>; -- Note: PigUnit ignores STORE operations by default STORE aliasOutput INTO ‘output’; Verisign Public 15
  • 16. A simple PigUnit test (cont.) simpleTest.assertOutput( 1 ‚aliasInput‛, input, 2 ‚aliasOutput‛, expectedOutput ); 1 Pig injects input[] = { ‚foo‛, ‚bar‛, ‚foo‛ } into the alias named aliasInput in the Pig script. For this purpose Pig creates a temporary file, writes the equivalent of StringUtils.join(input, ‚n‛) to the file, and finally makes its location available to the LOAD operation. 2 Pig opens an iterator on the content of aliasOutput, and runs assertEquals() based on StringUtils.join(..., ‚n‛) with expectedOutput and the actual content. See o.a.p.pigunit.{PigTest, Cluster} and o.a.p.test.Util. Verisign Public 16
  • 17. PigUnit drawbacks • How to divide your “main” Pig script into testable units? • Only run a single end-to-end test for the full script? • Extract testable snippets from the main script? • Argh, code duplication! • Split the main script into logical units = smaller scripts; then run individual tests and include the smaller scripts in the main script • Ok-ish but splitting too much makes the Pig code hard to understand (too many trees, no forest). • PigUnit is a nice tool but batteries are not included • It does work but it is not as convenient or powerful as you’d like. • Notably you still need to know and write Java to use it. But one compelling reason for Pig is that you can do without Java. • You may end up writing your own wrapper/helper lib around it. • Consider contributing this back to the Apache Pig project! Verisign Public 17
  • 19. Connecting to a real cluster (default: local mode) // this is not enough to enable cluster mode in PigUnit pigServer = new PigServer(ExecType.MAPREDUCE); // ...do PigUnit stuff... // rather: Properties props = System.getProperties(); if (clusterMode) props.setProperty(‚pigunit.exectype.cluster‛, ‚true‛); else props.removeProperty(‚pigunit.exectype.cluster‛); • $HADOOP_CONF_DIR must be in CLASSPATH • Similar approach for enabling LZO support • mapred.output.compress => ‚true‛ • mapred.output.compression.codec => ‚c.h.c.lzo.LzopCodec‛ Verisign Public 19
  • 20. Write a convenient PigUnit runner for your users • Pig user != Java developer • Pig users should only need to provide three files: • pig/myscript.pig • input/testdata.txt • output/expected.txt • PigUnit runner discovers and runs tests for users • PigTest#assertOutput() can also handle files • But you must manage file uploads and similar “glue” yourself pigUnitRunner.runPigTest( new Path(scriptFile), new Path(inputFile), new Path(expectedOutputFile) ); Verisign Public 20
  • 21. Slightly off-topic: Java/Pig combo • Pig API provides nifty features to control Pig workflows through Java • Similar to how working with PigUnit feels • Definitely worth a look! // ‘pigParams’ is the main glue between Java and Pig here, // e.g. to specify the location of input data pigServer.registerScript(scriptInputStream, pigParams); ExecJob job = pigServer.store( ‚aliasOutput‛, ‚/path/to/output‛, ‚PigStorage()‛ ); if (job != null && job.getStatus() == JOB_STATUS.COMPLETED) System.out.println(‚Happy world!‛); Verisign Public 21
  • 22. Thank You © 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United States and in foreign countries. All other trademarks are property of their respective owners.