Pig and Pig Latin
What is Pig?
• Apache Pig is a Hadoop platform for creating MapReduce jobs. Pig uses a high-level,
SQL-like programming language named Pig Latin.
The benefits of Pig include:
• Run a MapReduce job with a few simple lines of code.
• Process structured data with a schema, or Pig can process unstructured data without a
schema. (Pigs eat anything!)
• Pig Latin uses a familiar SQL-like syntax.
• Pig scripts read and write data from HDFS.
• Pig Latin is a data flow language, a logical solution for many MapReduce algorithms.
Pig Latin
• Pig Latin is a high-level data flow scripting language.
Pig Latin scripts can be executed one of three ways:
• Pig script: write a Pig Latin program in a text file and execute it using the pig
executable.
• Grunt shell: enter Pig statements manually one-at-a-time from a CLI tool known
as the Grunt interactive shell.
• Embedded in Java: use the PigServer class to execute a Pig query from within Java
code.
The Grunt Shell
• Grunt is an interactive shell that enables users to enter Pig Latin statements
and also interact with HDFS.
• To enter the Grunt shell, run the pig executable in the PIG_HOMEbin folder:
Pig Latin Types
Functions
• Functions in Pig come in four types:
• Eval function : A function that takes one or more expressions and returns another
expression.
• Filter function : A special type of eval function that returns a logical Boolean result.
• Load function: A function that specifies how to load data into a relation from
external storage.
• Store function : A function that specifies how to save the contents of a relation to
external storage.
Eval Function
Filter, Load, Store Functions
Data Processing Operators
• Loading and Storing Data
• Filtering Data
• Grouping and Joining Data
• Combining Data
User-Defined Functions : Filter UDF
• Filter UDFs are all subclasses of FilterFunc, which itself is a subclass of EvalFunc
• Override EvalFunc’s only abstract method, exec(),
Filter UDF Contd..
public class IsGoodQuality extends FilterFunc {
@Override
public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {return false;}
try {
Object object = tuple.get(0);
if (object == null) {return false;}
int i = (Integer) object;
return i == 0 || i == 1 || i == 4 || i == 5 || i == 9;
} catch (ExecException e) {
throw new IOException(e);
}}}
Exploring Data with Apache Pig from the Grunt shell
LAB

Pig and Pig Latin - Module 5

  • 1.
  • 2.
    What is Pig? •Apache Pig is a Hadoop platform for creating MapReduce jobs. Pig uses a high-level, SQL-like programming language named Pig Latin. The benefits of Pig include: • Run a MapReduce job with a few simple lines of code. • Process structured data with a schema, or Pig can process unstructured data without a schema. (Pigs eat anything!) • Pig Latin uses a familiar SQL-like syntax. • Pig scripts read and write data from HDFS. • Pig Latin is a data flow language, a logical solution for many MapReduce algorithms.
  • 3.
    Pig Latin • PigLatin is a high-level data flow scripting language. Pig Latin scripts can be executed one of three ways: • Pig script: write a Pig Latin program in a text file and execute it using the pig executable. • Grunt shell: enter Pig statements manually one-at-a-time from a CLI tool known as the Grunt interactive shell. • Embedded in Java: use the PigServer class to execute a Pig query from within Java code.
  • 4.
    The Grunt Shell •Grunt is an interactive shell that enables users to enter Pig Latin statements and also interact with HDFS. • To enter the Grunt shell, run the pig executable in the PIG_HOMEbin folder:
  • 5.
  • 6.
    Functions • Functions inPig come in four types: • Eval function : A function that takes one or more expressions and returns another expression. • Filter function : A special type of eval function that returns a logical Boolean result. • Load function: A function that specifies how to load data into a relation from external storage. • Store function : A function that specifies how to save the contents of a relation to external storage.
  • 7.
  • 8.
  • 9.
    Data Processing Operators •Loading and Storing Data • Filtering Data
  • 10.
    • Grouping andJoining Data
  • 11.
  • 12.
    User-Defined Functions :Filter UDF • Filter UDFs are all subclasses of FilterFunc, which itself is a subclass of EvalFunc • Override EvalFunc’s only abstract method, exec(),
  • 13.
    Filter UDF Contd.. publicclass IsGoodQuality extends FilterFunc { @Override public Boolean exec(Tuple tuple) throws IOException { if (tuple == null || tuple.size() == 0) {return false;} try { Object object = tuple.get(0); if (object == null) {return false;} int i = (Integer) object; return i == 0 || i == 1 || i == 4 || i == 5 || i == 9; } catch (ExecException e) { throw new IOException(e); }}}
  • 14.
    Exploring Data withApache Pig from the Grunt shell LAB