Apache Pig – Big Data Analytics
Prepared by: K H Hari Priya
Subject: Big Data Analytics
Introduction to Apache Pig
• Apache Pig is a high-level data flow platform for
executing MapReduce programs on Hadoop.
• Developed by Yahoo to simplify data analysis
tasks.
• Uses a scripting language called Pig Latin.
• Pig scripts are automatically converted into
MapReduce jobs and executed on HDFS.
• Handles structured, semi-structured, and
unstructured data.
Features of Apache Pig
• Rich set of operators: Join, sort, filter, etc.
• Ease of programming: Pig Latin is similar to SQL.
• Optimization: Automatically optimizes tasks.
• Extensibility: Users can define custom functions
(UDFs).
• Supports all kinds of data (structured,
semi/unstructured).
• Stores output in HDFS.
Advantages of Apache Pig
• Less code – concise scripts compared to Java
MapReduce.
• Reusability – flexible and easy to reuse code.
• Supports nested data types like tuple, bag, and
map.
• Efficient execution on large datasets.
• Portable across different Hadoop
environments.
Architecture of Apache Pig
• Pig Latin scripts are converted into MapReduce jobs
internally.
• Main Components:
1. Parser – checks syntax and generates logical plan
(DAG).
2. Optimizer – applies logical optimizations.
3. Compiler – converts optimized plan into MapReduce
jobs.
4. Execution Engine – executes jobs on Hadoop.
• Execution modes: Local mode / MapReduce mode.
Pig Latin Data Model
Pig Latin supports complex nested data models:
• Atom – single value like int, float, or string (e.g.,
'30', 'Raja')
• Tuple – ordered set of fields (e.g., (Raja, 30))
• Bag – collection of tuples (e.g., {(Raja,30),
(Mohammad,45)})
• Map – key-value pairs (e.g., [name#Raja, age#30])
• Relation – bag of tuples (like a table).
Grunt Shell in Pig
• Grunt Shell is the interactive shell for running Pig Latin commands.
Modes:
- Local Mode: $ ./pig -x local
- MapReduce Mode: $ ./pig -x mapreduce
Commands:
• sh – execute Linux shell commands
• fs – execute HDFS commands
• clear – clear screen
• history – view previous commands
• exec – run Pig scripts
• quit – exit shell
Pig Latin Data Types
Basic Data Types:
• int, long, float, double
• chararray (string), bytearray, boolean,
datetime
Complex Data Types:
• Tuple – ordered set of fields
• Bag – collection of tuples
• Map – key-value pairs
Operators in Pig Latin
• Arithmetic Operators: +, -, *, /, %, ?:
• Comparison Operators: ==, !=, >, <, >=, <=,
matches
• Relational Operations:
• • LOAD – load data
• • STORE – save data
• • FILTER – remove unwanted rows
• • DISTINCT – remove duplicates
• • JOIN / GROUP – combine or group data
Executing Pig Scripts
• Scripts can be executed:
1. In Local Mode
2. In MapReduce Mode
3. From Grunt Shell using ‘exec’ command
Example:
Employee = LOAD 'Employee.txt' USING PigStorage(',') AS
(id:int, name:chararray, age:int);
Ordered = ORDER Employee BY age DESC;
Limited = LIMIT Ordered 4;
DUMP Limited;
Testing Pig Scripts with PigUnit
• PigUnit enables unit testing of Pig scripts using JUnit framework.
• Helps in rapid prototyping and regression testing.
• Can run in local mode (no cluster needed).
Steps:
1. Install Maven and Pig Eclipse plugin.
2. Write JUnit class with PigTest object.
3. Use assertOutput() to compare expected and actual output.
Example:
pigTest.assertOutput("D", output);
Right-click → Run As → JUnit Test.
Summary
• Apache Pig simplifies MapReduce
programming.
• Pig Latin is a powerful data flow language.
• Supports complex data types and automatic
optimization.
• Ideal for data transformation, filtering, and
analysis on Hadoop.
• Testing made easy using PigUnit.

Apache_Pig_Big_Data_Analytics unit5.pptx

  • 1.
    Apache Pig –Big Data Analytics Prepared by: K H Hari Priya Subject: Big Data Analytics
  • 2.
    Introduction to ApachePig • Apache Pig is a high-level data flow platform for executing MapReduce programs on Hadoop. • Developed by Yahoo to simplify data analysis tasks. • Uses a scripting language called Pig Latin. • Pig scripts are automatically converted into MapReduce jobs and executed on HDFS. • Handles structured, semi-structured, and unstructured data.
  • 3.
    Features of ApachePig • Rich set of operators: Join, sort, filter, etc. • Ease of programming: Pig Latin is similar to SQL. • Optimization: Automatically optimizes tasks. • Extensibility: Users can define custom functions (UDFs). • Supports all kinds of data (structured, semi/unstructured). • Stores output in HDFS.
  • 4.
    Advantages of ApachePig • Less code – concise scripts compared to Java MapReduce. • Reusability – flexible and easy to reuse code. • Supports nested data types like tuple, bag, and map. • Efficient execution on large datasets. • Portable across different Hadoop environments.
  • 5.
    Architecture of ApachePig • Pig Latin scripts are converted into MapReduce jobs internally. • Main Components: 1. Parser – checks syntax and generates logical plan (DAG). 2. Optimizer – applies logical optimizations. 3. Compiler – converts optimized plan into MapReduce jobs. 4. Execution Engine – executes jobs on Hadoop. • Execution modes: Local mode / MapReduce mode.
  • 6.
    Pig Latin DataModel Pig Latin supports complex nested data models: • Atom – single value like int, float, or string (e.g., '30', 'Raja') • Tuple – ordered set of fields (e.g., (Raja, 30)) • Bag – collection of tuples (e.g., {(Raja,30), (Mohammad,45)}) • Map – key-value pairs (e.g., [name#Raja, age#30]) • Relation – bag of tuples (like a table).
  • 7.
    Grunt Shell inPig • Grunt Shell is the interactive shell for running Pig Latin commands. Modes: - Local Mode: $ ./pig -x local - MapReduce Mode: $ ./pig -x mapreduce Commands: • sh – execute Linux shell commands • fs – execute HDFS commands • clear – clear screen • history – view previous commands • exec – run Pig scripts • quit – exit shell
  • 8.
    Pig Latin DataTypes Basic Data Types: • int, long, float, double • chararray (string), bytearray, boolean, datetime Complex Data Types: • Tuple – ordered set of fields • Bag – collection of tuples • Map – key-value pairs
  • 9.
    Operators in PigLatin • Arithmetic Operators: +, -, *, /, %, ?: • Comparison Operators: ==, !=, >, <, >=, <=, matches • Relational Operations: • • LOAD – load data • • STORE – save data • • FILTER – remove unwanted rows • • DISTINCT – remove duplicates • • JOIN / GROUP – combine or group data
  • 10.
    Executing Pig Scripts •Scripts can be executed: 1. In Local Mode 2. In MapReduce Mode 3. From Grunt Shell using ‘exec’ command Example: Employee = LOAD 'Employee.txt' USING PigStorage(',') AS (id:int, name:chararray, age:int); Ordered = ORDER Employee BY age DESC; Limited = LIMIT Ordered 4; DUMP Limited;
  • 11.
    Testing Pig Scriptswith PigUnit • PigUnit enables unit testing of Pig scripts using JUnit framework. • Helps in rapid prototyping and regression testing. • Can run in local mode (no cluster needed). Steps: 1. Install Maven and Pig Eclipse plugin. 2. Write JUnit class with PigTest object. 3. Use assertOutput() to compare expected and actual output. Example: pigTest.assertOutput("D", output); Right-click → Run As → JUnit Test.
  • 12.
    Summary • Apache Pigsimplifies MapReduce programming. • Pig Latin is a powerful data flow language. • Supports complex data types and automatic optimization. • Ideal for data transformation, filtering, and analysis on Hadoop. • Testing made easy using PigUnit.