Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Architecture | Simplilearn

Key Takeaways
What’s in it for you?

1. Why Pig?
2. What is Pig?
3. MapReduce vs Hive vs Pig
4. Pig architecture
5. Working of Pig
6. Pig Latin data model
7. Pig Execution modes
8. Use case – Twitter
9. Features of Pig
Let’s get started with
Pig!
1. Why Pig?

1. Why Pig?
2. What is Pig?
4. Pig architecture
5. Working of Pig
9. Features of Pig
Pig!
1. Why Pig?
2. What is Pig?

1. Why Pig?
2. What is Pig?
4. Pig architecture
5. Working of Pig
9. Features of Pig
Pig!
1. Why Pig?
2. What is Pig?
4. Pig architecture

1. Why Pig?
2. What is Pig?
4. Pig architecture
5. Working of Pig
9. Features of Pig
Pig!
1. Why Pig?
2. What is Pig?
4. Pig architecture
5. Working of Pig

1. Why Pig?
2. What is Pig?
4. Pig architecture
5. Working of Pig
9. Features of Pig
Pig!
1. Why Pig?
2. What is Pig?
4. Pig architecture
5. Working of Pig
9. Features of Pig

As we all know, Hadoop uses MapReduce to analyze and process big data
Why Pig?

Before
Processing Big Data consumed more time
Why Pig?

Before
Processing Big Data consumed more time Processing Big Data was faster using Mapreduce
Why Pig?
After

Before
Processing Big Data consumed more time Processing Big Data was faster using Mapreduce
Why Pig?
AfterThen, what is the
problem with
MapReduce ?

Prior to 2006, all MapReduce programs were written in Java
Why Pig?

Non-programmers found it
difficult to write lengthy Java
codes
They faced issues in
incorporating map, sort,
reduce fundamentals of
MapReduce while creating a
program
Eventually, it became a
difficult task to maintain and
optimize the code due to
which the processing time
increased
Map phase
Shuffle and sort
Reduce phase
Prior to 2006, all MapReduce programs were written in Java
Why Pig?

Why Pig?
Yahoo faced problems to process and analyze large
datasets using Java as the codes were complex and
lengthy
Problem

Why Pig?
lengthy
There was a necessity to develop an easier way to
analyze large datasets without using time consuming
complex Java codes
Problem
Necessity

Why Pig?
lengthy
• Apache Pig was developed by Yahoo researchers.
• It was developed with a vision to analyze and process large
datasets without using complex Java codes. Pig was
developed especially for non-programmers.
• Pig used simple steps to analyze datasets which was time
efficient.
Problem
Necessity
Solution
There was a necessity to develop an easier way to
analyze large datasets without using time consuming
complex Java codes

What is Pig?
Pig is a scripting platform that runs on Hadoop clusters, designed to process and analyze
large datasets

What is Pig?
large datasets
Uses SQL
like queries
Analyze data

What is Pig?
large datasets
Pig operates on various types of data like
structured, semi-structured and
unstructured data
Uses SQL
like queries
Analyze data

Key Takeaways
MapReduce vs Hive vs Pig

How is Blockchain distributed ledger different from a traditional ledger?
SQL like query Scripting language
vs vs
Complied language
Need to write long complex codes
Lower level of abstraction
No need to write complex
codes
Higher level of abstraction
Can process structured, semi
structured and unstructured data
Can process only structured data Can process structured, semi
codes

SQL like query Scripting language
vs vs
Complied language
Need to write long complex codes
Lower level of abstraction
codes
Can process structured, semi
codes
Can process only structured data Can process structured, semi
This is the advantage Pig has over Hive

vs vs
Supports partitioning feature
MapReduce uses Java and
Python
Code performance is good
Hive uses a SQL like query language
known as HiveQL
Code performance is lesser than
MapReduce and Pig
MapReduce is used by programmers
Code performance is lesser than
MapReduce but better than Hive
Hive is used by data analysts Pig is used by researchers and
programmers
Pig Latin is used which is a
procedural data flow language
Supports partitioning feature No concept of partitioning in
Pig

Key Takeaways
Components of Pig

Components of Pig
Pig has two components

Components of Pig
Runtime engine
Pig Latin
Pig Latin is the procedural data
flow language used in Pig to
analyze data
It is easy to program using Pig
Latin as it is similar to SQL
Runtime engine represents the
execution environment created
to run Pig Latin programs
It is also a compiler that
produces MapReduce
programs
Uses HDFS for storing and
retrieving data

Components of Pig
Pig Latin
Runtime engine
Pig Latin is the procedural
data flow language used in
Pig to analyze data
It is easy to program using
Pig Latin as it is similar to
SQL
It is also a compiler that produces
MapReduce programs
Uses HDFS for storing and
retrieving data
Runtime engine represents the
execution environment created
to run Pig Latin programs

Key Takeaways
Pig architecture

Pig architecture
There are 3 ways to execute the
written Pig script
Pig Latin Scripts
Programmers write a script In Pig
Latin to analyze data using Pig

Pig architecture
written Pig script
Pig Latin Scripts
Grunt Shell
Grunt Shell is Pig’s interactive shell which
is used to execute all Pig scripts

Pig architecture
written Pig script
Pig Latin Scripts
Grunt Shell Pig Server
If the Pig script is written in a script file, the
execution is done by the Pig Server

Pig architecture
written Pig script
Pig Latin Scripts
Parser
Parser checks the syntax of the Pig script
After checking, the output will be a
DAG – Directed Acyclic Graph

Pig architecture
written Pig script
Pig Latin Scripts
Parser
Optimizer DAG (logical plan) is passed to the logical
Optimizer where optimizations take place

Pig architecture
written Pig script
Pig Latin Scripts
Parser
Optimizer
Compiler
The Compiler converts the DAG into
MapReduce jobs

Pig architecture
written Pig script
Pig Latin Scripts
Parser
Optimizer
Compiler
Execution Engine
The MapReduce jobs are executed at the
Execution Engine
The results are displayed using “DUMP”
statement and stored in HDFS using
“STORE” statement

Pig architecture
written Pig script
Pig Latin Scripts
Parser
Optimizer
Compiler
Execution Engine
MapReduce
HDFS

Working of Pig
Load data and
write Pig script
Pig Latin script is written
by the users
1

Working of Pig
Load data and
write Pig script
Pig operations In this step, all the Pig
operations are performed by
parser, optimizer and
compiler
21

Working of Pig
Load data and
write Pig script
Pig operations
Execution of the
plan
In this stage, the results are shown
on the screen otherwise stored in
HDFS as per the code
1 2
3

Key Takeaways
Pig Latin data model

The data model of Pig Latin helps Pig to handle various
types of data

types of data
Atom represents any single value of primitive data type in
Pig Latin like int, float, string. It is stored as string
Examples
‘Rob’ or
50
Atom

types of data
Examples
‘Rob’ or
50
Atom Tuple
(Rob,5)
Tuple represents sequence of fields that can be of any
data type. It is same as a row in RDBMS i.e.; a set of data
from a single row

types of data
Examples
‘Rob’ or
50
Atom Tuple
(Rob,5)
Bag
{(Rob,5),(
Mike,10}
Bag is a collection of tuples. It is the same as a table in
RDBMS. It is represented by ‘{}’

types of data
Examples
‘Rob’ or
50
Atom Tuple
(Rob,5)
Bag
{(Rob,5),(
Mike,10}
Map
[name#Mi
ke,
age#10]
Map is a set of key-value pairs. Key is of chararray type
and value can be of any type. It is represented by ‘[]’

types of data
Examples
‘Rob’ or
50
Atom Tuple
(Rob,5)
Bag
{(Rob,5),(
Mike,10}
Map
[name#Mi
ke,
age#10]
Map is a set of key-value pairs. Key is of chararray type
and value can be of any type. It is represented by ‘[]’
Pig Latin has a fully nestable data model
that means one data type can be nested
with another

Here is a diagrammatical representation of the Pig Latin data
model
Sl. no Name Age Place
01
02
03
Jack
Bob
Joe
23
25
29
Goa
London
California

model
01
02
03
Jack
Bob
Joe
23
25
29
Goa
London
California
Field

model
01
02
03
Jack
Bob
Joe
23
25
29
Goa
London
California Tuple

model
01
02
03
Jack
Bob
Joe
23
25
29
Goa
London
California
}Bag

Key Takeaways
Pig Execution modes

Pig Execution modes
Pig works in two execution modes. Depending
on where the data is residing and where the Pig
script is going to run

Pig Execution modes
Local Mode MapReduce Mode

Pig Execution modes
Here, the Pig engine takes input from the Linux file system and the output is
stored in the same file system
Local Mode is useful in analyzing small datasets using Pig

Pig Execution modes
Here, the Pig engine directly interacts and executes in HDFS and
MapReduce
In the MapReduce mode, queries written in Pig Latin are translated into
MapReduce jobs and are run on a Hadoop cluster. By default, Pig runs on
this mode

Pig Execution modes
There are three modes in Pig, depending on
how a Pig Latin code can be written

Pig Execution modes
Interactive Mode
Batch Mode
Embedded Mode

Pig Execution modes
Interactive Mode
Batch Mode
Embedded Mode
Interactive mode means coding and executing the script, line by line

Pig Execution modes
Interactive Mode
Batch Mode
Embedded Mode
In Batch mode, all scripts are coded in a file with the extension .pig and
the file is directly executed

Pig Execution modes
Interactive Mode
Batch Mode
Embedded Mode
Pig lets it’s users define their own functions (UDFs) in
programming languages such as Java

Key Takeaways
Use case - Twitter

Use case – Twitter
Users on Twitter generate about 500 million tweets
on a daily basis

on a daily basis
Hadoop MapReduce was used to
process and analyze this data
Analyzing the number of tweets created by a user in
the tweet table was done using MapReduce in Java
programming language

on a daily basis
Hadoop MapReduce was used to
process and analyze this data
Analyzing the number of tweets created by a user in
the tweet table was done using MapReduce in Java
programming language
It was difficult to perform MapReduce operations as users
were not well versed with writing complex Java codes

The problems that were faced by Twitter while
analyzing datasets using MapReduce were :
Joining Datasets Sorting
Datasets
Grouping
Datasets
It was difficult to perform these operations on MapReduce as
it consumed more time since the Java codes were lengthy
and complex
Twitter used Apache Pig to overcome
these problems. Let’s see how.

Problem statement
Analyze the user table and tweet table and find out how many tweets
are created by a person

ID Name
1
2
3
Alice
Tim
John
User Table Tweet Table
1
2
1
3
1
2
Google….
Tennis…
Spacecraft…
Oscar…
Politics..…
Olympics…
ID Tweet
Problem statement

ID Name
1
2
3
Alice
Tim
John
1
2
1
3
1
2
Google….
Tennis…
Spacecraft…
Oscar…
Politics..…
Olympics…
ID Tweet
Problem statement
The following operations were
performed for analyzing the given data

ID Name
1
2
3
Alice
Tim
John
1
2
1
3
1
2
Google...
Tennis...
Spacecraft...
Oscar...
Politics...
Olympics...
ID Tweet
First, the twitter data is loaded onto the Pig storage using
LOAD command

ID Name
1
2
3
Alice
Tim
John
1
2
1
3
1
2
Google...
Tennis...
Spacecraft...
Oscar...
Politics...
Olympics...
ID Tweet ID Name
1
2
3
Alice
Tim
John
1
2
1
3
1
2
Google...
Tennis...
Spacecraft...
Oscar...
Politics...
Olympics...
ID Tweet
First, the twitter data is loaded onto the Pig storage using
LOAD command

In join and group operation, the tweet and user tables are joined
and grouped using COGROUP command
ID Name Tweet
1
1
2
1
2
3
Alice
Alice
Alice
Tim
Tim
John
Google...
Spacecraft...
Politics...
Tennis...
Oscar...
Olympics...
The remaining operations performed are shown below

ID Count
1
2
3
3
2
1
The next operation is the aggregation, the tweets are counted
according to the names. The command used is COUNT

ID
1
2
3
Name Count
3
2
1
Alice
Tim
John
The result after the count operation is joined with the user table to
find out the user name

ID
1
2
3
Name Count
3
2
1
Alice
Tim
John
Pig reduces the complexity of
the operations which would
have been lengthier using
MapReduce

ID
1
2
3
Name Count
3
2
1
Alice
Tim
John
Finally, we could find out the
number of tweets created by a
user in a simple way

Optimization and
compilation is easy
as it is done
automatically and
internally
Allows multiple
queries to process
parallelly
Pig offers a large
set of operators
such as join, filter
and so on
Pig lets us create
User-defined
Functions
Handles all kind of data
like structured, semi
structured and
unstructured
Short development
time as the code is
simpler
Features of Pig
Ease of programming
as Pig Latin is similar
to SQL. Lesser lines
of code needs to be
written

Optimization and
compilation is easy
as it is done
automatically and
internally
Allows multiple
queries to process
parallelly
Pig offers a large
set of operators
and so on
structured and
unstructured
Features of Pig
Ease of programming
of code needs to be
written
Short development
time as the code is
simpler
Pig lets us create
User-defined
Functions

Optimization and
compilation is easy
as it is done
automatically and
internally
Allows multiple
queries to process
parallelly
Pig offers a large
set of operators
and so on
Features of Pig
Ease of programming
of code needs to be
written
Short development
time as the code is
simpler
structured and
unstructured
Pig lets us create
User-defined
Functions

Optimization and
compilation is easy
as it is done
automatically and
internally
Allows multiple
queries to process
parallelly
Features of Pig
Ease of programming
of code needs to be
written
Short development
time as the code is
simpler
structured and
unstructured
Pig offers a large
set of operators
and so on
Pig lets us create
User-defined
Functions

Optimization and
compilation is easy
as it is done
automatically and
internally
Features of Pig
Ease of programming
of code needs to be
written
Short development
time as the code is
simpler
structured and
unstructured
Pig offers a large
set of operators
and so on
Allows multiple
queries to process
parallelly
Pig lets us create
User-defined
Functions

Features of Pig
Ease of programming
of code needs to be
written
Short development
time as the code is
simpler
structured and
unstructured
Pig offers a large
set of operators
and so on
Allows multiple
queries to process
parallelly
Optimization and
compilation is easy
as it is done
automatically and
internally
Pig lets us create
User-defined
Functions

Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Architecture | Simplilearn

Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Architecture | Simplilearn

More Related Content

What's hot

Similar to Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Architecture | Simplilearn

More from Simplilearn

Recently uploaded

Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Architecture | Simplilearn

Editor's Notes