Pig

I EAT BIG!!!
I HANDLE BIG!!!
J.Ramsingh
Ph.D Research Scholar
Department of Computer
Applications
Bharathiar University

History of
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University) 2
• A research project in Yahoo! Research
• To over came the rigidness of MapReduce
paradigm
• Pig was open sourced via the Apache
Incubator.
• The first Pig release in September 2008.
• In 2009 Amazon, Yahoo used pig
• 2010, top-level Apache project.

History of
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University) 3
paradigm
Incubator.

History of
4
paradigm
Incubator.
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)

WHY
5
• Not preferred for data analytics
• 200 LOC
= 10 LOC
• Not rich in Built-in-functions

Facts on
6
• Pigs eat anything
– (Relational, nested, unstructured, files, etc)

Facts on
7
• Pigs live anywhere
– (parallel data processing)

Facts on
8
• Pigs are domestic animals
– (Easily controlled, modified, integrated)

Facts on
9
• Pigs are domestic animals
– (Easily controlled, modified, integrated)
• Pigs fly
– (Processes data quickly)

Why Is It Called Pig?
10
• Entertaining nomenclature
– Pig Latin
– Grunt for a shell, and
– Piggybank for shared repository.

Overview of PIG
Is a
Can Handle Large Data Sets
I LOVE TO EAT MORE N MORE
11J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)

12
PIG Vs MAPREDUCE
PIG MAPREDUCE
Pig provides data-processing operations MapReduce provides the group by
operation directly ,order by operation
indirectly
Pig, can analyze a Pig Latin script and
understand the data flow it can do
early error checking and optimizations
MapReduce, the data processing inside the
map and reduce phases is opaque to the
system. (no opportunity to optimize or
check the user’s code.)
Pig Latin is much lower cost to write and
maintain
Java code for MapReduce. Is difficult to
code
Pig is rich in type systems MapReduce does not have a type system.
This limits the ability to check users’ code
for errors both before and during runtime.

PIG Vs MAPREDUCE
13

Running Environment
LOCAL MODE
HADOOP MODE

Pig Execution in Hadoop Cluster

Components of PIG
Pig Latin
Grunt
PIG SERVER
Command based
language
Execution
Environment
Compiler strives to
optimize execution

Compilation
• Pig system does two tasks:
Logical Plan
Physical Plan
– Builds a Logical Plan from a Pig Latin script
– Supports execution platform independence
– No processing of data performed at this stage
Compiles the Logical Plan to a Physical Plan and Executes
– Convert the Logical Plan into a series of Map-Reduce
statements to be executed by Hadoop Map-Reduce

Building a logical plan
A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS
dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA

designation);
dob,
COUNT(A);
LOAD DATA
GROUP DATA

designation);
dob,
COUNT(A);
LOAD DATA
GROUP DATA
21
FOREACH

designation);
dob,
COUNT(A);
LOAD DATA
GROUP DATA
FOREACH
FILTER

designation);
dob,
COUNT(A);
LOAD DATA
FILTER
GROUP
FOREACH

designation);
dob,
COUNT(A);
LOAD DATA
FILTER
GROUP
FOREACH
Only happens when output is
specified by STORE or DUMP

Building a Physical plan
Step 1: Create a map-reduce job for each
COGROUP
Map
Reduce
Load(user.dat)
Filter
Group
Foreach
LOAD DATA
FILTER
GROUP
FOREACH

Building a Physical plan
Step 1: Create a map-reduce job for each
COGROUP
Step 2: Push other commands into the
map and reduce functions where
possible
Step 3:May be the case certain
commands require their own map-
reduce job (ie: ORDER needs separate map-
reduce jobs)
Map
Reduce
Load(user.dat)
Filter
Group
Foreach
LOAD DATA
FILTER
GROUP
FOREACH

I Need all These
• Linux above 10
• Java above 6
• Hadoop
• Pig

Execution of Pig

PIG Latin Basics
• Pig Latin is a data flow language rather than procedural or declarative ,
in which the program consists of a collection of statements.
• A statement can be thought of as an operation, or a command.
Building blocks(Complex Data Types)
• Fields - Field is a piece of data
[eg : student_id = 01]
• Tuples - Tuple is a ordered set of fields
[eg : ( 01, Raja,MCA, C++)]
• Bags - Bag collection of tuples
[eg : ( 01, Raja, MCA, C++), eg: ( 22, Ramesh, MBA, C)
]

PIG Data Types
Simple Type Description
int Signed 32-bit integer
long Signed 64-bit integer
float 32-bit floating point
double 64-bit floating point
chararray Character array (string) in Unicode UTF-8 format
bytearray Byte array (blob)
boolean boolean

PIG Basic Commands
Statement Description
Load Read data from the file system
Store Write data to the file system
Dump Generate output
Foreach Apply expression to each record and generate one or more records
Filter Apply predicate to each record and remove records where false
Group / Cogroup Collect records with the same key from one or more inputs
Join Join two or more inputs based on a key
Order Sort records based on a Key
Distinct Remove duplicate records
Union Merge two datasets
Limit Limit the number of records
Split Split data into 2 or more sets, based on filter conditions

Load Command
LOAD 'data' [USING function] [AS schema];
• data – name of the directory or file
Must be in single quotes
• USING – specifies the load function to use
By default uses PigStorage () which parses each line into
fields using a delimiter
Default delimiter is tab (‘t’)
• AS – assign a schema to incoming data
Assigns names to fields
Declares types to fields

LOAD Command Example
data
• data = load '$dir/age.csv' using PigStorage(',') as
(name:chararray, age:chararray)
Schema

DUMP and STORE statements
• No action is taken until DUMP or STORE commands are
encountered
Pig will parse, validate and analyze statements but not execute them
• DUMP – displays the results to the screen
• STORE – saves results (typically to a file)
data = load '$dir/newfine.csv' using PigStorage(',') as (MemberCode:chararray,
IssueDate:chararray, ReturnDate:chararray)
::::
::::
:::::
DUMP data
Ram,22
..
..
Nothing is
executed;
Pig will
optimize this
entire
chunk of
script

FOREACH
• FOREACH <bag> GENERATE <data>
Iterate over each element in the bag and produce a result
result = FOREACH data GENERATE name;

FOREACH with Functions
FOREACH B GENERATE group, FUNCTION(A);
• Pig comes with many functions including COUNT,
FLATTEN, CONCAT, etc...
• Can implement a custom function
Example
counts = FOREACH data GENERATE group, COUNT(name);
Dump counts
Ram,3
Raj,4
Sam,2
Mani,1 36J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)

Diagnostic Tools
DESCRIBE
Display the structure of the Bag
DESCRIBE <bag_name>;
EXPLAIN
Display Execution Plan
Produces Various reports
• Logical Plan
• MapReduce Plan
EXPLAIN <bag_name>;
ILLUSTRATE
Illustrate how Pig engine transforms the data
ILLUSTRATE <bag_name>;

FLATTEN Operator
• Flattens nested bags and data types
• FLATTEN is not a function, it’s an operator Re-
arranges output
Eg
grunt > dump data
({(this),(is),(a),(line),(of),(text)})
({(yet),(another),(line),(of),(text)})
({(third),(line),(of),(words)})
grunt> flatBag = FOREACH data GENERATE flatten($0);
(this)
(is)
(a)
......
Nested structure: bag of
bags of tuples
Each row is flatten resulting in a
bag of simple tokens

Group
• Groups the data in one or multiple relations.
• The GROUP operator groups together tuples that have the same group key
(key field).
• The key field will be a tuple if the group key has more than one field,
otherwise it will be the same type as that of the group key.
Example
groupme= group data by name;
Dump groupme
(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)})
(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj, 62)})
(Sam,{(Sam, 15),(Sam, 22)})

Co-Group
• COGROUP is the same as GROUP.
• Group two datasets together by a common attribute.
• Groups data into nested bags
“Use GROUP when only one relation is involved and COGROUP
with multiple relations re involved”
Example
Data1=load '$dir/data.csv' using PigStorage(',') as (name:chararray,
age:chararray)
Data2=load '$dir/data2.csv' using PigStorage(',') as (name:chararray,
address:chararray)
X = COGROUP Data1 BY name, Data2 BY name;

cont …
Dump x
(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)},{(Ram,Cbe),(Ram,Che)})
(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj,62)},{(Raj,Mdu), (Raj,Mumbai),
(Raj,Delhi) })
(Sam,{(Sam, 15),(Sam, 22),{}})
Cogroup by default is an OUTER JOIN You can remove empty records with empty
bags by performing INNER on each bag
X = COGROUP Data1 BY name INNER, Data2 BY name INNER;
Dump x
(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)},{(Ram,Cbe),(Ram,Che)})
(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj,62)},{(Raj,Mdu), (Raj,Mumbai),
(Raj,Delhi) })
First field is a bag which came from data 1 bag (first
dataset);second bag is from the data 2 bag (second data-set)

Filtering
• Select a subset of the tuples in a bag
FILTER bag BY expression ;
• Expression uses simple comparison operators (==, !=, <, >, …)
and Logical connectors (AND, NOT, OR)
Example
Filterdata = filter data by age >20
Dump filterdata
(Ram, 30),(Ram, 22), (Ram, 25)
(Raj, 22), (Raj, 52), (Raj, 62)
(Sam, 22)

Ordering
• Sorts a relation based on one or more fields.
alias = ORDER alias BY { * [ASC|DESC]}
Example
orderddata = order data by age DESC;
Dump filterdata
(Raj, 52)
(Raj, 62)
(Ram, 30)
(Ram, 22)
(Raj, 22)
(Sam, 22)

Join
• Joins two datasets together by a common attribute.
• By default JOIN operator always performs an inner join.
• Inner joins ignore null keys, so it makes sense to filter them
out before the join.
Note : The JOIN and COGROUP operators perform similar functions. JOIN creates a flat
set of output records while COGROUP creates a nested set of output records
Data 1
Data 2
Join

Outer Join
• Records which will not join with the ‘other’ record-set are
included using outer join
• Left Outer
Records from the first data-set are included whether they have a
match or not. Fields from the unmatched (second) bag are set to
null.
Data 2Data 1

cont …
• Right Outer
The opposite of Left Outer Join: Records from the second data-set are
included no matter what. Fields from the unmatched (first) bag are set to
null.
• Full Outer
Records from both sides are included. For unmatched records the fields
from the ‘other’ bag are set to null.
Data 1 Data 2
Data 1 Data 2

47

EVAL FUNCTIONS
• AVG
• CONCAT
• COUNT
• ISEMPTY
• MAX
• MIN
• SUM

UDF’S
• User Defined Functions
• Is a way to operate on fields
• But not on group
• Can be called using the pig script

UDF to Rescue [Embeded Mode]
• Easy to use
• Easy to code
• Keeps the power of PIG
• You are free to write in

PIG a WRAPPER
51

Does What Ever You Want
• Image Feature Extraction
• Geo Computations
• Data Cleaning
• Retrieve Web Pages
• NLP
………
• Even more…….

Why PIG IS Faster?
• Few bugs
• Few LOC
• Easier to read(purpose of analytics is
straight forward)

Pit Falls
• Version match

Pit Falls
• Bugs in older version requires register of jars

Conclusion
• Pig is a data processing environment in
Hadoop which targets procedural
programmers, who do large-scale data
analysis.
• Pig-Latin offers high-level data manipulation in
a procedural style.

References
• http://hadooptutorials.co.in
• https://www.youtube.com
• https://flume.apache.org
• http://hortonworks.com
• http://www-01.ibm.com
• https://www.youtube.com
• http://kafka.apache.org

• Thank you!!!!

Pig

Recommended

Recommended

More Related Content

Similar to Pig

Similar to Pig (20)

Recently uploaded

Recently uploaded (20)

Pig

Editor's Notes