Pig

PIG
C P Madumathi
Sri Krishna college of arts and science

 Called DATA FLOW LANGUAGE.
 Used as a SCRIPTING LANGUAGE in Big Data
Technology.
 Executes through HDFS.
 HDFS is based on GOOGLE FILE SYSTEM(GFS).

IS A
Can handle large dataset
I LUV 2 EAT MORE N MORE

PIG WORKS ON HADOOP ENVIRONMENT
Even in a local mode

• Not preferred for data analytics
• 200 LOC = 10 LOC
• Not rich in Built-in-functions

 Pig system does two tasks:
Logical Plan
Physical Plan
◦ Builds a Logical Plan from a Pig Latin script
◦ Supports execution platform independence
◦ No processing of data performed at this stage
Compiles the Logical Plan to a Physical Plan and
Executes
◦ Convert the Logical Plan into a series of Map-Reduce
statements to be executed by Hadoop Map-Reduce

A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA

designation);
COUNT(A);
STORE D INTO ‘result.dat
LOAD DATA
GROUP DATA

designation);
COUNT(A);
LOAD DATA
GROUP DATA
FOREACH

A = LOAD ‘dataset 1.dat’AS (name, dob, designation);
COUNT(A);
LOAD DATA
GROUP DATA
FOREACH
FILTER

 Linux above 10
 Java above 6
 Hadoop
 Pig

 Pig Latin is a data flow language rather than
procedural or declarative , in which the
program consists of a collection of statements.
 A statement can be thought of as an operation,
or a command.

 Fields - Field is a piece of data
[eg : student_id = 01]
 Tuples - Tuple is a ordered set of fields
[eg : ( 01, Raja,MCA, C++)]
 Bags - Bag collection of tuples
[eg : ( 01, Raja, MCA, C++),
eg: ( 22, Ramesh, MBA, C) ]

SIMPLE TYPE DESCRIPTION
int Signed 32-bit integer
long Signed 64-bit integer
float 32-bit floating point
double 64-bit floating point
chararray Character array (string) in Unicode UTF-8
format
bytearray Byte array (blob)
boolean boolean

Statement Description
Load Read data from the file system
Store Write data to the file system
Dump Generate output
Foreach Apply expression to each record
and generate one or more records
Filter Apply predicate to each record and
remove records where false
Group / Cogroup Collect records with the same key
from one or more inputs
Join Join two or more inputs based on a
key

Order Sort records based on a Key
Distinct Remove duplicate records
Union Merge two datasets
Limit Limit the number of records
Split Split data into 2 or more sets,
based on filter conditions

LOAD 'data' [USING function] [AS schema]; data
 Example:
data = load '$dir/age.csv' using PigStorage(',')
as (name:chararray, age:chararray)

 No action is taken until DUMP or STORE commands
are encountered
Pig will parse, validate and analyze statements but not
execute them
 DUMP – displays the results to the screen
 STORE – saves results (typically to a file)

FOREACH B GENERATE group, FUNCTION(A);
 Pig comes with many functions including COUNT,
FLATTEN, CONCAT, etc...
 Can implement a custom function

 Groups the data in one or multiple relations.
 The GROUP operator groups together tuples that
have the same group key (key field).
 The key field will be a tuple if the group key has
more than one field, otherwise it will be the same
type as that of the group key.

 COGROUP is the same as GROUP.
 Group two datasets together by a common attribute.
 Groups data into nested bags
“Use GROUP when only one relation is involved and
COGROUP with multiple relations re involved”
;

 Select a subset of the tuples in a bag
FILTER bag BY expression ;
 Expression uses simple comparison operators (==, !=,
<, >, …) and Logical connectors (AND, NOT, OR)

 Sorts a relation based on one or more fields.
alias = ORDER alias BY { * [ASC|DESC]}

 Joins two datasets together by a common attribute.
 By default JOIN operator always performs an inner
join.
 Inner joins ignore null keys, so it makes sense to
filter them out before the join.

 Records which will not join with the ‘other’
record-set are included using outer join
 Left Outer
Records from the first data-set are included whether
they have a match or not. Fields from the unmatched
(second) bag are set to null.

 User Defined Functions
 Is a way to operate on fields
 But not on group
 Can be called using the pig script

 Easy to use
 Easy to code
 Keeps the power of PIG
 You are free to write in

 Image Feature Extraction
 Geo Computations
 Data Cleaning
 Retrieve Web Pages
 NLP
………
 Even more…….

 Few bugs
 Few LOC
 Easier to read(purpose of analytics is straight
forward)

 Pig is a data processing environment in
Hadoop which targets procedural
programmers, who do large-scale data
analysis.
 Pig-Latin offers high-level data manipulation
in a procedural style.

Pig

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pig

Similar to Pig (20)

Recently uploaded

Recently uploaded (20)

Pig