Apache PIG basics, commands to work on pig, advantages of pig, dis advantages of pig, pig wrapper classes, history of pig, components of pig, execution of pig, Pig latin basics, EVAL functions, Embedded pig, pig versus mapreduce, pig running environment, hadoop environment, logical plan, physical plan, ordering, joins
1. I EAT BIG!!!
I HANDLE BIG!!!
J.Ramsingh
Ph.D Research Scholar
Department of Computer
Applications
Bharathiar University
2. History of
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University) 2
• A research project in Yahoo! Research
• To over came the rigidness of MapReduce
paradigm
• Pig was open sourced via the Apache
Incubator.
• The first Pig release in September 2008.
• In 2009 Amazon, Yahoo used pig
• 2010, top-level Apache project.
3. History of
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University) 3
• A research project in Yahoo! Research
• To over came the rigidness of MapReduce
paradigm
• Pig was open sourced via the Apache
Incubator.
• The first Pig release in September 2008.
• In 2009 Amazon, Yahoo used pig
• 2010, top-level Apache project.
4. History of
4
• A research project in Yahoo! Research
• To over came the rigidness of MapReduce
paradigm
• Pig was open sourced via the Apache
Incubator.
• The first Pig release in September 2008.
• In 2009 Amazon, Yahoo used pig
• 2010, top-level Apache project.
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
5. WHY
5
• Not preferred for data analytics
• 200 LOC
= 10 LOC
• Not rich in Built-in-functions
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
6. Facts on
6
• Pigs eat anything
– (Relational, nested, unstructured, files, etc)
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
7. Facts on
7
• Pigs eat anything
– (Relational, nested, unstructured, files, etc)
• Pigs live anywhere
– (parallel data processing)
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
8. Facts on
8
• Pigs eat anything
– (Relational, nested, unstructured, files, etc)
• Pigs live anywhere
– (parallel data processing)
• Pigs are domestic animals
– (Easily controlled, modified, integrated)
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
9. Facts on
9
• Pigs eat anything
– (Relational, nested, unstructured, files, etc)
• Pigs live anywhere
– (parallel data processing)
• Pigs are domestic animals
– (Easily controlled, modified, integrated)
• Pigs fly
– (Processes data quickly)
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
10. Why Is It Called Pig?
10
• Entertaining nomenclature
– Pig Latin
– Grunt for a shell, and
– Piggybank for shared repository.
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
11. Overview of PIG
Is a
Can Handle Large Data Sets
I LOVE TO EAT MORE N MORE
11J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
12. 12
PIG Vs MAPREDUCE
PIG MAPREDUCE
Pig provides data-processing operations MapReduce provides the group by
operation directly ,order by operation
indirectly
Pig, can analyze a Pig Latin script and
understand the data flow it can do
early error checking and optimizations
MapReduce, the data processing inside the
map and reduce phases is opaque to the
system. (no opportunity to optimize or
check the user’s code.)
Pig Latin is much lower cost to write and
maintain
Java code for MapReduce. Is difficult to
code
Pig is rich in type systems MapReduce does not have a type system.
This limits the ability to check users’ code
for errors both before and during runtime.
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
15. Pig Execution in Hadoop Cluster
15J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
16. Pig Execution in Hadoop Cluster
16J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
17. Components of PIG
Pig Latin
Grunt
PIG SERVER
Command based
language
Execution
Environment
Compiler strives to
optimize execution
17J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
18. Compilation
• Pig system does two tasks:
Logical Plan
Physical Plan
– Builds a Logical Plan from a Pig Latin script
– Supports execution platform independence
– No processing of data performed at this stage
Compiles the Logical Plan to a Physical Plan and Executes
– Convert the Logical Plan into a series of Map-Reduce
statements to be executed by Hadoop Map-Reduce
18J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
19. Building a logical plan
A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS
dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
19J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
20. Building a logical plan
A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS
dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
GROUP DATA
20J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
21. Building a logical plan
A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS
dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
GROUP DATA
21
FOREACH
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
22. Building a logical plan
A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS
dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
GROUP DATA
FOREACH
FILTER
22J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
23. Building a logical plan
A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS
dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
FILTER
GROUP
FOREACH
23J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
24. Building a logical plan
A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS
dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
FILTER
GROUP
FOREACH
Only happens when output is
specified by STORE or DUMP
24J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
25. Building a Physical plan
Step 1: Create a map-reduce job for each
COGROUP
Map
Reduce
Load(user.dat)
Filter
Group
Foreach
LOAD DATA
FILTER
GROUP
FOREACH
25J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
26. Building a Physical plan
Step 1: Create a map-reduce job for each
COGROUP
Step 2: Push other commands into the
map and reduce functions where
possible
Step 3:May be the case certain
commands require their own map-
reduce job (ie: ORDER needs separate map-
reduce jobs)
Map
Reduce
Load(user.dat)
Filter
Group
Foreach
LOAD DATA
FILTER
GROUP
FOREACH
26J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
27. I Need all These
• Linux above 10
• Java above 6
• Hadoop
• Pig
27J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
29. PIG Latin Basics
• Pig Latin is a data flow language rather than procedural or declarative ,
in which the program consists of a collection of statements.
• A statement can be thought of as an operation, or a command.
Building blocks(Complex Data Types)
• Fields - Field is a piece of data
[eg : student_id = 01]
• Tuples - Tuple is a ordered set of fields
[eg : ( 01, Raja,MCA, C++)]
• Bags - Bag collection of tuples
[eg : ( 01, Raja, MCA, C++), eg: ( 22, Ramesh, MBA, C)
]
29J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
30. PIG Data Types
Simple Type Description
int Signed 32-bit integer
long Signed 64-bit integer
float 32-bit floating point
double 64-bit floating point
chararray Character array (string) in Unicode UTF-8 format
bytearray Byte array (blob)
boolean boolean
30J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
31. PIG Basic Commands
Statement Description
Load Read data from the file system
Store Write data to the file system
Dump Generate output
Foreach Apply expression to each record and generate one or more records
Filter Apply predicate to each record and remove records where false
Group / Cogroup Collect records with the same key from one or more inputs
Join Join two or more inputs based on a key
Order Sort records based on a Key
Distinct Remove duplicate records
Union Merge two datasets
Limit Limit the number of records
Split Split data into 2 or more sets, based on filter conditions
31J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
32. Load Command
LOAD 'data' [USING function] [AS schema];
• data – name of the directory or file
Must be in single quotes
• USING – specifies the load function to use
By default uses PigStorage () which parses each line into
fields using a delimiter
Default delimiter is tab (‘t’)
• AS – assign a schema to incoming data
Assigns names to fields
Declares types to fields
32J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
33. LOAD Command Example
data
• data = load '$dir/age.csv' using PigStorage(',') as
(name:chararray, age:chararray)
Schema
33J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
34. DUMP and STORE statements
• No action is taken until DUMP or STORE commands are
encountered
Pig will parse, validate and analyze statements but not execute them
• DUMP – displays the results to the screen
• STORE – saves results (typically to a file)
data = load '$dir/newfine.csv' using PigStorage(',') as (MemberCode:chararray,
IssueDate:chararray, ReturnDate:chararray)
::::
::::
:::::
DUMP data
Ram,22
..
..
Nothing is
executed;
Pig will
optimize this
entire
chunk of
script
34J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
35. FOREACH
• FOREACH <bag> GENERATE <data>
Iterate over each element in the bag and produce a result
result = FOREACH data GENERATE name;
35J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
36. FOREACH with Functions
FOREACH B GENERATE group, FUNCTION(A);
• Pig comes with many functions including COUNT,
FLATTEN, CONCAT, etc...
• Can implement a custom function
Example
counts = FOREACH data GENERATE group, COUNT(name);
Dump counts
Ram,3
Raj,4
Sam,2
Mani,1 36J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
37. Diagnostic Tools
DESCRIBE
Display the structure of the Bag
DESCRIBE <bag_name>;
EXPLAIN
Display Execution Plan
Produces Various reports
• Logical Plan
• MapReduce Plan
EXPLAIN <bag_name>;
ILLUSTRATE
Illustrate how Pig engine transforms the data
ILLUSTRATE <bag_name>;
37J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
38. FLATTEN Operator
• Flattens nested bags and data types
• FLATTEN is not a function, it’s an operator Re-
arranges output
Eg
grunt > dump data
({(this),(is),(a),(line),(of),(text)})
({(yet),(another),(line),(of),(text)})
({(third),(line),(of),(words)})
grunt> flatBag = FOREACH data GENERATE flatten($0);
(this)
(is)
(a)
......
Nested structure: bag of
bags of tuples
Each row is flatten resulting in a
bag of simple tokens
38J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
39. Group
• Groups the data in one or multiple relations.
• The GROUP operator groups together tuples that have the same group key
(key field).
• The key field will be a tuple if the group key has more than one field,
otherwise it will be the same type as that of the group key.
Example
groupme= group data by name;
Dump groupme
(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)})
(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj, 62)})
(Sam,{(Sam, 15),(Sam, 22)})
39J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
40. Co-Group
• COGROUP is the same as GROUP.
• Group two datasets together by a common attribute.
• Groups data into nested bags
“Use GROUP when only one relation is involved and COGROUP
with multiple relations re involved”
Example
Data1=load '$dir/data.csv' using PigStorage(',') as (name:chararray,
age:chararray)
Data2=load '$dir/data2.csv' using PigStorage(',') as (name:chararray,
address:chararray)
X = COGROUP Data1 BY name, Data2 BY name;
40J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
41. cont …
Dump x
(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)},{(Ram,Cbe),(Ram,Che)})
(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj,62)},{(Raj,Mdu), (Raj,Mumbai),
(Raj,Delhi) })
(Sam,{(Sam, 15),(Sam, 22),{}})
Cogroup by default is an OUTER JOIN You can remove empty records with empty
bags by performing INNER on each bag
X = COGROUP Data1 BY name INNER, Data2 BY name INNER;
Dump x
(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)},{(Ram,Cbe),(Ram,Che)})
(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj,62)},{(Raj,Mdu), (Raj,Mumbai),
(Raj,Delhi) })
First field is a bag which came from data 1 bag (first
dataset);second bag is from the data 2 bag (second data-set)
41J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
42. Filtering
• Select a subset of the tuples in a bag
FILTER bag BY expression ;
• Expression uses simple comparison operators (==, !=, <, >, …)
and Logical connectors (AND, NOT, OR)
Example
Filterdata = filter data by age >20
Dump filterdata
(Ram, 30),(Ram, 22), (Ram, 25)
(Raj, 22), (Raj, 52), (Raj, 62)
(Sam, 22)
42J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
43. Ordering
• Sorts a relation based on one or more fields.
alias = ORDER alias BY { * [ASC|DESC]}
Example
orderddata = order data by age DESC;
Dump filterdata
(Raj, 52)
(Raj, 62)
(Ram, 30)
(Ram, 22)
(Raj, 22)
(Sam, 22)
43J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
44. Join
• Joins two datasets together by a common attribute.
• By default JOIN operator always performs an inner join.
• Inner joins ignore null keys, so it makes sense to filter them
out before the join.
Note : The JOIN and COGROUP operators perform similar functions. JOIN creates a flat
set of output records while COGROUP creates a nested set of output records
Data 1
Data 2
Join
44J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
45. Outer Join
• Records which will not join with the ‘other’ record-set are
included using outer join
• Left Outer
Records from the first data-set are included whether they have a
match or not. Fields from the unmatched (second) bag are set to
null.
Data 2Data 1
45J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
46. cont …
• Right Outer
The opposite of Left Outer Join: Records from the second data-set are
included no matter what. Fields from the unmatched (first) bag are set to
null.
• Full Outer
Records from both sides are included. For unmatched records the fields
from the ‘other’ bag are set to null.
Data 1 Data 2
Data 1 Data 2
46J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
48. EVAL FUNCTIONS
• AVG
• CONCAT
• COUNT
• ISEMPTY
• MAX
• MIN
• SUM
48J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
49. UDF’S
• User Defined Functions
• Is a way to operate on fields
• But not on group
• Can be called using the pig script
49J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
50. UDF to Rescue [Embeded Mode]
• Easy to use
• Easy to code
• Keeps the power of PIG
• You are free to write in
50J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
52. Does What Ever You Want
• Image Feature Extraction
• Geo Computations
• Data Cleaning
• Retrieve Web Pages
• NLP
………
• Even more…….
52J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
53. Why PIG IS Faster?
• Few bugs
• Few LOC
• Easier to read(purpose of analytics is
straight forward)
53J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
54. Pit Falls
• Version match
54J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
55. Pit Falls
• Bugs in older version requires register of jars
55J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
56. Conclusion
• Pig is a data processing environment in
Hadoop which targets procedural
programmers, who do large-scale data
analysis.
• Pig-Latin offers high-level data manipulation in
a procedural style.
56J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
through the way it implements the grouping. Filter and projection can be implemented trivially in the map phase. But other operators, particularly join, are not provided and must instead be written by the user.