3. Called DATA FLOW LANGUAGE.
Used as a SCRIPTING LANGUAGE in Big Data
Technology.
Executes through HDFS.
HDFS is based on GOOGLE FILE SYSTEM(GFS).
9. Pig system does two tasks:
Logical Plan
Physical Plan
◦ Builds a Logical Plan from a Pig Latin script
◦ Supports execution platform independence
◦ No processing of data performed at this stage
Compiles the Logical Plan to a Physical Plan and
Executes
◦ Convert the Logical Plan into a series of Map-Reduce
statements to be executed by Hadoop Map-Reduce
10. A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
11. A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat
LOAD DATA
GROUP DATA
12. A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
GROUP DATA
FOREACH
13. A = LOAD ‘dataset 1.dat’AS (name, dob, designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
GROUP DATA
FOREACH
FILTER
16. Pig Latin is a data flow language rather than
procedural or declarative , in which the
program consists of a collection of statements.
A statement can be thought of as an operation,
or a command.
17. Fields - Field is a piece of data
[eg : student_id = 01]
Tuples - Tuple is a ordered set of fields
[eg : ( 01, Raja,MCA, C++)]
Bags - Bag collection of tuples
[eg : ( 01, Raja, MCA, C++),
eg: ( 22, Ramesh, MBA, C) ]
18. SIMPLE TYPE DESCRIPTION
int Signed 32-bit integer
long Signed 64-bit integer
float 32-bit floating point
double 64-bit floating point
chararray Character array (string) in Unicode UTF-8
format
bytearray Byte array (blob)
boolean boolean
19. Statement Description
Load Read data from the file system
Store Write data to the file system
Dump Generate output
Foreach Apply expression to each record
and generate one or more records
Filter Apply predicate to each record and
remove records where false
Group / Cogroup Collect records with the same key
from one or more inputs
Join Join two or more inputs based on a
key
20. Order Sort records based on a Key
Distinct Remove duplicate records
Union Merge two datasets
Limit Limit the number of records
Split Split data into 2 or more sets,
based on filter conditions
21. LOAD 'data' [USING function] [AS schema]; data
Example:
data = load '$dir/age.csv' using PigStorage(',')
as (name:chararray, age:chararray)
22. No action is taken until DUMP or STORE commands
are encountered
Pig will parse, validate and analyze statements but not
execute them
DUMP – displays the results to the screen
STORE – saves results (typically to a file)
23. FOREACH B GENERATE group, FUNCTION(A);
Pig comes with many functions including COUNT,
FLATTEN, CONCAT, etc...
Can implement a custom function
24. Groups the data in one or multiple relations.
The GROUP operator groups together tuples that
have the same group key (key field).
The key field will be a tuple if the group key has
more than one field, otherwise it will be the same
type as that of the group key.
25. COGROUP is the same as GROUP.
Group two datasets together by a common attribute.
Groups data into nested bags
“Use GROUP when only one relation is involved and
COGROUP with multiple relations re involved”
;
26. Select a subset of the tuples in a bag
FILTER bag BY expression ;
Expression uses simple comparison operators (==, !=,
<, >, …) and Logical connectors (AND, NOT, OR)
27. Sorts a relation based on one or more fields.
alias = ORDER alias BY { * [ASC|DESC]}
28. Joins two datasets together by a common attribute.
By default JOIN operator always performs an inner
join.
Inner joins ignore null keys, so it makes sense to
filter them out before the join.
29. Records which will not join with the ‘other’
record-set are included using outer join
Left Outer
Records from the first data-set are included whether
they have a match or not. Fields from the unmatched
(second) bag are set to null.
30.
31. User Defined Functions
Is a way to operate on fields
But not on group
Can be called using the pig script
32. Easy to use
Easy to code
Keeps the power of PIG
You are free to write in
33.
34. Image Feature Extraction
Geo Computations
Data Cleaning
Retrieve Web Pages
NLP
………
Even more…….
35. Few bugs
Few LOC
Easier to read(purpose of analytics is straight
forward)
38. Pig is a data processing environment in
Hadoop which targets procedural
programmers, who do large-scale data
analysis.
Pig-Latin offers high-level data manipulation
in a procedural style.