SlideShare a Scribd company logo
1 of 58
I EAT BIG!!!
I HANDLE BIG!!!
J.Ramsingh
Ph.D Research Scholar
Department of Computer
Applications
Bharathiar University
History of
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University) 2
• A research project in Yahoo! Research
• To over came the rigidness of MapReduce
paradigm
• Pig was open sourced via the Apache
Incubator.
• The first Pig release in September 2008.
• In 2009 Amazon, Yahoo used pig
• 2010, top-level Apache project.
History of
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University) 3
• A research project in Yahoo! Research
• To over came the rigidness of MapReduce
paradigm
• Pig was open sourced via the Apache
Incubator.
• The first Pig release in September 2008.
• In 2009 Amazon, Yahoo used pig
• 2010, top-level Apache project.
History of
4
• A research project in Yahoo! Research
• To over came the rigidness of MapReduce
paradigm
• Pig was open sourced via the Apache
Incubator.
• The first Pig release in September 2008.
• In 2009 Amazon, Yahoo used pig
• 2010, top-level Apache project.
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
WHY
5
• Not preferred for data analytics
• 200 LOC
= 10 LOC
• Not rich in Built-in-functions
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Facts on
6
• Pigs eat anything
– (Relational, nested, unstructured, files, etc)
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Facts on
7
• Pigs eat anything
– (Relational, nested, unstructured, files, etc)
• Pigs live anywhere
– (parallel data processing)
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Facts on
8
• Pigs eat anything
– (Relational, nested, unstructured, files, etc)
• Pigs live anywhere
– (parallel data processing)
• Pigs are domestic animals
– (Easily controlled, modified, integrated)
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Facts on
9
• Pigs eat anything
– (Relational, nested, unstructured, files, etc)
• Pigs live anywhere
– (parallel data processing)
• Pigs are domestic animals
– (Easily controlled, modified, integrated)
• Pigs fly
– (Processes data quickly)
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Why Is It Called Pig?
10
• Entertaining nomenclature
– Pig Latin
– Grunt for a shell, and
– Piggybank for shared repository.
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Overview of PIG
Is a
Can Handle Large Data Sets
I LOVE TO EAT MORE N MORE
11J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
12
PIG Vs MAPREDUCE
PIG MAPREDUCE
Pig provides data-processing operations MapReduce provides the group by
operation directly ,order by operation
indirectly
Pig, can analyze a Pig Latin script and
understand the data flow it can do
early error checking and optimizations
MapReduce, the data processing inside the
map and reduce phases is opaque to the
system. (no opportunity to optimize or
check the user’s code.)
Pig Latin is much lower cost to write and
maintain
Java code for MapReduce. Is difficult to
code
Pig is rich in type systems MapReduce does not have a type system.
This limits the ability to check users’ code
for errors both before and during runtime.
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
PIG Vs MAPREDUCE
13
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Running Environment
LOCAL MODE
HADOOP MODE
14J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Pig Execution in Hadoop Cluster
15J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Pig Execution in Hadoop Cluster
16J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Components of PIG
Pig Latin
Grunt
PIG SERVER
Command based
language
Execution
Environment
Compiler strives to
optimize execution
17J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Compilation
• Pig system does two tasks:
Logical Plan
Physical Plan
– Builds a Logical Plan from a Pig Latin script
– Supports execution platform independence
– No processing of data performed at this stage
Compiles the Logical Plan to a Physical Plan and Executes
– Convert the Logical Plan into a series of Map-Reduce
statements to be executed by Hadoop Map-Reduce
18J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Building a logical plan
A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS
dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
19J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Building a logical plan
A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS
dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
GROUP DATA
20J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Building a logical plan
A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS
dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
GROUP DATA
21
FOREACH
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Building a logical plan
A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS
dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
GROUP DATA
FOREACH
FILTER
22J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Building a logical plan
A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS
dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
FILTER
GROUP
FOREACH
23J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Building a logical plan
A = LOAD ‘dataset 1.dat’AS (name, dob,
designation);
B = GROUP A BY designation;
C = FOREACH B GENERATE group AS
dob,
COUNT(A);
D = FILTER C BY name IS ‘XXX’
OR name IS ‘yyy’;
STORE D INTO ‘result.dat’;
LOAD DATA
FILTER
GROUP
FOREACH
Only happens when output is
specified by STORE or DUMP
24J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Building a Physical plan
Step 1: Create a map-reduce job for each
COGROUP
Map
Reduce
Load(user.dat)
Filter
Group
Foreach
LOAD DATA
FILTER
GROUP
FOREACH
25J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Building a Physical plan
Step 1: Create a map-reduce job for each
COGROUP
Step 2: Push other commands into the
map and reduce functions where
possible
Step 3:May be the case certain
commands require their own map-
reduce job (ie: ORDER needs separate map-
reduce jobs)
Map
Reduce
Load(user.dat)
Filter
Group
Foreach
LOAD DATA
FILTER
GROUP
FOREACH
26J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
I Need all These
• Linux above 10
• Java above 6
• Hadoop
• Pig
27J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Execution of Pig
28J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
PIG Latin Basics
• Pig Latin is a data flow language rather than procedural or declarative ,
in which the program consists of a collection of statements.
• A statement can be thought of as an operation, or a command.
Building blocks(Complex Data Types)
• Fields - Field is a piece of data
[eg : student_id = 01]
• Tuples - Tuple is a ordered set of fields
[eg : ( 01, Raja,MCA, C++)]
• Bags - Bag collection of tuples
[eg : ( 01, Raja, MCA, C++), eg: ( 22, Ramesh, MBA, C)
]
29J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
PIG Data Types
Simple Type Description
int Signed 32-bit integer
long Signed 64-bit integer
float 32-bit floating point
double 64-bit floating point
chararray Character array (string) in Unicode UTF-8 format
bytearray Byte array (blob)
boolean boolean
30J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
PIG Basic Commands
Statement Description
Load Read data from the file system
Store Write data to the file system
Dump Generate output
Foreach Apply expression to each record and generate one or more records
Filter Apply predicate to each record and remove records where false
Group / Cogroup Collect records with the same key from one or more inputs
Join Join two or more inputs based on a key
Order Sort records based on a Key
Distinct Remove duplicate records
Union Merge two datasets
Limit Limit the number of records
Split Split data into 2 or more sets, based on filter conditions
31J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Load Command
LOAD 'data' [USING function] [AS schema];
• data – name of the directory or file
Must be in single quotes
• USING – specifies the load function to use
By default uses PigStorage () which parses each line into
fields using a delimiter
Default delimiter is tab (‘t’)
• AS – assign a schema to incoming data
Assigns names to fields
Declares types to fields
32J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
LOAD Command Example
data
• data = load '$dir/age.csv' using PigStorage(',') as
(name:chararray, age:chararray)
Schema
33J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
DUMP and STORE statements
• No action is taken until DUMP or STORE commands are
encountered
Pig will parse, validate and analyze statements but not execute them
• DUMP – displays the results to the screen
• STORE – saves results (typically to a file)
data = load '$dir/newfine.csv' using PigStorage(',') as (MemberCode:chararray,
IssueDate:chararray, ReturnDate:chararray)
::::
::::
:::::
DUMP data
Ram,22
..
..
Nothing is
executed;
Pig will
optimize this
entire
chunk of
script
34J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
FOREACH
• FOREACH <bag> GENERATE <data>
Iterate over each element in the bag and produce a result
result = FOREACH data GENERATE name;
35J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
FOREACH with Functions
FOREACH B GENERATE group, FUNCTION(A);
• Pig comes with many functions including COUNT,
FLATTEN, CONCAT, etc...
• Can implement a custom function
Example
counts = FOREACH data GENERATE group, COUNT(name);
Dump counts
Ram,3
Raj,4
Sam,2
Mani,1 36J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Diagnostic Tools
DESCRIBE
Display the structure of the Bag
DESCRIBE <bag_name>;
EXPLAIN
Display Execution Plan
Produces Various reports
• Logical Plan
• MapReduce Plan
EXPLAIN <bag_name>;
ILLUSTRATE
Illustrate how Pig engine transforms the data
ILLUSTRATE <bag_name>;
37J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
FLATTEN Operator
• Flattens nested bags and data types
• FLATTEN is not a function, it’s an operator Re-
arranges output
Eg
grunt > dump data
({(this),(is),(a),(line),(of),(text)})
({(yet),(another),(line),(of),(text)})
({(third),(line),(of),(words)})
grunt> flatBag = FOREACH data GENERATE flatten($0);
(this)
(is)
(a)
......
Nested structure: bag of
bags of tuples
Each row is flatten resulting in a
bag of simple tokens
38J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Group
• Groups the data in one or multiple relations.
• The GROUP operator groups together tuples that have the same group key
(key field).
• The key field will be a tuple if the group key has more than one field,
otherwise it will be the same type as that of the group key.
Example
groupme= group data by name;
Dump groupme
(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)})
(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj, 62)})
(Sam,{(Sam, 15),(Sam, 22)})
39J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Co-Group
• COGROUP is the same as GROUP.
• Group two datasets together by a common attribute.
• Groups data into nested bags
“Use GROUP when only one relation is involved and COGROUP
with multiple relations re involved”
Example
Data1=load '$dir/data.csv' using PigStorage(',') as (name:chararray,
age:chararray)
Data2=load '$dir/data2.csv' using PigStorage(',') as (name:chararray,
address:chararray)
X = COGROUP Data1 BY name, Data2 BY name;
40J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
cont …
Dump x
(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)},{(Ram,Cbe),(Ram,Che)})
(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj,62)},{(Raj,Mdu), (Raj,Mumbai),
(Raj,Delhi) })
(Sam,{(Sam, 15),(Sam, 22),{}})
Cogroup by default is an OUTER JOIN You can remove empty records with empty
bags by performing INNER on each bag
X = COGROUP Data1 BY name INNER, Data2 BY name INNER;
Dump x
(Ram,{(Ram, 30),(Ram, 22), (Ram, 25)},{(Ram,Cbe),(Ram,Che)})
(Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj,62)},{(Raj,Mdu), (Raj,Mumbai),
(Raj,Delhi) })
First field is a bag which came from data 1 bag (first
dataset);second bag is from the data 2 bag (second data-set)
41J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Filtering
• Select a subset of the tuples in a bag
FILTER bag BY expression ;
• Expression uses simple comparison operators (==, !=, <, >, …)
and Logical connectors (AND, NOT, OR)
Example
Filterdata = filter data by age >20
Dump filterdata
(Ram, 30),(Ram, 22), (Ram, 25)
(Raj, 22), (Raj, 52), (Raj, 62)
(Sam, 22)
42J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Ordering
• Sorts a relation based on one or more fields.
alias = ORDER alias BY { * [ASC|DESC]}
Example
orderddata = order data by age DESC;
Dump filterdata
(Raj, 52)
(Raj, 62)
(Ram, 30)
(Ram, 22)
(Raj, 22)
(Sam, 22)
43J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Join
• Joins two datasets together by a common attribute.
• By default JOIN operator always performs an inner join.
• Inner joins ignore null keys, so it makes sense to filter them
out before the join.
Note : The JOIN and COGROUP operators perform similar functions. JOIN creates a flat
set of output records while COGROUP creates a nested set of output records
Data 1
Data 2
Join
44J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Outer Join
• Records which will not join with the ‘other’ record-set are
included using outer join
• Left Outer
Records from the first data-set are included whether they have a
match or not. Fields from the unmatched (second) bag are set to
null.
Data 2Data 1
45J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
cont …
• Right Outer
The opposite of Left Outer Join: Records from the second data-set are
included no matter what. Fields from the unmatched (first) bag are set to
null.
• Full Outer
Records from both sides are included. For unmatched records the fields
from the ‘other’ bag are set to null.
Data 1 Data 2
Data 1 Data 2
46J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
47
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
EVAL FUNCTIONS
• AVG
• CONCAT
• COUNT
• ISEMPTY
• MAX
• MIN
• SUM
48J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
UDF’S
• User Defined Functions
• Is a way to operate on fields
• But not on group
• Can be called using the pig script
49J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
UDF to Rescue [Embeded Mode]
• Easy to use
• Easy to code
• Keeps the power of PIG
• You are free to write in
50J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
PIG a WRAPPER
51
J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Does What Ever You Want
• Image Feature Extraction
• Geo Computations
• Data Cleaning
• Retrieve Web Pages
• NLP
………
• Even more…….
52J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Why PIG IS Faster?
• Few bugs
• Few LOC
• Easier to read(purpose of analytics is
straight forward)
53J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Pit Falls
• Version match
54J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Pit Falls
• Bugs in older version requires register of jars
55J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
Conclusion
• Pig is a data processing environment in
Hadoop which targets procedural
programmers, who do large-scale data
analysis.
• Pig-Latin offers high-level data manipulation in
a procedural style.
56J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
References
• http://hadooptutorials.co.in
• https://www.youtube.com
• https://flume.apache.org
• http://hortonworks.com
• http://www-01.ibm.com
• https://www.youtube.com
• http://kafka.apache.org
57J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
• Thank you!!!!
58J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)

More Related Content

Similar to Pig

apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsRAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsHyunjung Park
 
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant scienceVenice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant scienceGigaScience, BGI Hong Kong
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics EnvironmentIan Foster
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environmentizahn
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slidesAnandMHadoop
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012   Kostas Tzouma - Linking and analyzing bigdata - StratosphereEDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - StratosphereEuropean Data Forum
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRoku
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
 
PyRate for fun and research
PyRate for fun and researchPyRate for fun and research
PyRate for fun and researchBrianna McHorse
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014thiruvel
 

Similar to Pig (20)

apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsRAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant scienceVenice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Lec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptxLec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptx
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
 
4.1-Pig.pptx
4.1-Pig.pptx4.1-Pig.pptx
4.1-Pig.pptx
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012   Kostas Tzouma - Linking and analyzing bigdata - StratosphereEDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Recommender Systems in the Linked Data era
Recommender Systems in the Linked Data eraRecommender Systems in the Linked Data era
Recommender Systems in the Linked Data era
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
PyRate for fun and research
PyRate for fun and researchPyRate for fun and research
PyRate for fun and research
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
 

Recently uploaded

School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Electromagnetic relays used for power system .pptx
Electromagnetic relays used for power system .pptxElectromagnetic relays used for power system .pptx
Electromagnetic relays used for power system .pptxNANDHAKUMARA10
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
Post office management system project ..pdf
Post office management system project ..pdfPost office management system project ..pdf
Post office management system project ..pdfKamal Acharya
 
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesLinux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesRashidFaridChishti
 
Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementDr. Deepak Mudgal
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxpritamlangde
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxMustafa Ahmed
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Introduction to Artificial Intelligence ( AI)
Introduction to Artificial Intelligence ( AI)Introduction to Artificial Intelligence ( AI)
Introduction to Artificial Intelligence ( AI)ChandrakantDivate1
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Introduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptxIntroduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptxhublikarsn
 
Query optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsQuery optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsmeharikiros2
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessorAshwiniTodkar4
 

Recently uploaded (20)

School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Electromagnetic relays used for power system .pptx
Electromagnetic relays used for power system .pptxElectromagnetic relays used for power system .pptx
Electromagnetic relays used for power system .pptx
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Post office management system project ..pdf
Post office management system project ..pdfPost office management system project ..pdf
Post office management system project ..pdf
 
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using PipesLinux Systems Programming: Inter Process Communication (IPC) using Pipes
Linux Systems Programming: Inter Process Communication (IPC) using Pipes
 
Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth Reinforcement
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptx
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Introduction to Artificial Intelligence ( AI)
Introduction to Artificial Intelligence ( AI)Introduction to Artificial Intelligence ( AI)
Introduction to Artificial Intelligence ( AI)
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Introduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptxIntroduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptx
 
Query optimization and processing for advanced database systems
Query optimization and processing for advanced database systemsQuery optimization and processing for advanced database systems
Query optimization and processing for advanced database systems
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor
 

Pig

  • 1. I EAT BIG!!! I HANDLE BIG!!! J.Ramsingh Ph.D Research Scholar Department of Computer Applications Bharathiar University
  • 2. History of J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University) 2 • A research project in Yahoo! Research • To over came the rigidness of MapReduce paradigm • Pig was open sourced via the Apache Incubator. • The first Pig release in September 2008. • In 2009 Amazon, Yahoo used pig • 2010, top-level Apache project.
  • 3. History of J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University) 3 • A research project in Yahoo! Research • To over came the rigidness of MapReduce paradigm • Pig was open sourced via the Apache Incubator. • The first Pig release in September 2008. • In 2009 Amazon, Yahoo used pig • 2010, top-level Apache project.
  • 4. History of 4 • A research project in Yahoo! Research • To over came the rigidness of MapReduce paradigm • Pig was open sourced via the Apache Incubator. • The first Pig release in September 2008. • In 2009 Amazon, Yahoo used pig • 2010, top-level Apache project. J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 5. WHY 5 • Not preferred for data analytics • 200 LOC = 10 LOC • Not rich in Built-in-functions J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 6. Facts on 6 • Pigs eat anything – (Relational, nested, unstructured, files, etc) J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 7. Facts on 7 • Pigs eat anything – (Relational, nested, unstructured, files, etc) • Pigs live anywhere – (parallel data processing) J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 8. Facts on 8 • Pigs eat anything – (Relational, nested, unstructured, files, etc) • Pigs live anywhere – (parallel data processing) • Pigs are domestic animals – (Easily controlled, modified, integrated) J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 9. Facts on 9 • Pigs eat anything – (Relational, nested, unstructured, files, etc) • Pigs live anywhere – (parallel data processing) • Pigs are domestic animals – (Easily controlled, modified, integrated) • Pigs fly – (Processes data quickly) J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 10. Why Is It Called Pig? 10 • Entertaining nomenclature – Pig Latin – Grunt for a shell, and – Piggybank for shared repository. J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 11. Overview of PIG Is a Can Handle Large Data Sets I LOVE TO EAT MORE N MORE 11J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 12. 12 PIG Vs MAPREDUCE PIG MAPREDUCE Pig provides data-processing operations MapReduce provides the group by operation directly ,order by operation indirectly Pig, can analyze a Pig Latin script and understand the data flow it can do early error checking and optimizations MapReduce, the data processing inside the map and reduce phases is opaque to the system. (no opportunity to optimize or check the user’s code.) Pig Latin is much lower cost to write and maintain Java code for MapReduce. Is difficult to code Pig is rich in type systems MapReduce does not have a type system. This limits the ability to check users’ code for errors both before and during runtime. J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 13. PIG Vs MAPREDUCE 13 J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 14. Running Environment LOCAL MODE HADOOP MODE 14J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 15. Pig Execution in Hadoop Cluster 15J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 16. Pig Execution in Hadoop Cluster 16J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 17. Components of PIG Pig Latin Grunt PIG SERVER Command based language Execution Environment Compiler strives to optimize execution 17J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 18. Compilation • Pig system does two tasks: Logical Plan Physical Plan – Builds a Logical Plan from a Pig Latin script – Supports execution platform independence – No processing of data performed at this stage Compiles the Logical Plan to a Physical Plan and Executes – Convert the Logical Plan into a series of Map-Reduce statements to be executed by Hadoop Map-Reduce 18J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 19. Building a logical plan A = LOAD ‘dataset 1.dat’AS (name, dob, designation); B = GROUP A BY designation; C = FOREACH B GENERATE group AS dob, COUNT(A); D = FILTER C BY name IS ‘XXX’ OR name IS ‘yyy’; STORE D INTO ‘result.dat’; LOAD DATA 19J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 20. Building a logical plan A = LOAD ‘dataset 1.dat’AS (name, dob, designation); B = GROUP A BY designation; C = FOREACH B GENERATE group AS dob, COUNT(A); D = FILTER C BY name IS ‘XXX’ OR name IS ‘yyy’; STORE D INTO ‘result.dat’; LOAD DATA GROUP DATA 20J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 21. Building a logical plan A = LOAD ‘dataset 1.dat’AS (name, dob, designation); B = GROUP A BY designation; C = FOREACH B GENERATE group AS dob, COUNT(A); D = FILTER C BY name IS ‘XXX’ OR name IS ‘yyy’; STORE D INTO ‘result.dat’; LOAD DATA GROUP DATA 21 FOREACH J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 22. Building a logical plan A = LOAD ‘dataset 1.dat’AS (name, dob, designation); B = GROUP A BY designation; C = FOREACH B GENERATE group AS dob, COUNT(A); D = FILTER C BY name IS ‘XXX’ OR name IS ‘yyy’; STORE D INTO ‘result.dat’; LOAD DATA GROUP DATA FOREACH FILTER 22J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 23. Building a logical plan A = LOAD ‘dataset 1.dat’AS (name, dob, designation); B = GROUP A BY designation; C = FOREACH B GENERATE group AS dob, COUNT(A); D = FILTER C BY name IS ‘XXX’ OR name IS ‘yyy’; STORE D INTO ‘result.dat’; LOAD DATA FILTER GROUP FOREACH 23J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 24. Building a logical plan A = LOAD ‘dataset 1.dat’AS (name, dob, designation); B = GROUP A BY designation; C = FOREACH B GENERATE group AS dob, COUNT(A); D = FILTER C BY name IS ‘XXX’ OR name IS ‘yyy’; STORE D INTO ‘result.dat’; LOAD DATA FILTER GROUP FOREACH Only happens when output is specified by STORE or DUMP 24J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 25. Building a Physical plan Step 1: Create a map-reduce job for each COGROUP Map Reduce Load(user.dat) Filter Group Foreach LOAD DATA FILTER GROUP FOREACH 25J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 26. Building a Physical plan Step 1: Create a map-reduce job for each COGROUP Step 2: Push other commands into the map and reduce functions where possible Step 3:May be the case certain commands require their own map- reduce job (ie: ORDER needs separate map- reduce jobs) Map Reduce Load(user.dat) Filter Group Foreach LOAD DATA FILTER GROUP FOREACH 26J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 27. I Need all These • Linux above 10 • Java above 6 • Hadoop • Pig 27J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 28. Execution of Pig 28J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 29. PIG Latin Basics • Pig Latin is a data flow language rather than procedural or declarative , in which the program consists of a collection of statements. • A statement can be thought of as an operation, or a command. Building blocks(Complex Data Types) • Fields - Field is a piece of data [eg : student_id = 01] • Tuples - Tuple is a ordered set of fields [eg : ( 01, Raja,MCA, C++)] • Bags - Bag collection of tuples [eg : ( 01, Raja, MCA, C++), eg: ( 22, Ramesh, MBA, C) ] 29J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 30. PIG Data Types Simple Type Description int Signed 32-bit integer long Signed 64-bit integer float 32-bit floating point double 64-bit floating point chararray Character array (string) in Unicode UTF-8 format bytearray Byte array (blob) boolean boolean 30J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 31. PIG Basic Commands Statement Description Load Read data from the file system Store Write data to the file system Dump Generate output Foreach Apply expression to each record and generate one or more records Filter Apply predicate to each record and remove records where false Group / Cogroup Collect records with the same key from one or more inputs Join Join two or more inputs based on a key Order Sort records based on a Key Distinct Remove duplicate records Union Merge two datasets Limit Limit the number of records Split Split data into 2 or more sets, based on filter conditions 31J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 32. Load Command LOAD 'data' [USING function] [AS schema]; • data – name of the directory or file Must be in single quotes • USING – specifies the load function to use By default uses PigStorage () which parses each line into fields using a delimiter Default delimiter is tab (‘t’) • AS – assign a schema to incoming data Assigns names to fields Declares types to fields 32J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 33. LOAD Command Example data • data = load '$dir/age.csv' using PigStorage(',') as (name:chararray, age:chararray) Schema 33J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 34. DUMP and STORE statements • No action is taken until DUMP or STORE commands are encountered Pig will parse, validate and analyze statements but not execute them • DUMP – displays the results to the screen • STORE – saves results (typically to a file) data = load '$dir/newfine.csv' using PigStorage(',') as (MemberCode:chararray, IssueDate:chararray, ReturnDate:chararray) :::: :::: ::::: DUMP data Ram,22 .. .. Nothing is executed; Pig will optimize this entire chunk of script 34J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 35. FOREACH • FOREACH <bag> GENERATE <data> Iterate over each element in the bag and produce a result result = FOREACH data GENERATE name; 35J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 36. FOREACH with Functions FOREACH B GENERATE group, FUNCTION(A); • Pig comes with many functions including COUNT, FLATTEN, CONCAT, etc... • Can implement a custom function Example counts = FOREACH data GENERATE group, COUNT(name); Dump counts Ram,3 Raj,4 Sam,2 Mani,1 36J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 37. Diagnostic Tools DESCRIBE Display the structure of the Bag DESCRIBE <bag_name>; EXPLAIN Display Execution Plan Produces Various reports • Logical Plan • MapReduce Plan EXPLAIN <bag_name>; ILLUSTRATE Illustrate how Pig engine transforms the data ILLUSTRATE <bag_name>; 37J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 38. FLATTEN Operator • Flattens nested bags and data types • FLATTEN is not a function, it’s an operator Re- arranges output Eg grunt > dump data ({(this),(is),(a),(line),(of),(text)}) ({(yet),(another),(line),(of),(text)}) ({(third),(line),(of),(words)}) grunt> flatBag = FOREACH data GENERATE flatten($0); (this) (is) (a) ...... Nested structure: bag of bags of tuples Each row is flatten resulting in a bag of simple tokens 38J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 39. Group • Groups the data in one or multiple relations. • The GROUP operator groups together tuples that have the same group key (key field). • The key field will be a tuple if the group key has more than one field, otherwise it will be the same type as that of the group key. Example groupme= group data by name; Dump groupme (Ram,{(Ram, 30),(Ram, 22), (Ram, 25)}) (Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj, 62)}) (Sam,{(Sam, 15),(Sam, 22)}) 39J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 40. Co-Group • COGROUP is the same as GROUP. • Group two datasets together by a common attribute. • Groups data into nested bags “Use GROUP when only one relation is involved and COGROUP with multiple relations re involved” Example Data1=load '$dir/data.csv' using PigStorage(',') as (name:chararray, age:chararray) Data2=load '$dir/data2.csv' using PigStorage(',') as (name:chararray, address:chararray) X = COGROUP Data1 BY name, Data2 BY name; 40J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 41. cont … Dump x (Ram,{(Ram, 30),(Ram, 22), (Ram, 25)},{(Ram,Cbe),(Ram,Che)}) (Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj,62)},{(Raj,Mdu), (Raj,Mumbai), (Raj,Delhi) }) (Sam,{(Sam, 15),(Sam, 22),{}}) Cogroup by default is an OUTER JOIN You can remove empty records with empty bags by performing INNER on each bag X = COGROUP Data1 BY name INNER, Data2 BY name INNER; Dump x (Ram,{(Ram, 30),(Ram, 22), (Ram, 25)},{(Ram,Cbe),(Ram,Che)}) (Raj ,{(Raj, 5),(Raj, 22), (Raj, 52), (Raj,62)},{(Raj,Mdu), (Raj,Mumbai), (Raj,Delhi) }) First field is a bag which came from data 1 bag (first dataset);second bag is from the data 2 bag (second data-set) 41J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 42. Filtering • Select a subset of the tuples in a bag FILTER bag BY expression ; • Expression uses simple comparison operators (==, !=, <, >, …) and Logical connectors (AND, NOT, OR) Example Filterdata = filter data by age >20 Dump filterdata (Ram, 30),(Ram, 22), (Ram, 25) (Raj, 22), (Raj, 52), (Raj, 62) (Sam, 22) 42J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 43. Ordering • Sorts a relation based on one or more fields. alias = ORDER alias BY { * [ASC|DESC]} Example orderddata = order data by age DESC; Dump filterdata (Raj, 52) (Raj, 62) (Ram, 30) (Ram, 22) (Raj, 22) (Sam, 22) 43J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 44. Join • Joins two datasets together by a common attribute. • By default JOIN operator always performs an inner join. • Inner joins ignore null keys, so it makes sense to filter them out before the join. Note : The JOIN and COGROUP operators perform similar functions. JOIN creates a flat set of output records while COGROUP creates a nested set of output records Data 1 Data 2 Join 44J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 45. Outer Join • Records which will not join with the ‘other’ record-set are included using outer join • Left Outer Records from the first data-set are included whether they have a match or not. Fields from the unmatched (second) bag are set to null. Data 2Data 1 45J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 46. cont … • Right Outer The opposite of Left Outer Join: Records from the second data-set are included no matter what. Fields from the unmatched (first) bag are set to null. • Full Outer Records from both sides are included. For unmatched records the fields from the ‘other’ bag are set to null. Data 1 Data 2 Data 1 Data 2 46J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 48. EVAL FUNCTIONS • AVG • CONCAT • COUNT • ISEMPTY • MAX • MIN • SUM 48J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 49. UDF’S • User Defined Functions • Is a way to operate on fields • But not on group • Can be called using the pig script 49J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 50. UDF to Rescue [Embeded Mode] • Easy to use • Easy to code • Keeps the power of PIG • You are free to write in 50J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 51. PIG a WRAPPER 51 J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 52. Does What Ever You Want • Image Feature Extraction • Geo Computations • Data Cleaning • Retrieve Web Pages • NLP ……… • Even more……. 52J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 53. Why PIG IS Faster? • Few bugs • Few LOC • Easier to read(purpose of analytics is straight forward) 53J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 54. Pit Falls • Version match 54J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 55. Pit Falls • Bugs in older version requires register of jars 55J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 56. Conclusion • Pig is a data processing environment in Hadoop which targets procedural programmers, who do large-scale data analysis. • Pig-Latin offers high-level data manipulation in a procedural style. 56J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 57. References • http://hadooptutorials.co.in • https://www.youtube.com • https://flume.apache.org • http://hortonworks.com • http://www-01.ibm.com • https://www.youtube.com • http://kafka.apache.org 57J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)
  • 58. • Thank you!!!! 58J.Ramsingh.,MCA.,M.Phil.,(Ph.D Research Scholar,DCA,Bharathiar University)

Editor's Notes

  1. through the way it implements the grouping. Filter and projection can be implemented trivially in the map phase. But other operators, particularly join, are not provided and must instead be written by the user.